Schema-on-read vs schema-on-write
- what is the schema versus no schema opposition?
- is there always some schema in the database?
- how implicit schema differs from explicit schema?
Schema-less / schema-on-read
The term schema-less (or schema-on-read) refers to a data processing approach where the structure or schema of the data is interpreted or applied just when the data is read from the storage rather than when it is initially loaded into or written to the persistency layer.
In this approach, the data is stored in its raw or semi-structured form.
Now is the good time to ask the question:
Is there always a schema in the database?
Even in schema-less databases, there are some expectations about the data structure. One may expect key-value pairs or document structures, which form a kind of an implicit schema. However, this schema is not enforced at the database level, allowing for more flexibility in data storage and retrieval. It means that the database system itself does not impose strict rules or requirements on the structure of the data being stored. So each document can have different fields and structures. It allows for more dynamic and agile development process. Documents, fields, entities can be created, updated, transformed and removed during development of an application.
But there is a drawback: changing the structure of data may lead to some inconsistency between the software layer and the persistency layer, for example, if a field has been renamed or its type changed and there are new and old documents in the same database. Depending on a detailed implementation, some operations using such critical field, like sorting, may fail.
Schema-on-read approach gives flexibility during development (no need for traditional data migration known from relational databases). It also gives scalability which makes the maintenance process cost- and asset-efficient. This matters significantly in the context of contemporary distributed systems, orchestration (Kubernetes) and cloud-native application development. Another quality is adaptability to business requirements.
The opposite of schema-less / schema-on-read is the schema-on-write concept.
Schema-enforced / schema-on-write
Before data is inserted into the database, it must conform to a predefined schema, otherwise it is rejected or throws an error. In traditional relational databases, you must define a schema upfront: specifying the tables, columns, and their data types before inserting any data. Any attempt to insert data that doesn’t conform to this schema will result in an error.
Schema-less systems are very fast. They offer easy to query / easy to store possibilities of working with databases, favouring availability over consistency and integrity.
On the other hand, the schema-on-write approach means that database system must conduct verification of queries, to check whether it matches required data structure, and so on. The verification process is not straightforward and it certainly takes some time, which implies some time overhead. Then, after the initial checks, the database system applies optimization of queries. Why? After all, it should not be too slow! In this way, the traditional, realational DBMS (SQL-like) could catch up its speed drawbacks, comparing to the NoSQL solutions.
Typically, the schema-on-write approach is related to relational databases, e.g. RDMS using SQL dialects.
Schema-on-write in details: how a SQL statement is processed in database?
In case of the SQL example, every SQL statement (in general, every query sent to the database) is subjected to verification and optimization. But before it happens, first it needs to be parsed by SQL engine.
Parsing
The parser breaks down statement input into parse tree (syntax tree).
What is syntax tree?
An abstract syntax tree (AST) is a data structure used in computer science to represent the structure of a program. It is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurring in the text (variable, assignment, operator). To put it simply, the syntax tree is a kind of a graph that depicts the structure of a given statement, its elements and relations between them.
Validation
A query (statement) that has been already parsed is subjected to the validation process. Database tables, columns, and permissions are checked. First, the syntax analysis happens. It checks if the statement matches required SQL dialect (and if it uses SQL at all, and not tagalog, for instance!). Then, the semantic analysis takes place. It checks whether the statement matches the database structure, regarding the naming of tables and columns. Also it verifies the sender’s permissions to execute given operation.
Optimization
How SQL statement is optimized?
SQL query optimization happens in the database query processing pipeline. It involves the query optimizer. The query optimizer job is to determine the most efficient query execution plan by various actions:
- query rewriting & transformation (e.g. commands reordering, changing subqueries into joins, etc.)
- cost-based optimization
- index selection
- join order change
- filtering and predicate can be moved closer to the data source
- check if there is a possibility of parallel execution
- memory management
- check if there is a possibility of query plan caching
Execution
The final stage is the query execution. The chosen execution plan is passed to the query executor, which carries out the actual retrieval, modification, or insertion of data. Data are accessed and the result is generated and returned to the sender.
What algorithms are used to calculate the most efficient query execution plan?
Specific algorithms used by the query optimizer can vary between different database systems. Common techniques include:
- cost of each query estimations (in terms of the CPU, I/O, memory usage), it uses mathematical models to count the cost (so-called cost models)
- proper search algorithms selection
- application of statistical information
- join ordering heuristics
- and others.
Are execution plans cached in databases?
Yes, the query execution plans might be cached as it gives better performance. Many RDBMS employ cache to store and reuse execution plans for frequently executed SQL queries. This caching mechanism helps to avoid the cost of repeatedly optimizing and generating execution plans for the same queries.
What are steps for query execution plan caching?
- Query compilation: When a SQL query is first executed, the database system goes through the parsing, optimization, and execution steps to generate an execution plan tailored to the current state of the database.
- Query plan storage: After optimization, the generated execution plan is stored in the query plan cache associated with the specific query text or a hash of the query text.
- During subsequent executions, when the same or a similar query is executed again, the database system checks the query plan cache first.
- Cache hit: if a matching execution plan is found in the cache (a cache hit), the system can directly reuse the stored plan instead of going through the entire optimization process again.
- Cache miss: if there is no matching execution plan in the cache (a cache miss), the database system reoptimizes the query and generates a new execution plan, which is then stored in the cache for future use.
Schema-less vs schema-enforced in practice: NoSQL vs SQL
As we could see, schema-enforced solutions, related with SQL databases, imply a lot of validations, verifications, checks and optimizations before, during and after statement execution (see SQL triggers, for instance). This may suggest that they are generally more secure. Certainly, here, the emphasis is being put on data integrity and constraints. Schema-enforce not only allows for writing more secure applications, especially when dealing with sensitive data, critical sections and crucial operations, but even requires serious approach to data consistency and security.
Here, there are multiple security layers:
RDBMS infrastructure (this aspect is present in NoSQL databases as well)
- security of physical storage (bare metal, cloud)
- protection of the connection between application and database (e.g REST API calls)
- Identity and Access Management (IAM): access to the database as admin, user, application)
RDMS verification & constraints layer
- data consistency & integrity requirements
- data constraints
- authorization check to execute operations / to access databases and tables
- query verification
Software layer
- code should be consitent with database constraints
- may add additional authenticaton & authorization layer, like user roles and authorities
- ORM framework adds another layer of protection
- there are prepared statements to protect against SQL injection
- custom validation for input, data and database access may be applied
SQL-like languages have highly structured schemas and often use normalization. They put a strong emphasis on ACID, transactions, data integrity and consistency Their drawback is lower speed and worse scalability. Here, the NoSQL solutions offers much more:
- they are based on flexible schema, not relational tables
- storage formats: JSON, BSON (binary JSON) key-value pairs, key-document pairs, BSON, graphs
- easier for horizontal scaling (more nodes or servers)
- flexible schemas (document-bases, key-value pairs, column-based, graphs)
- often uses denormalization
- NoSQL favors system availability and fault tolerance instead of consistency / integrity
- JSON-based queries, SQL languages, other DDL languages
There are various types of NoSQL databases:
- Documents store (MongoDb). Data as documents (JSON, BSON). Each document is a set of key-value pairs or key-document pairs.
- Key-value pairs (Redis)
- Column stores (Apache Cassandra). Data organized in columns instead of rows. Well-suited for analytics and time-series data.
- Graph databases
- Object-oriented: store data as objects.
- Multimodel - different approaches in one base
- others: XML, NewSQL, time-series
Schema-less solutions are ideal for scenarios with large amounts of unstructured or semi-structured data, high read/write throughput, and the need for horizontal scalability. They are common in web applications, content management, real-time big data processing, and distributed systems.
A particular feature for speed and scalability is sharding. Sharding is a database architecture strategy where a large database is partitioned into smaller, more manageable pieces called shards. Each shard is an independent database server with independent storage. It stores a subset of the overall data. The goal of sharding is to distribute the data and the associated workload across multiple servers, improving performance, scalability, and resource utilization.
Used, for example, in Redis, ElasticSearch, MongoDB and Postgres / MySQL (with extensions for sharding).
NoSQL - secure or not?
NoSQL systems are facing the same challenges as traditional databases. They were subjected to data breaches as well (see for example Mexican voters registry leak, where data were stored in MongoDB, 2016). NoSQL injection attacks might be possible. Some sources claim the authorization and encryption mechanisms are weaker in NoSQL, but this opinion is disputable and has no clear argumentation.
If a client communicates with a NoSQL database via plain text, it poses risk of the man-in-the-middle attack. But the same can be true in case of a relational database if the network traffic is not properly secured.
When comparing SQL with NoSQL solutions in terms of security, the only thing that seems to be in favour the former is the strict schema that requires predefined data structure and applies internal validation of statements. In addition, it requires stronger data integrity and consistency, often with explicit data constraints.
However, it does not guarantee security. On the other hand, it is possible, that schema-less persistency supported by a proper security layer and data validation could be compensated on the application side.