Enter the Bee

Schema-on-read vs schema-on-write

2024-04-10T21:37:00+00:00

what is the schema versus no schema opposition?
is there always some schema in the database?
how implicit schema differs from explicit schema?

Schema-less / schema-on-read

The term schema-less (or schema-on-read) refers to a data processing approach where the structure or schema of the data is interpreted or applied just when the data is read from the storage rather than when it is initially loaded into or written to the persistency layer.

In this approach, the data is stored in its raw or semi-structured form.

Now is the good time to ask the question:

Is there always a schema in the database?

Even in schema-less databases, there are some expectations about the data structure. One may expect key-value pairs or document structures, which form a kind of an implicit schema. However, this schema is not enforced at the database level, allowing for more flexibility in data storage and retrieval. It means that the database system itself does not impose strict rules or requirements on the structure of the data being stored. So each document can have different fields and structures. It allows for more dynamic and agile development process. Documents, fields, entities can be created, updated, transformed and removed during development of an application.

But there is a drawback: changing the structure of data may lead to some inconsistency between the software layer and the persistency layer, for example, if a field has been renamed or its type changed and there are new and old documents in the same database. Depending on a detailed implementation, some operations using such critical field, like sorting, may fail.

Schema-on-read approach gives flexibility during development (no need for traditional data migration known from relational databases). It also gives scalability which makes the maintenance process cost- and asset-efficient. This matters significantly in the context of contemporary distributed systems, orchestration (Kubernetes) and cloud-native application development. Another quality is adaptability to business requirements.

The opposite of schema-less / schema-on-read is the schema-on-write concept.

Schema-enforced / schema-on-write

Before data is inserted into the database, it must conform to a predefined schema, otherwise it is rejected or throws an error. In traditional relational databases, you must define a schema upfront: specifying the tables, columns, and their data types before inserting any data. Any attempt to insert data that doesn’t conform to this schema will result in an error.

Schema-less systems are very fast. They offer easy to query / easy to store possibilities of working with databases, favouring availability over consistency and integrity.

On the other hand, the schema-on-write approach means that database system must conduct verification of queries, to check whether it matches required data structure, and so on. The verification process is not straightforward and it certainly takes some time, which implies some time overhead. Then, after the initial checks, the database system applies optimization of queries. Why? After all, it should not be too slow! In this way, the traditional, realational DBMS (SQL-like) could catch up its speed drawbacks, comparing to the NoSQL solutions.

Typically, the schema-on-write approach is related to relational databases, e.g. RDMS using SQL dialects.

Schema-on-write in details: how a SQL statement is processed in database?

In case of the SQL example, every SQL statement (in general, every query sent to the database) is subjected to verification and optimization. But before it happens, first it needs to be parsed by SQL engine.

Parsing

The parser breaks down statement input into parse tree (syntax tree).

What is syntax tree?

An abstract syntax tree (AST) is a data structure used in computer science to represent the structure of a program. It is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurring in the text (variable, assignment, operator). To put it simply, the syntax tree is a kind of a graph that depicts the structure of a given statement, its elements and relations between them.

Validation

A query (statement) that has been already parsed is subjected to the validation process. Database tables, columns, and permissions are checked. First, the syntax analysis happens. It checks if the statement matches required SQL dialect (and if it uses SQL at all, and not tagalog, for instance!). Then, the semantic analysis takes place. It checks whether the statement matches the database structure, regarding the naming of tables and columns. Also it verifies the sender’s permissions to execute given operation.

Optimization

How SQL statement is optimized?

SQL query optimization happens in the database query processing pipeline. It involves the query optimizer. The query optimizer job is to determine the most efficient query execution plan by various actions:

query rewriting & transformation (e.g. commands reordering, changing subqueries into joins, etc.)
cost-based optimization
index selection
join order change
filtering and predicate can be moved closer to the data source
check if there is a possibility of parallel execution
memory management
check if there is a possibility of query plan caching

Execution

The final stage is the query execution. The chosen execution plan is passed to the query executor, which carries out the actual retrieval, modification, or insertion of data. Data are accessed and the result is generated and returned to the sender.

What algorithms are used to calculate the most efficient query execution plan?

Specific algorithms used by the query optimizer can vary between different database systems. Common techniques include:

cost of each query estimations (in terms of the CPU, I/O, memory usage), it uses mathematical models to count the cost (so-called cost models)
proper search algorithms selection
application of statistical information
join ordering heuristics
and others.

Are execution plans cached in databases?

Yes, the query execution plans might be cached as it gives better performance. Many RDBMS employ cache to store and reuse execution plans for frequently executed SQL queries. This caching mechanism helps to avoid the cost of repeatedly optimizing and generating execution plans for the same queries.

What are steps for query execution plan caching?

Query compilation: When a SQL query is first executed, the database system goes through the parsing, optimization, and execution steps to generate an execution plan tailored to the current state of the database.
Query plan storage: After optimization, the generated execution plan is stored in the query plan cache associated with the specific query text or a hash of the query text.
During subsequent executions, when the same or a similar query is executed again, the database system checks the query plan cache first.
Cache hit: if a matching execution plan is found in the cache (a cache hit), the system can directly reuse the stored plan instead of going through the entire optimization process again.
Cache miss: if there is no matching execution plan in the cache (a cache miss), the database system reoptimizes the query and generates a new execution plan, which is then stored in the cache for future use.

Schema-less vs schema-enforced in practice: NoSQL vs SQL

As we could see, schema-enforced solutions, related with SQL databases, imply a lot of validations, verifications, checks and optimizations before, during and after statement execution (see SQL triggers, for instance). This may suggest that they are generally more secure. Certainly, here, the emphasis is being put on data integrity and constraints. Schema-enforce not only allows for writing more secure applications, especially when dealing with sensitive data, critical sections and crucial operations, but even requires serious approach to data consistency and security.

Here, there are multiple security layers:

RDBMS infrastructure (this aspect is present in NoSQL databases as well)

security of physical storage (bare metal, cloud)
protection of the connection between application and database (e.g REST API calls)
Identity and Access Management (IAM): access to the database as admin, user, application)

RDMS verification & constraints layer

data consistency & integrity requirements
data constraints
authorization check to execute operations / to access databases and tables
query verification

Software layer

code should be consitent with database constraints
may add additional authenticaton & authorization layer, like user roles and authorities
ORM framework adds another layer of protection
there are prepared statements to protect against SQL injection
custom validation for input, data and database access may be applied

SQL-like languages have highly structured schemas and often use normalization. They put a strong emphasis on ACID, transactions, data integrity and consistency Their drawback is lower speed and worse scalability. Here, the NoSQL solutions offers much more:

they are based on flexible schema, not relational tables
storage formats: JSON, BSON (binary JSON) key-value pairs, key-document pairs, BSON, graphs
easier for horizontal scaling (more nodes or servers)
flexible schemas (document-bases, key-value pairs, column-based, graphs)
often uses denormalization
NoSQL favors system availability and fault tolerance instead of consistency / integrity
JSON-based queries, SQL languages, other DDL languages

There are various types of NoSQL databases:

Documents store (MongoDb). Data as documents (JSON, BSON). Each document is a set of key-value pairs or key-document pairs.
Key-value pairs (Redis)
Column stores (Apache Cassandra). Data organized in columns instead of rows. Well-suited for analytics and time-series data.
Graph databases
Object-oriented: store data as objects.
Multimodel - different approaches in one base
others: XML, NewSQL, time-series

Schema-less solutions are ideal for scenarios with large amounts of unstructured or semi-structured data, high read/write throughput, and the need for horizontal scalability. They are common in web applications, content management, real-time big data processing, and distributed systems.

A particular feature for speed and scalability is sharding. Sharding is a database architecture strategy where a large database is partitioned into smaller, more manageable pieces called shards. Each shard is an independent database server with independent storage. It stores a subset of the overall data. The goal of sharding is to distribute the data and the associated workload across multiple servers, improving performance, scalability, and resource utilization.

Used, for example, in Redis, ElasticSearch, MongoDB and Postgres / MySQL (with extensions for sharding).

NoSQL - secure or not?

NoSQL systems are facing the same challenges as traditional databases. They were subjected to data breaches as well (see for example Mexican voters registry leak, where data were stored in MongoDB, 2016). NoSQL injection attacks might be possible. Some sources claim the authorization and encryption mechanisms are weaker in NoSQL, but this opinion is disputable and has no clear argumentation.

If a client communicates with a NoSQL database via plain text, it poses risk of the man-in-the-middle attack. But the same can be true in case of a relational database if the network traffic is not properly secured.

When comparing SQL with NoSQL solutions in terms of security, the only thing that seems to be in favour the former is the strict schema that requires predefined data structure and applies internal validation of statements. In addition, it requires stronger data integrity and consistency, often with explicit data constraints.

However, it does not guarantee security. On the other hand, it is possible, that schema-less persistency supported by a proper security layer and data validation could be compensated on the application side.

Defensive programming in Java

2023-11-11T16:53:00+00:00

In this article we will discuss the principles of defensive programming in Java. Defensive programming involves writing code that anticipates and protects against potential errors and unexpected situations to ensure the reliability and robustness of a software.

Ensure proper exception handling

As a reminder: all exceptions (including errors) are subclasses of Throwable class.

Throwable objects - commonly called Exceptions - can be either exceptions or errors.

Because of that, Throwable has two subclasses - Error and Exception:

Throwable -> Error

Throwable -> Exception

What is the difference between error and exception?

Errors in Java represent serious, usually unrecoverable problems that occur at runtime. They are rather related to a sudden change of state of the application and its environment. They are typically caused by issues outside the control of the program, such as system failures, hardware problems, or severe environmental conditions.

Errors are unchecked: they are not meant to be caught or handled by the application code.

Examples of errors: OutOfMemoryError, StackOverflowError

Exceptions are related to abnormal conditions or unexpected situations during the execution of a program. Exceptions are often caused by faulty code or invalid inputs.

Exceptions are either meant to be caught and handled by the application (when they are “checked”), or they should be avoided by correct program logic, including input validation (when they are “unchecked”).

Proper handling of checked exceptions requires try-catch blocks to catch exceptions, implementing error-handling logic (exception handlers, e.g. using Spring framework) or recovery mechanisms (also possible with Spring).

Examples of unchecked exceptions: infamous NullPointerException, ArrayIndexOutOfBoundsException. <--- should not be caught & handled, should be avoided by correct coding!

Examples of checked exceptions: IOException, FileNotFoundException, SQLException. <--- we should predict them and be prepared (catch & handle)!

So exceptions can be checked or unchecked. Unchecked exceptions are subclasses of RuntimeException. All other exceptions are checked. They are directly under Exception class.

Throwable -> Exception -> RuntimeException -> (unchecked exceptions)

Throwable -> Exception -> (checked exceptions)

To sum up, unchecked issues are Errors, RuntimeException and its descendants.

The Exception class and its descendants are checked issues, and they require proper handling. When such exception has been thrown, the control flow is transferred to the nearest exception handler. In Java, checked exceptions are tracked by the compiler.

Exception rethrowing and chaining

Exception rethrowing is simply throwing again caught exception in catch clause.

Exception chaining is wrapping caught exception into a new one (of another class) It is also useful if a checked exception occurs in a method that is not allowed to throw a checked exception. You can catch the checked exception and chain it to an unchecked one.

Use initCause, getCause() and a constructor with Cause to pass original exception so that it could be retrieved later.

Exception handling and security

Now it is important to ask what is the meaning of proper exception handling for application security.

Handling the exceptions may be seen as trivial, but often it is not, leading to non-readible, overcomplicated spaghetti code, with nested try-catch clauses.

Not all business scenarios are “happy path”.

When exception handling is simple and robust, potential vulnerabilities or sensitive information exposure are handled safely to prevent security breaches. It increases software integrity then. The control over the system is enhanced, with greater stability and easier monitoring. It allows to prepare right responses to exceptional situations.

Close the resources: try-with-resources, Autocloseable vs Closeable

Exception handling in resource management is another problem.

// suppose there is some collection to iterate through it:
var input = new ArrayList<String>();
// resources comes into play...
PrintWriter out = new PrintWriter("output.txt");
for (String row : input) {
out.println(row.toLowerCase());
}
// when exception occurs, this code is never reached:
out.close();

Try-with-resources guarantees that the resources are closed regardless the exception:

var out = new PrintWriter("output.txt");
try (out) {
for (String row : input)
out.println(row.toLowerCase());
} // out.close() is called implicitely!

out.close() method is called behind the scenes, because PrintWriter implements AutoCloseable. The resources will be closed when try-with-resources exits, no matter if an exception has been thrown. And no need to worry about closing resources when the code executes normally. It happens automagically!

Closeable implements AutoCloseable so it can use try-with resources to automatically close the resources. Autocloseable: close() throws Exception Closeable: close() throws IOException. Older interface, specifically designed for I/O-related classes.

When there are two resources declared and initialized in try() clause (which is a valid and acceptable case):

try (Scanner in = new Scanner(Paths.get("/usr/share/dict/words"));
PrintWriter out = new PrintWriter("output.txt")) {
while (in.hasNext())
out.println(in.next().toLowerCase());
}

Here:

Resources are closed in reverse order of their initialization: out is closed before in.
When PrintWriter throws exception, try() mechanism closes in and propagates the exception from out.

Suppression mechanism

Very interesting aspect is discussed by Cay S. Horstmann in his Core Java:

Some close methods can throw exceptions. If that happens when the try block completed normally, the exception is thrown to the caller. However, if another exception had been thrown, causing the close methods of the resources to be called, and one of them throws an exception, that exception is likely to be of lesser importance than the original one. In this situation, the original exception gets rethrown, and the exceptions from calling close are caught and attached as “suppressed” exceptions.

After catching the first exception, it is possible to get to the supressed exception using getSuppressed():

try {
// something here throws the more important exception
} catch (IOException ex) {
Throwable[] secondaryExceptions = ex.getSuppressed();
// here we can catch supressed
}

Do not throw from finally.

The suppression mechanism works only for try-with-resources. Do not throw exceptions in finally clause. If try() block terminates with an exception, this exception is masked by an exception in finally clause.

Do not return from finally.

If try() has return statement and there is another return in finally, the latter (return from finally) overshadows the former (from try).

Resource management and security

While try-with-resources itself is not a security feature per se, it does play a role in promoting secure coding practices and preventing resource leaks that could lead to security vulnerabilities.

By automatically managing the closing of resources, try-with-resources reduces the likelihood of resource leaks. Leaked resources, such as open file handles or network connections, can pose security risks and impact the stability of an application.

Properly closing resources prevents resource exhaustion attacks where an attacker intentionally consumes available resources (e.g., file descriptors) to degrade system performance or cause denial-of-service conditions.

For security-sensitive resources like cryptographic streams or database connections, try-with-resources ensures that they are properly released, reducing the window of opportunity for potential security vulnerabilities related to resource mismanagement.

Last but not least, with try-with-resources, handling exceptions related to resource cleanup is simplified.

Assertions

Java supports assertions, which are boolean expressions that the programmer believes will be true at that point in the code. They are useful during development and testing to catch potential issues early. They can be disabled in production code (actually, assertions are disabled by default).

Instead of checking the condition with if:

if (x < 0) throw...

which is expensive and slows down the program, we can use assertions.

Enable assertions:

java -ea MainClass

and then put into code:

assert x > 0 : "x should be greater than 0";

We can enable assertions for single classer or packages as well:

java -ea:MyCustomClass -ea:com.mycompany.somepackage... MainClass

Assertions are handled by the class loader. When they are disabled (by default!), the class loader removes all assertion code so that it won’t slow execution (e.g. in production environment).

We can customize disabling assertions as well using -da flag (a.k.a. switch):

java -ea:... -da:MyCustomClass MainClass

During the development phase, assertions can serve as a form of security audit by checking that security-critical conditions or assumptions are met. However, assertions should not be relied upon as the primary means of enforcing security.

Many faces of defensive programming

Other rules of defensive programming are related to:

Null-checks
Immutable classes and defensive copying
Input validation (already discussed a propos SQL injection)
Proper logging
Design patterns & the use of tested and known algorithms

It is particularly important not reinvent the wheel, when it comes to the existing algorithms. They are well-grounded in computer science theory, battle-tested by wide community and mesurable in terms of efficiency and outcome. Writing own implementation is not always a good idea, as it may lead to inefficient behaviour and unpredictable results. And unless you are a cryptography maven, do not try to implement your own cryptography solutions into industry-grade code.

TBC in next articles.

SQL injection and how to mitigate

2023-10-01T20:23:00+00:00

Previously on SQL: Intro to SQL security

SQL injection attack

SQL injection - vulnerability that occurs when an attacker is able to manipulate SQL query or inject own SQL command (a part of SQL query) into an SQL query.

SQL injection attack - happens when a vector of attack is closely related with a SQL vulnerability and an attacker takes advantage of such vulnerabilty, injecting malicioius SQL code.

SQL injection is widespread because it is easily detected and exploited!

Possible results of SQL injection attack: unauthorized access to database, unauthorized read of data, such as user login and passwords, data manipulation, taking control over the operating system.

Malicious SQL

Suppose there is a web application that exposes the endpoint:

https://somecompany.com/customers/search?last_name=something

This endpoint allows searching for customers of this company in the application database. Let’s imagine the app is connected to the customers database, similar to the one discussed in previous posts. Request sent to the endpoint executes some code of the application, performing search in the database. Let’s say it is a method that triggers following query:

SELECT * FROM customer WHERE last_name LIKE '%something%' AND active=true

where something is mocking user input, pasted into the browser, e.g. into html input field.

In Java such method could look like this:

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

public class DatabaseQueryExecutor {

    public static void executeSearchQuery(String lastName) {
        Connection connection = null;
        Statement statement = null;

        try {
            // Set up the database connection (replace with your database URL, username, and password)
            String dbUrl = "jdbc:mysql://your-database-url";
            String dbUser = "your-username";
            String dbPassword = "your-password";

            connection = DriverManager.getConnection(dbUrl, dbUser, dbPassword);
            statement = connection.createStatement();

            // Construct the SQL query using placeholders for search terms
            String sqlQuery = "SELECT * FROM customer WHERE last_name LIKE '%" + lastName + "%' AND department =1;

            // Execute the SQL query
            ResultSet resultSet = statement.executeQuery(sqlQuery);

            // Process the results (you can replace this with your specific logic)
            while (resultSet.next()) {
                // Retrieve and process data here
            }
        } catch (SQLException e) {
            e.printStackTrace();
        } finally {
            // Close resources in the reverse order of their creation
            try {
                if (statement != null) {
                    statement.close();
                }
                if (connection != null) {
                    connection.close();
                }
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    }
}

With URL request for word: malicious':

https://somecompany.com/customers/search?first_name=malicious'

SQL should be:

SELECT * FROM customer WHERE last_name LIKE '%malicious'%' AND active=true

but there is syntax error due to unescaped character (apostrophe / single quote symbol).

Single quote symbol denotes opening and closing a string; adding redundant character of this type closes string prematurely and opens another one without proper closing.

So when you put some special character into input box of a web application and the result is similiar to this message: Incorrect syntax near..., it might suggest that the input had been directly injected into some backend SQL query. And, apparently, this input character was executed as part of SQL query, which ended with SQL error message. The conclusion: a bad actor (as bad as the actors from Mick Herron’s Slough House series) can try to tinker with this vulnerability, looking for an opportunity to execute SQL injection attack.

There is a way to get rid of this error, commenting the rest of query, that should not be executed with two dashes:

-- this is SQL comment

Note, that in some flavours (e.g. MySQL, MariaDB), the dashes should be followed by a new line character. See documentation.

With the comment trick, the URL will look like https://somecompany.com/customers/search?first_name=malicious'-- , which triggers this query underneath:

SELECT * FROM customer WHERE last_name LIKE '%malicious'-- %' AND active=true

(Kramdown in Jekyll has been set to MySQL, so in a Markdown file we can see that with a trailing space it works perfectly fine. It should look fine also for other plugins.)

By commenting the end of query, the second part of WHERE condition has been disabled.

As a result, the query returns all customers with given name (just put something instead of malicious), disregarding their department.

Let’s add infamous OR 1=1-- to the URL:

https://somecompany.com/customers/search?first_name=malicious' OR 1=1--

so that we trigger another SQL query through above-mentionned Java method (which is, by the way, a strong evidence of bad practice):

SELECT * FROM customer WHERE last_name LIKE '%malicious' OR 1=1-- %' AND active=true

As the OR condition is tautology (always true), the database will return all customers irrespectively of their name.

By adding a boolean SQL command, evaluated to true, and attaching it to the WHERE clause, one can make the query always true. Open sesame!

Given that OR clause has precedence, the WHERE condition will be skipped and the query executed because of true condition after OR.

Methods of SQL injection attack

Union-based

Using previously discussed mechanisms, in this scenaro UNION clause is attached to the initial query. By trial and error procedure (using brute force or ORDER BY command), an attacker will know the numbers of columns in a given table. He will also discover database metada from information_schema.tables (in some database engines), such as name of tabels and their columns. During following steps, the perpetrator can see the content of database, including user logins, passwords (hased or not) and the like. I will not paste examples of SQL queries nor detailed instruction, as it is easily accessible in the internet and in the books.

Error-based

Error messages can contain information useful to execute the attack. An example is casting string to integer or concatenating character to the result of version function. Actual syntax depends on SQL dialect.

Blind (content-based)

Using boolean condition and substring function, blind attack is similar to bruteforcing. Its goal is to get the table content by fetching entries by single characters.

Blind (time-based)

The fundamentals for this type of SQL injection is to check time between request and response.

Stacked queries

In Java, java.sql.Statement.executeQuery() method allows to execute only a single query. If other possibility has been enabled in the database, a malicious actor can attach another SQL query to the initial one, executing SQL injection attack. For example, by attaching DROP DATABASE users-- command.

In badly secured applications, where stored passwords are not hashed, it is possible to bypass login gateway. A hacker can also modify / update data, execute code remotely, load or save a file on a disk, and finally, inject commands into OS.

How to detect vulnerability?

At the very basic level, SQL injection vulnerabilities can be noticed on the front-end side (UI) and by REST API behaviour (given that the REST API fetches data from the database). Playing with web application search engines and endpoints through UI and http calls might be useful. Use examples described earlier or find more SQL injection test cases for this purpose.

At the same time, the code base should be checked for vulnerabilities. Linters, code-style checkers, security scanners may detect potential dangers. As we could see earlier, the main problem lies in String concatenation.

Mitigation - query parametrization, prepared statements, stored procedures

From programmer’s point of view, SQL injection can be mitigated using query parametrization, where input is not literally treated as part of SQL, but as a separate variable of type String, which is substituted into wild cards of pre-prepared SQL query in a way that seriously limits the possibility of SQL injection. In this solution, there is no String concatenation, so the risk is lower.

In Java, we can also use PreparedStatement. The refactored method looks like this:

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

public class SafeDatabaseQueryExecutor {

    public static void executeSearchQuery(String lastName) {
        Connection connection = null;
        PreparedStatement preparedStatement = null;

        try {
            // Set up the database connection (replace with your database URL, username, and password)
            String dbUrl = "jdbc:mysql://your-database-url";
            String dbUser = "your-username";
            String dbPassword = "your-password";

            connection = DriverManager.getConnection(dbUrl, dbUser, dbPassword);

            // Construct the SQL query with a placeholder for the last name
            String sqlQuery = "SELECT * FROM customer WHERE last_name LIKE ? AND department = 1";
            preparedStatement = connection.prepareStatement(sqlQuery);

            // Set the parameter for the prepared statement
            preparedStatement.setString(1, "%" + lastName + "%");

            // Execute the SQL query
            ResultSet resultSet = preparedStatement.executeQuery();

            // Process the results (replace with your specific logic)
            while (resultSet.next()) {
                // Retrieve and process data here
            }
        } catch (SQLException e) {
            e.printStackTrace();
        } finally {
            // Close resources in the reverse order of their creation
            try {
                if (preparedStatement != null) {
                    preparedStatement.close();
                }
                if (connection != null) {
                    connection.close();
                }
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    }
}

Stored procedures are predefined queries stored on the database server, allowing repeated execution across various applications. When implemented appropriately, stored procedures provide a level of defense against SQL injection attacks comparable to that of prepared statements. Controlling permissions for executing a stored procedure enables us to regulate access and, if needed, limit direct interaction with the underlying table, thus mitigating the potential impact of SQL injection.

Similar to prepared statements, stored procedures support parameterized queries, treating user input as data rather than executable SQL code. Moreover, the database automatically sanitizes the parameters transmitted to the procedure, preventing malicious code from being injected by potential attackers.

Stored procedures, like SQL queries, can be vulnerable to injection. To prevent this, parameterize the stored procedure queries instead of concatenating parameters.

Do not concatenate query in Stored procedures:

    SET @Statement = CONCAT('SELECT * FROM customer WHERE name = ', customer_name, ' );

I would rather go for prepared statement with parameter:

PREPARE statement FROM 'SELECT * FROM customer WHERE name = ?';

More on avoiding SQL injection in Java and other languages (e.g. Python): Bobby Tables and OWASP

Frontend vs backend input validation? Both!

Sanitization involves the removal of undesirable characters (such as curly braces, quotes, slashes and backslashes) or unsafe code from user-supplied data. Validation, on the other hand, ensures that user-supplied data adheres to the expected format defined by the database. For instance, we can verify input length and reject excessively long inputs, as well as enforce specific formats for email addresses and dates. This approach effectively thwarts attackers attempting to submit specially crafted input values containing malicious SQL statements.

While sanitizing and validating input contributes to controlling the input in SQL queries, it’s important to note that it’s not foolproof. Attackers may employ techniques like double encoding to circumvent these safeguards.

Front-end validation is useful and user-friendly. User receives immediate information if an input is invalid (along with the reason - at least in well-designed applications). Thanks to that the backend does not need to handle multiple incorrect request coming from the UI.

But front-end validation can be bypassed by a bad actor (malicious user or attacker). Validation on the front-end side will not secure against SQL injections. Angular, JavaScript, TypeScript and other front-end code seen in the browser developer’s tools as script can be manipulated and exploited.

Not to mention that in client-server architecture one may execute direct HTTP call (e.g. via Postman or curl) to the backend application, bypassing UI validation. It would deliver a payload that has not been validated yet to the backend API. Such payload could contain SQL query to execute during the attack on the backend side.

In fact, front-end validation should always be paired with backend validation, like Java & Spring Validation API.

Whitelisting vs blacklisting: don’t rely on blacklisting

A whitelist (allowlist) enables us to establish precise rules that exclusively permit specific characters or patterns in the input, ensuring the rejection of any malicious input.

Compared to a blacklist (denylist), a whitelist proves to be a superior strategy for thwarting SQL injection attacks. By explicitly defining the permissible input types, a whitelist leaves less room for maneuvering, unlike a blacklist, which can be circumvented by attackers through input variation. In essence, a whitelist provides greater control over the accepted input.

Require proper formatting for text, data and numerical values. Use selecting option (drop-down, calendars) where possible.

Restricted access - Principle of the least privilege (PoLP)

According to this well-known rule:

Every module (such as a process, a user, or a program, depending on the subject) must be able to access only the information and resources that are necessary for its legitimate purpose. Saltzer, Jerome H.; Schroeder, Michael D. (1975). “The protection of information in computer systems”

Ensure that database accounts used by your application have the least necessary privileges to reduce the impact of a successful attack.

When establishing a database user for your application, it’s essential to carefully consider the privileges assigned to that user. For instance, does your application necessitate full access to read, write, and modify all databases? Should it have the authority to truncate or drop tables? By restricting your application’s access to the database, you can mitigate the potential impact of SQL injection attacks. Instead of relying on a single database user for your application, it is advisable to create multiple database users and associate them with specific application roles. Security vulnerabilities often propagate like a chain reaction, so it’s imperative to remain vigilant about each link in the chain to prevent significant harm.

Database hardening

An app should not have admin privileges when connecting to the database. Even is someone injects some malicious code, chances are the damage will be limited.
Users should be separated when connecting to given database, even within one application. Preferably, databases should be separated also. This should be doable in microservices. Thus, SQL injection to one table most probably won’t extend to other tables, and breaking into one database will not hurt the others.
Disable stacked queries so that another SQL query would not be attached to the initial one during the attack. It will be more tedious and time-consuming to fetch or alter the data step by step than to delete a table or a database at once.
Check given database engine for vulnerabilities. Disable dangerous options.
Database should not have root permissions in the operating systems

ORM layer

Object-relational mapping (ORM) layer can also be your line of defense.

But do not shoot yourself in the foot: easy-to-use Hibernate framework applied with superficial knowledge only can be far less efficient than low-level JPA & JDBC solutions, not to mention pure SQL written by an expert in given database engine!

An ORM layer translates data between the database and objects bidirectionally, reducing explicit SQL queries and minimizing the risk of SQL injection. However, when custom queries are needed, Hibernate in Java introduces Hibernate Query Language (HQL), requiring careful use of the createQuery() function to mitigate injection risks. Despite the benefits, it’s crucial to acknowledge that ORM libraries must convert logic back to SQL statements, necessitating trust in proper parameter escaping. To ensure the absence of SQL injection vulnerabilities, regularly scan for known weaknesses and avoid outdated library versions.

SQL security basics

2023-09-11T20:23:00+00:00

Previously on SQL: Advanced SQL for Java developers: coursor, function, index.

Normalized database vs denormalized database

Normalized database is optimized for minimizing redundancy, not for lowest possible read time. Such database contains many tables, uses joins and rather complex queries etc.

In a normalized database, data are organized into multiple related tables. Each table is designed to store a specific type of data, and relationships between tables are established through foreign keys.

Storage efficiency is better as data are stored in most space-efficient manner. Read efficiency is harder to achieve, when it retrieves data from multiple related tables. While this can be computationally expensive, it allows for flexibility in querying the data.

With proper normalization, data consistency is usually easier to maintain, as changes to data only need to be made in one place (the corresponding table). Normalized databases are typically favored for systems where data integrity and consistency are critical, such as financial and transactional systems.

Denormalized databases is optimized for read time, not for minimizing redundancy. Such database contains as many columns in one table as possible. Here, there is no need to create more tables, but smaller ones. It does not look like a clear solution, but it is faster. Data are stored in fewer tables, and there may be duplication of data across tables. This is done to reduce the need for JOIN operations. However, denormalization can be a valid design choice if it serves specific performance needs. Denormalization can be more storage-intensive because it may involve redundancy.

Security in normalized and denormalized database

Normalization aims to minimize data redundancy and ensure data integrity. By breaking data into smaller, related tables, it reduces the risk of data anomalies, such as insertion, update, and deletion anomalies. Normalized databases are typically favored for systems where data integrity and consistency are critical, such as financial and transactional systems.

In normalized databases, the primary concern is managing data relationships, and security measures should focus on access controls, auditing, and preventing unauthorized changes to the database schema, as the structure is more complex.

Denormalization can lead to some loss of data integrity, as redundancy increases the risk of anomalies. Here, data duplication can be a problem.

In denormalized databases, security measures need to consider data duplication, as the same data might exist in multiple places. Special attention must be paid to keeping all copies of the data secure and ensuring consistency in access controls.

Integrity

In SQL databases, data integrity refers to the accuracy and consistency of data stored in the database. There are several types of data integrity in SQL databases, each serving a specific purpose. These integrity constraints help ensure that data remains reliable and valid.

Entity integrity: No duplicate rows exist in a table.

Entity integrity ensures that each row (record) in a table is uniquely identifiable, typically through a primary key. This means that each row must have a unique value in its primary key column, preventing duplicate records in the table.

Domain integrity: Restricting the type of values that one can insert in order to enforce correct values (in Java, for example, using enums may be helpful).

Domain integrity enforces that data values in a column meet specific criteria, such as data type, format, and allowable values. Common examples include ensuring that a date column contains valid dates or that an integer column contains only whole numbers.

Referential integrity: Records that are used by other records cannot be deleted (using constraints).

Referential integrity establishes and enforces relationships between tables through foreign keys. It ensures that data in a foreign key column in one table corresponds to data in the primary key column of another table. This constraint prevents orphaned records and maintains the consistency of relationships. Here, for example, you won’t be able to delete a record from one table which is related to another table via constraint. Either you delete both of them, or none.

Cascading actions, such as CASCADE DELETE and CASCADE UPDATE, are often associated with referential integrity. They define what should happen to related records when a referenced record is deleted or updated. Cascading actions can help maintain data consistency.

Custom integrity / custom constraints

User-defined integrity allows users to define custom constraints or business rules to maintain data integrity. This can include rules specific to a particular application or domain, ensuring that data adheres to business logic: validation rules, data calculations, and workflow-related checks.

Triggers are event-driven actions that can be executed automatically in response to changes in the database. Triggers can enforce custom data integrity rules and actions.

Combination of integrity requirements may happen, \like domain key integrity that combines elements of both domain and entity integrity by ensuring that the primary key values in a table are unique and also meet domain constraints, such as data type and format requirements.

Integrity and security

Database integrity, in the context of security, refers to the fundamental aspect of ensuring the accuracy, consistency, and reliability of the data stored in a database as a means to enhance data security.

Ensuring data integrity means that data stored in the database is accurate and free from errors. Accuracy is crucial for making informed decisions and avoiding security incidents that could arise from erroneous data.

Consistency prevents data anomalies that might be exploited for security breaches.

Part of data integrity involves validating and sanitizing data input to the database. This practice minimizes the risk of SQL injection and other security vulnerabilities that could compromise the integrity and security of the database.

Data integrity helps prevent unauthorized changes or tampering with the data. This safeguards against both accidental and malicious alterations to the data.

Data corruption can lead to security risks, as corrupted data might have unpredictable consequences on the application. Data integrity measures help minimize the risk of data corruption by ensuring that data is stored consistently and accurately.

Even within an organization, there is a risk of insider threats. Data integrity measures, such as access controls and audit trails, help detect and prevent unauthorized access, alterations, or exfiltration of data by employees or insiders.

Last but not least, maintaining data quality, ensuring data recovery and continuity are also important feature of integrity that serves the purpose of security. n the event of a security incident or data breach, maintaining data integrity ensures that backup and recovery processes can restore a reliable and consistent database state. Data integrity is essential for business continuity and disaster recovery planning.

In summary, database integrity plays a crucial role in data security by ensuring the accuracy, consistency, and reliability of the data stored in the database. When data is trustworthy, it reduces the likelihood of security incidents, minimizes vulnerabilities, and supports the overall security of the database and the applications that rely on it. Data integrity and security measures work hand in hand to protect sensitive information and maintain the integrity of the database.

Idempotent vs deterministic function

Although sometimes both terms are mistaken or not clear enough, these are two different concepts.

An idempotent function is a fuction that has the same effect when applied multiple times. No matter if executed once or more, the result is the same. It is used to ensure that a specific operation is performed only once, even if it is requested multiple times. An example of an idempotent function is SQL DELETE. Deleting a resource one time or multiple times has the same result: the resource has been deleted.

In mathematics: f(f(x)) = f(x)

On the other hand, a deterministic function is a function in which the output is completely determined by the input. In other words, given the same input, a deterministic function will always produce the same output, making it predictable and consistent.

For each input x, there is only one corresponding output y, so it’s simply f(x).

Deterministic functions are commonly used in various fields, including computer science, cryptography, and databases. They are valuable for ensuring data consistency and predictability, as they guarantee that the same input will always result in the same output.

A common use case of deterministic functions is unit testing: for the same input data, their result of unit tests is always the same. There should not be any other factors impacting the result. If such test starts to fail, it means the code has been broken.

In SQL, examples of deterministic functions are mathematical and date and time functions: SELECT ABS(-9), SELECT 2 - 2, SELECT DATEADD(day, 5, '2023-01-01').

Non-deterministic SQL functions are getting random number, generating UUID, selecting current user etc.

Idempotent and deterministic SQL functions - implications for security

Idempotent functions can help prevent data corruption and security incidents. For example, idempotent SQL DELETE or UPDATE operations ensure that critical data is not accidentally or maliciously deleted or modified multiple times. Idempotent functions are often used within transactions to ensure that a sequence of operations can be safely retried without introducing data inconsistencies (atomic operations).

One of their features is preventing unwanted side effects. Idempotent functions follows Security Through Predictable Behavior principle. The predictability of idempotent functions can help prevent unwanted side effects or actions that could lead to security incidents. When a function’s behavior is consistent, it is easier to anticipate and control its impact on the database. In disaster recovery and backup scenarios, idempotent operations can be valuable for restoring the database to a known state without introducing additional inconsistencies.

Deterministic functions are critical for data integrity and help in auditing and compliance efforts. Deterministic functions are commonly used in cryptographic operations. They ensure consistent encryption and decryption, which is essential for data security.

Database monitoring, SQL optimization & transactions

2023-09-01T08:00:00+00:00

What is query execution plan?

A query plan, also known as an execution plan or query execution plan, is a detailed, step-by-step blueprint that the database management system (DBMS) uses to execute a specific SQL query. The query plan is generated by the query optimizer, a component of the DBMS responsible for determining the most efficient way to execute a query based on the database schema, indexes, statistics, and other factors.

The query plan provides insights into how the DBMS will retrieve and process the data to satisfy the query, including details on which indexes will be used, the order of table access, and the algorithms employed for sorting and joining data. Understanding and analyzing the query plan can be crucial for optimizing the performance of SQL queries.

How to find a query execution plan?

Many relational database systems support the EXPLAIN statement, which provides information about how a query will be executed without actually executing it. For example:

EXPLAIN SELECT column1, column2 FROM my_table WHERE column1 = 'empty' AND column2 = 'non_empty';

Different database systems have specific commands to obtain query plans. For example:

PostgreSQL: use EXPLAIN or EXPLAIN ANALYZE
MySQL: use EXPLAIN or EXPLAIN EXTENDED
SQL Server: use SHOWPLAN_XML
MariaDb: EXPLAIN with many more options…

Feel free to check documentation / manual for given SQL flavour. Almost all is there!

Some database management tools provide graphical representations of query plans, for example, DBeaver

IntelliJ IDEA supports two types of execution plans:

Explain Plan: the result is shown in a mixed tree and table format on a dedicated Plan tab.
Explain Plan (Raw): the result is shown in a table format.

How to interpret query execution plan?

Table access: look for information on how tables are accessed, including whether full table scans or index scans are used. Consider whether indexes are being utilized effectively.

Joins: check how joins between tables are executed. Different join algorithms (nested loops, hash joins, merge joins) have different performance characteristics. The choice of the join algorithm depends on the size of the tables and the available indexes.

Filter predicates: examine the conditions used to filter rows. Ensure that indexes are used for selective conditions and that the query is leveraging the available statistics.

Sorting and group operations: check for any sorting or grouping operations. Determine if the query plan is using indexes or other methods to satisfy these operations.

Index usage: verify that indexes are being used efficiently. Check if the indexes cover the columns needed for the query and if they are selective.

Parallel execution: some query plans may involve parallel execution, where multiple processes are used to speed up the query. Understand if and how parallelism is being employed.

Use flame graph!

In IntelliJ Idea, Flame Graph is a part of Query Execution Plan feature. A flame graph in the context of SQL typically refers to a visualization technique used for profiling and analyzing the performance of SQL queries or database operations. While flame graphs, in the broader sense, are often associated with the visualization of stack traces in programming, a flame graph in the SQL domain focuses on representing the execution flow and time distribution of SQL queries. Sometimes flame graphs represents not only SQL queries, but also REST API requests associated with them along with microservices that handle the complete excecution flow (for example, in DataDog).

Get familiar with database profiler

A database profiler is a tool or feature provided by database management systems (DBMS) to capture and analyze information about the execution of SQL queries and operations against the database. Profilers are valuable for performance tuning, optimization, and troubleshooting, as they allow database administrators and developers to identify bottlenecks, inefficient queries, and areas for improvement in the database system.

Profilers provide access to query execution plans, but they offer much more features.

Profilers capture detailed statistics about the execution of SQL queries, including the time taken for execution, resource usage (CPU, memory, disk I/O), and the number of rows affected.

They can report on locking and blocking issues, helping to identify situations where transactions are contending for the same resources and causing delays. Profilers provide information about the start and end of transactions, as well as the duration and resource consumption of transactions. This can be essential for understanding the impact of transactions on overall system performance. You will find details about active database sessions and connections, including the users accessing the database, the duration of their sessions, and the resources they are consuming. Profilers may log errors and exceptions encountered during query execution. Some of them offer real-time monitoring capabilities.

Finally, profilers may support the creation of triggers or events that automatically capture information when specific conditions are met. For example, a profiler might capture information whenever a query takes longer than a defined threshold.

How to optimize database?

Use indexes wisely

Ensure that tables are appropriately indexed based on the queries being executed. Analyze if the existing indexes are being utilized effectively.

Update statistics

Regularly update table statistics to provide the query optimizer with accurate information about the distribution of data in tables.

Consider query rewriting

In some cases, rewriting the query or restructuring the schema can lead to more efficient query plans.

Avoid functions on indexed columns

Avoid using functions on columns involved in WHERE clauses, as it may prevent the use of indexes.

Thoroughly check joins

Ensure that join conditions are well-defined and that indexes are available for columns used in join conditions.

Review transactions

Think about transactions: are they used effectively? Is locking strategy adequate to the purpose?

What is transaction?

In SQL, a transaction is a sequence of one or more SQL statements that are executed as a single, indivisible unit of work. The properties of a transaction are often described by the ACID properties, which stand for Atomicity, Consistency, Isolation, and Durability.

In general, transaction phases are:

When to use transactions?

Transactions are good solution in following situations:

When not to use transactions?

In these cases, avoid transactions:

simple read-only operations
where high concurrency is a top priority and conflicts are unlikely
individual, independent operations
performance-critical scenarios
where data is cached or denormalized
bulk data loading or large-scale batch processing
non-critical data, short-lived operations
logging & auditing

Deadlock

A deadlock in SQL occurs when two or more transactions are blocked, each waiting for the other to release a lock on a resource, resulting in a circular waiting condition. It must be avoided at all cost.

Optimistic lock

Locking is a way of preventing lost update. Optimistic lock checks whether a value to be updated has not been changed since last read. The optimistic locking approach allows multiple transactions to proceed with their operations without acquiring locks on the data. Instead, it relies on a mechanism to detect conflicts and resolve them at the time of committing the changes.

During read phase, the transaction records some form of a version identifier associated with the data (e.g., a timestamp, a version number, a hash value).

It does not acquire any lock.

First, it reads data (1) and records some form of a version identifier associated with the data (e.g., a timestamp, a version number, a hash value).

Then starts the second phase: update (2).

During validation phase (3), it checks for any modifiactions done by another transaction in the meantime. This is typically done by comparing the recorded version identifier with the current version of the data.

Commit / rollback phase (4): if no changes, do commit. If there are changes, perform rollback or conflict resolution.

When optimistic locking is a good strategy?

With high concurrency requirements: in scenarios where high levels of concurrent access to the data are crucial, optimistic locking can be more suitable. It allows multiple transactions to read and modify data concurrently, reducing contention and increasing overall system performance.

With low risk of conflicts: when the likelihood of conflicts between transactions is low, optimistic locking is an efficient choice. If the data is not frequently updated by multiple transactions simultaneously, the overhead of acquiring and releasing locks may be unnecessary.

For short transactions: optimistic locking is well-suited for short-duration transactions where the time between reading and updating the data is minimal. Short transactions reduce the window during which conflicts might occur, making it less likely for two transactions to modify the same data concurrently.

When optimizing read-heavy workloads: in situations where the workload is predominantly read-heavy, and write operations are infrequent, optimistic locking can be effective. Readers are not impeded by locks, and conflicts during write operations are addressed when they occur.

To reducing lock contention: optimistic locking helps in reducing lock contention (competition for acquiring locks). By allowing multiple transactions to read data simultaneously and only checking for conflicts at the time of update, contention is minimized.

Optimistic locking is often more compatible with distributed systems. In scenarios where data is distributed across multiple nodes or databases, acquiring locks might be challenging or impractical. Optimistic locking allows for a more decentralized approach.

Optimistic locking is commonly used in scenarios where data may be edited offline, and changes need to be merged with the central database. Each offline editor can make changes independently, and conflicts are resolved when attempting to merge the changes. It seems to be the way the deck synchronization in Anki works.

It is also more scalable solution for systems with a large number of transactions and a desire to reduce the load on the database caused by acquiring and releasing locks.

Pessimistic lock

It is another way of preventing lost update. Pessimistic lock explicitly forces other threads to wait until an update is done.

Lock acquisition (untill commit / rollback) is done in this strategy. In many cases, it involves exclusive locks. Another type of pessimistic is a shared lock, which might be later escalated to exclusive lock.

Pessimistic locking may lead to deadlocks. Pessimistic locking is often associated with higher isolation levels, with more consistency and less concurrency.

When pessimistic locking is a good strategy?

Where certain sections of code or database operations are critical and must be executed without interference from other transactions, pessimistic locking can be beneficial. This ensures that only one transaction at a time can access or modify the protected resource.

When maintaining data integrity is a top priority, pessimistic locking can be appropriate. For example, if an application enforces business rules that require consistency in data relationships, acquiring locks during transactions helps prevent concurrent modifications that could violate those rules.

In situations where transactions involve resource-intensive operations or complex calculations, pessimistic locking can be used to avoid conflicts and ensure that a transaction completes without interference from other transactions. Also, when transactions involve **multiple steps or span different parts of the application, pessimistic locking can be used to ensure that the entire transaction is executed atomically without interference from other transactions.

Pessimistic locking is effective in preventing race conditions, where multiple transactions compete to read or modify the same data simultaneously. By acquiring locks, the system can control access and avoid conflicts.

In batch processing scenarios where large volumes of data are processed, pessimistic locking can help maintain order and prevent concurrent transactions from affecting each other. This is especially important when the order of processing is crucial. Maintaining Consistency in Distributed Systems:

In distributed systems with shared resources, pessimistic locking can be used to ensure that only one node at a time makes modifications to a shared resource.

What is ACID?

Transactions should be designed and implemented accordingly to ACID rules.

Atomicity (A) ensures that a transaction is treated as a single, indivisible unit of work. Either all the changes made within the transaction are committed to the database, or none of them are. If any part of the transaction fails, the entire transaction is rolled back to its previous state.

Consistency (C) ensures that a transaction brings the database from one valid state to another. The database must satisfy certain integrity constraints before and after the transaction. If a transaction violates any integrity constraints, the database is left unchanged.

Isolation (I) ensures that the execution of one transaction is isolated from the execution of other transactions, even if they are executed concurrently.

Durability (D) guarantees that once a transaction is committed, its effects are permanent and survive subsequent system failures. The changes made by the transaction are stored in non-volatile storage (such as disk) and can be recovered even if the system crashes or restarts.

SQL cheatsheet: part 7

2023-08-14T20:23:00+00:00

Previously on SQL: Advanced SQL for Java developers: procedure, view

What is SQL coursor?

It allows you to retrieve and manipulate rows from a result set one at a time. Mainly used for iteration. Rows under cursor can be transformed (e.g. updated, deleted). Other purposes: pagination, data validation.

Coursor: example of use

The provided code is a SQL script that creates a stored procedure in a database. This stored procedure uses a cursor to iterate through records in a table and display the values of the columns in each row as it traverses them.

-- db cursor is a kind of a pointer similar to Java iterator
-- implemented as stored procedure
-- cursor iterates through records one by one
DELIMITER $$
CREATE PROCEDURE pointer()
BEGIN
    -- variables holding values of colums in row that are currently being traversed by cursor
    DECLARE cursor_company_id INT;
    DECLARE cursor_company_name VARCHAR(255);
    DECLARE cursor_country_code VARCHAR(4);
    -- boolean variable (false / true flag) showing if cursor iteration is finished
    DECLARE iteration_completed BIT DEFAULT 0;
    -- cursor declaration
    DECLARE company_cursor CURSOR FOR
    SELECT company_id, name, hq_country FROM company;
    -- handler of continue type launched when not found occurs
    -- not found means no more rows to iterate
    -- in case of not found flag is raised
    DECLARE CONTINUE HANDLER FOR NOT FOUND
    SET iteration_completed = 1;
    -- opening cursor and fetching first row, procedure starts
    -- first row is mapped to declared variables
    OPEN company_cursor;
    FETCH company_cursor INTO cursor_company_id, cursor_company_name, cursor_country_code;
    -- 'WHILE' loop (do as long as no empty rows)
    WHILE iteration_completed = 0 DO
    SELECT cursor_company_id, cursor_company_name, cursor_country_code; -- displaying declared variables containing values currently being traversed by cursor
    FETCH company_cursor INTO cursor_company_id, cursor_company_name, cursor_country_code; -- map another row to declared variables, repeat the flow
    END WHILE;
    -- procedure ends, cursor closed
    CLOSE company_cursor;
END

Let’s go through it step by step.

DELIMITER $: This statement changes the delimiter used in the SQL script to$. It allows you to define the stored procedure using multiple SQL statements within the procedure.

CREATE PROCEDURE pointer(): This line begins the definition of the pointer stored procedure. The procedure has no parameters.

BEGIN: This keyword marks the beginning of the procedure’s executable code block.

DECLARE statements: In this section, several local variables are declared for storing values from the rows as the cursor iterates through the result set. These variables include cursor_company_id, cursor_company_name, cursor_country_code, and iteration_completed.

cursor_company_id: It will hold the company_id value from the current row. cursor_company_name: It will hold the name value from the current row. cursor_country_code: It will hold the hq_country value from the current row. iteration_completed: This is a boolean variable used to indicate whether the cursor iteration is finished. It’s initialized to 0 (false). DECLARE company_cursor CURSOR FOR …: Here, a cursor named company_cursor is declared. The cursor is associated with a SELECT statement that retrieves data from the company table, specifically the company_id, name, and hq_country columns.

DECLARE CONTINUE HANDLER FOR NOT FOUND …: This line declares a handler for the NOT FOUND condition. It means that when the cursor reaches the end of the result set (no more rows to iterate), the iteration_completed variable will be set to 1, indicating that the cursor iteration is completed.

OPEN company_cursor;: This statement opens the cursor, allowing it to start iterating through the rows of the result set.

FETCH company_cursor INTO …: This line fetches the first row from the cursor’s result set and maps the values of the columns (company_id, name, and hq_country) to the corresponding declared variables.

WHILE iteration_completed = 0 DO … END WHILE;: This section of the code creates a WHILE loop. The loop will continue executing as long as the iteration_completed variable is 0. Inside the loop, it displays the values of the declared variables containing the current row’s data and then fetches the next row.

CLOSE company_cursor;: After the loop finishes, this statement closes the cursor to release resources.

END: This keyword marks the end of the stored procedure definition.

In summary, this stored procedure (pointer) iterates through the records of the company table using a cursor. It displays the values of each row’s columns as it traverses them. The loop continues until there are no more rows to fetch, at which point the cursor is closed, and the procedure ends.

What is SQL coursor for?

In SQL, a cursor is a database object that allows you to retrieve and manipulate rows from a result set one at a time. Cursors are commonly used within stored procedures or other database objects to navigate through the records in a result set, perform operations on each record, and manage the flow of data processing.

Common use cases:

Iterating through records - cursors are used to loop through the rows returned by a query one by one, allowing you to perform actions on each row.

Processing and transforming data - cursors are helpful when you need to apply complex calculations, transformations, or business logic to individual rows within the result set.

Data validation and error handling - cursors can be used to validate data, perform data integrity checks, and handle exceptions or errors on a per-row basis.

Cursor-based pagination - cursors can be used for paginating large result sets. You fetch a certain number of rows at a time, improving performance and reducing memory consumption.

Various kinds of SQL coursor

There are multiple types of coursor, depending on SQL flavour:

Forward-only. This type of cursor can only navigate forward through the result set, making it suitable for read-only operations.

Scrollable cursors can move both forward and backward within the result set, allowing you to revisit previous rows.

A static cursor populates the result set at the time of cursor creation and the query result is cached for the lifetime of the cursor. A static cursor can move forward and backward direction. A static cursor is slower and use more memory in comparison to other cursor. Hence, you should use it only if scrolling is required and other types of cursors are not suitable. No UPDATE, INSERT, or DELETE operations are reflected in a static cursor (unless the cursor is closed and reopened). By default, static cursors are scrollable. SQL Server static cursors are always read-only.

A dynamic cursor allows you to see the data update, deletion and insertion in the data source while the cursor is open. Hence, a dynamic cursor is sensitive to any changes to the data source and supports update, delete operations. By default, dynamic cursors are scrollable.

MySQL cursor is read-only, non-scrollable and asensitive.

Read-only: you cannot update data in the underlying table through the cursor. Non-scrollable: you can only fetch rows in the order determined by the SELECT statement. You cannot fetch rows in the reversed order. In addition, you cannot skip rows or jump to a specific row in the result set. Asensitive: there are two kinds of cursors: asensitive cursor and insensitive cursor. An asensitive cursor points to the actual data, whereas an insensitive cursor uses a temporary copy of the data. An asensitive cursor performs faster than an insensitive cursor because it does not have to make a temporary copy of data. However, any change that made to the data from other connections will affect the data that is being used by an asensitive cursor, therefore, it is safer if you do not update the data that is being used by an asensitive cursor. MySQL cursor is asensitive.

Function

In SQL, a function is a database object that allows you to encapsulate a set of SQL statements or expressions into a reusable and named unit. Functions take zero or more input parameters, perform specific operations or calculations, and return a single value as their result. SQL functions can be used in queries, data manipulation, and various SQL statements to simplify and modularize database operations. There different types of SQL functions: scalar functions (return a single value), table-valued (return result set), aggregations, date-time functions and the like. This is another story related to SQL server internals.

How can we program SQL function to count logarithm:

-- CREATE FUCTION
DELIMITER $$
CREATE FUNCTION logarithm(
    base INT,
    n INT
    )
    RETURNS INT DETERMINISTIC -- like idempotent but does not alter db state even in first call
BEGIN
-- DECLARE local variables
DECLARE a INT DEFAULT 2;
DECLARE b INT;
SET b = n;
IF base > 1
    THEN SET a = base -- log base cannot be 0 or 1, use default 2 in such case
IF n <= 0
    THEN RETURN NULL -- n must be > 0, return null as log cannot be counted
RETURN LOG(a, n);
END$$

-- HOW TO CALL
SELECT logarithm(2, 128)$$

Function vs stored procedure

Function:

database object, a set of SQL statements or expressions wrapped into a reusable and named single unit
returns value(s) (single value - scalar function, result set - table-valued functions).
they can be used as expressions in queries, such as SELECT, WHERE…
scalar functions can be used to modify data, but they are typically designed for computations and transformations
designed to be used as read-only, they cannot contain control statements like COMMIT or ROLLBACK

Stored procedures:

may or may not return values - it is optional
called explicitly using the EXECUTE or EXEC statement, cannot be used directly in a SELECT statement or a WHERE clause
can include data modification statements (INSERT, UPDATE, DELETE) and transaction control statements (COMMIT, ROLLBACK)
suitable for operations that involve both reading and writing data
can contain transaction control statements, allowing for explicit control over transactions

Triggers

A trigger is a set of instructions or a program that is automatically executed (“triggered”) in response to specific events on a particular table or view. These events can include data manipulation language (DML) events like INSERT, UPDATE, DELETE, or data definition language (DDL) events like CREATE, ALTER, or DROP. Triggers are often used to enforce business rules, maintain referential integrity, and automate certain tasks.

DML triggers

These triggers respond to data manipulation language (DML) events, such as INSERT, UPDATE, and DELETE operations on a table. Common use cases include enforcing data integrity rules, auditing changes, and automating specific actions based on data modifications. Example of an AFTER INSERT trigger:

CREATE TRIGGER AfterInsertTrigger
AFTER INSERT
ON Employees
FOR EACH ROW
BEGIN
    -- Trigger logic, e.g., update a related table, log the change, etc.
END;

DDL triggers

These triggers respond to data definition language (DDL) events, such as CREATE, ALTER, and DROP operations on a database or table. Common use cases include restricting certain schema modifications, logging schema changes, or implementing specific actions when database objects are altered. Example of a BEFORE CREATE trigger:

CREATE TRIGGER BeforeCreateTrigger
BEFORE CREATE
ON DATABASE
BEGIN
    -- Trigger logic, e.g., check if the user has permission to create a table
END;

What does the trigger consist of?

Each trigger consist of three elements:

trigger event (when it should happen?) - specifies the event that causes the trigger to be executed (e.g., AFTER INSERT, BEFORE UPDATE)
trigger condition (why it should happen?) - optionally specifies a condition that must be true for the trigger to execute
trigger action (what should happen?) - contains SQL statements or procedures that are executed when the trigger is fired

How triggers are used?

to enforce business rules,
to enforce / maintain referential integrity rules
auditing & logging schema changes
automating data modifications
restricting certain schema modifications, logging schema changes
to database objects

What is the risk of using triggers?

They introduce additional complexity and can impact performance! Overuse of triggers can make database behavior less transparent and harder to manage. Therefore, triggers are often employed for tasks that are best handled within the database layer (meta-level, database management), such as enforcing integrity constraints or automating certain actions, rather than for general application logic.

What is SQL index?

A SQL index consists of a data structure that stores a sorted or hashed subset of the columns from a database table, along with pointers to the corresponding rows, to facilitate efficient and quick data retrieval operations.

Index: more insight

It is a separate bunch of data, created from indexed field (column) and pointer to full record containing such field. SQL indexes work by providing (theoretically) a faster way to retrieve data from a database table. Indexing creates a data structure that maps specific column values to their corresponding rows. It’s smaller than full record, contains less disk space, it’s sorted allowing binary search, so it’s faster to iterate through it. As index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through.

According to nice explanation in MySQL manual:

Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data. This is much faster than reading every row sequentially.

This is similar to hashing in data structures, like Hash Map. In fact, some SQL indexing methods are using hashing. Most of MySQL indexes use B-Trees, some use R-Trees and hashes.

OK, but what are B-Trees? MySQL manual offers helpful glossary of terms with B-Tree concept explained. B-Tree is a data structure, but not the same as binary tree. B-Tree can have multiple children, binary tree only two children per node.

The index allows the database to avoid a full table scan (row by row), resulting in significantly faster query execution (in theory). Instead of going through all the rows in the table, the database directly accesses the row(s) that match the condition.

Why I am writing “in theory”?

Indexes are especially beneficial for SELECT, WHERE, JOIN, and ORDER BY clauses, as they help the database engine quickly pinpoint the desired data. **However, it’s important to note that indexes come with some trade-offs. ** They consume storage space and can slightly slow down write operations (INSERT, UPDATE, DELETE) because the index must be updated when the data changes.

Postgres manual is also a great source of knowledge on indexes.

Index - detailed explanation

Data stored on disk-based storage devices is organized into blocks, which serve as the fundamental unit of disk access. Each block is accessed as a whole, representing the smallest disk access (atomic) operation. The structure of disk blocks resembles that of linked lists, with each block consisting of a data section and a pointer indicating the location of the next node or block. Importantly, these blocks do not necessarily need to be stored consecutively on the disk.

Search operation on unsorted data is called linear search.

What is linear search and why it requires (n+1)/2 accesses on average - when searching an unordered list with n elements?

Searching mechanism of linear search

In a linear search, you start searching from the beginning of the list and examine each element one by one until you find the target element or determine that it doesn’t exist in the list. You stop as soon as you find a match.

Best-case scenario

The best-case scenario is when the target element is found in the first position of the list. In this case, only one access is required.

Worst-case scenario

The worst-case scenario is when the target element is located at the end of the list or is not present at all. In this case, you must access every element in the list to determine that the target element is not there.

Average-case scenario

To calculate the average number of accesses, you need to consider all possible positions of the target element in the list. On average, you would expect to find it somewhere in the middle, requiring roughly (n+1)/2 accesses. This average assumes that the target element is equally likely to be in any position in the list.

The formula (n+1)/2 represents the arithmetic mean or average of all possible access scenarios in a linear search. It provides a reasonable estimate of the expected number of accesses needed to find an element when the position of the target element is not known in advance.

This is a result of the way linear searches work and is based on the concept of “expected number of accesses.” It’s just an estimation.

In practice, the actual number of accesses in a specific search may vary, but the (n+1)/2 formula provides a useful average estimation for the linear search algorithm’s performance.

What in case of non-unique fields?

(n+1)/2 is appropriate only if we search for a unique value (which cannot be doubled) - so once it is found, no need to search for more of them. If searched record is a non-key field (i.e. doesn’t contain unique entries), we must find all fields that matches expectation, so the entire table must be searched. Then it requires n block accesses.

What if data are sorted?

When data is stored in a sorted field, you can employ a Binary Search algorithm to locate specific values. Binary Search is highly efficient and typically requires log2(n) block accesses to find a particular value. Here, n represents the number of elements or records in the sorted field. (In contrast, a linear search in an unsorted field might require n block accesses in the worst case, which is significantly less efficient.)

What about duplication problem?

In a sorted field, once a value higher than the target value is found during the search, you can be confident that the target value doesn’t exist in the remaining portion of the field. This is because, in a sorted field, all values are ordered, and any duplicate values would be adjacent to each other. Therefore, you don’t need to continue searching for duplicate values once a higher value is encountered.

The combined effect of using Binary Search and the elimination of duplicate searches in a sorted field results in a substantial performance increase compared to an unsorted field. It allows for quicker and more efficient retrieval of data, especially when searching for specific values or performing range-based queries.

In summary, the key advantages of using a sorted field include the efficiency of Binary Search and the ability to eliminate duplicate searches, both of which significantly enhance query performance and data retrieval speed.

Advantages of using indexes

And now, indexing comes into play, offering some benefits:

avoiding a full table scan (row by row), using trees and hashing in searching
index is a given field + pointer to the record, so it is fewer data than original record
speed up SELECT, WHERE, JOIN, and ORDER BY

As this great StackOverflow article explains:

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

To sum up, indexing takes advantage of the fact that data are sorted, and it allows to use searching algorithms that are more efficient than simple linear search.

Are there any drawbacks of indexes?

Unfortunately, nothing comes for free. Here, I would like to quote StackOverflow again:

The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

In short:

index takes additional storage space (it needs additional data structure that stores a sorted or hashed subset of the columns)
index can slightly slow down write operations (INSERT, UPDATE, DELETE) because the index must be updated when the data changes

When to use indexes:

high-cardinality columns (uniqness of data in particular column)
frequent searches
large tables
JOIN, GROUP BY, ORDER
unique constraints (PRIMARY_KEY, UNIQUE)

When not to use indexes:

small tables
sequential data, increasing or decreasing, like timestamps: the benefits of indexing might be limited, as new values are continuously added at one end of the index
frequent write operations
low-cardinality columns
temporary tables

What is this cardinality, after all?

Cardinality means degree of uniqueness of data values contained in a particular column. High-cardinality refers to columns with values that are very uncommon or unique - a good use case to apply indexes: e.g. user_id (which is unique). Data of normal-cardinality would be: address, name, etc. And finally, examples of low-cardinality data are booleans, flags, Y/N switch, etc. - do not use indexes on such columns!

SQL cheatsheet: part 6

2023-07-20T04:23:00+00:00

Previously on SQL: Medium SQL for Java developers: recapitulation

Procedure

SQL procedure is a kind of SQL query embedded in a SQL script. In other words, it is SQL function that executes pre-programmed query. A procedure can accept arguments. Here is an example of a stored procedure that selects all columns from the table company according to the country code argument passed to this procedure.

-- STORED PROCEDURE - SQL query saved directly in db as a function
-- such function accepts argument being used to execute query within stored procedure
-- procedure can be then called multiple times with different args

-- HOW TO CREATE
DELIMITER
$$
CREATE PROCEDURE get_company_by_country_code(
    country_code VARCHAR (4)
)
BEGIN
    IF
country_code IS NULL
THEN
SELECT 'Function argument is null';
    ELSE
SELECT *
FROM company
WHERE company.hq_country = country_code;
END IF;
END $$

-- HOW TO CALL
CALL get_company_by_country_code('JPN') -- returns result
$$
CALL get_company_by_country_code(null) -- returns hard-coded answer
$$
CALL get_company_by_country_code() -- returns error as arg is expected
$$

I asked ChatGPT to explain the procedure. Here is detailed explanation:

The code sets the delimiter to $$. This is used to specify the end of the stored procedure definition since it contains semicolons (;) within its body.

The CREATE PROCEDURE statement is used to define the stored procedure get_company_by_country_code with the country_code parameter.

The BEGIN keyword indicates the start of the stored procedure’s body.

The code checks if the country_code parameter is NULL using the IF statement.

If the country_code is NULL, the code executes the SELECT ‘Function argument is null’; statement. This statement will return a single row with the string value ‘Function argument is null’.

If the country_code is not NULL, the code executes the SELECT * FROM company WHERE company.hq_country = country_code; statement. This statement selects all columns (*) from the company table where the hq_country column matches the provided country_code parameter.

The END IF; statement denotes the end of the IF block.

The END $$ statement denotes the end of the stored procedure definition, using the previously set delimiter.

Overall, this stored procedure retrieves company data based on the provided country_code. If the country_code parameter is NULL, it returns the string ‘Function argument is null’. Otherwise, it selects all columns from the company table where the hq_country column matches the provided country_code.

The explanation created by ChatGPT contains rather obvious, self-explanatory statements, but it can be helpful, nevertheless.

Why we should use stored procedures?

The Oracle JDBC tutorial explains this issue.

JDBC means Java Database Connectivity.

It’s Java API (a.k.a. “abstraction layer”) designed to standarize and simplify the process of connecting Java software to relational database management systems (RDBMS)

In short:

Stored procedures are precompiled database scripts (group of statements) that can be executed from a database client, such as a Java application, using JDBC. Stored procedures offer better performance, reduced network traffic, and improved security. They can encapsulate complex SQL logic and business rules. JDBC provides the CallableStatement interface for executing stored procedures.

The CallableStatement interface allows the use of SQL statements to call stored procedures.

You can prepare and execute stored procedures using CallableStatement. You can use execute(), executeQuery(), and executeUpdate() methods to invoke stored procedures. They can have input and output parameters, or parameters that are both input and output Error handling for stored procedures is also explained, including handling exceptions using SQLException.

What about advantages of stored procedures in Java, including security context?

Easier maintenance

1. Encapsulation of logic and operations

Stored procedures allow you to encapsulate business logic and database operations into a single unit, looking like a simple script. This helps enforce data integrity rules and security constraints. By centralizing the implementation of data operations within the stored procedure, you can ensure that security checks, access controls, and validation rules are consistently applied across different parts of the application.

2. Reusability of code

Stored procedures can be called and executed from various parts of an application or by multiple users. This promotes code reuse and reduces redundancy, as the same logic can be executed without rewriting it. Multiple Java applications or components can invoke the same stored procedure, reducing duplication of code and promoting consistency across different parts of your application.

3. Transaction management

Stored procedures can be used to define complex transactions that involve multiple SQL statements. This allows for consistent and reliable data modifications, with the ability to roll back changes if necessary. Storing SQL logic in stored procedures allows for centralized management and versioning of database operations. Modifications to the SQL code can be made in the stored procedures without requiring changes to the Java codebase.

This separation of concerns makes it easier to track and manage changes, ensuring that security updates and fixes can be applied more efficiently.

4. Database decoupling

A stored procedure is a helpful tool when thinking about vendor-specific database independence (like, for example, future data migration). Using stored procedures can help abstract the underlying database implementation from your Java code. By relying on stored procedures, you can write database-agnostic Java code that can work with different database systems without major modifications. This can simplify database migrations or switching to a different database platform in the future.

Better performance

Stored procedures can enhance performance by reducing network traffic. They are typically compiled and optimized by the database server during creation or the first execution. This can lead to faster execution times compared to dynamically generating SQL statements in Java code. By offloading data processing to the database server, you can reduce network latency and utilize the database’s query optimization capabilities. Instead of sending multiple SQL statements over the network, a single call to the stored procedure is made, reducing the overhead of multiple round trips.

It could make a difference if someone tried to DDOS your application.

Increased security

1. Security enhanced by limited access

Stored procedures allow you to grant permissions to execute the procedure without granting direct access to the underlying database tables. This means that users or applications can interact with the database only through the stored procedure, and they don’t have direct control over the underlying data. It provides a layer of abstraction and restricts unauthorized access to sensitive data.

2. Safe parametrized queries

Stored procedures typically use parameterized queries, where user input is passed as parameters rather than directly concatenating them into SQL statements. This helps prevent SQL injection attacks, a common security vulnerability where malicious input is injected into SQL statements. By using parameterized queries, stored procedures can ensure that user input is properly sanitized and reduce the risk of SQL injection attacks.

3. Auditing and logging for security control

Stored procedures provide a natural point for implementing auditing and logging mechanisms. You can log the execution of stored procedures, capturing details such as who executed the procedure, when it was executed, and what parameters were used. This can help with compliance requirements, troubleshooting, and identifying potential security breaches or suspicious activities.

View

SQL view can be considered as a virtual table that consolidates data from one or more tables. Unlike physical tables, view doesn’t store data itself and exists only logically in the database, where it is saved.

Each view in a database must have a unique name, just like a regular SQL table. It is defined by a predefined set of SQL queries that retrieve data from the underlying database tables. View can incorporate tables from a single database or multiple databases.

Example of SQL view:

-- VIEW acts like reusable saved SELECT
-- is stored directly in db

CREATE VIEW company_left_join_customer AS
SELECT company.*,
       c.customer_id AS customer_number, -- table name even in different tables must be unique
       c.first_name,
       c.last_name,
       c.registration_date
FROM
    company
        LEFT JOIN customer c ON company.customer_id = c.customer_id;

-- example of use
SELECT * FROM company_left_join_customer; -- view alias

-- customer, current branch, current turnover
CREATE VIEW customer_branch_turnover_current AS
SELECT
    customer.customer_id,
    customer.first_name,
    customer.last_name,
    customer.registration_date,
    b.branch_name,
    bc.from_date AS branch_since,
    t.turnover,
    t.from_date AS monthly_turnover_since
FROM
    customer
        INNER JOIN
    branch_customers bc
    ON customer.customer_id = bc.customer_id
        INNER JOIN
    branch b
    ON bc.branch_id = b.branch_id
        INNER JOIN
    turnover t
    ON customer.customer_id = t.customer_id
WHERE
        bc.to_date = '9999-01-01'
  AND
        t.to_date = '9999-01-01'

View versus procedure

What’s the difference between procedure and view?

SQL view is a virtual table or tables, similar to a product of SELECT query, optionally with JOIN.

No logic: There is no procedural logic in that, no conditional statements nor loops.

No parameters: Views don’t accept parameters.

No storage: A view doesn’t store data itself but provides a way to present data from one or more underlying tables or other views.

Views are primarily used for data retrieval and presentation.

Read-only: Views are read-only and can be used to simplify complex queries, filter data, and provide a consistent interface for users or applications.

In terms of security, views can be used to restrict access to specific columns or rows in a table, but they don’t provide the same level of security and control as stored procedures.

In terms of performance, views may improve query performance by providing a pre-defined and optimized representation of data.

However, complex views may introduce performance issues.

On the other hand, SQL procedure is like a pre-programmed query, often with custom parameters. Stored procedures support conditional logic, error handling, and the ability to return multiple result sets. Stored procedures can be used for data manipulation, such as CRUD operations. They can significantly improve security and performance.

Summary (updated 15.11.2023)

What is stored procedure? What to use it for?

Stored procedures are precompiled database scripts (group of statements) that can be executed from a database client, such as a Java application, using JDBC. Stored procedures offer better performance, reduced network traffic, and improved security. They can encapsulate complex SQL logic and business rules. JDBC provides the CallableStatement interface for executing stored procedures.

What is database view?

SQL view can be considered as a virtual table that consolidates data from one or more tables. Contrary to physical tables, view doesn’t store data itself and exists only logically in the database, where it is saved. Unlike procedures, view doesn’t have logic (it is only for presentation). No params, no storage and it’s read-only.

SQL cheatsheet: part 5

2023-06-20T04:23:00+00:00

Previously on SQL: CRUD, n+1, migrations

Summary: SQL basics for Java devs

Practice your SQL skills. Do not have a feeling that you need to start from the scratch over and over again! In particular, if you’re a beginner, or you do not work with SQL very often (it is not uncommon). SQL problems can be like a Nemesis: I saw senior architects hairs growing grey because of complex SQL issues affecting company performance. They were stammering trying to admit that there is a bug that no one is able to easily resolve.

You can excercise SQL kata at various programming website that I’ve already mentionned earlier.

You are also in position to use any of SQL playgrounds (a.k.a. fiddles), accessible in the web, to run and test some simple query, like db-fiddle It is more like shadow-fighting: you try something, and when it fails, you need to counter your imaginative opponent. For testing more advance queries and their performance, local database, Docker database or remote cloud database would be better along with any of SQL clients, like Workbench or IntelliJ.

Create a basic SQL schema - to have a sample table:

CREATE TABLE test (
    id INT
);
INSERT INTO test (id) VALUES (1);
INSERT INTO test (id) VALUES (2);

Let’s check what this fiddle offers:

SELECT @@version;
-- 5.7.38

and now sample queries to play with - note that this fiddle requires backticks, which weren’t needed in previous examples:

ALTER TABLE `test` add customer_id int;
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE `TABLE_NAME` = 'test';
SELECT `COLUMN_NAME` FROM `INFORMATION_SCHEMA`.`COLUMNS` WHERE `TABLE_NAME` = 'test';
SELECT 'anything';

Now, remember, that SELECT can evaluate Boole’s algebra expressions, as well as it can execute arithmetic calculations:

SELECT 1 < 0 AS boolean_value;
SELECT IF(2 + 2 = 4, 'TRUE', 'FALSE') AS two_plus_two_is_four;

The SQL playground I tested did not have problems with more advanced XOR gate example (see first part of this SQL series):

SET @false_xor := 'gate returns false';
SET @true_xor := 'gate returns true';
SELECT
    IF(0 XOR 0, @true_xor, @false_xor) AS '0 XOR 0',
    IF(0 XOR 1, @true_xor, @false_xor) AS '0 XOR 1',
    IF(1 XOR 0, @true_xor, @false_xor) AS '1 XOR 0',
    IF(1 XOR 1, @true_xor, @false_xor) AS '1 XOR 1'

Basic SELECT can be wrapped into null check clause: if value of the field is null, it will be filled with given substitute:

SELECT IFNULL(customer_id, 'it is null, though!') AS 'null checked customer_id' FROM test WHERE id = 1

---

**Query #1**

    SELECT IFNULL(customer_id, 'it is null, though!') AS 'null checked customer_id' FROM test WHERE id = 1;

| null checked customer_id |
| ------------------------ |
| it is null, though!      |

---

Let’s extend the table with some text column:

-- create new column:
ALTER TABLE `test` add country varchar(3);
-- and then add something there:
INSERT INTO test (id, customer_id, country) VALUES (7, null, 'FIN');
INSERT INTO test (id, customer_id, country) VALUES (8, null, 'NOR');

or update existing records with new values:

UPDATE test SET country = 'FIN' WHERE id = 1;
UPDATE test SET country = 'SWE' WHERE id = 2;

To check equality, use equal sign:

SELECT * FROM test WHERE country = 'FIN'

To make loose comparision, use the percent sign as a wildcard on a given side of look up expression:

SELECT * FROM test WHERE country LIKE '%N%';
SELECT * FROM test WHERE country LIKE '%N';
SELECT * FROM test WHERE country LIKE 'N%';

One wildcard replaces one or more characters. The underscore replaces one character:

SELECT * FROM test WHERE country LIKE '__N'

The last but not least, the coalesce keyword returns first non-null value of these listed in parentheses:

SELECT COALESCE(country, "Unknown") FROM test;

It is used to replace null value with a substitute:

FIN
Unknown
FIN
NOR

Aggregations

Let’s create new schema with two tables to test aggregations:

CREATE TABLE country
(
    id   INT,
    code VARCHAR(3)
);
INSERT INTO country (id, code)
VALUES (1, 'SWE'),
       (2, 'FIN'),
       (3, 'NOR'),
       (4, 'ISL'),
       (5, 'DNK');


CREATE TABLE player
(
    id    INT,
    name  VARCHAR(5),
    city  VARCHAR(10),
    games INT
);

INSERT INTO player (id, name, city, games)
VALUES (1, 'Swen', 'Kiruna', 10),
       (2, 'Antti', 'Kotka', 11),
       (3, 'Marit', 'Bergen', 13),
       (4, 'Katja', 'Keflavik', 4),
       (5, 'Karin', 'Odense', 22);

Here are all aggregation commands:

SELECT COUNT(*) AS count_all_records,
MAX(games), 
MIN(games), 
AVG(games)
FROM player

Counting occurences of each name (and grouping by the same name):

SELECT name, COUNT(*) AS occurences
FROM player
GROUP BY (name)

Counting occurences of names with ‘K’:

SELECT name, COUNT(name) AS occurences
FROM player
WHERE name LIKE '%K%'
GROUP BY (name)

Group by played games and count how many players achieved this number:

SELECT games, COUNT(games) AS players_with_this_qty_of_games
FROM player
GROUP BY (games)
HAVING count(games) < 13

Remember: WHERE is used before GROUP BY, HAVING after GROUP BY.

Both can be used in the same query!

Joining

UNION joins results of one query with results of another query from the same table, but omits duplicated rows (which matches both queries). You can UNION different tables, but number of colums in one table must equal numer of columns in the other table.

SELECT * FROM player WHERE games > 10
UNION
SELECT * FROM player WHERE name LIKE '%K%'

UNION ALL allows duplicates, so Karin would be listed twice.

Let’s change id column names to precisely indicate what id they are referring to.

CREATE TABLE country (
                         country_id INT,
                         code VARCHAR(3)
);
INSERT INTO country (country_id, code) VALUES (1, 'SWE'),
(2, 'FIN'), (3, 'NOR'), (4, 'ISL'), (5, 'DNK');


CREATE TABLE player (
                        player_id INT,
                        name VARCHAR(5),
                        city VARCHAR(10),
                        games INT
);
INSERT INTO player (player_id, name, city, games) VALUES (1, 'Swen', 'Kiruna', 10),
(2, 'Antti', 'Kotka', 11),(3, 'Marit', 'Bergen', 13),(4, 'Katja', 'Keflavik', 4),(5, 'Karin', 'Odense', 22);

Let’s update schemas by adding primary and foreign keys, required in joining operations.

ALTER TABLE player ADD country_id INT;
-- this does the trick: by accident, country_id should be the same as player id, so let's take advantage of that
UPDATE player SET country_id = player_id;

INNER JOIN joins table using primary and foreign keys. Inner join syntax looks like this:

SELECT * FROM
    customer
INNER JOIN
    field f ON customer.customer_id = f.customer_id

Apply it to our schema:

SELECT * FROM player INNER JOIN country ON player.country_id = country.id

Alternative syntax with USING. Foreign key column name: country_id in Player table matches country_id in Country table. It is required when using USING keyword, so that SQL knows how to connect tables via columns.

-- USING
SELECT * FROM
    company
        INNER JOIN
    customer USING(customer_id);

With Player and Country tables:

SELECT * FROM player INNER JOIN country USING (country_id)

INNER JOIN can be applied to more than two tables. You can also join using third, “helper” table.

CROSS JOIN makes Cartesian product if no WHERE is specified (each row x each row). With WHERE, it joins:

SELECT * FROM player CROSS JOIN country WHERE country.country_id = player_id

LEFT JOIN, RIGHT JOIN, OUTER JOIN

What is the difference between them?

ChatGPT offers concise summary:

LEFT JOIN

Also known as a LEFT OUTER JOIN.
Returns all the rows from the left table (the table mentioned before the LEFT JOIN clause) and the matching rows from the right table (the table mentioned after the LEFT JOIN clause).
If there are no matching rows in the right table, NULL values are returned for the columns of the right table.
This type of join ensures that all rows from the left table are included in the result, with the possibility of additional data from the right table if a match exists.

SELECT employees.name, departments.department_name
FROM employees
LEFT JOIN departments ON employees.department_id = departments.id;

RIGHT JOIN Also known as a RIGHT OUTER JOIN. Returns all the rows from the right table and the matching rows from the left table. If there are no matching rows in the left table, NULL values are returned for the columns of the left table. This join is less commonly used than the LEFT JOIN but has the same purpose, ensuring that all rows from the right table are included in the result.

SELECT employees.name, departments.department_name
FROM employees
RIGHT JOIN departments ON employees.department_id = departments.id;

FULL OUTER JOIN (OUTER JOIN): A FULL OUTER JOIN combines the result sets of both the left and right tables. It returns all the rows from both tables and matches rows where the join condition is met. If there are no matches in either table, NULL values are returned for the columns from the table without a match. The result includes all rows from both tables, ensuring that no data is excluded.

SELECT employees.name, departments.department_name
FROM employees
FULL OUTER JOIN departments ON employees.department_id = departments.id;

It’s important to note that not all database systems support RIGHT JOIN and FULL OUTER JOIN directly, and you may need to use alternative methods to achieve the same results in those cases, such as swapping the order of tables or using UNION clauses.

See previous article on JOINs: union vs join, left join, right join, inner vs outter join

Update: 15.11.2023 - other questions

Inner join vs outer join: what’s the difference?

An inner join returns only the rows from both tables that satisfy the specified join condition (can be joined by indicated field). Rows that do not have matching values in the joined columns are excluded from the result set.

An outer join returns all the rows from one table and the matching rows from the other table, being connected by indicated field. But if there is no match, the result will contain NULL values for columns from the table that does not have a matching row.

How SQL `GROUP BY` command works?

GROUP BY clause is used to group rows that have the same values in specified columns into summary rows, often for the purpose of applying aggregate functions to each group:

SELECT security_branch, COUNT(user_id) as user_count, MAX(last_login_datetime) as latest_login
FROM cybersecurity_users
GROUP BY security_branch;

with result:

+-------------------+------------+------------------------+
| security_branch   | user_count | latest_login  |
+-------------------+------------+------------------------+
| Threat Analysis   | 25         | 2023-11-15T08:30:00Z   |
| Incident Response | 18         | 2023-09-28T15:45:00Z   |
| Penetration Testing | 12       | 2023-07-05T12:10:00Z   |
| Security Operations | 30      | 2023-08-10T18:22:30Z   |
| Compliance        | 15         | 2023-09-02T09:55:45Z   |
+-------------------+------------+------------------------+

What is ORM?

ORM stands for Object-Relational Mapping. It is a programming paradigm that allows you to interact with a relational database using an object-oriented programming. ORM consists on mirroring logical entries (entities) from database tables to entites written in programing language on the application side.

Key features:

Mapping: ORM systems map database tables to classes, with each row in a table corresponds to an instance of a class, and each column corresponds to an attribute or property of that class.
Data abstraction: ORM abstracts away the details of database interactions, you deal with the objects / classes, not with the SQL queries.
CRUD: ORM systems provide methods and APIs for performing CRUD (Create, Read, Update, Delete) operations on database entities.
Relationships: ORM systems handle relationships between entities, such as one-to-one, one-to-many, and many-to-many relationships.
Portability: ORM systems often provide a level of database portability, allowing developers to switch between different database management systems (e.g., MySQL, PostgreSQL, Oracle) with minimal code changes. The ORM system abstracts the differences in SQL syntax and handles them internally.
Performance optimization: ORM systems may include features for optimizing database access, such as lazy loading (loading data on demand), caching, and query optimization.

SQL cheatsheet: part 4

2023-05-18T05:23:00+00:00

Previously on SQL: union vs join, left join, right join, inner vs outter join

CRUD

CRUD is an acronym that stands for Create, Read, Update, and Delete. It is a set of four basic operations that are commonly used in the context of database management systems and web development to manage data.

Here’s a breakdown of each operation:

Create (C): This operation involves creating new records or entities in a database. It typically involves inserting data into a database table or creating a new object in an object-oriented programming context.

Read (R): This operation involves retrieving or reading existing data from a database. It allows you to query and fetch specific records or information from a database. Reading data could involve retrieving a single record, a subset of records, or all the records in a table.

Update (U): This operation involves modifying or updating existing data in a database. It allows you to make changes to specific records or fields within a record. Updating data could involve modifying values, adding new information, or altering existing data.

Delete (D): This operation involves removing or deleting existing data from a database. It allows you to delete specific records or objects from a database. Deleting data could involve removing a single record, a subset of records, or all the records in a table.

These four operations form the fundamental building blocks for performing data manipulation within a database or application. They provide the basic functionality to create, retrieve, update, and delete data, enabling developers to perform various operations on data stored in a system.

CRUD applications are often (mistakenly) perceived by programmers as super-simple, even trivial, and certainly not impressing at all. In fact, many corporate-grade Java applications are CRUD-type or, at least, they have a part of code that is executing CRUD requests. This, of course, must be designed as business-oriented software that plays an important role for the client. Otherwise, nobody would pay for a simple CRUD app.

More experienced engineers will tell you that even CRUD operations are not so simple to proper execution and require some thinking to be correctly planned. CRUD requests are commonly performed on complex, internally related entities, and they involve a lot of records. Then performance and cost-efficiency comes into play. Hasty and supperficial use of simple SQL queries, very often offered out of the box by ORM frameworks (like Hibernate / JPA), may lead to peformance issues, like “n+1” problem.

n+1 problem

The “n+1 problem” is a term commonly used in the context of database querying, particularly in Object-Relational Mapping (ORM) frameworks. It refers to an issue that arises when retrieving data from a database with relationships between entities.

In an n+1 problem scenario, let’s say you have two entities with a one-to-many relationship, such as “Blog” and “Comment,” where each blog can have multiple comments. When you want to fetch a list of blogs and their associated comments, the n+1 problem occurs if the ORM framework generates n+1 queries to the database.

Here’s how it typically unfolds:

The initial query retrieves a list of blogs from the database. For each blog in the result set, the ORM framework executes an additional query to fetch the associated comments for that specific blog. This leads to n+1 queries, where n represents the number of blogs fetched in the initial query. The problem with the n+1 approach is that it incurs additional overhead and can result in significant performance issues, especially when dealing with large datasets. Each additional query introduces network latency and database overhead, causing the overall retrieval process to be slower and less efficient.

To mitigate the n+1 problem, ORM frameworks often provide ways to eager load or prefetch related data, allowing you to fetch the necessary information in a single query or a reduced number of queries. By doing so, you can avoid the performance pitfalls associated with the n+1 problem.

But beware: even such solution may not fully resolve the issues. Sometimes, plain SQL is a better (but harder) way to deal with data.

Use ORM frameworks with moderation.

It is very important to know SQL basics and be aware of more complex topics, to be able to predict possible traps.

A little more on CREATE

As part of this course, let’s make a short recapitulation of fondamental SELECT queries, basic SQL syntax and operators.

In first lesson, we created database tables. Obviously, CREATE operations correspond to Create part of CRUD acronym. At first glance, database and table creation seemed to be complex and difficult task, but in reality, it is easier than other queries.

Creation is usually made step-by-step, meaning: one table after another, during time, accordingly to when it is needed. See Flyway or Liquibase migrations. No need to write whole script at once.
Creation is one-time act. An application may create or recreate database structure during deployment, but usually it is not repeated during application working time nor on-demand (e.g. via REST endpoints).
If creation fails, no problem, nothing is lost. The creation / migration script will be fixed.

Flyway and Liquibase: what are database migrations

Database migration refers to the process of modifying the structure or schema of a database in a controlled and organized manner. It involves making changes to the database schema, such as adding or modifying tables, columns, constraints, or indexes, while ensuring that existing data is properly migrated or transformed to accommodate the new structure.

Database migrations are typically performed to introduce changes in an application’s data model, accommodate new features, fix issues, or improve performance. The process is crucial when working with evolving software systems that require continuous updates to the database schema.

Flyway and Liquibase are both popular database migration tools that help developers manage and version control database schema changes. They provide a systematic approach to perform and track database migrations, ensuring smooth and controlled updates to the database structure.

Flyway is an open-source database migration tool. It allows developers to define database changes using SQL scripts or Java-based migrations and tracks the execution of these scripts. Flyway maintains a metadata table in the database to keep track of which migrations have been applied. When running an application, Flyway automatically checks the metadata table and applies any pending migrations, keeping the database schema up to date. Flyway supports a wide range of databases and integrates well with various build tools and frameworks.

Liquibase is another popular open-source database migration tool. It follows a similar approach as Flyway but offers additional features and flexibility. Liquibase allows developers to define database changes using XML, YAML, JSON, or SQL formats. It tracks migrations using a changelog file that specifies the sequence of changes to be applied. Liquibase supports various databases and provides features like rollback support, preconditions, and more advanced change types. It also offers integration with different build tools and frameworks.

Flyway has a simpler and more lightweight design, focusing on simplicity and ease of use. It encourages convention over configuration and follows a strictly ordered migration approach.

Liquibase provides more flexibility and customization options. It supports a wider range of change types, offers advanced features like rollbacks, and allows more fine-grained control over migrations. Flyway uses SQL-based migrations by default, whereas Liquibase supports multiple file formats for defining changes (XML, YAML, JSON, or SQL). Both tools provide integrations with various build tools, frameworks, and Continuous Integration/Continuous Deployment (CI/CD) pipelines.

More on SELECT

SELECT operators are doing the Read part of CRUD, so they are only relatively safe to execute - data won’t be modified - but there might be pitfalls.

Enough theory. Let’s recall some practical skills:

-- select all columns matching both (AND) given conditions (note how operators were used for text and date values):
SELECT * FROM company WHERE hq_country='JPN' AND `established_date` < '1987-06-26';

-- select given columns matching at least one (OR) of two conditions
SELECT name, country FROM company WHERE hq_country='JPN' OR hq_country='KOR';

-- more elastic way of searching, limit the results
SELECT * FROM company WHERE name LIKE 'S%' LIMIT 2;

GROUP BY and COUNT are commonly used for getting some numerical values:

-- group by (counts rows grouped by country)
-- name may be replaced by any column
SELECT COUNT(name), hq_country FROM company GROUP BY hq_country;

Sort (order) the results. Ascending is default ordering strategy, so ASC keyword is redundant here:

-- order result
SELECT birth_date, first_name, last_name FROM customer WHERE first_name LIKE 'Fran%' ORDER BY first_name, last_name ASC; -- ASC is redundant
SELECT birth_date, first_name, last_name FROM customer WHERE first_name LIKE 'Fran%' ORDER BY `birth_date` DESC;

But descending is not default strategy, so do not forget DESC keyword.

We said that reading data is only relatively safe operation, because data are not modified. But the other side of the coin is that selecting data is not for free - sometimes it heavily impacts the database, that is doing all the hard work for us. Especially when we made a complex, incorrect query that should have been optimized.

Generally, SQL and databases are projected and optimized for data handling, even when dealing with large amount of data. Example: it might not be the best idea to map 100K records to ORM entities, then to Data Transfer Objects or other Java objects, in order to make some operations on them through Java streams, like sort or filter.

On the other hand, database might not be necessarily optimized for given use case. Not to mention, that sometimes is cheaper to fetch a bigger chunk of data in one query, and then to process it programatically, just to avoid n+1 problem.

Quid pro quo.

Update

Once database has been created and data inserted, it can be therefore updated (this is the update part of CRUD). Modifying data is doubly burdensome. First, the data to be updated should be selected beforehand, accordingly to some cirteria. Here, as we said before, there might be some performance issues, no matter if we want to make a single update (one time, “by hand”), or regularly, as part of normal flow of the application.

Secondly, we are changing the data. We can lose some information or break the data integrity.

SELECT some data first. If SELECT works correctly, then you can think of an UPDATE.

-- update record
UPDATE company SET name = 'Seoul 88' WHERE name = 'SEOUL_88';

UPDATE with JOIN:

-- update record copying column from joined table
UPDATE company
    INNER JOIN
    customer
ON company.customer_id = customer.customer_id
    SET company.name = CONCAT(
        company.name,
        '_',
        customer.first_name,
        '_',
        customer.last_name);

REVERSE operator:

-- REVERSE name
UPDATE company
    INNER JOIN
    customer
ON company.customer_id = customer.customer_id
SET company.name = REVERSE(company.name);

Substract or add to date:

-- SUB / ADD DATE
UPDATE company
SET established_date = DATE_SUB(established_date, INTERVAL 1 YEAR)
WHERE
        established_date > '2020-01-01'

More painstaking tricks:

-- insert space before last three chars:
-- (e.g. Entity Ltd instead of EntityLtd)
-- remove last tree chars
-- concat string, space and last three chars
UPDATE company
SET name =CONCAT(LEFT(name, LENGTH(name) - 3), ' Ltd')
WHERE
        established_date > '2020-01-01'

-- substract 2 years from date in case of even year, odd id and given country
-- substract 1 year in case of even year, even odd and given country
UPDATE company
SET established_date = (
        CASE
            WHEN
                EXTRACT(YEAR from established_date) % 2 = 0
        AND
        company_id % 2 != 0
                AND
                        hq_country = 'USA'
            THEN DATE_SUB(established_date, INTERVAL 2 YEAR)
        WHEN
                        EXTRACT(YEAR from established_date) % 2 = 0
                AND
                        company_id % 2 = 0
                AND
                        hq_country = 'USA'
            THEN DATE_SUB(established_date, INTERVAL 1 YEAR)
        ELSE established_date
        END
    );

-- funny thing, UPDATE date (if null, use current) by reversing year
UPDATE company
SET established_date = CONCAT(
    -- CASE should be add for null check...
    REVERSE(EXTRACT(YEAR from CURDATE())),
                              '-',
                              EXTRACT(MONTH from CURDATE()),
                              '-',
                              EXTRACT(DAY from CURDATE()))
WHERE
        name = 'Ale Lipa';

Delete

Finally, last item of CRUD: data deletion. It is risky because of potential unwanted data loss. DELETE is rather not executed frequently “in real application life”. Also, an external customer or user of a corporate-grade software hardly ever has an easy, overt possibility to trigger direct data deletion process. More often, it is a multistep process due to security reasons. And there should be backups… but as all security experts know, sometimes there are no backups.

Here, we can use the same trick as with the update. The query should have SELECT instead of DELETE. If we selected exactly what we wanted, we can replace the keywords (DELETE instead of SELECT).

Simple SQL syntax for delete looks like:

-- DELETE row duplicates (copies)
DELETE FROM
    company
WHERE
    company.name LIKE '%_COPY'

DELETE with JOIN:

-- JOIN and DELETE
-- joining three tables, delete records from two (branch remains intact)
DELETE customer, bc FROM
customer
INNER JOIN
    branch_customers bc
ON
    customer.customer_id = bc.customer_id
INNER JOIN
    branch b
ON
    bc.branch_id = b.branch_id
WHERE
    customer.last_name LIKE '%smith%'
AND
    bc.to_date = '9999-01-01'
AND
    customer.gender = 'M'

Table removal:

-- remove table
DROP TABLE company;

But do not do this in production (or any other important environment) (unless you are told to do so, but even then, double check it with someone).

TBC

SQL cheatsheet: part 3

2023-04-26T01:23:00+00:00

Previously on SQL: aggregations, group by, where vs having

Today let’s talk about joining results of different searches.

Union

UNION merges multiple queries as one result. Here we are selecting exemplary, non-existing records and their aliases:

-- UNION merges multiple queries as one result
SELECT
    1 AS id, 'Sunrise Ltd.' AS name
UNION
SELECT
    2 AS id, 'Sunset Co.' AS name;

UNION acts like union operator known from set theory, algebra of sets and Boolean algebra.

Only distinct rows are included. There should be a difference in at least one field:

-- only distinct rows are included: prints only one record
SELECT
    1 AS id, 'Sunrise Ltd.' AS name
UNION
SELECT
    1 AS id, 'Sunrise Ltd.' AS name;

-- selects both records:
SELECT
    1 AS id, 'Sunrise Ltd.' AS name
UNION
SELECT
    1 AS id, 'Sunset Ltd.' AS name;

UNION ALL allows duplicated results:

-- UNION ALL allows duplicated rows
SELECT
    * FROM company
UNION ALL
SELECT
    * FROM company;

Now test it on real data - records matching first condition (sql WHERE hq_country = 'JPN') are not re-selected by second part of query (sql SELECT * FROM company):

SELECT * FROM company WHERE hq_country = 'JPN'
UNION
SELECT * FROM company

This works like simple sql SELECT * FROM company - it does not duplicate the results:

SELECT * FROM company
UNION
SELECT * FROM company

Finally, a clean and logical example of unioning two selects. It takes everything from first set and add everything from the second one:

SELECT * FROM company WHERE hq_country = 'JPN'
UNION
SELECT * FROM company WHERE hq_country = 'KOR'

Of course, it is possible to union results from different tables.

-- UNION from different tables is possible but the result set must have same number of columns
-- error:
SELECT
    * FROM company
UNION ALL
SELECT
    * FROM customer;

It does not work, returning [21000][1222] The used SELECT statements have a different number of columns.

Let’s correct it, adjusting requested number of columns:

-- works:
SELECT
    name AS company_or_customer_name, customer_id as id FROM company
UNION ALL
SELECT
    CONCAT(last_name, ' ', first_name), customer_id FROM customer;

The columns selected in both SELECT clauses should be of the same type in some flavours (Postgres, Oracle). No such requirement in MySQL & MariaDB.

Inner join

INNER JOIN connects records from two (or even more) tables.

To match a record from one table to relevant record from another table, it uses fields (columns) marked as keys: primary key and foreign key, so that primary key from a record in one table points to the foreign key of the relevant record in connected table.

Usually, id values are used as primary and foreign keys.

-- INNER JOIN returns records with matching values in both tables (here: customer_id)
SELECT * FROM
    customer
INNER JOIN
    field f ON customer.customer_id = f.customer_id
WHERE
    f.field_name = 'Engineer';

Step-by-step explanation of the script:

-- take all records from ``customer`` table
SELECT * FROM customer
-- connect to records from ``field`` table
INNER JOIN field f
-- but only when ``customer_id`` in ``customer`` table (for given record) matches ``customer_id`` in ``field`` table
ON customer.customer_id = f.customer_id
-- Additional condition: do it only if ``field_name`` in ``field`` table is ``Engineer`` (and discard all the rest).
WHERE f.field_name = 'Engineer';

Primary to foreign key connection:

customer.customer_id -- primary key in ``customer`` table
= 
f.customer_id -- foreign key in ``field`` table

Primary key - unique field or combination of fields, only one row with the same PK may exist in a table.

Foreign key - field or combination of fields, indicates Primary key of a row in another table. May be unique or not.

INNER JOIN works only for the records having not null primary key. It is logical. Without primary key, there is no way to connect a record with another table (foreign keys point to non-null primary keys).

-- INNER JOIN shows only the records from company that have customer id NOT NULL
-- use OUTER JOINS: LEFT / RIGTH JOIN etc. if you expect null fields to be included
SELECT * FROM
    company
INNER JOIN
    customer c ON company.customer_id = c.customer_id;

Using

Instead of explicitly connecting primary key to foreign key, we can indicate it via USING:

-- USING
SELECT * FROM
    company
        INNER JOIN
    customer USING(customer_id);

Inner join on more than two tables

Inner join can connect records from more than two tables, provided that they contain relevant ids (foreign keys). It is useful when multiple conditions using information from various tables are required.

-- INNER JOIN with two more tables, both containing customer_id
SELECT  * FROM
    customer
        INNER JOIN
    field f on customer.customer_id = f.customer_id
        INNER JOIN
    turnover t on customer.customer_id = t.customer_id
WHERE
    customer.registration_date > '2000-01-01'
   OR (
            customer.birth_date < '1980-01-01'
        AND
            t.turnover < 10000
    )
   OR (
            customer.birth_date < '1960-01-01'
        AND
            f.field_name NOT LIKE  '%Engineer%'
    );

Another example: joining using third, helper table:

-- JOIN through third table
SELECT *
FROM customer
         INNER JOIN
     branch_customers
     ON
         customer.customer_id = branch_customers.customer_id
         INNER JOIN
     branch
     ON
         branch_customers.branch_id = branch.branch_id

Use case - find customer number per branch:

-- how many CUSTOMERS per BRANCH ?
SELECT branch_name, COUNT(*) AS number_of_customers
FROM customer
         INNER JOIN
     branch_customers
     ON customer.customer_id = branch_customers.customer_id
         INNER JOIN
     branch
     ON branch_customers.branch_id = branch.branch_id
GROUP BY branch_name

Cross join

Cross join joins each row of the first table with each row of the second table. This join type is also known as Cartesian join.

-- CROSS JOIN joins all rows from one table with all rows of second table
-- on given condition
-- without condition it makes Cartesian product
SELECT * FROM
    customer
        CROSS JOIN
    company
WHERE customer.customer_id = company.customer_id

Left, right and outer join

LEFT JOIN shows everything from left table.

-- LEFT JOIN shows all rows from left table (company) - even the records that cannot be joined
-- with customer table records due to NULL customer_id in company table
SELECT * FROM
    company
LEFT JOIN
    customer c ON company.customer_id = c.customer_id

RIGHT JOIN takes every record from right table.

-- on the other hand, RIGHT JOIN shows all rows from right table (customer)
-- - even the records that cannot be joined with company table
-- due to missing customer_id in company table
SELECT * FROM
    company
RIGHT JOIN
    customer c ON company.customer_id = c.customer_id

OUTER JOIN lists all records from left and right, even if they have null as their id (so that they cannot be normally joined).

-- FULL OUTER JOIN lists all rows from both tables
-- no matter if NULL
-- FULL OUTER JOIN is not supported in MySql
-- workaround: https://www.xaprb.com/blog/2006/05/26/how-to-write-full-outer-join-in-mysql/
SELECT * FROM
    company
        LEFT OUTER JOIN
    customer c ON company.customer_id = c.customer_id
UNION
SELECT * FROM
    company
        RIGHT OUTER JOIN
    customer c ON company.customer_id = c.customer_id
ORDER BY company_id DESC

Some other workaround of FULL OUTER JOIN:

-- workaround of FULL OUTER JOIN without using LEFT / RIGHT JOIN
SELECT * FROM
    company
        INNER JOIN
    customer
    ON
            company.customer_id = customer.customer_id
UNION
SELECT *, NULL, NULL, NULL, NULL, NULL, NULL FROM
    company
WHERE
    NOT
            company.customer_id
            IN
            (
                SELECT DISTINCT
                    company.customer_id
                FROM
                    company
                        INNER JOIN
                    customer
                    ON
                            company.customer_id = customer.customer_id
            )
   OR
    company.customer_id IS NULL
UNION
SELECT
       NULL, NULL, NULL, NULL, NULL,
       customer.customer_id, customer.birth_date, customer.first_name,
       customer.last_name, customer.gender, customer.registration_date
FROM customer
ORDER BY company_id DESC
    LIMIT 10;

UNION workarounds for JOIN

-- UNION workaround instead of OUTER JOIN (without LEFT / RIGHT JOIN)
-- customer_id must be not null
SELECT name, company.customer_id FROM
    company
        INNER JOIN
    customer
        ON
    company.customer_id = customer.customer_id
UNION
SELECT name, company.customer_id FROM
    company
WHERE
    NOT
        company.customer_id
    IN
        (
        SELECT DISTINCT
            company.customer_id
        FROM
            company
                INNER JOIN
            customer
                ON
            company.customer_id = customer.customer_id
    )

-- above workaround with all columns from both tables included
-- and rows with null customer_id
SELECT * FROM -- returns columns from company and customer
    company
        INNER JOIN
    customer
    ON
            company.customer_id = customer.customer_id
UNION
SELECT *, NULL, NULL, NULL, NULL, NULL, NULL FROM
-- returns colums from company only (no join), hence null to replace missing columns from customer
    company
WHERE
    NOT
            company.customer_id
            IN
            (
                SELECT DISTINCT
                    company.customer_id
                FROM
                    company
                        INNER JOIN
                    customer
                    ON
                            company.customer_id = customer.customer_id
            )
    OR
        company.customer_id IS NULL;

TBC