How does a relational database work

阿新 • • 發佈：2018-12-29

When it comes to relational databases, I can’t help thinking that something is missing. They’re used everywhere. There are many different databases: from the small and useful SQLite to the powerful Teradata. But, there are only a few articles that explain how a database works. You can google by yourself “how does a relational database work” to see how few results there are. Moreover, those articles are short. Now, if you look for the last trendy technologies (Big Data, NoSQL or JavaScript), you’ll find more in-depth articles explaining how they work.

Are relational databases too old and too boring to be explained outside of university courses, research papers and books?

As a developer, I HATE using something I don’t understand. And, if databases have been used for 40 years, there must be a reason. Over the years, I’ve spent hundreds of hours to really understand these weird black boxes I use every day. Relational Databases

are very interesting because they’re based on useful and reusable concepts. If understanding a database interests you but you’ve never had the time or the will to dig into this wide subject, you should like this article.

Though the title of this article is explicit, the aim of this article is NOT to understand how to use a database

. Therefore, you should already know how to write a simple join query and basic CRUD queries; otherwise you might not understand this article. This is the only thing you need to know, I’ll explain everything else.

I’ll start with some computer science stuff like time complexity. I know that some of you hate this concept but, without it, you can’t understand the cleverness inside a database. Since it’s a huge topic, I’ll focus on what I think is essential: the way a database handles an SQL query. I’ll only present the basic concepts behind a database so that at the end of the article you’ll have a good idea of what’s happening under the hood.

Since it’s a long and technical article that involves many algorithms and data structures, take your time to read it. Some concepts are more difficult to understand; you can skip them and still get the overall idea.

For the more knowledgeable of you, this article is more or less divided into 3 parts:

An overview of low-level and high-level database components
An overview of the query optimization process
An overview of the transaction and buffer pool management

Back to basics

A long time ago (in a galaxy far, far away….), developers had to know exactly the number of operations they were coding. They knew by heart their algorithms and data structures because they couldn’t afford to waste the CPU and memory of their slow computers.

In this part, I’ll remind you about some of these concepts because they are essential to understand a database. I’ll also introduce the notion of database index.

O(1) vs O(n²)

Nowadays, many developers don’t care about time complexity … and they’re right!

But when you deal with a large amount of data (I’m not talking about thousands) or if you’re fighting for milliseconds, it becomes critical to understand this concept. And guess what, databases have to deal with both situations! I won’t bore you a long time, just the time to get the idea. This will help us later to understand the concept of cost based optimization.

The concept

The time complexity is used to see how long an algorithm will take for a given amount of data. To describe this complexity, computer scientists use the mathematical big O notation. This notation is used with a function that describes how many operations an algorithm needs for a given amount of input data.

For example, when I say “this algorithm is in O( some_function() )”, it means that for a certain amount of data the algorithm needs some_function(a_certain_amount_of_data) operations to do its job.

What’s important is not the amount of data but the way the number of operations increases when the amount of data increases. The time complexity doesn’t give the exact number of operations but a good idea.

In this figure, you can see the evolution of different types of complexities. I used a logarithmic scale to plot it. In other words, the number of data is quickly increasing from 1 to 1 billion. We can see that:

The O(1) or constant complexity stays constant (otherwise it wouldn’t be called constant complexity).
The O(log(n)) stays low even with billions of data.
The worst complexity is the O(n²) where the number of operations quickly explodes.
The two other complexities are quickly increasing.

Examples

With a low amount of data, the difference between O(1) and O(n²) is negligible. For example, let’s say you have an algorithm that needs to process 2000 elements.

An O(1) algorithm will cost you 1 operation
An O(log(n)) algorithm will cost you 7 operations
An O(n) algorithm will cost you 2 000 operations
An O(n*log(n)) algorithm will cost you 14 000 operations
An O(n²) algorithm will cost you 4 000 000 operations

The difference between O(1) and O(n²) seems a lot (4 million) but you’ll lose at max 2 ms, just the time to blink your eyes. Indeed, current processors can handle hundreds of millions of operations per second. This is why performance and optimization are not an issue in many IT projects.

As I said, it’s still important to know this concept when facing a huge number of data. If this time the algorithm needs to process 1 000 000 elements (which is not that big for a database):

An O(1) algorithm will cost you 1 operation
An O(log(n)) algorithm will cost you 14 operations
An O(n) algorithm will cost you 1 000 000 operations
An O(n*log(n)) algorithm will cost you 14 000 000 operations
An O(n²) algorithm will cost you 1 000 000 000 000 operations

I didn’t do the math but I’d say with the O(n²) algorithm you have the time to take a coffee (even a second one!). If you put another 0 on the amount of data, you’ll have the time to take a long nap.

Going deeper

To give you an idea:

A search in a good hash table gives an element in O(1)
A search in a well-balanced tree gives a result in O(log(n))
A search in an array gives a result in O(n)
The best sorting algorithms have an O(n*log(n)) complexity.
A bad sorting algorithm has an O(n²) complexity

Note: In the next parts, we’ll see these algorithms and data structures.

There are multiple types of time complexity:

the average case scenario
the best case scenario
and the worst case scenario

The time complexity is often the worst case scenario.

I only talked about time complexity but complexity also works for:

the memory consumption of an algorithm
the disk I/O consumption of an algorithm

Of course there are worse complexities than n², like:

n⁴: that sucks! Some of the algorithms I’ll mention have this complexity.
3ⁿ: that sucks even more! One of the algorithms we’re going to see in the middle of this article has this complexity (and it’s really used in many databases).
factorial n : you’ll never get your results, even with a low amount of data.
nⁿ: if you end-up with this complexity, you should ask yourself if IT is really your field…

Note: I didn’t give you the real definition of the big O notation but just the idea. You can read this article on Wikipedia for the real (asymptotic) definition.

Merge Sort

What do you do when you need to sort a collection? What? You call the sort() function … ok, good answer… But for a database you have to understand how this sort() function works.

There are several good sorting algorithms so I’ll focus on the most important one: the merge sort. You might not understand right now why sorting data is useful but you should after the part on query optimization. Moreover, understanding the merge sort will help us later to understand a common database join operation called the merge join.

Merge

Like many useful algorithms, the merge sort is based on a trick: merging 2 sorted arrays of size N/2 into a N-element sorted array only costs N operations. This operation is called a merge.

Let’s see what this means with a simple example:

You can see on this figure that to construct the final sorted array of 8 elements, you only need to iterate one time in the 2 4-element arrays. Since both 4-element arrays are already sorted:

1) you compare both current elements in the 2 arrays (current=first for the first time)
2) then take the lowest one to put it in the 8-element array
3) and go to the next element in the array you took the lowest element
and repeat 1,2,3 until you reach the last element of one of the arrays.
Then you take the rest of the elements of the other array to put them in the 8-element array.

This works because both 4-element arrays are sorted and therefore you don’t need to “go back” in these arrays.

Now that we’ve understood this trick, here is my pseudocode of the merge sort.

array mergeSort(array a)
   if(length(a)==1)
      return a[0];
   end if

   //recursive calls
   [left_array right_array] := split_into_2_equally_sized_arrays(a);
   array new_left_array := mergeSort(left_array);
   array new_right_array := mergeSort(right_array);

   //merging the 2 small ordered arrays into a big one
   array result := merge(new_left_array,new_right_array);
   return result;

The merge sort breaks the problem into smaller problems then finds the results of the smaller problems to get the result of the initial problem (note: this kind of algorithms is called divide and conquer). If you don’t understand this algorithm, don’t worry; I didn’t understand it the first time I saw it. If it can help you, I see this algorithm as a two-phase algorithm:

The division phase where the array is divided into smaller arrays
The sorting phase where the small arrays are put together (using the merge) to form a bigger array.

Division phase

During the division phase, the array is divided into unitary arrays using 3 steps. The formal number of steps is log(N) (since N=8, log(N) = 3).

How do I know that?

I’m a genius! In one word: mathematics. The idea is that each step divides the size of the initial array by 2. The number of steps is the number of times you can divide the initial array by two. This is the exact definition of logarithm (in base 2).

Sorting phase

In the sorting phase, you start with the unitary arrays. During each step, you apply multiple merges and the overall cost is N=8 operations:

In the first step you have 4 merges that cost 2 operations each
In the second step you have 2 merges that cost 4 operations each
In the third step you have 1 merge that costs 8 operations

Since there are log(N) steps, the overall costs N * log(N) operations.

The power of the merge sort

Why this algorithm is so powerful?

Because:

You can modify it in order to reduce the memory footprint, in a way that you don’t create new arrays but you directly modify the input array.

Note: this kind of algorithms is called in-place.

You can modify it in order to use disk space and a small amount of memory at the same time without a huge disk I/O penalty. The idea is to load in memory only the parts that are currently processed. This is important when you need to sort a multi-gigabyte table with only a memory buffer of 100 megabytes.

Note: this kind of algorithms is called external sorting.

You can modify it to run on multiple processes/threads/servers.

For example, the distributed merge sort is one of the key components of Hadoop (which is THE framework in Big Data).

This algorithm can turn lead into gold (true fact!).

This sorting algorithm is used in most (if not all) databases but it’s not the only one. If you want to know more, you can read this research paper that discusses the pros and cons of the common sorting algorithms in a database.

Array, Tree and Hash table

Now that we understand the idea behind time complexity and sorting, I have to tell you about 3 data structures. It’s important because they’re the backbone of modern databases. I’ll also introduce the notion of database index.

Array

The two-dimensional array is the simplest data structure. A table can be seen as an array. For example:

This 2-dimensional array is a table with rows and columns:

Each row represents a subject
The columns the features that describe the subjects.
Each column stores a certain type of data (integer, string, date …).

Though it’s great to store and visualize data, when you need to look for a specific value it sucks.

For example, if you want to find all the guys who work in the UK, you’ll have to look at each row to find if the row belongs to the UK. This will cost you N operations (N being the number of rows) which is not bad but could there be a faster way? This is where trees come into play.

Note: Most modern databases provide advanced arrays to store tables efficiently like heap-organized tables or index-organized tables. But it doesn’t change the problem of fast searching for a specific condition on a group of columns.

Tree and database index

A binary search tree is a binary tree with a special property, the key in each node must be:

greater than all keys stored in the left sub-tree
smaller than all keys stored in the right sub-tree

Let’s see what it means visually

The idea

This tree has N=15 elements. Let’s say I’m looking for 208:

I start with the root whose key is 136. Since 136<208, I look at the right sub-tree of the node 136.
398>208 so, I look at the left sub-tree of the node 398
250>208 so, I look at the left sub-tree of the node 250
200<208 so, I look at the right sub-tree of the node 200. But 200 doesn’t have a right subtree, the value doesn’t exist (because if it did exist it would be in the right subtree of 200)

Now let’s say I’m looking for 40

I start with the root whose key is 136. Since 136>40, I look at the left sub-tree of the node 136.
80>40 so, I look at the left sub-tree of the node 80
40= 40, the node exists. I extract the id of the row inside the node (it’s not in the figure) and look at the table for the given row id.
Knowing the row id let me know where the data is precisely on the table and therefore I can get it instantly.

In the end, both searches cost me the number of levels inside the tree. If you read carefully the part on the merge sort you should see that there are log(N) levels. So the cost of the search is log(N), not bad!

Back to our problem

But this stuff is very abstract so let’s go back to our problem. Instead of a stupid integer, imagine the string that represents the country of someone in the previous table. Suppose you have a tree that contains the column “country” of the table:

If you want to know who is working in the UK
you look at the tree to get the node that represents the UK
inside the “UK node” you’ll find the locations of the rows of the UK workers.

This search only costs you log(N) operations instead of N operations if you directly use the array. What you’ve just imagined was a database index.

You can build a tree index for any group of columns (a string, an integer, 2 strings, an integer and a string, a date …) as long as you have a function to compare the keys (i.e. the group of columns) so that you can establish an order among the keys (which is the case for any basic types in a database).

B+Tree Index

Although this tree works well to get a specific value, there is a BIG problem when you need to get multiple elements between two values. It will cost O(N) because you’ll have to look at each node in the tree and check if it’s between these 2 values (for example, with an in-order traversal of the tree). Moreover this operation is not disk I/O friendly since you’ll have to read the full tree. We need to find a way to efficiently do a range query. To answer this problem, modern databases use a modified version of the previous tree called B+Tree. In a B+Tree:

only the lowest nodes (the leaves) store information (the location of the rows in the associated table)
the other nodes are just here to route to the right node during the search.

As you can see, there are more nodes (twice more). Indeed, you have additional nodes, the “decision nodes” that will help you to find the right node (that stores the location of the rows in the associated table). But the search complexity is still in O(log(N)) (there is just one more level). The big difference is that the lowest nodes are linked to their successors.

With this B+Tree, if you’re looking for values between 40 and 100:

You just have to look for 40 (or the closest value after 40 if 40 doesn’t exist) like you did with the previous tree.
Then gather the successors of 40 using the direct links to the successors until you reach 100.

Let’s say you found M successors and the tree has N nodes. The search for a specific node costs log(N) like the previous tree. But, once you have this node, you get the M successors in M operations with the links to their successors. This search only costs M + log(N) operations vs N operations with the previous tree. Moreover, you don’t need to read the full tree (just M + log(N) nodes), which means less disk usage. If M is low (like 200 rows) and N large (1 000 000 rows) it makes a BIG difference.

But there are new problems (again!). If you add or remove a row in a database (and therefore in the associated B+Tree index):

you have to keep the order between nodes inside the B+Tree otherwise you won’t be able to find nodes inside the mess.
you have to keep the lowest possible number of levels in the B+Tree otherwise the time complexity in O(log(N)) will become O(N).

I other words, the B+Tree needs to be self-ordered and self-balanced. Thankfully, this is possible with smart deletion and insertion operations. But this comes with a cost: the insertion and deletion in a B+Tree are in O(log(N)). This is why some of you have heard that using too many indexes is not a good idea. Indeed, you’re slowing down the fast insertion/update/deletion of a row in a table since the database needs to update the indexes of the table with a costly O(log(N)) operation per index. Moreover, adding indexes means more workload for the transaction manager (we will see this manager at the end of the article).

For more details, you can look at the Wikipedia article about B+Tree. If you want an example of a B+Tree implementation in a database, look at this article and this article from a core developer of MySQL. They both focus on how innoDB (the engine of MySQL) handles indexes.

Note: I was told by a reader that, because of low-level optimizations, the B+Tree needs to be fully balanced.

Hash table

Our last important data structure is the hash table. It’s very useful when you want to quickly look for values. Moreover, understanding the hash table will help us later to understand a common database join operation called the hash join. This data structure is also used by a database to store some internal stuff (like the lock table or the buffer pool, we’ll see both concepts later)

The hash table is a data structure that quickly finds an element with its key. To build a hash table you need to define:

a key for your elements
a hash function for the keys. The computed hashes of the keys give the locations of the elements (called buckets).
a function to compare the keys. Once you found the right bucket you have to find the element you’re looking for inside the bucket using this comparison.

A simple example

Let’s have a visual example:

This hash table has 10 buckets. Since I’m lazy I only drew 5 buckets but I know you’re smart so I let you imagine the 5 others. The Hash function I used is the modulo 10 of the key. In other words I only keep the last digit of the key of an element to find its bucket:

if the last digit is 0 the element ends up in the bucket 0,
if the last digit is 1 the element ends up in the bucket 1,
if the last digit is 2 the element ends up in the bucket 2,
…

The compare function I used is simply the equality between 2 integers.

Let’s say you want to get the element 78:

The hash table computes the hash code for 78 which is 8.
It looks in the bucket 8, and the first element it finds is 78.
It gives you back the element 78
The search only costs 2 operations (1 for computing the hash value and the other for finding the element inside the bucket).

Now, let’s say you want to get the element 59:

The hash table computes the hash code for 59 which is 9.
It looks in the bucket 9, and the first element it finds is 99. Since 99!=59, element 99 is not the right element.
Using the same logic, it looks at the second element (9), the third (79), … , and the last (29).
The element doesn’t exist.
The search costs 7 operations.

A good hash function

As you can see, depending on the value you’re looking for, the cost is not the same!

If I now change the hash function with the modulo 1 000 000 of the key (i.e. taking the last 6 digits), the second search only costs 1 operation because there are no elements in the bucket 000059. The real challenge is to find a good hash function that will create buckets that contain a very small amount of elements.

In my example, finding a good hash function is easy. But this is a simple example, finding a good hash function is more difficult when the key is:

a string (for example the last name of a person)
2 strings (for example the last name and the first name of a person)
2 strings and a date (for example the last name, the first name and the birth date of a person)
…

With a good hash function, the search in a hash table is in O(1).

Array vs hash table

Why not using an array?

Hum, you’re asking a good question.

A hash table can be half loaded in memory and the other buckets can stay on disk.
With an array you have to use a contiguous space in memory. If you’re loading a large table it’s very difficult to have enough contiguous space.
With a hash table you can choose the key you want (for example the country AND the last name of a person).

For more information, you can read my article on the Java HashMap which is an efficient hash table implementation; you don’t need to understand Java to understand the concepts inside this article.

Global overview

We’ve just seen the basic components inside a database. We now need to step back to see the big picture.

A database is a collection of information that can easily be accessed and modified. But a simple bunch of files could do the same. In fact, the simplest databases like SQLite are nothing more than a bunch of files. But SQLite is a well-crafted bunch of files because it allows you to:

use transactions that ensure data are safe and coherent
quickly process data even when you’re dealing with millions of data

More generally, a database can be seen as the following figure:

Before writing this part, I’ve read multiple books/papers and every source had its on way to represent a database. So, don’t focus too much on how I organized this database or how I named the processes because I made some choices to fit the plan of this article. What matters are the different components; the overall idea is that a database is divided into multiple components that interact with each other.

The core components:

The process manager: Many databases have a pool of processes/threads that needs to be managed. Moreover, in order to gain nanoseconds, some modern databases use their own threads instead of the Operating System threads.
The network manager: Network I/O is a big issue, especially for distributed databases. That’s why some databases have their own manager.
File system manager: Disk I/O is the first bottleneck of a database. Having a manager that will perfectly handle the Operating System file system or even replace it is important.
The memory manager: To avoid the disk I/O penalty a large quantity of ram is required. But if you handle a large amount of memory, you need an efficient memory manager. Especially when you have many queries using memory at the same time.
Security Manager: for managing the authentication and the authorizations of the users
Client manager: for managing the client connections
…

The tools:

Backup manager: for saving and restoring a database.
Recovery manager: for restarting the database in a coherent state after a crash
Monitor manager: for logging the activity of the database and providing tools to monitor a database
Administration manager: for storing metadata (like the names and the structures of the tables) and providing tools to manage databases, schemas, tablespaces, …
…

The query Manager:

Query parser: to check if a query is valid
Query rewriter: to pre-optimize a query
Query optimizer: to optimize a query
Query executor: to compile and execute a query

The data manager:

Transaction manager: to handle transactions
Cache manager: to put data in memory before using them and put data in memory before writing them on disk
Data access manager: to access data on disk

For the rest of this article, I’ll focus on how a database manages an SQL query through the following processes:

the client manager
the query manager
the data manager (I’ll also include the recovery manager in this part)

Client manager

The client manager is the part that handles the communications with the client. The client can be a (web) server or an end-user/end-application. The client manager provides different ways to access the database through a set of well-known APIs: JDBC, ODBC, OLE-DB …

It can also provide proprietary database access APIs.

When you connect to a database:

The manager first checks your authentication (your login and password) and then checks if you have the authorizations to use the database. These access rights are set by your DBA.
Then, it checks if there is a process (or a thread) available to manage your query.
It also checks if the database if not under heavy load.
It can wait a moment to get the required resources. If this wait reaches a timeout, it closes the connection and gives a readable error message.
Then it sends your query to the query manager and your query is processed
Since the query processing is not an “all or nothing” thing, as soon as it gets data from the query manager, it stores the partial results in a buffer and start sending them to you.
In case of problem, it stops the connection, gives you a readable explanation and releases the resources.

Query manager

This part is where the power of a database lies. During this part, an ill-written query is transformed into a fast executable code. The code is then executed and the results are returned to the client manager. It’s a multiple-step operation:

the query is first parsed to see if it’s valid
it’s then rewritten to remove useless operations and add some pre-optimizations
it’s then optimized to improve the performances and transformed into an execution and data access plan.
then the plan is compiled
at last, it’s executed

In this part, I won’t talk a lot about the last 2 points because they’re less important.

After reading this part, if you want a better understanding I recommend reading:

The initial research paper (1979) on cost based optimization: Access Path Selection in a Relational Database Management System. This article is only 12 pages and understandable with an average level in computer science.
A very good and in-depth presentation on how DB2 9.X optimizes queries here
A very good presentation on how PostgreSQL optimizes queries here. It’s the most accessible document since it’s more a presentation on “let’s see what query plans PostgreSQL gives in these situations“ than a “let’s see the algorithms used by PostgreSQL”.
The official SQLite documentation about optimization. It’s “easy” to read because SQLite uses simple rules. Moreover, it’s the only official documentation that really explains how it works.
A good presentation on how SQL Server 2005 optimizes queries here
A white paper about optimization in Oracle 12c here
2 theoretical courses on query optimization from the authors of the book “DATABASE SYSTEM CONCEPTS” here and there. A good read that focuses on disk I/O cost but a good level in CS is required.
Another theoretical course that I find more accessible but that only focuses on join operators and disk I/O.

Query parser

Each SQL statement is sent to the parser where it is checked for correct syntax. If you made a mistake in your query the parser will reject the query. For example, if you wrote “SLECT …” instead of “SELECT …”, the story ends here.

But this goes deeper. It also checks that the keywords are used in the right order. For example a WHERE before a SELECT will be rejected.

Then, the tables and the fields inside the query are analyzed. The parser uses the metadata of the database to check:

If the tables exist
If the fields of the tables exist
If the operations for the types of the fields are possible (for example you can’t compare an integer with a string, you can’t use a substring() function on an integer)

Then it checks if you have the authorizations to read (or write) the tables in the query. Again, these access rights on tables are set by your DBA.

During this parsing, the SQL query is transformed into an internal representation (often a tree)

If everything is ok then the internal representation is sent to the query rewriter.

Query rewriter

At this step, we have an internal representation of a query. The aim of the rewriter is:

to pre-optimize the query
to avoid unnecessary operations
to help the optimizer to find the best possible solution

The rewriter executes a list of known rules on the query. If the query fits a pattern of a rule, the rule is applied and the query is rewritten. Here is a non-exhaustive list of (optional) rules:

View merging: If you’re using a view in your query, the view is transformed with the SQL code of the view.
Subquery flattening: Having subqueries is very difficult to optimize so the rewriter will try to modify a query with a subquery to remove the subquery.

For example

SELECT PERSON.*
FROM PERSON
WHERE PERSON.person_key IN
(SELECT MAILS.person_key
FROM MAILS
WHERE MAILS.mail LIKE 'christophe%');

Will be replaced by

SELECT PERSON.*
FROM PERSON, MAILS
WHERE PERSON.person_key = MAILS.person_key
and MAILS.mail LIKE 'christophe%';

Removal of unnecessary operators: For example if you use a DISTINCT whereas you have a UNIQUE constraint that prevents the data from being non-unique, the DISTINCT keyword is removed.
Redundant join elimination: If you have twice the same join condition because one join condition is hidden in a view or if by transitivity there is a useless join, it’s removed.
Constant arithmetic evaluation: If you write something that requires a calculus, then it’s computed once during the rewriting. For example WHERE AGE > 10+2 is transformed into WHERE AGE > 12 and TODATE(“some date”) is transformed into the date in the datetime format
(Advanced) Partition Pruning: If you’re using a partitioned table, the rewriter is able to find what partitions to use.
(Advanced) Materialized view rewrite: If you have a materialized view that matches a subset of the predicates in your query, the rewriter checks if the view is up to date and modifies the query to use the materialized view instead of the raw tables.
(Advanced) Custom rules: If you have custom rules to modify a query (like Oracle policies), then the rewriter executes these rules
(Advanced) Olap transformations: analytical/windowing functions, star joins, rollup … are also transformed (but I’m not sure if it’s done by the rewriter or the optimizer, since both processes are very close it must depends on the database).

This rewritten query is then sent to the query optimizer where the fun begins!

Statistics

Before we see how a database optimizes a query we need to speak about statistics because without them a database is stupid. If you don’t tell the database to analyze its own data, it will not do it and it will make (very) bad assumptions.

But what kind of information does a database need?

I have to (briefly) talk about how databases and Operating systems store data. They’re using a minimum unit called a page or a block (4 or 8 kilobytes by default). This means that if you only need 1 Kbytes it will cost you one page anyway. If the page takes 8 Kbytes then you’ll waste 7 Kbytes.

Back to the statistics! When you ask a database to gather statistics, it computes values like:

The number of rows/pages in a table
For each column in a table:
- distinct data values
- the length of data values (min, max, average)
- data range information (min, max, average)
Information on the indexes of the table.

These statistics will help the optimizer to estimate the disk I/O, CPU and memory usages of the query.

The statistics for each column are very important. For example if a table PERSON needs to be joined on 2 columns: LAST_NAME, FIRST_NAME. With the statistics, the database knows that there are only 1 000 different values on FIRST_NAME and 1 000 000 different values on LAST_NAME. Therefore, the database will join the data on LAST_NAME, FIRST_NAME instead of FIRST_NAME,LAST_NAME because it produces way less comparisons since the LAST_NAME are unlikely to be the same so most of the time a comparison on the 2 (or 3) first characters of the LAST_NAME is enough.

But these are basic statistics. You can ask a database to compute advanced statistics called histograms. Histograms are statistics that inform about the distribution of the values inside the columns. For example

the most frequent values
the quantiles
…

These extra statistics will help the database to find an even better query plan. Especially for equality predicate (ex: WHERE AGE = 18 ) or range predicates (ex: WHERE AGE > 10 and AGE <40 ) because the database will have a better idea of the number rows concerned by these predicates (note: the technical word for this concept is selectivity).

The statistics are stored in the metadata of the database. For example you can see the statistics for the (non-partitioned) tables:

in USER/ALL/DBA_TABLES and USER/ALL/DBA_TAB_COLUMNS for Oracle
in SYSCAT.TABLES and SYSCAT.COLUMNS for DB2.

The statistics have to be up to date. There is nothing worse than a database thinking a table has only 500 rows whereas it has 1 000 000 rows. The only drawback of the statistics is that it takes time to compute them. This is why they’re not automatically computed by default in most databases. It becomes difficult with millions of data to compute them. In this case, you can choose to compute only the basics statistics or to compute the stats on a sample of the database.

For example, when I was working on a project dealing with hundreds of millions rows in each tables, I chose to compute the statistics on only 10%, which led to a huge gain in time. For the story it turned out to be a bad decision because occasionally the 10% chosen by Oracle 10G for a specific column of a specific table were very different from the overall 100% (which is very unlikely to happen for a table with 100M rows). This wrong statistic led to a query taking occasionally 8 hours instead of 30 seconds; a nightmare to find the root cause. This example shows how important the statistics are.

Note: Of course, there are more advanced statistics specific for each database. If you want to know more, read the documentations of the databases. That being said, I’ve tried to understand how the statistics are used and the best official documentation I found was the one from PostgreSQL.

Query optimizer

All modern databases are using a Cost Based Optimization (or CBO) to optimize queries. The idea is to put a cost an every operation and find the best way to reduce the cost of the query by using the cheapest chain of operations to get the result.

To understand how a cost optimizer works I think it’s good to have an example to “feel” the complexity behind this task. In this part I’ll present you the 3 common ways to join 2 tables and we will quickly see that even a simple join query is a nightmare to optimize. After that, we’ll see how real optimizers do this job.

For these joins, I’ll focus on their time complexity but a database optimizer computes their CPU cost, disk I/O cost and memory requirement. The difference between time complexity and CPU cost is that time cost is very approximate (it’s for lazy guys like me). For the CPU cost, I should count every operation like an addition, an “if statement”, a multiplication, an iteration … Moreover:

Each high level code operation has a specific number of low level CPU operations.
The cost of a CPU operation is not the same (in terms of CPU cycles) whether you’re using an Intel Core i7, an Intel Pentium 4, an AMD Opteron…. In other words it depends on the CPU architecture.

Using the time complexity is easier (at least for me) and with it we can still get the concept of CBO. I’ll sometimes speak about disk I/O since it’s an important concept. Keep in mind that the bottleneck is most of the time the disk I/O and not the CPU usage.

Indexes

We talked about indexes when we saw the B+Trees. Just remember that these indexes are already sorted.

FYI, there are other types of indexes like bitmap indexes. They don’t offer the same cost in terms of CPU, disk I/O and memory than B+Tree indexes.

Moreover, many modern databases can dynamically create temporary indexes just for the current query if it can improve the cost of the execution plan.

Access Path

Before applying your join operators, you first need to get your data. Here is how you can get your data.

Note: Since the real problem with all the access paths is the disk I/O, I won’t talk a lot about time complexity.

Full scan

If you’ve ever read an execution plan you must have seen the word full scan (or just scan). A full scan is simply the database reading a table or an index entirely. In terms of disk I/O, a table full scan is obviously more expensive than an index full scan.

Range Scan

There are other types of scan like index range scan. It is used for example when you use a predicate like “WHERE AGE > 20 AND AGE <40”.

Of course you need have an index on the field AGE to use this index range scan.

We already saw in the first part that the time cost of a range query is something like log(N) +M, where N is the number of data in this index and M an estimation of the number of rows inside this range. Both N and M values are known thanks to the statistics (Note: M is the selectivity for the predicate AGE >20 AND AGE<40). Moreover, for a range scan you don’t need to read the full index so it’s less expensive in terms of disk I/O than a full scan.

Unique scan

If you only need one value from an index you can use the unique scan.

Access by row id

Most of the time, if the database uses an index, it will have to look for the rows associated to the index. To do so it will use an access by row id.

For example, if you do something like

SELECT LASTNAME, FIRSTNAME from PERSON WHERE AGE = 28

If you have an index for person on column age, the optimizer will use the index to find all the persons who are 28 then it will ask for the associate rows in the table because the index only has information about the age and you want to know the lastname and the firstname.

But, if now you do something like

SELECT TYPE_PERSON.CATEGORY from PERSON ,TYPE_PERSON
WHERE PERSON.AGE = TYPE_PERSON.AGE

The index on PERSON will be used to join with TYPE_PERSON but the table PERSON will not be accessed by row id since you’re not asking information on this table.

Though it works great for a few accesses, the real issue with this operation is the disk I/O. If you need too many accesses by row id the database might choose a full scan.

Others paths

I didn’t present all the access paths. If you want to know more, you can read the Oracle documentation. The names might not be the same for the other databases but the concepts behind are the same.

Join operators

So, we know how to get our data, let’s join them!

I’ll present the 3 common join operators: Merge Join, Hash Join and Nested Loop Join. But before that, I need to introduce new vocabulary: inner relation and outer relation. A relation can be:

a table
an index
an intermediate result from a previous operation (for example the result of a previous join)

When you’re joining two relations, the join algorithms manage the two relations differently. In the rest of the article, I’ll assume that:

the outer relation is the left data set
the inner relation is the right data set

For example, A JOIN B is the join between A and B where A is the outer relation and B the inner relation.

Most of the time, the cost of A JOIN B is not the same as the cost of B JOIN A.

In this part, I’ll also assume that the outer relation has N elements and the inner relation M elements. Keep in mind that a real optimizer knows the values of N and M with the statistics.

Note: N and M are the cardinalities of the relations.

Nested loop join

The nested loop join is the easiest one.

Here is the idea:

for each row in the outer relation
you look at all the rows in the inner relation to see if there are rows that match

Here is a pseudo code:

nested_loop_join(array outer, array inner)
  for each row a in outer
    for each row b in inner
      if (match_join_condition(a,b))
        write_result_in_output(a,b)
      end if
    end for
   end for

Since it’s a double iteration, the time complexity is O(N*M)

In term of disk I/O, for each of the N rows in the outer relation, the inner loop needs to read M rows from the inner relation. This algorithm needs to read N + N*M rows from disk. But, if the inner relation is small enough, you can put the relation in memory and just have M +N reads. With this modification, the inner relation must be the smallest one since it has more chance to fit in memory.

In terms of time complexity it makes no difference but in terms of disk I/O it’s way better to read only once both relations.

Of course, the inner relation can be replaced by an index, it will be better for the disk I/O.

Since this algorithm is very simple, here is another version that is more disk I/O friendly if the inner relation is too big to fit in memory. Here is the idea:

How does a relational database work

When it comes to relational databases, I can’t help thinking that something is missing. They’re used everywhere. There are many different databases: from

How does a HashMap work in JAVA

Most JAVA developers are using Maps and especially HashMaps. A HashMap is a simple yet powerful way to store and get data. But how many developers know h

Ask HN: How does a microwave work?

Does anyone know of a good (visual) explanation of how it actually works? I'm not satisfied with the Wikipedia explanation.Is the heated food different (he

How does a browser know which response belongs to which request?

　　Today I knows that the server never send a request to a client! It just make response~ 　　So,if the browser always want to get the newest data in the Ser

How does a single thread run on multiple cores?

30 I am trying to understand, at a high-level, how single threads run across multiple cores. Below is my best understanding. I

Ask HN: How Does Google Customer Engineer Work?

Currently, I am a DevOps Engineer in Toronto, doing a lot of Cloud/DevOps related works, and I am in the final round with Google Customer Engineer - Cloud

Storing Tweets in a Relational Database

As you can see, a single tweet has a wealth of information, and most of these keys correspond to nested dictionaries and lists. If you have not worked with

How does Amazon T3 class work

How does Amazon T3 class workHow does Amazon T3 work?On August 21st this year, Amazon introduced the third generation of the T series — T3. This article wi

Selenium - How does the Selenium WebDriver work?

Actually, your automation scripts use Selenium commands for emulating user actions on a web page. When the automation script is executed, the follow

What is a Relational Database?

A relational database is a collection of data items with pre-defined relationships between them. These items are organized as a set of t

how find a record import other database.

developer alt left span color into mage .cn use question:how find a record import other database. answer: solution one:you user insert in

DNS 到底怎麽工作的？ (How does dns work?)

ive 進行 rev article alt 二級 tps HR important 其實這個問題每次看的時候都覺得很明白，但是很久之後就忘記了，所以這次準備記錄下來。深入到這個過程的各個細節之中，以後多看看。 Step 1 請求緩存信息：當你在開始訪問一個 www.

How many times a day does a clock’s hands overlap?

Obviously, it is less than 24 times. per hour, hour hands run 5 blocks.(1/12 lap) per hour, minute hands run 60 blocks.(1 lap) so diff is 55

MPLS Network: How Does It Work?

The entire network today has expanded and there emerge faster Ethernet switch like gigabit Ethernet switch and even 10gb switch wh

What Is an Internet Switch and How Does It Work?

The Internet switch, since its birth, has been growing rapidly not only in function but also in performance. Experts have researched and developed gen

Presidential alert: Why did Trump just text me about a 'test of the National Wireless Emergency Alert System' and how does it wo

Donald Trump is texting everyone in the US the exact same message. "THIS IS A TEST of the National Wireless Emergency Alert System," the message will begin

How does brain structure influence performance on language tasks? Computational modeling shows promise as a tool for probing thi

But how neuroanatomy impacts performance is largely an open question. To learn more, scientists are developing a new tool -- computational models of the b

How does a relational database work

Back to basics

O(1) vs O(n2)

The concept

Examples

Going deeper

Merge Sort

Merge

Division phase

Sorting phase

The power of the merge sort

Array, Tree and Hash table

Array

Tree and database index

The idea

Back to our problem

B+Tree Index

Hash table

A simple example

A good hash function

Array vs hash table

Global overview

Client manager

Query manager

Query parser

Query rewriter

Statistics

Query optimizer

Indexes

Access Path

Full scan

Range Scan

Unique scan

Access by row id

Others paths

Join operators

Nested loop join

相關推薦

O(1) vs O(n²)