PostgreSQL Indexes: First Principles30 Jan 2016
We have all heard about indexes. Yeah, that thing that it’s automatically added to the Primary Key column that enables fast data retrieval and stuff. Sure, but have you ever asked yourself if there are multiple types or implementations of indexes? Or maybe, what type of indexes your favourite RDBMS implements? In this blog post, we will take a step back to the beginning, exploring what indexes are, what is their role, types of indexes, metrics and so on. And all of this in PostgreSQL.
What’s the Problem?
When explaining indexes, everyone uses the phonebook analogy. The problem is that the phonebook analogy is so good, thinking of any other one is a waste of time. But, before we jump into indexes, we need to understand the problem that they solve.
So, think of a phonebook. The phonebook is a book with all of the names and phone numbers of the people in a city or a country. Now, let’s say we want to find John Doe’s phone number. Knowing that the phone book is alphabetically ordered we look for the page where the surname Doe is. There, we look for John and his phone number. Good, that is efficient enough for a phonebook.
Now, imagine if our phonebook was not alphabetically sorted? Hell, right? We would need to go through all of the pages, reading every name in the phonebook, until we find our John Doe. This is called sequential searching. You go over all of the records, until you find the person whose phone number you are looking for.
One does not need to be a genius to understand that this is super inefficient.
The problem with database tables is that the data in them is unordered. If we
people table, containing all information about people, to lookup the
person with the full name John Doe, we would need to execute:
Easy enough. But although this query is fast, under the hood, the database will hit every row of the table, until it finds the record it is looking for. I am sure your side-project won’t have this problem soon, but imagine a table with millions of records, without indexes. Data retrieval would take seconds, sometimes maybe minutes. Imagine waiting 30 seconds for a list of videos on YouTube?
I guess the phonebook example painted the picture well for you. Now, what’s interesting is that the phonebooks, actually have indexes. But, they are different. They are not like a book index, which shows what chapter starts on which page. Or maybe they do, I haven’t seen a phonebook in a while.
Nevertheless, indexes have different architectures/indexing methods. And the phonebook has a clustered index. Clustering means that the data is in a distinct order, resulting the row data to be stored in order. If this confuses you, think again about the phonebook - the records in the book are ordered alphabetically. Regardless of how simple this might seem to you, it’s a clustered index. The way the data is ordered (clusters) makes it really easy to search and find the needed phone number. So, a clustered index is an index which physically orders the data (the actual bits on disk) in a certain way, and when new data arrives it is saved in the same order.
A caveat with the clustered indexes is that only one can be created on a given database table. That occurs due to their nature - they enforce a data order. Also, clustered indexes increase the write time, because when new data arrives, all of the data has to be rearranged. But on the bright side - they can greatly increase the reading speed of the table.
To summarize - clustered indexes order the data physically (on disk) in clusters.
I know you guessed it - if there are clustered indexes, the opposite type has to exist. And you are right, non-clustered indexes are a thing. They are the type that we know to use most, but I guess not everyone knows how their implementation works.
Non-clustered indexes are indexes that keep a separate ordering list that has pointers to the physical rows. It’s basically like a book index, it knows on what page a certain chapter starts and ends. Now, unlike the clustered index, a table can have many non-clustered indexes. But, the caveat with these indexes is that each new index will increase the time it takes to write new records.
So, to summarize - non-clustered indexes do not order the data physically, they just keep a list of the data order.
So, after we covered the index architecture, we can explore how indexes work in PostgreSQL. But first things first - let’s play with some indexes.
Indexes in PostgreSQL are manipulated via a set of commands. For this example, I will use a table with 1000 records, representing 1000 users. The table schema will look like this:
If you want to follow along with this tutorial, you can easily create this table with the following query:
Having the table now, we can dump any amount of data we want. I will use the
following CSV file,
which contains 1000 user records. Dumping the data in the table is easy using
COPY command. If you are unfamiliar with the command, check out
my post which explains it in depth.
In our case, the actual command is:
Having the data in the database now, let’s see the real impact of the indexes.
PostgreSQL creates a query plan for each query it receives. When writing your
queries, it is very important to choose the best query plan, which adheres to the
query structure and the properties of the data. The query plans are really easy
to produce in PostgreSQL. It comes with a really neat command called
shows the query plan, which contains all of the relevant info.
For example, let’s see the query plan for the query that returns all of the users:
This will return the following output:
As you can notice, this simple query requires a Sequential Scan, because
there is no
WHERE clause in it. This means that it will have to scan each row
of the table and return it.
In the parentheses we notice couple of values. The cost is a range of arbitrary units (that closely resemble disk page fetches) - starting from the expected before the output phase can begin, to the estimated total cost of this query. Then, the numbers of rows output that were estimated by the query planner. Last, the width represents the average width of the rows, in bytes.
Let’s add a
WHERE clause to the query:
The query will return the user whose email is
[email protected]. Perfect.
EXPLAIN the query, and see its query plan:
As you can see, although the
WHERE clause was added, and the result set was
reduced to only one row, the time didn’t go down. Actually, it increased. This
happens because the query will issue a Sequential Scan on the table.
Let’s see if adding an index to the
Adding an Index
Adding an index in PostgreSQL is done via the
CREATE INDEX command:
Or, in our specific case:
After we run this command, let’s see a description of the table:
You can see in the “Indexes” section, the new index appears. After we added the
index, let’s see the query plan for the
Whoa! The projected query cost dropped from the original 22.50 to 8.29. So, what changed? As you can notice, instead of a Sequential Scan, now PostgreSQL will issue a Index Scan. We won’t go into details here, because types and approaches of scanning is topic for another blogpost. Simply said, Postgres will issue a scan based on the indexes, therefore finding the exact location of the user that we are looking for and returning it. This is a much cheaper operation then using a sequential scan.
Sequential isn’t Always the Worst
Let’s see a different example. Let’s say, we want to return all of the users that
id between 100 and 200:
Looks about right, right? It uses a Index Scan, with an Index Cond(itional)
searching for all
ids between 100 and 200. Good enough.
How about we try to return all users with
ids between 200 and 800?
Hold on, what happened here? Although there is an index on the primary key
column, PostgreSQL decides that doing a sequential scan on the
user table is
more performant than an index scan. An index scan would require finding 600
indexes and returning each and every one of those records whose index were found.
On the other hand, a sequential scan would just go over each of the records and
filter out the unwanted rows.
So, although this is a contrived example, you can see in some situations an index scan will not be as performant as a sequential scan.
Indexes are a powerful way to improve the performance of your tables, but they have to be used carefully. Often, indexes can actually stand in the way of your queries, if not used properly.
Next time we will see what are the index types in PostgreSQL and how we can leverage them.