# Postgres 102 - Explains and Advanced Features

In this tutorial, we’re going to cover some good-to-know Postgres features and tricks for query optimization. The only prerequisite is basic SQL and having Docker/Postgres installed.

## Setting up our environment

First things first, let’s make some mock data for us to play around with. We’re going to make a table of domains (e.g. espn.com) and domain categories (e.g. sports publishers):

And we’re going to insert some data:

Sidenote
generate_series is a wonderful little function that does exactly what you think it does. It’s fantastic for making mock data.

Next, let’s make a table that connects the two together:

Now it’s very easy to get a list of domains in any category:

Now let’s make things a little more complicated. Let’s add parent categories so we can get a categories topology. Imagine a “Sports” category that can contain “Football” and “Basketball” categories. To do this, we need a table that defines the parent-child relationships for categories:

Okay, so far so good. How can we get a list of domains in parent categories? One option is to do a simple join:

However, what if we want the domains of a category and we don’t know if it’s a parent category or not? One possible solution is to both and union the results together:

But A) this is super ugly and B) it stops working if our parent categories get parent categories. In other words, we are only solving a graph that’s two layers deep. A better solution is to use something called a “recursive common table expression”. But first, you should first understand a normal “common table expression” (CTE):

The with syntax here is the CTE. It’s a useful tool to cache subqueries and often, I find them much cleaner than actual subqueries because you can give them names!

A recursive CTE is slightly different because it allows a CTE to iterate on itself:

Now our join will work no matter how many layers we have!

However, it’s quite a bit of work to write this out every time. If we were using an ORM, we’d be reading a lot of documentation to get this syntax down. To avoid this, we can write a view:

The database will pretend this is a table and even join other tables to it! However, be very careful with views because as you add filters and joins, the query planner may be very confused, as we’ll see later.

If the data in your view doesn’t change very often, one common tool is a materialized view. Mviews, as they’re common called, allow you to cache the results of the view and only refresh them manually:

Keep in mind though: refresh materialized view will block reads. If you add a unique index to your mview (as you should), you can use refresh materialized view concurrently, which will refresh your mview without blocking reads.

## Foreign data wrappers

When you have a lot of data, it’s common to split your tables between multiple databases. To simulate this, let’s create another docker instance. This time, we’ll add a “link” so the second docker instance can network to the first.

We’ll be using a Postgres extension called postgres_fdw that allows you to communicate to other Postgres instances. There’s a lot of cool Postgres extensions out there: they range from adding different data types to different foreign data wrappers even new storage engines and indices.

There are also foreign data wrappers to MySQL, Redis, Dynamo, you name it.

## Optimizing queries

Okay, now the fun stuff :). Let’s create a brand_domain_creative table. We use something more-or-less shaped like this table in Moat Pro and it tells us the estimated impressions (score) a creative had on a certain day on a certain domain.

Neat! Next we’re going to fill it with ~60M rows of simulated data. This may take a short while.

Now we can do queries like “what are the top 10 brands in January for domain 0?”

Yikes! Never mind. That’s taking way too slow. Ctrl+C to interrupt the query and get out of there.

To see what happened, we can use the explain query. It’ll output the database’s execution plan for running the query:

You’ll notice that each step has a cost estimate. Postgres takes statistics about your tables to estimate how long each step would take so it can intelligently choose the optimal strategy. In this case, the statistics are a wee-bit off as it thinks the domain will have ~1500 rows in this date range when in actuality, it’s around 31k. We can tell it to re-analyze its contents via an analyze command:

Now our explain looks something like this:

Even with more accurate statistics, the database doesn’t have any other option. To execute this query, it needs to go through every single row and do a count. How can we give it a shortcut? An index of course!

What is an index? Essentially, think of of it as a glossary for a very large book. If you want to find each page that has the word aardvark, it’s much faster to find the entry in the glossary than to read every page.

By default, Postgres indices are b-trees under the hood because they’re very versatile. However, you can choose other index types if you know what you’re doing.

Building this index took a long time because the computer had to go through every single data point. When you think about it, two and a half minutes to organize 60M data points sounds pretty great. Dang, computers are cool.

Sidenote
Why are they called B-trees? Rudolf Bayer and Ed McCreight invented it while working at Boeing, so it could stand for “Boeing”, “Bayer”, or even “balance” tree. McCreight says they couldn’t decide between the options. They couldn’t name it Boeing without lawyers being involved, so the company missed out on some great free advertising.

Now let’s try this query again!

That was significantly faster. And it’ll be faster if you run it again because of caching:

Caching? How does caching work? Well, virtual memory plays the biggest role here, but Postgres has shared_buffers that cache recent information including:

• Table data
• Indexes
• Query execution plans

Keeping shared_buffers consistent all of these while writes are coming in is some serious voodoo magic, so if you see a Postgres contributor, buy them a beer.

Let’s see how our query was faster via an explain analyze. explain analyze is like an explain but it also runs the query to give you more information. verbose and buffers give you debug information about each step.

Let’s interpret this explain from inside out.

• “Bitmap index scan”: our index is large enough to take up several blocks. Because of the way B-Trees work, we can make a map that can tell us which blocks contain indices that match our conditions. The resulting array of booleans is our bitmap. We then use this bitmap to read the relevant index blocks and collect all the indices that match our conditions. This took 1.27s.

• “Bitmap heap scan”: Armed with our indices, we create a bitmap of heap blocks to read and then we read them. This took almost no time at 0.16s and resulted in 31k rows.

• “Sort”: Looks like Postgres is sorting the rows with quicksort to make it easier to…

• “GroupAggregate”: Group the rows together by brand_id and sum the scores (334 resulting rows).

• “Sort”: Sort our grouped rows based on sum(score) DESC using top-N heapsort

• “Limit”: limit our results from 100 to 10.

Sidenote
Quicksort is in-place, so it makes sense they chose that for 31k rows. Top-N heapsort is a sort where you only keep the Top-N, which is significantly less complex. It only makes sense if you do a limit after your sort.

Can we do better? Sure! Seems like the slow part is getting stuff from the index. We have to read 118k buffers here and only 31k buffers to actually get the data (gee, I’m starting to suspect our buffers are exactly 10k rows each).

Why does does index need to read so many blocks? Well, it’s because of the shape of our index. Our index looks like this: (start_date, domain_id, brand_id, creative_id). This means if you ordered our index in a list, it would look like this:

So in every 1M index entries, only 1k of them are relevant to our query at hand. Thus, we can assume that we have to read a lot of different index blocks to collect our heap.

What happens if we make a new index organized by (domain_id, start_date)? Then our index blocks are significantly closer together and our b-tree doesn’t have to make keys for creative_id/brand_id.

Great Neptune’s trident, that was fast! Let’s see how things changed:

As expected, we got our rows out significantly faster (144ms). Interestingly, the DB switched from a GroupAggregate to a HashAggregate even though the only step that should have been affected was the index scan. Databases are mysterious beasts. In this case, it bought us 2ms. Huzzah!

Sidenote
Another common reason for slow Bitmap Index Scans is a lack of vaccuuming. By default, Postgres keeps old versions of rows arounds for MVCC (Multi-version consistency control) and they can remain in your index as well. vaccuum frequently, kids.

## Performance tuning

Let’s try another query: what are brands that have appeared on the most domains?

Yikes. This is going to be slow again. Our indices can only eliminate half of our dataset. What can we do?

One solution is to have a (start_date, brand, domain) index. Maybe this way, Postgres doesn’t need the actual rows to perform the query:

Programmer uses index! It’s not very effective!

Whoa, wtf? Why is it doing a sequential scan on the rows? Even an analyze doesn’t change this. If it used the index, it would go through 31 (days) * 334 (brand) * 1000 (domain) = 10.354M index entries. That’s 60 times fewer than going through 600M rows!

Well, the difference is that index disk access is random whereas the sequential scan is, well, sequential. The optimizer gives random reads an estimated random_page_cost cost of “4” by default. And keep in mind that reading one index block involves reading a number of other index blocks because that’s how b-trees work.

But wait! My computer has an SSD! Shouldn’t they be weighed the same? Well, you can tune that in your Postgres config by changing random_page_cost to 1 :). Or, you can do it temporarily in your session:

Wtf Rodrigo? You lied! It’s still doing a sequential scan!

Well, as a hack, if you set seq_page_cost to 9999, you can see what its index plan would look like:

Huh, so the database doesn’t have a method to do an index scan and do a GroupAggregate at the same time! So it’s forced to to index scan for 31M entries! Maybe there’s a good reason for that - database programming is hard because there’s a ton of corner cases.

If you want to research it, pull requests are welcome ;).

## Pre-aggregation

First, let’s set our seq_page_cost back to normal:

So how can we make the above query faster? Well, if it’s dedicated to doing a sequential scan, we can simply give it less rows! In this query, we don’t need the creative column. So what if we removed it and rolled up all the creative scores into brands?

The contents of this table are ~3x smaller! So it makes sense that our query time went down by that much.

And what happens if we add an index?

Well, now we get out 10M row method :).

But here’s one more optimization we can do! We reduced brand_domain_creative to brand_domain and in our business logic, we frequently do month long queries. What happens if we rollup if start_date to the nearest month?

Now, this query turns into:

Woosh! The moral of the story here is, you can learn to be a database whisperer, but normally the simplest approach (pre-compute as much as you can) is the right answer.

However, every neat trick in CS comes at a cost and pre-aggregation is no exception. Specifically:

• Disk space

• Compute time

• Out of sync tables

One last thing: what if you have a query where you want to see the brands with the most domains from ‘2017-01-25’ to ‘2017-02-28? In this case, the optimal query involves getting the daily rows from brand_domain and the monthly rows from monthly_brand:

Pretty freakin’ fast. In Moat Pro, we have every cut imaginable, from brand to monthly_creative_hostname to alltime_tag_brand_creative. We have a query engine that choose the right cuts and aggregates their results together.