Published on

Postgres Advanced Features

Authors

So you already know some Postgres? Okay. In this tutorial, we're going to cover some good-to-know Postgres features and tricks for query optimization. The only prerequisite is basic SQL and having Docker/Postgres installed.

Setting up our environment

Let's start with a little docker container running Postgres:

$ docker run --name pg1 -d -P postgres
37cd118ecd727522bd904bffb672956c070dc7a5bc1a207a61de4b628a653b3b

# Find the port
$ docker port pg1
5432/tcp -> 0.0.0.0:32769

# Assuming you have the psql client installed, connect to it:
$ psql -h localhost -p 32769 -U postgres postgres

Adding mock data

First things first, let's make some mock data for us to play around with. We're going to make a table of domains (e.g. espn.com) and domain categories (e.g. sports publishers):

create table domain(id serial primary key, url text);
create table domain_category(id serial primary key);

And we're going to insert some data:

insert into domain select id from generate_series(0, 999) id;
insert into domain_category select id from generate_series(0, 49) id;

Sidenote

generate_series is a wonderful little function that does exactly what you think it does. It's fantastic for making mock data.

Next, let's make a table that connects the two together:

create table domain_category_domain(
    id serial primary key,
    domain_category_id integer not null references domain_category(id),
    domain_id integer not null references domain(id)
);

insert into domain_category_domain(domain_category_id, domain_id)
select domain_category_id, domain_id
from generate_series(0, 49) as domain_category_id,
     generate_series(0, 999) as domain_id
where random() < 0.02;

Now it's very easy to get a list of domains in any category:

select count(domain_id)
from domain_category_domain
where domain_category_id = 0;
 count
-------
    23
(1 row)

Time: 1.526 ms

Now let's make things a little more complicated. Let's add parent categories so we can get a categories topology. Imagine a "Sports" category that can contain "Football" and "Basketball" categories. To do this, we need a table that defines the parent-child relationships for categories:

create table domain_category_relation(
    id serial primary key,
    parent_id integer references domain_category(id) not null,
    child_id integer references domain_category(id) not null
);

insert into domain_category(id)
select id
from generate_series(50, 54) as id;

insert into domain_category_relation(child_id, parent_id)
select child_id, (child_id / 10) + 50
from generate_series(0, 49) as child_id;

select * from domain_category_relation where parent_id=50;
postgres=# select * from domain_category_relation where parent_id=50;
 id | parent_id | child_id
----+-----------+----------
  1 |        50 |        0
  2 |        50 |        1
  3 |        50 |        2
  4 |        50 |        3
  5 |        50 |        4
  6 |        50 |        5
  7 |        50 |        6
  8 |        50 |        7
  9 |        50 |        8
 10 |        50 |        9
(10 rows)

Okay, so far so good. How can we get a list of domains in parent categories? One option is to do a simple join:

select dcr.parent_id, dcd.domain_id
from domain_category_relation dcr
join domain_category_domain dcd
on dcd.domain_category_id = dcr.child_id;
where dcr.parent_id=50;

However, what if we want the domains of a category and we don't know if it's a parent category or not? One possible solution is to both and union the results together:

select dcr.parent_id, dcd.domain_id
from domain_category_relation dcr
join domain_category_domain dcd
on dcd.domain_category_id = dcr.child_id;
where dcr.parent_id=50;
union
select parent_id, domain_id
from domain_category_domain
where domain_category_id=50;

But A) this is super ugly and B) it stops working if our parent categories get parent categories. In other words, we are only solving a graph that's two layers deep. A better solution is to use something called a "recursive common table expression". But first, you should first understand a normal "common table expression" (CTE):

with cached_result as (
    select 1
) select *
from cached_result;

The with syntax here is the CTE. It's a useful tool to cache subqueries and often, I find them much cleaner than actual subqueries because you can give them names!

A recursive CTE is slightly different because it allows a CTE to iterate on itself:

with recursive flattened_domain_category_domain (domain_category_id, domain_id) as (
    select domain_category_id, domain_id
    from domain_category_domain
    union
    select dcr.parent_id, fdcd.domain_id
    from domain_category_relation dcr
    join flattened_domain_category_domain fdcd
    on fdcd.domain_category_id = dcr.child_id
) select domain_category_id, domain_id from flattened_domain_category_domain;

Now our join will work no matter how many layers we have!

However, it's quite a bit of work to write this out every time. If we were using an ORM, we'd be reading a lot of documentation to get this syntax down. To avoid this, we can write a view:

create view flattened_domain_category_domain_view(domain_category_id, domain_id) as
with recursive flattened_domain_category_domain(domain_category_id, domain_id) as (
    select domain_category_id, domain_id
    from domain_category_domain
    union
    select dcr.parent_id, fdcd.domain_id
    from domain_category_relation dcr
    join flattened_domain_category_domain fdcd
    on fdcd.domain_category_id = dcr.child_id
) select domain_category_id, domain_id from flattened_domain_category_domain;

select count(domain_id)
from flattened_domain_category_domain_view
where domain_category_id = 50;
 count
-------
   198
(1 row)

Time: 6.328 ms

The database will pretend this is a table and even join other tables to it! However, be very careful with views because as you add filters and joins, the query planner may be very confused, as we'll see later.

If the data in your view doesn't change very often, one common tool is a materialized view. Mviews, as they're common called, allow you to cache the results of the view and only refresh them manually:

create materialized view flattened_domain_category_domain_mv(domain_category_id, domain_id) as
with recursive flattened_domain_category_domain(domain_category_id, domain_id) as (
    select domain_category_id, domain_id
    from domain_category_domain
    union
    select dcr.parent_id, fdcd.domain_id
    from domain_category_relation dcr
    join flattened_domain_category_domain fdcd
    on fdcd.domain_category_id = dcr.child_id
) select domain_category_id, domain_id from flattened_domain_category_domain;

-- Yes, you can add indices to make them really fast, like a normal table!
create unique index on flattened_domain_category_domain_mv(domain_category_id, domain_id);
create index on flattened_domain_category_domain_mv(domain_id);

refresh materialized view flattened_domain_category_domain_mv;

Keep in mind though: refresh materialized view will block reads. If you add a unique index to your mview (as you should), you can use refresh materialized view concurrently, which will refresh your mview without blocking reads.

Foreign data wrappers

When you have a lot of data, it's common to split your tables between multiple databases. To simulate this, let's create another docker instance. This time, we'll add a "link" so the second docker instance can network to the first.

# Note the `link` option this time
$ docker run --name pg2 -d -P --link pg1:pg1 postgres
679d635ad021d3744fbd07c27f3120778484e0fb93d8632d486d207b0a348e5c

$ docker port pg2
5432/tcp -> 0.0.0.0:32773

$ psql -h localhost -p 32773 -U postgres postgres

We'll be using a Postgres extension called postgres_fdw that allows you to communicate to other Postgres instances. There's a lot of cool Postgres extensions out there: they range from adding different data types to different foreign data wrappers even new storage engines and indices.

-- Presentation only
-- Disable indices for now
update pg_index set indisvalid = false where indexrelid = 'brand_domain_creative_start_date_domain_id_creative_id_bran_idx'::regclass;
update pg_index set indisvalid = false where indexrelid = 'brand_domain_creative_domain_id_start_date_idx'::regclass;
update pg_index set indisvalid = false where indexrelid = 'brand_domain_creative_start_date_brand_id_domain_id_idx'::regclass;

create extension postgres_fdw;

create server pg1 foreign data wrapper postgres_fdw options (host 'pg1', port '5432', dbname 'postgres');
create user mapping for postgres server pg1 options (user 'postgres');
create foreign table flattened_domain_category_domain (
    domain_category_id integer not null, domain_id integer not null
) server pg1 options (schema_name 'public', table_name 'flattened_domain_category_domain_mv');

select count(domain_id)
from flattened_domain_category_domain
where domain_category_id = 50;
 count
-------
   198
(1 row)

Time: 3.517 ms

There are also foreign data wrappers to MySQL, Redis, Dynamo, you name it.

Optimizing queries

Okay, now the fun stuff :). Let's create a brand_domain_creative table. We use something more-or-less shaped like this table in Moat Pro and it tells us the estimated impressions (score) a creative had on a certain day on a certain domain.

create table brand_domain_creative(
    start_date date not null,
    brand_id integer not null,
    domain_id integer not null,
    creative_id integer not null,
    score float not null
);

Neat! Next we're going to fill it with ~60M rows of simulated data. This may take a short while.

insert into brand_domain_creative(
    start_date, brand_id, domain_id, creative_id, score
)
select
    '2017-01-01'::date + date_offset,
    (creative_id / 3) :: integer,
    domain_id,
    creative_id,
    random() * 1000
from
    generate_series(0, 59) as date_offset,
    generate_series(0, 999) as domain_id,
    generate_series(0, 999) as creative_id;
INSERT 0 60000000
Time: 140552.498 ms

Now we can do queries like "what are the top 10 brands in January for domain 0?"

select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;

Yikes! Never mind. That's taking way too slow. Ctrl+C to interrupt the query and get out of there.

To see what happened, we can use the explain query. It'll output the database's execution plan for running the query:

explain select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1432263.79..1432263.81 rows=10 width=12)
   ->  Sort  (cost=1432263.79..1432264.29 rows=200 width=12)
         Sort Key: (sum(score)) DESC
         ->  GroupAggregate  (cost=1432246.22..1432259.47 rows=200 width=12)
               Group Key: brand_id
               ->  Sort  (cost=1432246.22..1432249.97 rows=1500 width=12)
                     Sort Key: brand_id
                     ->  Seq Scan on brand_domain_creative  (cost=0.00..1432167.09 rows=1500 width=12)
                           Filter: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date) AND (domain_id = 0))
(9 rows)

You'll notice that each step has a cost estimate. Postgres takes statistics about your tables to estimate how long each step would take so it can intelligently choose the optimal strategy. In this case, the statistics are a wee-bit off as it thinks the domain will have ~1500 rows in this date range when in actuality, it's around 31k. We can tell it to re-analyze its contents via an analyze command:

analyze brand_domain_creative;

Now our explain looks something like this:

                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1434628.86..1434628.88 rows=10 width=12)
   ->  Sort  (cost=1434628.86..1434629.69 rows=334 width=12)
         Sort Key: (sum(score)) DESC
         ->  GroupAggregate  (cost=1434393.70..1434621.64 rows=334 width=12)
               Group Key: brand_id
               ->  Sort  (cost=1434393.70..1434468.57 rows=29947 width=12)
                     Sort Key: brand_id
                     ->  Seq Scan on brand_domain_creative  (cost=0.00..1432167.12 rows=29947 width=12)
                           Filter: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date) AND (domain_id = 0))
(9 rows)

Even with more accurate statistics, the database doesn't have any other option. To execute this query, it needs to go through every single row and do a count. How can we give it a shortcut? An index of course!

create index on brand_domain_creative(start_date, domain_id, creative_id, brand_id);

-- Presentation only
update pg_index set indisvalid = true where indexrelid = 'brand_domain_creative_start_date_domain_id_creative_id_bran_idx'::regclass;
CREATE INDEX
Time: 155261.995 ms

What is an index? Essentially, think of of it as a glossary for a very large book. If you want to find each page that has the word aardvark, it's much faster to find the entry in the glossary than to read every page.

By default, Postgres indices are b-trees under the hood because they're very versatile. However, you can choose other index types if you know what you're doing.

Building this index took a long time because the computer had to go through every single data point. When you think about it, two and a half minutes to organize 60M data points sounds pretty great. Dang, computers are cool.

Sidenote

Why are they called B-trees? Rudolf Bayer and Ed McCreight invented it while working at Boeing, so it could stand for "Boeing", "Bayer", or even "balance" tree. McCreight says they couldn't decide between the options. They couldn't name it Boeing without lawyers being involved, so the company missed out on some great free advertising.

Now let's try this query again!

select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
 brand_id |       sum
----------+------------------
      319 | 54501.5292000026
       29 | 53102.4147509597
      279 | 52780.3690684959
      311 | 52639.5615208894
      318 | 52623.2881098986
       44 | 52072.1511933953
      278 | 51771.6338969767
       35 | 51655.6670391001
      115 | 51610.0491145626
      200 | 51545.3661591746
(10 rows)

Time: 17177.307 ms

That was significantly faster. And it'll be faster if you run it again because of caching:

 brand_id |       sum
----------+------------------
      319 | 54501.5292000026
       29 | 53102.4147509597
      279 | 52780.3690684959
      311 | 52639.5615208894
      318 | 52623.2881098986
       44 | 52072.1511933953
      278 | 51771.6338969767
       35 | 51655.6670391001
      115 | 51610.0491145626
      200 | 51545.3661591746
(10 rows)

Time: 1362.534 ms

Caching? How does caching work? Well, virtual memory plays the biggest role here, but Postgres has shared_buffers that cache recent information including:

  • Table data
  • Indexes
  • Query execution plans

Keeping shared_buffers consistent all of these while writes are coming in is some serious voodoo magic, so if you see a Postgres contributor, buy them a beer.

Let's see how our query was faster via an explain analyze. explain analyze is like an explain but it also runs the query to give you more information. verbose and buffers give you debug information about each step.

explain (analyze, buffers, verbose) select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
                                                                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=957420.30..957420.33 rows=10 width=12) (actual time=1441.634..1441.636 rows=10 loops=1)
   Output: brand_id, (sum(score))
   Buffers: shared read=149777
   ->  Sort  (cost=957420.30..957421.14 rows=334 width=12) (actual time=1441.632..1441.632 rows=10 loops=1)
         Output: brand_id, (sum(score))
         Sort Key: (sum(brand_domain_creative.score)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared read=149777
         ->  GroupAggregate  (cost=957185.14..957413.08 rows=334 width=12) (actual time=1435.458..1441.548 rows=334 loops=1)
               Output: brand_id, sum(score)
               Group Key: brand_domain_creative.brand_id
               Buffers: shared read=149777
               ->  Sort  (cost=957185.14..957260.01 rows=29947 width=12) (actual time=1435.431..1437.536 rows=31000 loops=1)
                     Output: brand_id, score
                     Sort Key: brand_domain_creative.brand_id
                     Sort Method: quicksort  Memory: 2222kB
                     Buffers: shared read=149777
                     ->  Bitmap Heap Scan on public.brand_domain_creative  (cost=862903.05..954958.56 rows=29947 width=12) (actual time=1277.881..1430.520 rows=31000 loops=1)
                           Output: brand_id, score
                           Recheck Cond: ((brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date) AND (brand_domain_creative.domain_id = 0))
                           Heap Blocks: exact=31000
                           Buffers: shared read=149777
                           ->  Bitmap Index Scan on brand_domain_creative_start_date_domain_id_creative_id_bran_idx  (cost=0.00..862895.56 rows=29947 width=0) (actual time=1272.282..1272.282 rows=31000 loops=1)
                                 Index Cond: ((brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date) AND (brand_domain_creative.domain_id = 0))
                                 Buffers: shared read=118777
 Planning time: 0.225 ms
 Execution time: 1441.689 ms
(27 rows)

Let's interpret this explain from inside out.

  • "Bitmap index scan": our index is large enough to take up several blocks. Because of the way B-Trees work, we can make a map that can tell us which blocks contain indices that match our conditions. The resulting array of booleans is our bitmap. We then use this bitmap to read the relevant index blocks and collect all the indices that match our conditions. This took 1.27s.

  • "Bitmap heap scan": Armed with our indices, we create a bitmap of heap blocks to read and then we read them. This took almost no time at 0.16s and resulted in 31k rows.

  • "Sort": Looks like Postgres is sorting the rows with quicksort to make it easier to...

  • "GroupAggregate": Group the rows together by brand_id and sum the scores (334 resulting rows).

  • "Sort": Sort our grouped rows based on sum(score) DESC using top-N heapsort

  • "Limit": limit our results from 100 to 10.

Sidenote

Quicksort is in-place, so it makes sense they chose that for 31k rows. Top-N heapsort is a sort where you only keep the Top-N, which is significantly less complex. It only makes sense if you do a limit after your sort.

Can we do better? Sure! Seems like the slow part is getting stuff from the index. We have to read 118k buffers here and only 31k buffers to actually get the data (gee, I'm starting to suspect our buffers are exactly 10k rows each).

Why does does index need to read so many blocks? Well, it's because of the shape of our index. Our index looks like this: (start_date, domain_id, brand_id, creative_id). This means if you ordered our index in a list, it would look like this:

('2017-01-01', 0, 0, 0)
... ~1000 more entries ...
('2017-01-01', 1, 0, 0)
... ~1M more entries
('2017-01-02', 0, 0, 0)

So in every 1M index entries, only 1k of them are relevant to our query at hand. Thus, we can assume that we have to read a lot of different index blocks to collect our heap.

What happens if we make a new index organized by (domain_id, start_date)? Then our index blocks are significantly closer together and our b-tree doesn't have to make keys for creative_id/brand_id.

create index on brand_domain_creative(domain_id, start_date);

-- Presentation only
update pg_index set indisvalid = true where indexrelid = 'brand_domain_creative_domain_id_start_date'::regclass;
CREATE INDEX
Time: 91935.165 ms
select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
 brand_id |       sum
----------+------------------
      319 | 54501.5292000026
       29 | 53102.4147509597
      279 | 52780.3690684959
      311 | 52639.5615208894
      318 | 52623.2881098986
       44 | 52072.1511933953
      278 | 51771.6338969767
       35 | 51655.6670391001
      115 | 51610.0491145626
      200 | 51545.3661591746
(10 rows)

Time: 152.984 ms

Great Neptune's trident, that was fast! Let's see how things changed:

                                                                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=92930.19..92930.22 rows=10 width=12) (actual time=154.177..154.180 rows=10 loops=1)
   Output: brand_id, (sum(score))
   Buffers: shared read=31088
   ->  Sort  (cost=92930.19..92931.03 rows=334 width=12) (actual time=154.175..154.176 rows=10 loops=1)
         Output: brand_id, (sum(score))
         Sort Key: (sum(brand_domain_creative.score)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared read=31088
         ->  HashAggregate  (cost=92919.64..92922.98 rows=334 width=12) (actual time=154.047..154.111 rows=334 loops=1)
               Output: brand_id, sum(score)
               Group Key: brand_domain_creative.brand_id
               Buffers: shared read=31088
               ->  Bitmap Heap Scan on public.brand_domain_creative  (cost=714.39..92769.90 rows=29947 width=12) (actual time=15.778..145.068 rows=31000 loops=1)
                     Output: start_date, brand_id, domain_id, creative_id, score
                     Recheck Cond: ((brand_domain_creative.domain_id = 0) AND (brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date))
                     Heap Blocks: exact=31000
                     Buffers: shared read=31088
                     ->  Bitmap Index Scan on brand_domain_creative_domain_id_start_date_idx  (cost=0.00..706.90 rows=29947 width=0) (actual time=9.210..9.210 rows=31000 loops=1)
                           Index Cond: ((brand_domain_creative.domain_id = 0) AND (brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date))
                           Buffers: shared read=88
 Planning time: 0.145 ms
 Execution time: 154.279 ms
(22 rows)

As expected, we got our rows out significantly faster (144ms). Interestingly, the DB switched from a GroupAggregate to a HashAggregate even though the only step that should have been affected was the index scan. Databases are mysterious beasts. In this case, it bought us 2ms. Huzzah!

Sidenote

Sidenote Another common reason for slow Bitmap Index Scans is a lack of vaccuuming. By default, Postgres keeps old versions of rows arounds for MVCC (Multi-version consistency control) and they can remain in your index as well. vaccuum frequently, kids.

Performance tuning

Let's try another query: what are brands that have appeared on the most domains?

select brand_id, count(distinct domain_id)
from brand_domain_creative
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;

Yikes. This is going to be slow again. Our indices can only eliminate half of our dataset. What can we do?

One solution is to have a (start_date, brand, domain) index. Maybe this way, Postgres doesn't need the actual rows to perform the query:

create index on brand_domain_creative(brand_id, start_date, domain_id);

-- Presentation only
update pg_index set indisvalid = true where indexrelid = 'brand_domain_creative_brand_id_start_date_domain_id_idx'::regclass;

Programmer uses index! It's not very effective!

explain (analyze, buffers, verbose) select brand_id, count(distinct domain_id)
from brand_domain_creative
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6630126.72..6630126.75 rows=10 width=12) (actual time=72016.241..72016.243 rows=10 loops=1)
   Output: brand_id, (count(DISTINCT domain_id))
   Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
   ->  Sort  (cost=6630126.72..6630127.56 rows=334 width=12) (actual time=72016.239..72016.240 rows=10 loops=1)
         Output: brand_id, (count(DISTINCT domain_id))
         Sort Key: (count(DISTINCT brand_domain_creative.domain_id)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
         ->  GroupAggregate  (cost=6398171.16..6630119.50 rows=334 width=12) (actual time=59799.011..72015.292 rows=334 loops=1)
               Output: brand_id, count(DISTINCT domain_id)
               Group Key: brand_domain_creative.brand_id
               Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
               ->  Sort  (cost=6398171.16..6475486.16 rows=30926000 width=8) (actual time=59762.502..65814.582 rows=31000000 loops=1)
                     Output: brand_id, domain_id
                     Sort Key: brand_domain_creative.brand_id
                     Sort Method: external merge  Disk: 545432kB
                     Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
                     ->  Seq Scan on public.brand_domain_creative  (cost=0.00..1282166.00 rows=30926000 width=8) (actual time=0.016..28675.620 rows=31000000 loops=1)
                           Output: brand_id, domain_id
                           Filter: ((brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date))
                           Rows Removed by Filter: 29000000
                           Buffers: shared hit=16284 read=365882
 Planning time: 0.659 ms
 Execution time: 72123.210 ms

Whoa, wtf? Why is it doing a sequential scan on the rows? Even an analyze doesn't change this. If it used the index, it would go through 31 (days) * 334 (brand) * 1000 (domain) = 10.354M index entries. That's 60 times fewer than going through 600M rows!

Well, the difference is that index disk access is random whereas the sequential scan is, well, sequential. The optimizer gives random reads an estimated random_page_cost cost of "4" by default. And keep in mind that reading one index block involves reading a number of other index blocks because that's how b-trees work.

But wait! My computer has an SSD! Shouldn't they be weighed the same? Well, you can tune that in your Postgres config by changing random_page_cost to 1 :). Or, you can do it temporarily in your session:

set random_page_cost=1;
explain select brand_id, count(distinct domain_id)
from brand_domain_creative
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Limit  (cost=6086504.22..6086504.25 rows=10 width=12)
   ->  Sort  (cost=6086504.22..6086505.06 rows=334 width=12)
         Sort Key: (count(DISTINCT domain_id)) DESC
         ->  GroupAggregate  (cost=5854548.66..6086497.00 rows=334 width=12)
               Group Key: brand_id
               ->  Sort  (cost=5854548.66..5931863.66 rows=30926000 width=8)
                     Sort Key: brand_id
                     ->  Seq Scan on brand_domain_creative  (cost=0.00..1282166.00 rows=30926000 width=8)
                           Filter: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date))
(9 rows)

Wtf Rodrigo? You lied! It's still doing a sequential scan!

Well, as a hack, if you set seq_page_cost to 9999, you can see what its index plan would look like:

set seq_page_cost=9999;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1118169096.56..1118169096.59 rows=10 width=12)
   ->  Sort  (cost=1118169096.56..1118169097.40 rows=334 width=12)
         Sort Key: (count(DISTINCT domain_id)) DESC
         ->  GroupAggregate  (cost=0.56..1118169089.34 rows=334 width=12)
               Group Key: brand_id
               ->  Index Only Scan using brand_domain_creative_brand_id_start_date_domain_id_idx on brand_domain_creative  (cost=0.56..1118014456.00 rows=30926000 width=8)
                     Index Cond: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date))
(7 rows)

Huh, so the database doesn't have a method to do an index scan and do a GroupAggregate at the same time! So it's forced to to index scan for 31M entries! Maybe there's a good reason for that - database programming is hard because there's a ton of corner cases.

If you want to research it, pull requests are welcome ;).

Pre-aggregation

First, let's set our seq_page_cost back to normal:

set seq_page_cost=1;

So how can we make the above query faster? Well, if it's dedicated to doing a sequential scan, we can simply give it less rows! In this query, we don't need the creative column. So what if we removed it and rolled up all the creative scores into brands?

create table brand_domain(
    start_date date not null,
    brand_id integer not null,
    domain_id integer not null,
    score float not null
);
insert into brand_domain(start_date, brand_id, domain_id, score)
select
    start_date,
    brand_id,
    domain_id,
    sum(score)
from
    brand_domain_creative
group by
    start_date, brand_id, domain_id
order by
    start_date, brand_id, domain_id;

select brand_id, count(distinct domain_id)
from brand_domain
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
 brand_id | count
----------+-------
        1 |  1000
        2 |  1000
        3 |  1000
        4 |  1000
        5 |  1000
        6 |  1000
        7 |  1000
        8 |  1000
        9 |  1000
        0 |  1000
(10 rows)

Time: 10877.772 ms

The contents of this table are ~3x smaller! So it makes sense that our query time went down by that much.

And what happens if we add an index?

create index on brand_domain(brand_id, start_date, domain_id);

explain (analyze, buffers, verbose) select brand_id, count(distinct domain_id)
from brand_domain
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
                                                                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1434911.01..1434911.04 rows=10 width=12) (actual time=8470.640..8470.642 rows=10 loops=1)
   Output: brand_id, (count(DISTINCT domain_id))
   Buffers: shared hit=10288 read=142735
   ->  Sort  (cost=1434911.01..1434911.85 rows=334 width=12) (actual time=8470.639..8470.641 rows=10 loops=1)
         Output: brand_id, (count(DISTINCT domain_id))
         Sort Key: (count(DISTINCT brand_domain.domain_id)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=10288 read=142735
         ->  GroupAggregate  (cost=0.56..1434903.80 rows=334 width=12) (actual time=25.915..8469.810 rows=334 loops=1)
               Output: brand_id, count(DISTINCT domain_id)
               Group Key: brand_domain.brand_id
               Buffers: shared hit=10288 read=142735
               ->  Index Only Scan using brand_domain_brand_id_start_date_domain_id_idx on public.brand_domain  (cost=0.56..1383494.52 rows=10281188 width=8) (actual time=1.181..6603.681 rows=10354000 loops=1)
                     Output: brand_id, start_date, domain_id
                     Index Cond: ((brand_domain.start_date >= '2017-01-01'::date) AND (brand_domain.start_date <= '2017-01-31'::date))
                     Heap Fetches: 10354000
                     Buffers: shared hit=10288 read=142735
 Planning time: 0.112 ms
 Execution time: 8470.682 ms
(19 rows)

Time: 8472.057 ms

Well, now we get out 10M row method :).

But here's one more optimization we can do! We reduced brand_domain_creative to brand_domain and in our business logic, we frequently do month long queries. What happens if we rollup if start_date to the nearest month?

create table monthly_brand_domain(
    start_date date not null,
    brand_id integer not null,
    domain_id integer not null,
    score float not null
);
create unique index on monthly_brand_domain(start_date, brand_id, domain_id);
insert into monthly_brand_domain(
    start_date, brand_id, domain_id, score
) select
    date_trunc('month', start_date)::date,
    brand_id,
    domain_id,
    sum(score)
from
    brand_domain
group by
    date_trunc('month', start_date)::date, brand_id, domain_id
order by
    date_trunc('month', start_date)::date, brand_id, domain_id;

Now, this query turns into:

select brand_id, count(distinct domain_id)
from monthly_brand_domain
where start_date = '2017-01-01'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
 brand_id | count
----------+-------
        1 |  1000
        2 |  1000
        3 |  1000
        4 |  1000
        5 |  1000
        6 |  1000
        7 |  1000
        8 |  1000
        9 |  1000
        0 |  1000
(10 rows)

Time: 93.158 ms

Woosh! The moral of the story here is, you can learn to be a database whisperer, but normally the simplest approach (pre-compute as much as you can) is the right answer.

However, every neat trick in CS comes at a cost and pre-aggregation is no exception. Specifically:

  • Disk space

  • Compute time

  • Out of sync tables

One last thing: what if you have a query where you want to see the brands with the most domains from '2017-01-25' to '2017-02-28? In this case, the optimal query involves getting the daily rows from brand_domain and the monthly rows from monthly_brand:

select s.brand_id, count(distinct s.domain_id)
from (
    select brand_id, domain_id
    from brand_domain
    where start_date >= '2017-01-25'
    and start_date <= '2017-01-31'
    union all
    select brand_id, domain_id
    from monthly_brand_domain
    where start_date = '2017-02-01'
) s
group by brand_id
limit 10;
 brand_id | count
----------+-------
        0 |  1000
        1 |  1000
        2 |  1000
        3 |  1000
        4 |  1000
        5 |  1000
        6 |  1000
        7 |  1000
        8 |  1000
        9 |  1000
(10 rows)

Time: 2283.934 ms

Pretty freakin' fast. In Moat Pro, we have every cut imaginable, from brand to monthly_creative_hostname to alltime_tag_brand_creative. We have a query engine that choose the right cuts and aggregates their results together.