- Published on
Postgres Advanced Features
- Authors
- Name
- Rodrigo Menezes
- @rclmenezes
So you already know some Postgres? Okay. In this tutorial, we're going to cover some good-to-know Postgres features and tricks for query optimization. The only prerequisite is basic SQL and having Docker/Postgres installed.
Setting up our environment
Let's start with a little docker container running Postgres:
$ docker run --name pg1 -d -P postgres
37cd118ecd727522bd904bffb672956c070dc7a5bc1a207a61de4b628a653b3b
# Find the port
$ docker port pg1
5432/tcp -> 0.0.0.0:32769
# Assuming you have the psql client installed, connect to it:
$ psql -h localhost -p 32769 -U postgres postgres
Adding mock data
First things first, let's make some mock data for us to play around with. We're going to make a table of domains (e.g. espn.com) and domain categories (e.g. sports publishers):
create table domain(id serial primary key, url text);
create table domain_category(id serial primary key);
And we're going to insert some data:
insert into domain select id from generate_series(0, 999) id;
insert into domain_category select id from generate_series(0, 49) id;
Sidenote
generate_series
is a wonderful little function that does exactly what you think it
does. It's fantastic for making
mock data.
Next, let's make a table that connects the two together:
create table domain_category_domain(
id serial primary key,
domain_category_id integer not null references domain_category(id),
domain_id integer not null references domain(id)
);
insert into domain_category_domain(domain_category_id, domain_id)
select domain_category_id, domain_id
from generate_series(0, 49) as domain_category_id,
generate_series(0, 999) as domain_id
where random() < 0.02;
Now it's very easy to get a list of domains in any category:
select count(domain_id)
from domain_category_domain
where domain_category_id = 0;
count
-------
23
(1 row)
Time: 1.526 ms
Now let's make things a little more complicated. Let's add parent categories so we can get a categories topology. Imagine a "Sports" category that can contain "Football" and "Basketball" categories. To do this, we need a table that defines the parent-child relationships for categories:
create table domain_category_relation(
id serial primary key,
parent_id integer references domain_category(id) not null,
child_id integer references domain_category(id) not null
);
insert into domain_category(id)
select id
from generate_series(50, 54) as id;
insert into domain_category_relation(child_id, parent_id)
select child_id, (child_id / 10) + 50
from generate_series(0, 49) as child_id;
select * from domain_category_relation where parent_id=50;
postgres=# select * from domain_category_relation where parent_id=50;
id | parent_id | child_id
----+-----------+----------
1 | 50 | 0
2 | 50 | 1
3 | 50 | 2
4 | 50 | 3
5 | 50 | 4
6 | 50 | 5
7 | 50 | 6
8 | 50 | 7
9 | 50 | 8
10 | 50 | 9
(10 rows)
Okay, so far so good. How can we get a list of domains in parent categories? One option is to do a simple join:
select dcr.parent_id, dcd.domain_id
from domain_category_relation dcr
join domain_category_domain dcd
on dcd.domain_category_id = dcr.child_id;
where dcr.parent_id=50;
However, what if we want the domains of a category and we don't know if it's a parent category or not? One possible solution is to both and union the results together:
select dcr.parent_id, dcd.domain_id
from domain_category_relation dcr
join domain_category_domain dcd
on dcd.domain_category_id = dcr.child_id;
where dcr.parent_id=50;
union
select parent_id, domain_id
from domain_category_domain
where domain_category_id=50;
But A) this is super ugly and B) it stops working if our parent categories get parent categories. In other words, we are only solving a graph that's two layers deep. A better solution is to use something called a "recursive common table expression". But first, you should first understand a normal "common table expression" (CTE):
with cached_result as (
select 1
) select *
from cached_result;
The with
syntax here is the CTE. It's a useful tool to cache subqueries and often, I find them much cleaner than actual subqueries because you can give them names!
A recursive CTE is slightly different because it allows a CTE to iterate on itself:
with recursive flattened_domain_category_domain (domain_category_id, domain_id) as (
select domain_category_id, domain_id
from domain_category_domain
union
select dcr.parent_id, fdcd.domain_id
from domain_category_relation dcr
join flattened_domain_category_domain fdcd
on fdcd.domain_category_id = dcr.child_id
) select domain_category_id, domain_id from flattened_domain_category_domain;
Now our join will work no matter how many layers we have!
However, it's quite a bit of work to write this out every time. If we were using an ORM, we'd be reading a lot of documentation to get this syntax down. To avoid this, we can write a view
:
create view flattened_domain_category_domain_view(domain_category_id, domain_id) as
with recursive flattened_domain_category_domain(domain_category_id, domain_id) as (
select domain_category_id, domain_id
from domain_category_domain
union
select dcr.parent_id, fdcd.domain_id
from domain_category_relation dcr
join flattened_domain_category_domain fdcd
on fdcd.domain_category_id = dcr.child_id
) select domain_category_id, domain_id from flattened_domain_category_domain;
select count(domain_id)
from flattened_domain_category_domain_view
where domain_category_id = 50;
count
-------
198
(1 row)
Time: 6.328 ms
The database will pretend this is a table and even join other tables to it! However, be very careful with views because as you add filters and joins, the query planner may be very confused, as we'll see later.
If the data in your view doesn't change very often, one common tool is a materialized view
. Mviews, as they're common called, allow you to cache the results of the view and only refresh them manually:
create materialized view flattened_domain_category_domain_mv(domain_category_id, domain_id) as
with recursive flattened_domain_category_domain(domain_category_id, domain_id) as (
select domain_category_id, domain_id
from domain_category_domain
union
select dcr.parent_id, fdcd.domain_id
from domain_category_relation dcr
join flattened_domain_category_domain fdcd
on fdcd.domain_category_id = dcr.child_id
) select domain_category_id, domain_id from flattened_domain_category_domain;
-- Yes, you can add indices to make them really fast, like a normal table!
create unique index on flattened_domain_category_domain_mv(domain_category_id, domain_id);
create index on flattened_domain_category_domain_mv(domain_id);
refresh materialized view flattened_domain_category_domain_mv;
Keep in mind though: refresh materialized view
will block reads. If you add a unique index to your mview (as you should), you can use refresh materialized view concurrently
, which will refresh your mview without blocking reads.
Foreign data wrappers
When you have a lot of data, it's common to split your tables between multiple databases. To simulate this, let's create another docker instance. This time, we'll add a "link" so the second docker instance can network to the first.
# Note the `link` option this time
$ docker run --name pg2 -d -P --link pg1:pg1 postgres
679d635ad021d3744fbd07c27f3120778484e0fb93d8632d486d207b0a348e5c
$ docker port pg2
5432/tcp -> 0.0.0.0:32773
$ psql -h localhost -p 32773 -U postgres postgres
We'll be using a Postgres extension called postgres_fdw
that allows you to communicate to other Postgres instances. There's a lot of cool Postgres extensions out there: they range from adding different data types to different foreign data wrappers even new storage engines and indices.
-- Presentation only
-- Disable indices for now
update pg_index set indisvalid = false where indexrelid = 'brand_domain_creative_start_date_domain_id_creative_id_bran_idx'::regclass;
update pg_index set indisvalid = false where indexrelid = 'brand_domain_creative_domain_id_start_date_idx'::regclass;
update pg_index set indisvalid = false where indexrelid = 'brand_domain_creative_start_date_brand_id_domain_id_idx'::regclass;
create extension postgres_fdw;
create server pg1 foreign data wrapper postgres_fdw options (host 'pg1', port '5432', dbname 'postgres');
create user mapping for postgres server pg1 options (user 'postgres');
create foreign table flattened_domain_category_domain (
domain_category_id integer not null, domain_id integer not null
) server pg1 options (schema_name 'public', table_name 'flattened_domain_category_domain_mv');
select count(domain_id)
from flattened_domain_category_domain
where domain_category_id = 50;
count
-------
198
(1 row)
Time: 3.517 ms
There are also foreign data wrappers to MySQL, Redis, Dynamo, you name it.
Optimizing queries
Okay, now the fun stuff :). Let's create a brand_domain_creative
table. We use something more-or-less shaped like this table in Moat Pro and it tells us the estimated impressions (score) a creative had on a certain day on a certain domain.
create table brand_domain_creative(
start_date date not null,
brand_id integer not null,
domain_id integer not null,
creative_id integer not null,
score float not null
);
Neat! Next we're going to fill it with ~60M rows of simulated data. This may take a short while.
insert into brand_domain_creative(
start_date, brand_id, domain_id, creative_id, score
)
select
'2017-01-01'::date + date_offset,
(creative_id / 3) :: integer,
domain_id,
creative_id,
random() * 1000
from
generate_series(0, 59) as date_offset,
generate_series(0, 999) as domain_id,
generate_series(0, 999) as creative_id;
INSERT 0 60000000
Time: 140552.498 ms
Now we can do queries like "what are the top 10 brands in January for domain 0?"
select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
Yikes! Never mind. That's taking way too slow. Ctrl+C to interrupt the query and get out of there.
To see what happened, we can use the explain
query. It'll output the database's execution plan for running the query:
explain select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1432263.79..1432263.81 rows=10 width=12)
-> Sort (cost=1432263.79..1432264.29 rows=200 width=12)
Sort Key: (sum(score)) DESC
-> GroupAggregate (cost=1432246.22..1432259.47 rows=200 width=12)
Group Key: brand_id
-> Sort (cost=1432246.22..1432249.97 rows=1500 width=12)
Sort Key: brand_id
-> Seq Scan on brand_domain_creative (cost=0.00..1432167.09 rows=1500 width=12)
Filter: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date) AND (domain_id = 0))
(9 rows)
You'll notice that each step has a cost estimate. Postgres takes statistics about your tables to estimate how long each step would take so it can intelligently choose the optimal strategy. In this case, the statistics are a wee-bit off as it thinks the domain will have ~1500 rows in this date range when in actuality, it's around 31k. We can tell it to re-analyze its contents via an analyze
command:
analyze brand_domain_creative;
Now our explain looks something like this:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1434628.86..1434628.88 rows=10 width=12)
-> Sort (cost=1434628.86..1434629.69 rows=334 width=12)
Sort Key: (sum(score)) DESC
-> GroupAggregate (cost=1434393.70..1434621.64 rows=334 width=12)
Group Key: brand_id
-> Sort (cost=1434393.70..1434468.57 rows=29947 width=12)
Sort Key: brand_id
-> Seq Scan on brand_domain_creative (cost=0.00..1432167.12 rows=29947 width=12)
Filter: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date) AND (domain_id = 0))
(9 rows)
Even with more accurate statistics, the database doesn't have any other option. To execute this query, it needs to go through every single row and do a count. How can we give it a shortcut? An index of course!
create index on brand_domain_creative(start_date, domain_id, creative_id, brand_id);
-- Presentation only
update pg_index set indisvalid = true where indexrelid = 'brand_domain_creative_start_date_domain_id_creative_id_bran_idx'::regclass;
CREATE INDEX
Time: 155261.995 ms
What is an index? Essentially, think of of it as a glossary for a very large book. If you want to find each page that has the word aardvark
, it's much faster to find the entry in the glossary than to read every page.
By default, Postgres indices are b-trees under the hood because they're very versatile. However, you can choose other index types if you know what you're doing.
Building this index took a long time because the computer had to go through every single data point. When you think about it, two and a half minutes to organize 60M data points sounds pretty great. Dang, computers are cool.
Sidenote
Why are they called B-trees? Rudolf Bayer and Ed McCreight invented it while working at Boeing, so it could stand for "Boeing", "Bayer", or even "balance" tree. McCreight says they couldn't decide between the options. They couldn't name it Boeing without lawyers being involved, so the company missed out on some great free advertising.
Now let's try this query again!
select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
brand_id | sum
----------+------------------
319 | 54501.5292000026
29 | 53102.4147509597
279 | 52780.3690684959
311 | 52639.5615208894
318 | 52623.2881098986
44 | 52072.1511933953
278 | 51771.6338969767
35 | 51655.6670391001
115 | 51610.0491145626
200 | 51545.3661591746
(10 rows)
Time: 17177.307 ms
That was significantly faster. And it'll be faster if you run it again because of caching:
brand_id | sum
----------+------------------
319 | 54501.5292000026
29 | 53102.4147509597
279 | 52780.3690684959
311 | 52639.5615208894
318 | 52623.2881098986
44 | 52072.1511933953
278 | 51771.6338969767
35 | 51655.6670391001
115 | 51610.0491145626
200 | 51545.3661591746
(10 rows)
Time: 1362.534 ms
Caching? How does caching work? Well, virtual memory plays the biggest role here, but Postgres has shared_buffers
that cache recent information including:
- Table data
- Indexes
- Query execution plans
Keeping shared_buffers
consistent all of these while writes are coming in is some serious voodoo magic, so if you see a Postgres contributor, buy them a beer.
Let's see how our query was faster via an explain analyze
. explain analyze
is like an explain but it also runs the query to give you more information. verbose
and buffers
give you debug information about each step.
explain (analyze, buffers, verbose) select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=957420.30..957420.33 rows=10 width=12) (actual time=1441.634..1441.636 rows=10 loops=1)
Output: brand_id, (sum(score))
Buffers: shared read=149777
-> Sort (cost=957420.30..957421.14 rows=334 width=12) (actual time=1441.632..1441.632 rows=10 loops=1)
Output: brand_id, (sum(score))
Sort Key: (sum(brand_domain_creative.score)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared read=149777
-> GroupAggregate (cost=957185.14..957413.08 rows=334 width=12) (actual time=1435.458..1441.548 rows=334 loops=1)
Output: brand_id, sum(score)
Group Key: brand_domain_creative.brand_id
Buffers: shared read=149777
-> Sort (cost=957185.14..957260.01 rows=29947 width=12) (actual time=1435.431..1437.536 rows=31000 loops=1)
Output: brand_id, score
Sort Key: brand_domain_creative.brand_id
Sort Method: quicksort Memory: 2222kB
Buffers: shared read=149777
-> Bitmap Heap Scan on public.brand_domain_creative (cost=862903.05..954958.56 rows=29947 width=12) (actual time=1277.881..1430.520 rows=31000 loops=1)
Output: brand_id, score
Recheck Cond: ((brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date) AND (brand_domain_creative.domain_id = 0))
Heap Blocks: exact=31000
Buffers: shared read=149777
-> Bitmap Index Scan on brand_domain_creative_start_date_domain_id_creative_id_bran_idx (cost=0.00..862895.56 rows=29947 width=0) (actual time=1272.282..1272.282 rows=31000 loops=1)
Index Cond: ((brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date) AND (brand_domain_creative.domain_id = 0))
Buffers: shared read=118777
Planning time: 0.225 ms
Execution time: 1441.689 ms
(27 rows)
Let's interpret this explain from inside out.
-
"Bitmap index scan": our index is large enough to take up several blocks. Because of the way B-Trees work, we can make a map that can tell us which blocks contain indices that match our conditions. The resulting array of booleans is our bitmap. We then use this bitmap to read the relevant index blocks and collect all the indices that match our conditions. This took 1.27s.
-
"Bitmap heap scan": Armed with our indices, we create a bitmap of heap blocks to read and then we read them. This took almost no time at 0.16s and resulted in 31k rows.
-
"Sort": Looks like Postgres is sorting the rows with quicksort to make it easier to...
-
"GroupAggregate": Group the rows together by brand_id and sum the scores (334 resulting rows).
-
"Sort": Sort our grouped rows based on sum(score) DESC using top-N heapsort
-
"Limit": limit our results from 100 to 10.
Sidenote
Quicksort is in-place, so it makes sense they chose that for 31k rows. Top-N heapsort is a sort where you only keep the Top-N, which is significantly less complex. It only makes sense if you do a limit after your sort.
Can we do better? Sure! Seems like the slow part is getting stuff from the index. We have to read 118k buffers here and only 31k buffers to actually get the data (gee, I'm starting to suspect our buffers are exactly 10k rows each).
Why does does index need to read so many blocks? Well, it's because of the shape of our index. Our index looks like this: (start_date, domain_id, brand_id, creative_id)
. This means if you ordered our index in a list, it would look like this:
('2017-01-01', 0, 0, 0)
... ~1000 more entries ...
('2017-01-01', 1, 0, 0)
... ~1M more entries
('2017-01-02', 0, 0, 0)
So in every 1M index entries, only 1k of them are relevant to our query at hand. Thus, we can assume that we have to read a lot of different index blocks to collect our heap.
What happens if we make a new index organized by (domain_id, start_date)
? Then our index blocks are significantly closer together and our b-tree doesn't have to make keys for creative_id
/brand_id
.
create index on brand_domain_creative(domain_id, start_date);
-- Presentation only
update pg_index set indisvalid = true where indexrelid = 'brand_domain_creative_domain_id_start_date'::regclass;
CREATE INDEX
Time: 91935.165 ms
select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
brand_id | sum
----------+------------------
319 | 54501.5292000026
29 | 53102.4147509597
279 | 52780.3690684959
311 | 52639.5615208894
318 | 52623.2881098986
44 | 52072.1511933953
278 | 51771.6338969767
35 | 51655.6670391001
115 | 51610.0491145626
200 | 51545.3661591746
(10 rows)
Time: 152.984 ms
Great Neptune's trident, that was fast! Let's see how things changed:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=92930.19..92930.22 rows=10 width=12) (actual time=154.177..154.180 rows=10 loops=1)
Output: brand_id, (sum(score))
Buffers: shared read=31088
-> Sort (cost=92930.19..92931.03 rows=334 width=12) (actual time=154.175..154.176 rows=10 loops=1)
Output: brand_id, (sum(score))
Sort Key: (sum(brand_domain_creative.score)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared read=31088
-> HashAggregate (cost=92919.64..92922.98 rows=334 width=12) (actual time=154.047..154.111 rows=334 loops=1)
Output: brand_id, sum(score)
Group Key: brand_domain_creative.brand_id
Buffers: shared read=31088
-> Bitmap Heap Scan on public.brand_domain_creative (cost=714.39..92769.90 rows=29947 width=12) (actual time=15.778..145.068 rows=31000 loops=1)
Output: start_date, brand_id, domain_id, creative_id, score
Recheck Cond: ((brand_domain_creative.domain_id = 0) AND (brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date))
Heap Blocks: exact=31000
Buffers: shared read=31088
-> Bitmap Index Scan on brand_domain_creative_domain_id_start_date_idx (cost=0.00..706.90 rows=29947 width=0) (actual time=9.210..9.210 rows=31000 loops=1)
Index Cond: ((brand_domain_creative.domain_id = 0) AND (brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date))
Buffers: shared read=88
Planning time: 0.145 ms
Execution time: 154.279 ms
(22 rows)
As expected, we got our rows out significantly faster (144ms). Interestingly, the DB switched from a GroupAggregate to a HashAggregate even though the only step that should have been affected was the index scan. Databases are mysterious beasts. In this case, it bought us 2ms. Huzzah!
Sidenote
Sidenote
Another common reason for slow Bitmap Index Scans is a lack of vaccuum
ing. By default, Postgres keeps
old versions of rows arounds for MVCC (Multi-version consistency control) and they can remain in your
index as well. vaccuum
frequently, kids.
Performance tuning
Let's try another query: what are brands that have appeared on the most domains?
select brand_id, count(distinct domain_id)
from brand_domain_creative
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
Yikes. This is going to be slow again. Our indices can only eliminate half of our dataset. What can we do?
One solution is to have a (start_date, brand, domain)
index. Maybe this way, Postgres doesn't need the actual rows to perform the query:
create index on brand_domain_creative(brand_id, start_date, domain_id);
-- Presentation only
update pg_index set indisvalid = true where indexrelid = 'brand_domain_creative_brand_id_start_date_domain_id_idx'::regclass;
Programmer uses index! It's not very effective!
explain (analyze, buffers, verbose) select brand_id, count(distinct domain_id)
from brand_domain_creative
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=6630126.72..6630126.75 rows=10 width=12) (actual time=72016.241..72016.243 rows=10 loops=1)
Output: brand_id, (count(DISTINCT domain_id))
Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
-> Sort (cost=6630126.72..6630127.56 rows=334 width=12) (actual time=72016.239..72016.240 rows=10 loops=1)
Output: brand_id, (count(DISTINCT domain_id))
Sort Key: (count(DISTINCT brand_domain_creative.domain_id)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
-> GroupAggregate (cost=6398171.16..6630119.50 rows=334 width=12) (actual time=59799.011..72015.292 rows=334 loops=1)
Output: brand_id, count(DISTINCT domain_id)
Group Key: brand_domain_creative.brand_id
Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
-> Sort (cost=6398171.16..6475486.16 rows=30926000 width=8) (actual time=59762.502..65814.582 rows=31000000 loops=1)
Output: brand_id, domain_id
Sort Key: brand_domain_creative.brand_id
Sort Method: external merge Disk: 545432kB
Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
-> Seq Scan on public.brand_domain_creative (cost=0.00..1282166.00 rows=30926000 width=8) (actual time=0.016..28675.620 rows=31000000 loops=1)
Output: brand_id, domain_id
Filter: ((brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date))
Rows Removed by Filter: 29000000
Buffers: shared hit=16284 read=365882
Planning time: 0.659 ms
Execution time: 72123.210 ms
Whoa, wtf? Why is it doing a sequential scan on the rows? Even an analyze
doesn't change this. If it used the index, it would go through 31 (days) * 334 (brand) * 1000 (domain) = 10.354M index entries. That's 60 times fewer than going through 600M rows!
Well, the difference is that index disk access is random whereas the sequential scan is, well, sequential. The optimizer gives random reads an estimated random_page_cost
cost of "4" by default. And keep in mind that reading one index block involves reading a number of other index blocks because that's how b-trees work.
But wait! My computer has an SSD! Shouldn't they be weighed the same? Well, you can tune that in your Postgres config by changing random_page_cost
to 1 :). Or, you can do it temporarily in your session:
set random_page_cost=1;
explain select brand_id, count(distinct domain_id)
from brand_domain_creative
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Limit (cost=6086504.22..6086504.25 rows=10 width=12)
-> Sort (cost=6086504.22..6086505.06 rows=334 width=12)
Sort Key: (count(DISTINCT domain_id)) DESC
-> GroupAggregate (cost=5854548.66..6086497.00 rows=334 width=12)
Group Key: brand_id
-> Sort (cost=5854548.66..5931863.66 rows=30926000 width=8)
Sort Key: brand_id
-> Seq Scan on brand_domain_creative (cost=0.00..1282166.00 rows=30926000 width=8)
Filter: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date))
(9 rows)
Wtf Rodrigo? You lied! It's still doing a sequential scan!
Well, as a hack, if you set seq_page_cost
to 9999, you can see what its index plan would look like:
set seq_page_cost=9999;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1118169096.56..1118169096.59 rows=10 width=12)
-> Sort (cost=1118169096.56..1118169097.40 rows=334 width=12)
Sort Key: (count(DISTINCT domain_id)) DESC
-> GroupAggregate (cost=0.56..1118169089.34 rows=334 width=12)
Group Key: brand_id
-> Index Only Scan using brand_domain_creative_brand_id_start_date_domain_id_idx on brand_domain_creative (cost=0.56..1118014456.00 rows=30926000 width=8)
Index Cond: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date))
(7 rows)
Huh, so the database doesn't have a method to do an index scan and do a GroupAggregate at the same time! So it's forced to to index scan for 31M entries! Maybe there's a good reason for that - database programming is hard because there's a ton of corner cases.
If you want to research it, pull requests are welcome ;).
Pre-aggregation
First, let's set our seq_page_cost
back to normal:
set seq_page_cost=1;
So how can we make the above query faster? Well, if it's dedicated to doing a sequential scan, we can simply give it less rows! In this query, we don't need the creative column. So what if we removed it and rolled up all the creative scores into brands?
create table brand_domain(
start_date date not null,
brand_id integer not null,
domain_id integer not null,
score float not null
);
insert into brand_domain(start_date, brand_id, domain_id, score)
select
start_date,
brand_id,
domain_id,
sum(score)
from
brand_domain_creative
group by
start_date, brand_id, domain_id
order by
start_date, brand_id, domain_id;
select brand_id, count(distinct domain_id)
from brand_domain
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
brand_id | count
----------+-------
1 | 1000
2 | 1000
3 | 1000
4 | 1000
5 | 1000
6 | 1000
7 | 1000
8 | 1000
9 | 1000
0 | 1000
(10 rows)
Time: 10877.772 ms
The contents of this table are ~3x smaller! So it makes sense that our query time went down by that much.
And what happens if we add an index?
create index on brand_domain(brand_id, start_date, domain_id);
explain (analyze, buffers, verbose) select brand_id, count(distinct domain_id)
from brand_domain
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1434911.01..1434911.04 rows=10 width=12) (actual time=8470.640..8470.642 rows=10 loops=1)
Output: brand_id, (count(DISTINCT domain_id))
Buffers: shared hit=10288 read=142735
-> Sort (cost=1434911.01..1434911.85 rows=334 width=12) (actual time=8470.639..8470.641 rows=10 loops=1)
Output: brand_id, (count(DISTINCT domain_id))
Sort Key: (count(DISTINCT brand_domain.domain_id)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10288 read=142735
-> GroupAggregate (cost=0.56..1434903.80 rows=334 width=12) (actual time=25.915..8469.810 rows=334 loops=1)
Output: brand_id, count(DISTINCT domain_id)
Group Key: brand_domain.brand_id
Buffers: shared hit=10288 read=142735
-> Index Only Scan using brand_domain_brand_id_start_date_domain_id_idx on public.brand_domain (cost=0.56..1383494.52 rows=10281188 width=8) (actual time=1.181..6603.681 rows=10354000 loops=1)
Output: brand_id, start_date, domain_id
Index Cond: ((brand_domain.start_date >= '2017-01-01'::date) AND (brand_domain.start_date <= '2017-01-31'::date))
Heap Fetches: 10354000
Buffers: shared hit=10288 read=142735
Planning time: 0.112 ms
Execution time: 8470.682 ms
(19 rows)
Time: 8472.057 ms
Well, now we get out 10M row method :).
But here's one more optimization we can do! We reduced brand_domain_creative
to brand_domain
and in our business logic, we frequently do month long queries. What happens if we rollup if start_date
to the nearest month?
create table monthly_brand_domain(
start_date date not null,
brand_id integer not null,
domain_id integer not null,
score float not null
);
create unique index on monthly_brand_domain(start_date, brand_id, domain_id);
insert into monthly_brand_domain(
start_date, brand_id, domain_id, score
) select
date_trunc('month', start_date)::date,
brand_id,
domain_id,
sum(score)
from
brand_domain
group by
date_trunc('month', start_date)::date, brand_id, domain_id
order by
date_trunc('month', start_date)::date, brand_id, domain_id;
Now, this query turns into:
select brand_id, count(distinct domain_id)
from monthly_brand_domain
where start_date = '2017-01-01'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
brand_id | count
----------+-------
1 | 1000
2 | 1000
3 | 1000
4 | 1000
5 | 1000
6 | 1000
7 | 1000
8 | 1000
9 | 1000
0 | 1000
(10 rows)
Time: 93.158 ms
Woosh! The moral of the story here is, you can learn to be a database whisperer, but normally the simplest approach (pre-compute as much as you can) is the right answer.
However, every neat trick in CS comes at a cost and pre-aggregation is no exception. Specifically:
-
Disk space
-
Compute time
-
Out of sync tables
One last thing: what if you have a query where you want to see the brands with the most domains from '2017-01-25' to '2017-02-28? In this case, the optimal query involves getting the daily rows from brand_domain
and the monthly rows from monthly_brand
:
select s.brand_id, count(distinct s.domain_id)
from (
select brand_id, domain_id
from brand_domain
where start_date >= '2017-01-25'
and start_date <= '2017-01-31'
union all
select brand_id, domain_id
from monthly_brand_domain
where start_date = '2017-02-01'
) s
group by brand_id
limit 10;
brand_id | count
----------+-------
0 | 1000
1 | 1000
2 | 1000
3 | 1000
4 | 1000
5 | 1000
6 | 1000
7 | 1000
8 | 1000
9 | 1000
(10 rows)
Time: 2283.934 ms
Pretty freakin' fast. In Moat Pro, we have every cut imaginable, from brand
to monthly_creative_hostname
to alltime_tag_brand_creative
. We have a query engine that choose the right cuts and aggregates their results together.