Postgres 102 - Explains and Advanced Features

In this tutorial, we’re going to cover some good-to-know Postgres features and tricks for query optimization. The only prerequisite is basic SQL and having Docker/Postgres installed.

Setting up our environment

Let’s start with a little docker container running Postgres:

1
2
3
4
5
6
7
8
9
$ docker run --name pg1 -d -P postgres
37cd118ecd727522bd904bffb672956c070dc7a5bc1a207a61de4b628a653b3b
# Find the port
$ docker port pg1
5432/tcp -> 0.0.0.0:32769
# Assuming you have the psql client installed, connect to it:
$ psql -h localhost -p 32769 -U postgres postgres

Adding mock data

First things first, let’s make some mock data for us to play around with. We’re going to make a table of domains (e.g. espn.com) and domain categories (e.g. sports publishers):

1
2
create table domain(id serial primary key, url text);
create table domain_category(id serial primary key);

And we’re going to insert some data:

1
2
insert into domain select id from generate_series(0, 999) id;
insert into domain_category select id from generate_series(0, 49) id;

Sidenote
generate_series is a wonderful little function that does exactly what you think it does. It’s fantastic for making mock data.

Next, let’s make a table that connects the two together:

1
2
3
4
5
6
7
8
9
10
11
create table domain_category_domain(
id serial primary key,
domain_category_id integer not null references domain_category(id),
domain_id integer not null references domain(id)
);
insert into domain_category_domain(domain_category_id, domain_id)
select domain_category_id, domain_id
from generate_series(0, 49) as domain_category_id,
generate_series(0, 999) as domain_id
where random() < 0.02;

Now it’s very easy to get a list of domains in any category:

1
2
3
select count(domain_id)
from domain_category_domain
where domain_category_id = 0;
1
2
3
4
5
6
count
-------
23
(1 row)
Time: 1.526 ms

Now let’s make things a little more complicated. Let’s add parent categories so we can get a categories topology. Imagine a “Sports” category that can contain “Football” and “Basketball” categories. To do this, we need a table that defines the parent-child relationships for categories:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
create table domain_category_relation(
id serial primary key,
parent_id integer references domain_category(id) not null,
child_id integer references domain_category(id) not null
);
insert into domain_category(id)
select id
from generate_series(50, 54) as id;
insert into domain_category_relation(child_id, parent_id)
select child_id, (child_id / 10) + 50
from generate_series(0, 49) as child_id;
select * from domain_category_relation where parent_id=50;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
postgres=# select * from domain_category_relation where parent_id=50;
id | parent_id | child_id
----+-----------+----------
1 | 50 | 0
2 | 50 | 1
3 | 50 | 2
4 | 50 | 3
5 | 50 | 4
6 | 50 | 5
7 | 50 | 6
8 | 50 | 7
9 | 50 | 8
10 | 50 | 9
(10 rows)

Okay, so far so good. How can we get a list of domains in parent categories? One option is to do a simple join:

1
2
3
4
5
select dcr.parent_id, dcd.domain_id
from domain_category_relation dcr
join domain_category_domain dcd
on dcd.domain_category_id = dcr.child_id;
where dcr.parent_id=50;

However, what if we want the domains of a category and we don’t know if it’s a parent category or not? One possible solution is to both and union the results together:

1
2
3
4
5
6
7
8
9
select dcr.parent_id, dcd.domain_id
from domain_category_relation dcr
join domain_category_domain dcd
on dcd.domain_category_id = dcr.child_id;
where dcr.parent_id=50;
union
select parent_id, domain_id
from domain_category_domain
where domain_category_id=50;

But A) this is super ugly and B) it stops working if our parent categories get parent categories. In other words, we are only solving a graph that’s two layers deep. A better solution is to use something called a “recursive common table expression”. But first, you should first understand a normal “common table expression” (CTE):

1
2
3
4
with cached_result as (
select 1
) select *
from cached_result;

The with syntax here is the CTE. It’s a useful tool to cache subqueries and often, I find them much cleaner than actual subqueries because you can give them names!

A recursive CTE is slightly different because it allows a CTE to iterate on itself:

1
2
3
4
5
6
7
8
9
with recursive flattened_domain_category_domain (domain_category_id, domain_id) as (
select domain_category_id, domain_id
from domain_category_domain
union
select dcr.parent_id, fdcd.domain_id
from domain_category_relation dcr
join flattened_domain_category_domain fdcd
on fdcd.domain_category_id = dcr.child_id
) select domain_category_id, domain_id from flattened_domain_category_domain;

Now our join will work no matter how many layers we have!

However, it’s quite a bit of work to write this out every time. If we were using an ORM, we’d be reading a lot of documentation to get this syntax down. To avoid this, we can write a view:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
create view flattened_domain_category_domain_view(domain_category_id, domain_id) as
with recursive flattened_domain_category_domain(domain_category_id, domain_id) as (
select domain_category_id, domain_id
from domain_category_domain
union
select dcr.parent_id, fdcd.domain_id
from domain_category_relation dcr
join flattened_domain_category_domain fdcd
on fdcd.domain_category_id = dcr.child_id
) select domain_category_id, domain_id from flattened_domain_category_domain;
select count(domain_id)
from flattened_domain_category_domain_view
where domain_category_id = 50;
1
2
3
4
5
6
count
-------
198
(1 row)
Time: 6.328 ms

The database will pretend this is a table and even join other tables to it! However, be very careful with views because as you add filters and joins, the query planner may be very confused, as we’ll see later.

If the data in your view doesn’t change very often, one common tool is a materialized view. Mviews, as they’re common called, allow you to cache the results of the view and only refresh them manually:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
create materialized view flattened_domain_category_domain_mv(domain_category_id, domain_id) as
with recursive flattened_domain_category_domain(domain_category_id, domain_id) as (
select domain_category_id, domain_id
from domain_category_domain
union
select dcr.parent_id, fdcd.domain_id
from domain_category_relation dcr
join flattened_domain_category_domain fdcd
on fdcd.domain_category_id = dcr.child_id
) select domain_category_id, domain_id from flattened_domain_category_domain;
-- Yes, you can add indices to make them really fast, like a normal table!
create unique index on flattened_domain_category_domain_mv(domain_category_id, domain_id);
create index on flattened_domain_category_domain_mv(domain_id);
refresh materialized view flattened_domain_category_domain_mv;

Keep in mind though: refresh materialized view will block reads. If you add a unique index to your mview (as you should), you can use refresh materialized view concurrently, which will refresh your mview without blocking reads.

Foreign data wrappers

When you have a lot of data, it’s common to split your tables between multiple databases. To simulate this, let’s create another docker instance. This time, we’ll add a “link” so the second docker instance can network to the first.

1
2
3
4
5
6
7
8
# Note the `link` option this time
$ docker run --name pg2 -d -P --link pg1:pg1 postgres
679d635ad021d3744fbd07c27f3120778484e0fb93d8632d486d207b0a348e5c
$ docker port pg2
5432/tcp -> 0.0.0.0:32773
$ psql -h localhost -p 32773 -U postgres postgres

We’ll be using a Postgres extension called postgres_fdw that allows you to communicate to other Postgres instances. There’s a lot of cool Postgres extensions out there: they range from adding different data types to different foreign data wrappers even new storage engines and indices.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Presentation only
-- Disable indices for now
update pg_index set indisvalid = false where indexrelid = 'brand_domain_creative_start_date_domain_id_creative_id_bran_idx'::regclass;
update pg_index set indisvalid = false where indexrelid = 'brand_domain_creative_domain_id_start_date_idx'::regclass;
update pg_index set indisvalid = false where indexrelid = 'brand_domain_creative_start_date_brand_id_domain_id_idx'::regclass;
create extension postgres_fdw;
create server pg1 foreign data wrapper postgres_fdw options (host 'pg1', port '5432', dbname 'postgres');
create user mapping for postgres server pg1 options (user 'postgres');
create foreign table flattened_domain_category_domain (
domain_category_id integer not null, domain_id integer not null
) server pg1 options (schema_name 'public', table_name 'flattened_domain_category_domain_mv');
select count(domain_id)
from flattened_domain_category_domain
where domain_category_id = 50;
1
2
3
4
5
6
count
-------
198
(1 row)
Time: 3.517 ms

There are also foreign data wrappers to MySQL, Redis, Dynamo, you name it.

Optimizing queries

Okay, now the fun stuff :). Let’s create a brand_domain_creative table. We use something more-or-less shaped like this table in Moat Pro and it tells us the estimated impressions (score) a creative had on a certain day on a certain domain.

1
2
3
4
5
6
7
create table brand_domain_creative(
start_date date not null,
brand_id integer not null,
domain_id integer not null,
creative_id integer not null,
score float not null
);

Neat! Next we’re going to fill it with ~60M rows of simulated data. This may take a short while.

1
2
3
4
5
6
7
8
9
10
11
12
13
insert into brand_domain_creative(
start_date, brand_id, domain_id, creative_id, score
)
select
'2017-01-01'::date + date_offset,
(creative_id / 3) :: integer,
domain_id,
creative_id,
random() * 1000
from
generate_series(0, 59) as date_offset,
generate_series(0, 999) as domain_id,
generate_series(0, 999) as creative_id;
1
2
INSERT 0 60000000
Time: 140552.498 ms

Now we can do queries like “what are the top 10 brands in January for domain 0?”

1
2
3
4
5
6
7
8
select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;

Yikes! Never mind. That’s taking way too slow. Ctrl+C to interrupt the query and get out of there.

To see what happened, we can use the explain query. It’ll output the database’s execution plan for running the query:

1
2
3
4
5
6
7
8
explain select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
1
2
3
4
5
6
7
8
9
10
11
12
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1432263.79..1432263.81 rows=10 width=12)
-> Sort (cost=1432263.79..1432264.29 rows=200 width=12)
Sort Key: (sum(score)) DESC
-> GroupAggregate (cost=1432246.22..1432259.47 rows=200 width=12)
Group Key: brand_id
-> Sort (cost=1432246.22..1432249.97 rows=1500 width=12)
Sort Key: brand_id
-> Seq Scan on brand_domain_creative (cost=0.00..1432167.09 rows=1500 width=12)
Filter: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date) AND (domain_id = 0))
(9 rows)

You’ll notice that each step has a cost estimate. Postgres takes statistics about your tables to estimate how long each step would take so it can intelligently choose the optimal strategy. In this case, the statistics are a wee-bit off as it thinks the domain will have ~1500 rows in this date range when in actuality, it’s around 31k. We can tell it to re-analyze its contents via an analyze command:

1
analyze brand_domain_creative;

Now our explain looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1434628.86..1434628.88 rows=10 width=12)
-> Sort (cost=1434628.86..1434629.69 rows=334 width=12)
Sort Key: (sum(score)) DESC
-> GroupAggregate (cost=1434393.70..1434621.64 rows=334 width=12)
Group Key: brand_id
-> Sort (cost=1434393.70..1434468.57 rows=29947 width=12)
Sort Key: brand_id
-> Seq Scan on brand_domain_creative (cost=0.00..1432167.12 rows=29947 width=12)
Filter: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date) AND (domain_id = 0))
(9 rows)

Even with more accurate statistics, the database doesn’t have any other option. To execute this query, it needs to go through every single row and do a count. How can we give it a shortcut? An index of course!

1
2
3
4
create index on brand_domain_creative(start_date, domain_id, creative_id, brand_id);
-- Presentation only
update pg_index set indisvalid = true where indexrelid = 'brand_domain_creative_start_date_domain_id_creative_id_bran_idx'::regclass;
1
2
CREATE INDEX
Time: 155261.995 ms

What is an index? Essentially, think of of it as a glossary for a very large book. If you want to find each page that has the word aardvark, it’s much faster to find the entry in the glossary than to read every page.

By default, Postgres indices are b-trees under the hood because they’re very versatile. However, you can choose other index types if you know what you’re doing.

Building this index took a long time because the computer had to go through every single data point. When you think about it, two and a half minutes to organize 60M data points sounds pretty great. Dang, computers are cool.


Sidenote
Why are they called B-trees? Rudolf Bayer and Ed McCreight invented it while working at Boeing, so it could stand for “Boeing”, “Bayer”, or even “balance” tree. McCreight says they couldn’t decide between the options. They couldn’t name it Boeing without lawyers being involved, so the company missed out on some great free advertising.

Now let’s try this query again!

1
2
3
4
5
6
7
8
select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
brand_id | sum
----------+------------------
319 | 54501.5292000026
29 | 53102.4147509597
279 | 52780.3690684959
311 | 52639.5615208894
318 | 52623.2881098986
44 | 52072.1511933953
278 | 51771.6338969767
35 | 51655.6670391001
115 | 51610.0491145626
200 | 51545.3661591746
(10 rows)
Time: 17177.307 ms

That was significantly faster. And it’ll be faster if you run it again because of caching:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
brand_id | sum
----------+------------------
319 | 54501.5292000026
29 | 53102.4147509597
279 | 52780.3690684959
311 | 52639.5615208894
318 | 52623.2881098986
44 | 52072.1511933953
278 | 51771.6338969767
35 | 51655.6670391001
115 | 51610.0491145626
200 | 51545.3661591746
(10 rows)
Time: 1362.534 ms

Caching? How does caching work? Well, virtual memory plays the biggest role here, but Postgres has shared_buffers that cache recent information including:

  • Table data
  • Indexes
  • Query execution plans

Keeping shared_buffers consistent all of these while writes are coming in is some serious voodoo magic, so if you see a Postgres contributor, buy them a beer.

Let’s see how our query was faster via an explain analyze. explain analyze is like an explain but it also runs the query to give you more information. verbose and buffers give you debug information about each step.

1
2
3
4
5
6
7
8
explain (analyze, buffers, verbose) select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=957420.30..957420.33 rows=10 width=12) (actual time=1441.634..1441.636 rows=10 loops=1)
Output: brand_id, (sum(score))
Buffers: shared read=149777
-> Sort (cost=957420.30..957421.14 rows=334 width=12) (actual time=1441.632..1441.632 rows=10 loops=1)
Output: brand_id, (sum(score))
Sort Key: (sum(brand_domain_creative.score)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared read=149777
-> GroupAggregate (cost=957185.14..957413.08 rows=334 width=12) (actual time=1435.458..1441.548 rows=334 loops=1)
Output: brand_id, sum(score)
Group Key: brand_domain_creative.brand_id
Buffers: shared read=149777
-> Sort (cost=957185.14..957260.01 rows=29947 width=12) (actual time=1435.431..1437.536 rows=31000 loops=1)
Output: brand_id, score
Sort Key: brand_domain_creative.brand_id
Sort Method: quicksort Memory: 2222kB
Buffers: shared read=149777
-> Bitmap Heap Scan on public.brand_domain_creative (cost=862903.05..954958.56 rows=29947 width=12) (actual time=1277.881..1430.520 rows=31000 loops=1)
Output: brand_id, score
Recheck Cond: ((brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date) AND (brand_domain_creative.domain_id = 0))
Heap Blocks: exact=31000
Buffers: shared read=149777
-> Bitmap Index Scan on brand_domain_creative_start_date_domain_id_creative_id_bran_idx (cost=0.00..862895.56 rows=29947 width=0) (actual time=1272.282..1272.282 rows=31000 loops=1)
Index Cond: ((brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date) AND (brand_domain_creative.domain_id = 0))
Buffers: shared read=118777
Planning time: 0.225 ms
Execution time: 1441.689 ms
(27 rows)

Let’s interpret this explain from inside out.

  • “Bitmap index scan”: our index is large enough to take up several blocks. Because of the way B-Trees work, we can make a map that can tell us which blocks contain indices that match our conditions. The resulting array of booleans is our bitmap. We then use this bitmap to read the relevant index blocks and collect all the indices that match our conditions. This took 1.27s.

  • “Bitmap heap scan”: Armed with our indices, we create a bitmap of heap blocks to read and then we read them. This took almost no time at 0.16s and resulted in 31k rows.

  • “Sort”: Looks like Postgres is sorting the rows with quicksort to make it easier to…

  • “GroupAggregate”: Group the rows together by brand_id and sum the scores (334 resulting rows).

  • “Sort”: Sort our grouped rows based on sum(score) DESC using top-N heapsort

  • “Limit”: limit our results from 100 to 10.


Sidenote
Quicksort is in-place, so it makes sense they chose that for 31k rows. Top-N heapsort is a sort where you only keep the Top-N, which is significantly less complex. It only makes sense if you do a limit after your sort.

Can we do better? Sure! Seems like the slow part is getting stuff from the index. We have to read 118k buffers here and only 31k buffers to actually get the data (gee, I’m starting to suspect our buffers are exactly 10k rows each).

Why does does index need to read so many blocks? Well, it’s because of the shape of our index. Our index looks like this: (start_date, domain_id, brand_id, creative_id). This means if you ordered our index in a list, it would look like this:

1
2
3
4
5
('2017-01-01', 0, 0, 0)
... ~1000 more entries ...
('2017-01-01', 1, 0, 0)
... ~1M more entries
('2017-01-02', 0, 0, 0)

So in every 1M index entries, only 1k of them are relevant to our query at hand. Thus, we can assume that we have to read a lot of different index blocks to collect our heap.

What happens if we make a new index organized by (domain_id, start_date)? Then our index blocks are significantly closer together and our b-tree doesn’t have to make keys for creative_id/brand_id.

1
2
3
4
create index on brand_domain_creative(domain_id, start_date);
-- Presentation only
update pg_index set indisvalid = true where indexrelid = 'brand_domain_creative_domain_id_start_date'::regclass;
1
2
CREATE INDEX
Time: 91935.165 ms
1
2
3
4
5
6
7
8
select brand_id, sum(score)
from brand_domain_creative
where domain_id = 0
and start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by sum(score) desc
limit 10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
brand_id | sum
----------+------------------
319 | 54501.5292000026
29 | 53102.4147509597
279 | 52780.3690684959
311 | 52639.5615208894
318 | 52623.2881098986
44 | 52072.1511933953
278 | 51771.6338969767
35 | 51655.6670391001
115 | 51610.0491145626
200 | 51545.3661591746
(10 rows)
Time: 152.984 ms

Great Neptune’s trident, that was fast! Let’s see how things changed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=92930.19..92930.22 rows=10 width=12) (actual time=154.177..154.180 rows=10 loops=1)
Output: brand_id, (sum(score))
Buffers: shared read=31088
-> Sort (cost=92930.19..92931.03 rows=334 width=12) (actual time=154.175..154.176 rows=10 loops=1)
Output: brand_id, (sum(score))
Sort Key: (sum(brand_domain_creative.score)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared read=31088
-> HashAggregate (cost=92919.64..92922.98 rows=334 width=12) (actual time=154.047..154.111 rows=334 loops=1)
Output: brand_id, sum(score)
Group Key: brand_domain_creative.brand_id
Buffers: shared read=31088
-> Bitmap Heap Scan on public.brand_domain_creative (cost=714.39..92769.90 rows=29947 width=12) (actual time=15.778..145.068 rows=31000 loops=1)
Output: start_date, brand_id, domain_id, creative_id, score
Recheck Cond: ((brand_domain_creative.domain_id = 0) AND (brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date))
Heap Blocks: exact=31000
Buffers: shared read=31088
-> Bitmap Index Scan on brand_domain_creative_domain_id_start_date_idx (cost=0.00..706.90 rows=29947 width=0) (actual time=9.210..9.210 rows=31000 loops=1)
Index Cond: ((brand_domain_creative.domain_id = 0) AND (brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date))
Buffers: shared read=88
Planning time: 0.145 ms
Execution time: 154.279 ms
(22 rows)

As expected, we got our rows out significantly faster (144ms). Interestingly, the DB switched from a GroupAggregate to a HashAggregate even though the only step that should have been affected was the index scan. Databases are mysterious beasts. In this case, it bought us 2ms. Huzzah!


Sidenote
Another common reason for slow Bitmap Index Scans is a lack of vaccuuming. By default, Postgres keeps old versions of rows arounds for MVCC (Multi-version consistency control) and they can remain in your index as well. vaccuum frequently, kids.

Performance tuning

Let’s try another query: what are brands that have appeared on the most domains?

1
2
3
4
5
6
7
select brand_id, count(distinct domain_id)
from brand_domain_creative
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;

Yikes. This is going to be slow again. Our indices can only eliminate half of our dataset. What can we do?

One solution is to have a (start_date, brand, domain) index. Maybe this way, Postgres doesn’t need the actual rows to perform the query:

1
2
3
4
create index on brand_domain_creative(brand_id, start_date, domain_id);
-- Presentation only
update pg_index set indisvalid = true where indexrelid = 'brand_domain_creative_brand_id_start_date_domain_id_idx'::regclass;

Programmer uses index! It’s not very effective!

1
2
3
4
5
6
7
explain (analyze, buffers, verbose) select brand_id, count(distinct domain_id)
from brand_domain_creative
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=6630126.72..6630126.75 rows=10 width=12) (actual time=72016.241..72016.243 rows=10 loops=1)
Output: brand_id, (count(DISTINCT domain_id))
Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
-> Sort (cost=6630126.72..6630127.56 rows=334 width=12) (actual time=72016.239..72016.240 rows=10 loops=1)
Output: brand_id, (count(DISTINCT domain_id))
Sort Key: (count(DISTINCT brand_domain_creative.domain_id)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
-> GroupAggregate (cost=6398171.16..6630119.50 rows=334 width=12) (actual time=59799.011..72015.292 rows=334 loops=1)
Output: brand_id, count(DISTINCT domain_id)
Group Key: brand_domain_creative.brand_id
Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
-> Sort (cost=6398171.16..6475486.16 rows=30926000 width=8) (actual time=59762.502..65814.582 rows=31000000 loops=1)
Output: brand_id, domain_id
Sort Key: brand_domain_creative.brand_id
Sort Method: external merge Disk: 545432kB
Buffers: shared hit=16284 read=365882, temp read=199023 written=199023
-> Seq Scan on public.brand_domain_creative (cost=0.00..1282166.00 rows=30926000 width=8) (actual time=0.016..28675.620 rows=31000000 loops=1)
Output: brand_id, domain_id
Filter: ((brand_domain_creative.start_date >= '2017-01-01'::date) AND (brand_domain_creative.start_date <= '2017-01-31'::date))
Rows Removed by Filter: 29000000
Buffers: shared hit=16284 read=365882
Planning time: 0.659 ms
Execution time: 72123.210 ms

Whoa, wtf? Why is it doing a sequential scan on the rows? Even an analyze doesn’t change this. If it used the index, it would go through 31 (days) * 334 (brand) * 1000 (domain) = 10.354M index entries. That’s 60 times fewer than going through 600M rows!

Well, the difference is that index disk access is random whereas the sequential scan is, well, sequential. The optimizer gives random reads an estimated random_page_cost cost of “4” by default. And keep in mind that reading one index block involves reading a number of other index blocks because that’s how b-trees work.

But wait! My computer has an SSD! Shouldn’t they be weighed the same? Well, you can tune that in your Postgres config by changing random_page_cost to 1 :). Or, you can do it temporarily in your session:

1
set random_page_cost=1;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
explain select brand_id, count(distinct domain_id)
from brand_domain_creative
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Limit (cost=6086504.22..6086504.25 rows=10 width=12)
-> Sort (cost=6086504.22..6086505.06 rows=334 width=12)
Sort Key: (count(DISTINCT domain_id)) DESC
-> GroupAggregate (cost=5854548.66..6086497.00 rows=334 width=12)
Group Key: brand_id
-> Sort (cost=5854548.66..5931863.66 rows=30926000 width=8)
Sort Key: brand_id
-> Seq Scan on brand_domain_creative (cost=0.00..1282166.00 rows=30926000 width=8)
Filter: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date))
(9 rows)

Wtf Rodrigo? You lied! It’s still doing a sequential scan!

Well, as a hack, if you set seq_page_cost to 9999, you can see what its index plan would look like:

1
set seq_page_cost=9999;
1
2
3
4
5
6
7
8
9
10
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1118169096.56..1118169096.59 rows=10 width=12)
-> Sort (cost=1118169096.56..1118169097.40 rows=334 width=12)
Sort Key: (count(DISTINCT domain_id)) DESC
-> GroupAggregate (cost=0.56..1118169089.34 rows=334 width=12)
Group Key: brand_id
-> Index Only Scan using brand_domain_creative_brand_id_start_date_domain_id_idx on brand_domain_creative (cost=0.56..1118014456.00 rows=30926000 width=8)
Index Cond: ((start_date >= '2017-01-01'::date) AND (start_date <= '2017-01-31'::date))
(7 rows)

Huh, so the database doesn’t have a method to do an index scan and do a GroupAggregate at the same time! So it’s forced to to index scan for 31M entries! Maybe there’s a good reason for that - database programming is hard because there’s a ton of corner cases.

If you want to research it, pull requests are welcome ;).

Pre-aggregation

First, let’s set our seq_page_cost back to normal:

1
set seq_page_cost=1;

So how can we make the above query faster? Well, if it’s dedicated to doing a sequential scan, we can simply give it less rows! In this query, we don’t need the creative column. So what if we removed it and rolled up all the creative scores into brands?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
create table brand_domain(
start_date date not null,
brand_id integer not null,
domain_id integer not null,
score float not null
);
insert into brand_domain(start_date, brand_id, domain_id, score)
select
start_date,
brand_id,
domain_id,
sum(score)
from
brand_domain_creative
group by
start_date, brand_id, domain_id
order by
start_date, brand_id, domain_id;
select brand_id, count(distinct domain_id)
from brand_domain
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
brand_id | count
----------+-------
1 | 1000
2 | 1000
3 | 1000
4 | 1000
5 | 1000
6 | 1000
7 | 1000
8 | 1000
9 | 1000
0 | 1000
(10 rows)
Time: 10877.772 ms

The contents of this table are ~3x smaller! So it makes sense that our query time went down by that much.

And what happens if we add an index?

1
2
3
4
5
6
7
8
9
create index on brand_domain(brand_id, start_date, domain_id);
explain (analyze, buffers, verbose) select brand_id, count(distinct domain_id)
from brand_domain
where start_date >= '2017-01-01'
and start_date <= '2017-01-31'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1434911.01..1434911.04 rows=10 width=12) (actual time=8470.640..8470.642 rows=10 loops=1)
Output: brand_id, (count(DISTINCT domain_id))
Buffers: shared hit=10288 read=142735
-> Sort (cost=1434911.01..1434911.85 rows=334 width=12) (actual time=8470.639..8470.641 rows=10 loops=1)
Output: brand_id, (count(DISTINCT domain_id))
Sort Key: (count(DISTINCT brand_domain.domain_id)) DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=10288 read=142735
-> GroupAggregate (cost=0.56..1434903.80 rows=334 width=12) (actual time=25.915..8469.810 rows=334 loops=1)
Output: brand_id, count(DISTINCT domain_id)
Group Key: brand_domain.brand_id
Buffers: shared hit=10288 read=142735
-> Index Only Scan using brand_domain_brand_id_start_date_domain_id_idx on public.brand_domain (cost=0.56..1383494.52 rows=10281188 width=8) (actual time=1.181..6603.681 rows=10354000 loops=1)
Output: brand_id, start_date, domain_id
Index Cond: ((brand_domain.start_date >= '2017-01-01'::date) AND (brand_domain.start_date <= '2017-01-31'::date))
Heap Fetches: 10354000
Buffers: shared hit=10288 read=142735
Planning time: 0.112 ms
Execution time: 8470.682 ms
(19 rows)
Time: 8472.057 ms

Well, now we get out 10M row method :).

But here’s one more optimization we can do! We reduced brand_domain_creative to brand_domain and in our business logic, we frequently do month long queries. What happens if we rollup if start_date to the nearest month?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
create table monthly_brand_domain(
start_date date not null,
brand_id integer not null,
domain_id integer not null,
score float not null
);
create unique index on monthly_brand_domain(start_date, brand_id, domain_id);
insert into monthly_brand_domain(
start_date, brand_id, domain_id, score
) select
date_trunc('month', start_date)::date,
brand_id,
domain_id,
sum(score)
from
brand_domain
group by
date_trunc('month', start_date)::date, brand_id, domain_id
order by
date_trunc('month', start_date)::date, brand_id, domain_id;

Now, this query turns into:

1
2
3
4
5
6
select brand_id, count(distinct domain_id)
from monthly_brand_domain
where start_date = '2017-01-01'
group by brand_id
order by count(distinct domain_id) desc
limit 10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
brand_id | count
----------+-------
1 | 1000
2 | 1000
3 | 1000
4 | 1000
5 | 1000
6 | 1000
7 | 1000
8 | 1000
9 | 1000
0 | 1000
(10 rows)
Time: 93.158 ms

Woosh! The moral of the story here is, you can learn to be a database whisperer, but normally the simplest approach (pre-compute as much as you can) is the right answer.

However, every neat trick in CS comes at a cost and pre-aggregation is no exception. Specifically:

  • Disk space

  • Compute time

  • Out of sync tables

One last thing: what if you have a query where you want to see the brands with the most domains from ‘2017-01-25’ to ‘2017-02-28? In this case, the optimal query involves getting the daily rows from brand_domain and the monthly rows from monthly_brand:

1
2
3
4
5
6
7
8
9
10
11
12
13
select s.brand_id, count(distinct s.domain_id)
from (
select brand_id, domain_id
from brand_domain
where start_date >= '2017-01-25'
and start_date <= '2017-01-31'
union all
select brand_id, domain_id
from monthly_brand_domain
where start_date = '2017-02-01'
) s
group by brand_id
limit 10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
brand_id | count
----------+-------
0 | 1000
1 | 1000
2 | 1000
3 | 1000
4 | 1000
5 | 1000
6 | 1000
7 | 1000
8 | 1000
9 | 1000
(10 rows)
Time: 2283.934 ms

Pretty freakin’ fast. In Moat Pro, we have every cut imaginable, from brand to monthly_creative_hostname to alltime_tag_brand_creative. We have a query engine that choose the right cuts and aggregates their results together.

An Introduction to Linear Regression

In this post, we will be using a lot of Python. All of the code can be found here.

We’ll be using data from R2D3, which contains real estate information about New York City and San Francisco real estate.

Regression

One of the most important fields within data science, regression is about describing data. Simply put, we will try to draw the best line possible through our data.

First, an example. Let’s plot some New York City apartments’ prices by square foot:


Plot of NY Apartments

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# c1_plot_ny.py
import csv
import matplotlib.pyplot as plt
def get_data():
with open('ny_sf_data.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
rows = [r for r in reader]
ny_data = [
(int(p['sqft']), int(p['price']))
for p in rows
if p['in_sf'] == '0'
]
return ny_data
def plot_data(data):
sqft, price = zip(*data)
plt.plot(sqft, price, 'ro')
plt.xlabel('sqft')
plt.ylabel('price')
plt.show()
if __name__ == '__main__':
data = get_data()
plot_data(data)

As you can see, there’s a bit of bunching up in the lower left corner because we have a bunch of outliers. Let’s say any apartment over $2M or 2000 square feet is an outlier (although, I think this is our dataset showing it’s age…).


Plot of NY Apartments Without Outliers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# c2_plot_ny_wo_outliers.py
from c1_plot_ny import get_data, plot_data
def get_data_wo_outliers():
ny_data = get_data()
ny_data_wo_outliers = [
(s, p) for s, p in ny_data
if s < 2000 and p < 2000000
]
return ny_data_wo_outliers
if __name__ == '__main__':
data = get_data_wo_outliers()
plot_data(data)

Okay, so we can notice a couple things about this data:

  • There’s a minimum number of square feet and price. No one has an apartment smaller than 250 square feet and no one has a price lower than 300K. This makes sense.

  • The data seems to go in a up-right direction, so we more or less draw a line naively if we tried.

  • As we go along the axes, the data spreads. There must be other variables at work.

Drawing our line

Let’s try to draw the best straight line we can though the data. This is called linear regression. So our line is going to have the familiar-looking formula:

$$
\begin{align}
h_\theta(x) = \theta_0 + \theta_1 x
\end{align}
$$

Where $h_\theta(x)$ is our hypthesis $h$ of the price given a square foot of $x$. Our goal is to find our $\theta$ s, which define the slope and y-intercept of our line.

So how do we define the best line? Through an cost function, which defines how off a line is! The line with the least cost is the best line.

Okay, so how do we choose an cost function? Well, there’s a lot of different cost functions out there, but least squares error is perhaps the most common:

$$
\begin{align}
J(\theta) = {1 \over 2} \sum_{i=0}^n (h_\theta(x_i) - y)^2
\end{align}
$$

Essentially, for each data point $(x_i, y_i)$, we’re simply taking the difference between $y_{i}$ and $h_\theta(x_i)$ and squaring it. Then we’re summing them all together and halving it. Makes sense.

Gradient Descent

Okay, so now the hard part. We have data and a cost function, so now we need a process for reducing cost. To do this, we’ll use gradient descent.

Gradient descent works by constantly updating any $\theta_j$ in the direction that will dimension $J(\theta)$. Or:

$$
\begin{align}
\theta_j := \theta_j - \alpha {\delta \over \delta \theta_j} J(\theta)
\end{align}
$$

Where $\alpha$ is essentially “how far” down the slope you want to go at every step.

Thus, with a little bit of math, we can find the derivative of our $\theta_j$ with respect to $J(\theta)$ for an individual data point:

$$
\begin{align}
{\delta \over \delta \theta_j} J(\theta) &= {\delta \over \delta \theta_j} {1 \over 2} (h_\theta(x) - y)^2 \cr
&= 2 \cdot {1 \over 2} \cdot (h_\theta(x) - y) \cdot {\delta \over \delta \theta_j} (h_\theta(x) - y) \cr
&= (h_\theta(x) - y) \cdot {\delta \over \delta \theta_j} (h_\theta(x) - y)
\end{align}
$$

Now things begin to diverge for our $\theta$s:

$$
\begin{align}
{\delta \over \delta \theta_0} J(\theta) &= (h_\theta(x) - y) \cdot {\delta \over \delta \theta_0} (\theta_0 + \theta_1 x - y) \cr
&= (h_\theta(x) - y)
\end{align}
$$

$$
\begin{align}
{\delta \over \delta \theta_1} J(\theta) &= (h_\theta(x) - y) \cdot {\delta \over \delta \theta_1} (\theta_0 + \theta_1 x - y) \cr
&= (h_\theta(x) - y) \cdot x
\end{align}
$$

So now we can apply our derivatives to the original algorithm…

$$
\begin{align}
\theta_0 := \theta_0 + \alpha (y - h_\theta(x)) \cr
\theta_1 := \theta_1 + \alpha (y - h_\theta(x)) \cdot x
\end{align}
$$

However, this only applies to one data point. How do we apply this to multiple data points?

Batch Gradient Descent

The idea behind batch gradient descent is simple: go through your batch and get all the derivatives of ${\delta \over \delta \theta_0} J(\theta)$. Then take the average:

$$
\begin{align}
\theta_0 := \theta_0 + \alpha \sum_{i=0}^m (y - h_\theta(x)) \cr
\theta_1 := \theta_1 + \alpha \sum_{i=0}^m (y - h_\theta(x)) \cdot x
\end{align}
$$

This means that you only change your theta at the end of the batch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# c3_batch_gradient_descent.py
import sys
import matplotlib.pyplot as plt
from c2_plot_ny_wo_outliers import get_data_wo_outliers
# Try playing with the learning rates!
THETA_0_LEARNING_RATE = 0.1
THETA_1_LEARNING_RATE = 0.1**6
def batch_gradient_descent(data, theta_0, theta_1):
# End up with:
# Iteration 316, error: 65662214541.7, thetas: 80241.5458922 1144.09527519
N = len(data)
error = 0
theta_0_sum = 0
theta_1_sum = 0
for x, y in data:
hypothesis = estimate_price(theta_0, theta_1, x)
loss = y - hypothesis
error += 0.5 * (loss) ** 2
theta_0_sum += loss
theta_1_sum += loss * x
theta_0 += (THETA_0_LEARNING_RATE * theta_0_sum) / N
theta_1 += (THETA_1_LEARNING_RATE * theta_1_sum) / N
error /= N
return error, theta_0, theta_1
def linear_regression(data, gradient_descent_func):
theta_0 = 0
theta_1 = data[0][1] / data[0][0]
last_error = sys.maxint
error = 1
i = 0
while (abs(error - last_error) / error) > (0.1 ** 7):
last_error = error
error, theta_0, theta_1 = gradient_descent_func(
data, theta_0, theta_1
)
i += 1
print 'Iteration {}, error: {}, thetas: {} {}'.format(i, error, theta_0, theta_1)
return theta_0, theta_1
def estimate_price(theta_0, theta_1, x):
return theta_0 + theta_1 * x
def plot_data_w_line(data, theta_0, theta_1):
sqft, price = zip(*data)
max_price, max_sqft = max(price), max(sqft)
estimated_price = [estimate_price(theta_0, theta_1, x) for x in sqft]
plt.plot(sqft, price, 'ro')
plt.plot(sqft, estimated_price, 'b-')
plt.axis([0, max_sqft, 0, max_price])
plt.xlabel('sqft')
plt.ylabel('price')
plt.show()
if __name__ == '__main__':
data = get_data_wo_outliers()
theta_0, theta_1 = linear_regression(data, batch_gradient_descent)
plot_data_w_line(data, theta_0, theta_1)

This results in:


Batch Gradient Descent

Pretty good! The code completes in 316 iterations, an error of 65662214541.7 and estimates $\theta_0$ to be 80241.5458922 and $\theta_1$ to be 1144.09527519.

Stochastic Gradient Descent

Another way to train our data is to apply our changes to each data point individually. This is stochastic gradient descent:

$$
\begin{align}
\theta_0 := \theta_0 + \alpha (y - h_\theta(x)) \cr
\theta_1 := \theta_1 + \alpha (y - h_\theta(x)) \cdot x
\end{align}
$$

It has one big benefit: we don’t have to go through the entire batch. This is hugely important for problems with large, large datasets. You’ll see it used frequently in deep learning.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# c4_stochastic_gradient_descent.py
import random
import sys
import matplotlib.pyplot as plt
from c2_plot_ny_wo_outliers import get_data_wo_outliers
from c3_batch_gradient_descent import estimate_price, plot_data_w_line
# Try playing with the learning rates!
THETA_0_LEARNING_RATE = 0.1
THETA_1_LEARNING_RATE = 0.1**10
def stochastic_gradient_descent(data, theta_0, theta_1):
# End up with:
# Iteration 8075, error: 1.01475127545e+13, thetas: 103991.20381 1141.75031008
error = 0
for x, y in data:
hypothesis = estimate_price(theta_0, theta_1, x)
loss = y - hypothesis
error += 0.5 * (loss)**2
theta_0 = theta_0 + THETA_0_LEARNING_RATE * (y - hypothesis)
theta_1 = theta_1 + THETA_1_LEARNING_RATE * (y - hypothesis) * x
return error, theta_0, theta_1
def linear_regression(data, gradient_descent_func):
theta_0 = 0
theta_1 = data[0][1] / data[0][0]
last_error = sys.maxint
error = 1
i = 0
while (abs(error - last_error) / error) > (0.1**5):
last_error = error
random.shuffle(data) # Try removing the randomness and see what happens :)
error, theta_0, theta_1 = gradient_descent_func(
data, theta_0, theta_1
)
i += 1
print 'Iteration {}, error: {}, thetas: {} {}'.format(i, error, theta_0, theta_1)
return theta_0, theta_1
if __name__ == '__main__':
data = get_data_wo_outliers()
theta_0, theta_1 = linear_regression(data, stochastic_gradient_descent)
plot_data_w_line(data, theta_0, theta_1)

It’s worth noting that this took longer than batch descent since our dataset is so small. Also, we had to loosen up the right answer since we oscillated around the error a lot more. We would oscillate a lot less if it weren’t for the randomness!

There are in-betweens like mini-batch gradient descent, where you randomly turn your large dataset into small batches. This is hugely useful if you’re doing regression across multiple machines!

Multivariate Linear Regression

Okay, now things start to get fun. At the moment, we’re dealing with one input dimension (AKA “Simple” Linear Regression), which is great for getting started, but most datasets have more than one dimension. We can generalize our algorithm using linear algebra.

First, let’s say that we have $n$ dimensions. Thus, when we treat every input $x$ as a vector, we get:

$$
\overrightarrow{x}^{(i)} = \begin{bmatrix}
x^{(i)}_1 \cr
x^{(i)}_2 \cr
\vdots \cr
x^{(i)}_n \cr
\end{bmatrix}
$$

We can generalize our $m$ number of training vectors to be:

$$
X = \begin{bmatrix}
— (\overrightarrow{x}^{(1)})^T — \cr
— (\overrightarrow{x}^{(2)})^T — \cr
\vdots \cr
— (\overrightarrow{x}^{(m)})^T — \cr
\end{bmatrix}
$$

And our answers, $y$, can be a vector as well:

$$
\overrightarrow{y} = \begin{bmatrix}
y^{(1)} \cr
y^{(2)} \cr
\vdots \cr
y^{(m)} \cr
\end{bmatrix}
$$

And our parameters as well:

$$
\overrightarrow{\theta} = \begin{bmatrix}
\theta_{1} \cr
\theta_{2} \cr
\vdots \cr
\theta_{n} \cr
\end{bmatrix}
$$

We’re losing our “error” parameter of $\theta$. If you really want it, you can simply add a “1” column to each datapoint and that’ll accomplish the same thing and allow us to simplify our math and code a bit.

So our hypothesis for an individual data point looks like:

$$
\begin{align}
h_\theta(\overrightarrow{x}^{(i)}) = (\overrightarrow{x}^{(i)})^T \cdot \overrightarrow{\theta}
\end{align}
$$

So going back to our cost function, which we can put in matrix form:

$$
\begin{align}
J(\theta) &= {1 \over 2} \sum_{i=0}^n (h_\theta(\overrightarrow{x}^{(i)}) - y^{(i)})^2 \cr
&= {1 \over 2} (h_\theta(X) - \overrightarrow{y})^T(h_\theta(X) - \overrightarrow{y}) \cr
&= {1 \over 2} (X \cdot \overrightarrow{\theta} - \overrightarrow{y})^T(X \cdot \overrightarrow{\theta} - \overrightarrow{y})
\end{align}
$$

So if we wanted to find the derivative of $J(\theta)$ now (aka $\nabla_{\theta}J(\theta)$), we’d have to do some funky math. If you want to read how it can be derived, I recommend reading page 11 of Andrew Ng’s lecture notes on Linear Regression.

Skipping to the answer, we get:

$$
\begin{align}
\nabla_{\theta}J(\theta) &= X^T(X \cdot \theta - \overrightarrow{y})
\end{align}
$$

And thus, we can create steps for our $\overrightarrow{\theta}$:

$$
\begin{align}
\overrightarrow{\theta} := \overrightarrow{\theta} + \alpha X^T(\overrightarrow{y} - (X \cdot \theta))
\end{align}
$$

This kind of looks like our step for $\theta_1$ not too long ago!

$$
\begin{align}
\theta_1 := \theta_1 + \alpha (y - h_\theta(x)) x
\end{align}
$$

Programming this in numpy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# c5_stochastic_gradient_descent.py
import random
import sys
import matplotlib.pyplot as plt
import numpy as np
from c2_plot_ny_wo_outliers import get_data_wo_outliers
from c3_batch_gradient_descent import estimate_price, plot_data_w_line
LEARNING_RATE = 0.1**6
def batch_gradient_descent(x, y, theta):
# End up with:
# Iteration 8, error: 9.643796593e+12, theta: [ 2.35848572 1227.94035952]
x_trans = x.transpose()
M = len(x)
hypothesis = np.sum(x * theta, axis=1)
loss = hypothesis - y
theta -= LEARNING_RATE * (np.sum(x_trans * loss, axis=1)) / M
error = np.sum(0.5 * loss.transpose() * loss)
return error, theta
def linear_regression(x, y, gradient_descent_func):
theta = np.ones(np.shape(x[0]))
last_error = sys.maxint
error = 1
i = 0
while (abs(error - last_error) / error) > (0.1 ** 7):
last_error = error
error, theta = gradient_descent_func(
x, y, theta
)
i += 1
print 'Iteration {}, error: {}, theta: {}'.format(i, error, theta)
return theta
def data_in_np(data):
sqft, price = zip(*data)
# Adding a 1 to replicate our \theta_0
sqft = np.array([np.array([1, s]) for s in sqft])
price = np.array(price)
return sqft, price
if __name__ == '__main__':
data = get_data_wo_outliers()
sqft, price = data_in_np(data)
theta = linear_regression(sqft, price, batch_gradient_descent)
theta_0, theta_1 = theta
plot_data_w_line(data, theta_0, theta_1)

We get a slightly different answer here than in our old batch gradient descent code because our y-intercept has a different learning rate. If we gave it enough repititions, it would eventually get near the same area.

Of course, there are plenty of high level ML libraries to explore that do this stuff for you! However, it’s fun to understand what’s happening under the hood.

Things to worry about

Multivariable linear regression with gradient descent can have a lot of complications. For example, local minima:


Example of a local minima. Source: Creative Commons

As you can see, gradient descent might accidentally think the minimum is in the saddle there. There’s a ton of interesting papers about this. Training multiple times with different starting parameters is one way around this.

Overfitting is also an issue:


Overfitting. Source: Creative Commons

You can avoid this by splitting your data into a large “training” set and a large “testing” set. That’s standard procedure in most data science problems.

One last method: using the derivative

As many of you know, if you set the derivative of a curve to 0, it’ll either be a local maxima or a local minima. Since our error function does not have an upper limit, we know that the point where the derivative is 0 is a local minima.

So we can make our derivative $\nabla_{\theta}J(\theta)$ zero:

$$
\begin{align}
\nabla_{\theta}J(\theta) &= X^T(X\theta - \overrightarrow{y}) \cr
0 &= X^T(X\theta - \overrightarrow{y}) \cr
X^TX\theta &= X^T\overrightarrow{y} \cr
\theta &= (X^TX)^{-1}X^T\overrightarrow{y}
\end{align}
$$

And we can throw it in our code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# c6_derivative_method.py
import random
import sys
import matplotlib.pyplot as plt
import numpy as np
from c2_plot_ny_wo_outliers import get_data_wo_outliers
from c3_batch_gradient_descent import estimate_price, plot_data_w_line
LEARNING_RATE = 0.1**6
def linear_regression(x, y):
# End up with 82528.3036581 1141.67386409 :)
x_trans = x.transpose()
return np.dot(np.linalg.inv(np.dot(x_trans, x)), np.dot(x_trans, y))
def data_in_np(data):
sqft, price = zip(*data)
sqft = np.array([np.array([1, s]) for s in sqft])
price = np.array(price)
return sqft, price
if __name__ == '__main__':
data = get_data_wo_outliers()
sqft, price = data_in_np(data)
theta = linear_regression(sqft, price)
theta_0, theta_1 = theta
print theta_0, theta_1
plot_data_w_line(data, theta_0, theta_1)

And it totally works! You may ask “but why did we learn all about the gradient descent stuff then”? Well, you’ll need it for things that aren’t as straight forward as linear regression like deep neural networks.

To learn more:

Andrew Ng’s notes on Linear Regression.

ML @ Berkeley’s Machine Learning Crash Course.

How to Docker For Great Good

What is Docker?

Docker is a container system. It allows you to run code in a predefined environment that will Run Anywhere™.

So how is it different from a virtual machine? To start a VM, you allocate resources (X bytes of memory, Y CPU cores, etc) and these allocations are absolute. That is, if the VM only needs half its allocated memory or just a few CPU cycles, you can’t remove/add it dynamically.

That creates a lot of waste! It means that your services always use their maximum number of resources. In addition, you have the overhead of emulating their operating system and their hardware.

For the most part, as developers, we really want only a few things:

  • Make sure processes can’t affect the host operating system. We want our containers to be a jail.

  • Make sure processes can’t affect one another. So give me isolated memory addresses and file systems.

  • Give me hard memory/CPU limits, so only use what you need until a certain limit to make sure it doesn’t affect other processes.

That’s containers in a nut shell. Essentially Docker provides this by using:

  • cgroups: A Linux kernel feature that isolates and limits the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.

  • apparmor: A Linux security module that allows you to restrict per-program network access, raw socket access or file path access.

  • aufs (Another Union File System): Imagine a file system that works off of diffs. So every single change is just a diff layered on top of an existing file system. It allows you to “fork” other containers very easily.

  • Many more cool Linux modules and features

This is why Docker originally could only run on Ubuntu. Only recently can you run it on OSX/Windows without VirtualBox!

Downsides:

  • Difficult to get it working natively on host OSs

  • Security issues

  • Limited OS choice. (All official Docker images are using Ubuntu distro. Soon they’ll use Alpine Linux)

Actually using Docker

Okay, so first you need to install Docker. This installs the Docker daemon which controls all of the containers running on your computer.

Assuming your Docker daemon is running, you can now pull the base Ubuntu image:

1
2
3
4
5
6
7
8
9
$ docker pull ubuntu
Using default tag: latest
latest: Pulling from library/ubuntu
b3e1c725a85f: Pull complete
4daad8bdde31: Pull complete
63fe8c0068a8: Pull complete
4a70713c436f: Pull complete
bd842a2105a8: Pull complete
Digest: sha256:7a64bc9c8843b0a8c8b8a7e4715b7615e4e1b0d8ca3c7e7a76ec8250899c397a

Images are snapshots of a filesystem. You can push/pull images from a central repository. Docker, being a private company, made the default repo for pulling its own servers. You’ll learn how to push/pull from private container repos later.

Now, we can use the base Ubuntu image to run a command:

1
2
$ docker run ubuntu echo 'hello'
hello

That was really fast (slightly longer than a second for me)! So what happened?

  • The Docker CLI parsed our command a realized that we wanted to run “echo ‘hello’” on the Ubuntu image. It passes that information to the Docker daemon.

  • The Docker daemon started a process with all of the voodoo magic that isolates it.

  • It made sure that the process had access to a file system that we pulled (the Ubuntu image).

  • That process ran our echo 'hello' command

We can run any other bash command! For example, we could use ls to explore the filesystem.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ docker run ubuntu ls
bin
boot
dev
etc
home
lib
lib64
media
mnt
opt
proc
root
run
sbin
srv
sys
tmp
usr
var

Neat, right? So it feels like an actual Linux virtual machine!

What if we want to actually use bash within the container? We can use the ‘-i’ (interactive, keeps STDIN open) and ‘-t’ (pseudo-tty) options:

1
2
3
4
$ docker run -it ubuntu bash
root@71c819b308a6:/# echo 'hello'
hello
root@71c819b308a6:/# exit

Pulling a prebuilt image and advanced options

Now we can have some fun. First, let’s pull the official Docker image for Redis:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ docker pull redis
Using default tag: latest
latest: Pulling from library/redis
5040bd298390: Pull complete
996f41e871db: Pull complete
a40484248761: Pull complete
a97af2bf2ee7: Pull complete
010c454d55e5: Pull complete
142d4cb3dc08: Pull complete
6666ac0e527e: Pull complete
Digest: sha256:a027a470aa2b9b41cc2539847a97b8a14794ebd0a4c7c5d64e390df6bde56c73
Status: Downloaded newer image for redis

Now we can run Redis:

1
2
3
4
5
6
7
8
$ docker run -p 6379:6379 -d redis
ce59c7fa1ed003366d39b93137d3713f286017dd6c264b12fe987b805dc4d067
$ redis-cli
127.0.0.1:6379> set 1 1
OK
127.0.0.1:6379> get 1
"1"

You can see our currently running containers:

1
2
3
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ce59c7fa1ed0 redis "docker-entrypoint.sh" 7 seconds ago Up 5 seconds 0.0.0.0:6379->6379/tcp sad_sinoussi

And you can stop a running container:

1
2
$ docker stop ce59c7fa1ed003366d39b93137d3713f286017dd6c264b12fe987b805dc4d067
# You can also use `docker stop sad_sinoussi`, the randomly generated name.

It worth mentioning that docker keeps old containers around! For example:

1
2
$ # We can also do docker start ce59c7fa1ed003366d39b93137d3713f286017dd6c264b12fe987b805dc4d067
$ docker start sad_sinoussi

This means two things:

  1. Docker takes up more and more space. Use docker rm $(docker ps -a -q).

  2. If you assign a name to docker containers during run, you might see a name conflict. So use --rm

Making our own Docker container

Okay, now things can start to get fun. Let’s say we want to make a Docker container that runs a little flask app. First, let’s make a directory called test-app and make ourselves a little app:

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/python -u
# app.py
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run(debug=True, host='0.0.0.0')

Next, let’s make a Dockerfile. Dockerfiles are files that define a container. They use some set commands defined by Docker:

1
2
3
4
5
6
7
8
9
10
# Dockerfile
FROM python
RUN pip install flask
# Sets the working directory for any RUN/CMD/COPY/ADD instructions
WORKDIR /code/
ADD ./app.py /code/
CMD ["python", "./app.py"]

It’s worth mentioning that we’re using the offical “Python” image as our base. This installs pip and other goodies for us.

So now we can build our actual Docker image:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
$ docker build -t test-app .
Sending build context to Docker daemon 3.072 kB
Step 1 : FROM python
latest: Pulling from library/python
5040bd298390: Already exists
fce5728aad85: Pull complete
76610ec20bf5: Pull complete
52f3db4b5710: Pull complete
45b2a7e03e44: Pull complete
75ef15b2048b: Pull complete
e41da2f0bac3: Pull complete
Digest: sha256:cba517218b4342514e000557e6e9100018f980cda866420ff61bfa9628ced1dc
Status: Downloaded newer image for python:latest
---> 775dae9b960e
Step 2 : RUN pip install flask
---> Running in 6d8dae25e191
Collecting flask
Downloading Flask-0.12-py2.py3-none-any.whl (82kB)
Collecting click>=2.0 (from flask)
Downloading click-6.7-py2.py3-none-any.whl (71kB)
Collecting itsdangerous>=0.21 (from flask)
Downloading itsdangerous-0.24.tar.gz (46kB)
Collecting Jinja2>=2.4 (from flask)
Downloading Jinja2-2.9.4-py2.py3-none-any.whl (274kB)
Collecting Werkzeug>=0.7 (from flask)
Downloading Werkzeug-0.11.15-py2.py3-none-any.whl (307kB)
Collecting MarkupSafe>=0.23 (from Jinja2>=2.4->flask)
Downloading MarkupSafe-0.23.tar.gz
Installing collected packages: click, itsdangerous, MarkupSafe, Jinja2, Werkzeug, flask
Running setup.py install for itsdangerous: started
Running setup.py install for itsdangerous: finished with status 'done'
Running setup.py install for MarkupSafe: started
Running setup.py install for MarkupSafe: finished with status 'done'
Successfully installed Jinja2-2.9.4 MarkupSafe-0.23 Werkzeug-0.11.15 click-6.7 flask-0.12 itsdangerous-0.24
---> 2882709cbb5b
Removing intermediate container 6d8dae25e191
Step 3 : WORKDIR /code/
---> Running in da733427a997
---> 2b0774947e31
Removing intermediate container da733427a997
Step 4 : ADD ./app.py /code/
---> a44e95d734c5
Removing intermediate container 4bb2afe5864a
Step 5 : CMD python ./app.py
---> Running in 5f12142bf674
---> 7ae77de3b7b9
Removing intermediate container 5f12142bf674
Successfully built 7ae77de3b7b9

You’ll notice that after each command, it makes an intermediate container. Remember: Docker uses AUFS. As you customize your container instances’ file paths, your changes will be layers on top of this base image.

Also, we ran with a -t option. This “tags” our container so we can reference it more easily. We can see it here:

1
2
3
$ docker images test-app
REPOSITORY TAG IMAGE ID CREATED SIZE
test-app latest 7ae77de3b7b9 5 minutes ago 697.4 MB

Now we can run it:

1
2
3
4
5
6
7
$ docker run --rm -p 5000:5000 test-app
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger pin code: 100-366-731
172.17.0.1 - - [20/Jan/2017 17:26:27] "GET / HTTP/1.1" 200 -
172.17.0.1 - - [20/Jan/2017 17:26:27] "GET /favicon.ico HTTP/1.1" 404 -

To make a change, simply run docker build again! But that’s an annoying dev cycle…

Okay, let’s say we want our Docker container to link to our host file system:

1
$ docker run --rm -v `pwd`/app.py:/code/app.py -p 5000:5000 test-app

Ka-pow! Volumes are layer on top of a container, which means that we overwrite the previous version of app.py.

Now let’s try to make our Flask app talk to Redis. First let’s run a redis container:

1
2
3
4
5
$ docker run --name my-redis -d redis
55a0fb5de6136d3cb4fa772be0a5376a5a33b39d75ed7d08dc158b38ae55165c
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
55a0fb5de613 redis "docker-entrypoint.sh" 5 seconds ago Up 4 seconds 6379/tcp my-redis

Cool, now let’s change app.py to talk to redis. First add the pip install for python-redis in our Dockerfile:

1
2
3
4
5
6
7
8
9
## Dockerfile
FROM python
RUN pip install flask redis
WORKDIR /code/
ADD ./app.py /code/
CMD ["python", "./app.py"]

Then we build it again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ docker build -t test-app .
Sending build context to Docker daemon 3.072 kB
Step 1 : FROM python
---> 775dae9b960e
Step 2 : RUN pip install flask redis
---> Running in 248d0b09139f
Collecting flask
Downloading Flask-0.12-py2.py3-none-any.whl (82kB)
Collecting redis
Downloading redis-2.10.5-py2.py3-none-any.whl (60kB)
Collecting click>=2.0 (from flask)
Downloading click-6.7-py2.py3-none-any.whl (71kB)
Collecting Jinja2>=2.4 (from flask)
Downloading Jinja2-2.9.4-py2.py3-none-any.whl (274kB)
Collecting Werkzeug>=0.7 (from flask)
Downloading Werkzeug-0.11.15-py2.py3-none-any.whl (307kB)
Collecting itsdangerous>=0.21 (from flask)
Downloading itsdangerous-0.24.tar.gz (46kB)
Collecting MarkupSafe>=0.23 (from Jinja2>=2.4->flask)
Downloading MarkupSafe-0.23.tar.gz
Installing collected packages: click, MarkupSafe, Jinja2, Werkzeug, itsdangerous, flask, redis
Running setup.py install for MarkupSafe: started
Running setup.py install for MarkupSafe: finished with status 'done'
Running setup.py install for itsdangerous: started
Running setup.py install for itsdangerous: finished with status 'done'
Successfully installed Jinja2-2.9.4 MarkupSafe-0.23 Werkzeug-0.11.15 click-6.7 flask-0.12 itsdangerous-0.24 redis-2.10.5
---> d91f2e311d7e
Removing intermediate container 248d0b09139f
Step 3 : WORKDIR /code/
---> Running in 0b54d809edc6
---> 2a10bad9b3a6
Removing intermediate container 0b54d809edc6
Step 4 : ADD ./app.py /code/
---> 0328f86aa26c
Removing intermediate container 016781d53f5e
Step 5 : CMD python ./app.py
---> Running in f2bb51177fd8
---> 9ed492fd28ab
Removing intermediate container f2bb51177fd8
Successfully built 9ed492fd28ab

Then we run our Docker container while linking to the existing redis container:

1
$ docker run --rm -v `pwd`/app.py:/code/app.py -p 5000:5000 --link my-redis:redis test-app

Now we can change our app.py to talk to redis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/python -u
# app.py
from flask import Flask
import redis
app = Flask(__name__)
@app.route("/")
def hello():
r = redis.StrictRedis(host='redis', port=6379, db=0)
num = r.get('count')
num = int(num) + 1 if num else 1
r.set('count', num)
return "Hello {}!".format(num)
if __name__ == "__main__":
app.run(debug=True, host='0.0.0.0')

docker-compose

Some people realized that all of this Docker stuff could be made simpler so they made a Python library to do that called “fig”. It was so successful that it became part of docker as “Docker Compose”.

Essentially, it allows you to run several Docker containers at once:

1
2
3
4
5
6
7
8
9
10
11
12
13
# docker-compose.yml
web:
build: .
command: python app.py
volumes:
- ./app.py:/code/app.py
links:
- redis
ports:
- "5000:5000"
redis:
image: redis

Now we can run docker-compose up and everything will be running.

Docker in production

Docker containers are powerful for development, but they’re a really powerful idea for deployment as well for a couple reasons:

  1. Allows you to make micro-services super easily

  2. Very easy clustering (docker-swarm, ECS, Kubernetes)

  3. Easy ops: blue-green deployment and rollbacks are easy

  4. Autoscaling

There’s still no “set” way of doing things, so here’s an example of a task definition in ECS:

“””json
{
“family”: “logstash-production”,
“taskRoleArn”: “arn:aws:iam::317260457025:role/pro-logstash-task”,
“networkMode”: “bridge”,
“containerDefinitions”: [
{
“name”: “logstash”,
“image”: “317260457025.dkr.ecr.us-east-1.amazonaws.com/search/logstash:1.0.1”,
“cpu”: 512,
“memory”: 1000,
“essential”: true,
“command”: [“-f”, “/src/logstash.production.conf”],
“logConfiguration”: {
“logDriver”: “awslogs”,
“options”: {
“awslogs-group”: “logstash-production”,
“awslogs-region”: “us-east-1”
}
}
}
],
“placementConstraints”: [],
“volumes”: []
}
“””

And the work flow is simply:

  • Build your image: docker build -t $ECR_URL/<my_role_name>:VERSION .

  • Push your image to ECR: docker push $ECR_URL/<my_role_name>:VERSION

  • Update your task definition: aws ecs register-task-definition --cli-input-json file://production.json

  • Tell your AWS cluster to use the new task definition (either in the UI or the CLI).

It comes with a couple headaches:

  • Making security definitions and IAM roles

  • Making your actual cluster instances (different IAM role for this!)

  • Why on earth aren’t there CNAMEs for ECR urls?

Helpful Python Tricks

Advanced Data Structures

Odds are, you know about Python lists and dicts. But there’s a couple other really useful ones out there.

Tuples

Most of you have heard of this. Tuples are immutable collections - once you make them, you can’t change them.

1
2
a = (1,2,3)
a[0] = 4
1
TypeError: 'tuple' object does not support item assignment

Why are tuples useful? Two reasons:

  1. Immutability is good sometimes

  2. It saves space.

Sets

Sets are kind of like dictionaries without values. They work the same way under the hood (with a hashmap) but keys don’t store anything.

It’s very useful to keep track of unique things that happen.

1
2
3
4
5
a = set([1, 2, 1])
print a
b = {2, 3} # shorthand for set initialization
print a & b
print a | b
1
2
3
set([1,2])
set([2])
set([1,2,3])

You can also use frozenset, which is immutable like a tuple.

collections

The collections library has a bunch of useful ones:

  • namedtuple() - make a tuple with predefined named fields. Great for things like SQLAlchemy rows.
  • deque - a simple Python queue
  • Counter - counts things. Kind of like a set for counting.
  • OrderedDict - dict that remembers order
  • defaultdict - you can define a default thing to get in your dictionary

Generators

They’re pretty great:

1
2
3
4
5
6
7
8
9
10
11
12
def fibonacci():
a = 0
b = 1
while True:
yield b
b = a + b
a = b - a
for num in fibonacci():
if num > 5:
break
print(num)
1
2
3
4
5
1
1
2
3
5

Generators start again at the previous place when their called again. If the generator function actually returns, it raises a StopIteration and the generator is done.

They’re good because they don’t require much memory. That’s why they recommend you use xrange instead of range.

Variable swapping and decoupling

This is cool:

1
2
a, b = 1, 2
b, a = a, b

This is cooler:

1
a, b, c = [1,2,3]

This is coolest:

1
2
for i, number in enumerate(range(10)):
print i, number

Neato chained comparisons

1
(1 < 2 and 2 < 3) == (1 < 2 < 3)

Comprehensions

List/dict/tuple/set comprehensions are amazing. Want to make a simple list?

1
2
3
4
print [x * 2 for x in xrange(5)] # list
print (x * 2 for x in xrange(5)) # generator
print {x * 2 for x in xrange(5)} # set
print {x: x * 2 for x in xrange(5)} # dict
1
2
3
4
[0, 2, 4, 6, 8]
<generator object <genexpr> at 0x10df2dcd0>
set([0, 8, 2, 4, 6])
{0: 0, 1: 2, 2: 4, 3: 6, 4: 8}

Easy. And you can add filters:

1
[x for x in xrange(10) if x % 2 == 0] # Even numbers

And do double loops!

1
[x for x in range(5) for y in range(5)]

Neat-o gang. It saves a lot of LOC. I use it frequently in API calls to generate JSON blobs.

1
2
3
4
5
6
7
8
q = session.query(Creatives)
return [
{
'md5': c.md5,
'url': generate_cloudfront_url(c),
'brand_id': c.brand().brand_id
} for c in q
]

See? Now they don’t look so scary.

Common Python gotchas

Credit for most of this section goes to the Python Guide.

Mutable default arguments

This is a classic Python interview question by pedantic people. What do you think will happen when I run this?

1
2
3
4
5
6
7
def a(b, c=[]):
c.append(b)
return c
print a(1)
print a(2)
print a(3)

Huh?

1
2
3
[1]
[1,2]
[1,2,3]

Not what you expected right? Remember that functions are made before the code gets run. So the same list stays.

Instead of that, you should always do:

1
2
3
4
5
def a(b, c=None):
if c is None:
c = []
c.append(b)
return c

Don’t use mutable objects as default values (dicts, lists, etc). You can use strings and ints because they aren’t mutable.

Closures are weird

Here, we make lambdas in the comprehension:

1
2
3
4
def create_multipliers():
return [lambda x : i * x for i in range(5)]
print [multiplier(2) for multiplier in create_multipliers()]

You’d expect the result to be 0,2,4,6,8 but you actually get 8,8,8,8,8! Why? Because the closure is generated at run time and we leave our loop with i as 8.

After all, you’d expect this:

1
2
3
4
5
a = 1
def print_a():
print a
a = 2
print_a()

To print 2, correct?

Instead of this, use default values:

1
2
def create_multipliers():
return [lambda x, i=i : i * x for i in range(5)]

Classes inheritance is weird

This one comes from Martin Chikillian.

1
2
3
4
5
6
7
class A(object):
x = 1
class B(A):
pass
class C(A):
pass
print A.x, B.x, C.x
1
1 1 1

Sure.

1
2
B.x = 2
print A.x, B.x, C.x
1
1 2 1

K.

1
2
A.x = 3
print A.x, B.x, C.x
1
3 2 3

Wat?

When you reference an object’s attribute, it’ll first try to get it from its own definition. After that, it’ll go up and get it from its parents. This is called “Method Resolution Order (MRO)”.

Local/global variables are weird

1
2
3
4
a = 0
def b():
a += 1
b()
1
UnboundLocalError: local variable 'a' referenced before assignment

The moment you assign something, it creates it in the local scope. So a += 1 turns into a = a + 1 and it doesn’t know what the second a is.

To do this properly:

1
2
3
4
5
a = 0
def b():
global a
a += 1
b()

is vs ==

Remember: is will only be true if two things are the same object. == will call the object’s __eq__ method, which you can override:

1
2
3
4
class EverythingIsTrue(object):
def __eq__(self, other):
return True
print EverythingIsTrue() == False # Will return true

There are a ton of other magic functions like __le__, __lte__, __add__ that affect other operators like <, <=, +.

iPython is great

It remembers your history, does tab completion, give information about objects, has magic pasting.

It even has magic functions like %timeit:

1
2
%timeit range(1000)
100000 loops, best of 3: 7.76 us per loop

Or %run, which runs external scripts.

Context Managers

Context managers are cool if you need to make sure something gets cleaned up:

1
2
3
4
5
6
7
8
9
class MyContextManager(object):
def __enter__(self):
self.b = 2
return self
def __exit__(self, type, value, traceback):
self.b = None
with MyContextManager() as mc:
print mc.b
print mc.b
1
2
2
None

You can do other cool things with contextlib:

1
2
3
4
5
6
7
8
9
10
from contextlib import contextmanager
@contextmanager
def tag(name):
print "<%s>" % name
yield
print "</%s>" % name
with tag('Lol'):
print 'hi'
1
2
3
<Lol>
hi
</Lol>

List slicing and dicing

Python’s slice notation is the bomb. Get the last element by using -1: mylist[-1].

You can also get slices this way: mylist[0:5] or mylist[:5] or mylist[5:].

You can even define the “jumps”: myrange(10)[::2] gets even numbers. A good trick to reverse lists is to do myrange(10)[::-1] (although you should probably use reversed() instead!).

args and kwargs and splat

* is called the “splat” operator. Use it for making functions that take multiple things:

1
2
3
4
5
6
def a(b, c=None, *args, **kwargs):
print('b', b)
print('c', c)
print('my args', args)
print('my kwargs', kwargs)
a('pikachu', 1, 2, d=3, e=4)

Or use it for the opposite:

1
2
3
def b(a, b, c=None, d=None):
print 'here'
b(*[1,2], **{'c': 1, 'd': 2})

itertools.chain

I love itertools. Very handy library. Combine iterables:

1
2
3
4
import itertools
list1 = [1,2]
list2 = [3,4]
list(itertools.chain(list1, list2)) == [1,2,3,4]

Use group by:

1
2
3
4
5
6
7
8
9
10
a = [[1,2], [1,3], [1,4], [2,5]]
[(k, len(rows)) for k, rows in itertools.groupby(a, key=lambda l:l[0])]
pokemon_lineups = [
['nidoran', 'nidrino', 'nidoking'],
['nidoran', 'nidrina', 'nidoqueen'],
['magikarp', 'gyarados']
]
for base_pokemon, evolution_chain in itertools.groupby(pokemon_lineups, key=lambda l:l[0]):
print 'base_pokemon {} has {} chains'.format(base_pokemon, len(list(evolution_chain)))

Combinations and permutations (great for tests):

1
2
3
4
5
6
7
8
9
10
11
12
import itertools
import random
my_lineup = ['pikachu', 'bulbasaur', 'squirtle', 'charizard', 'jigglypuff', 'gyarados']
def chance_against_gary(ordered_lineup):
return random.randint(0, 100)
max(
[
(l, chance_against_gary(l))
for l in itertools.permutations(my_lineup, 6)
],
key=lambda t:t[1]
)

zip

Handy for combining lists for iterating:

1
2
3
4
pokemon = ['charizard', 'bulbasaur', 'pikachu', 'lickitung']
awesomeness = [10, 4, 8, 1]
for p, a in zip(pokemon, awesomeness):
print '{} is this awesome: {}'.format(p, a)

1
2
3
4
charizard is this awesome: 10
bulbasaur is this awesome: 4
pikachu is this awesome: 8
lickitung is this awesome: 1

Decorators are great

Wrap your functions! Here’s a decorator that makes sure a view is only useable if the user is logged in:

1
2
3
4
5
6
7
8
9
10
def must_be_logged_in(view):
def wrapper(request):
if not request.user:
raise HTTPForbidden
return view(request)
return wrapper
@must_be_logged_in
def my_view(request):
pass

Classmethods and staticmethods

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class A(object):
def my_method(self):
print self
@classmethod
def class_method(cls):
print cls
@staticmethod
def static_method():
print 'static'
a = A()
a.my_method()
A.class_method()
a.class_method()
A.static_method()

Best Python -m tools:

A lot of handy tools out there. -m runs the main function of a module. Here’s a good list. Some highlights:

SimpleHTTPServer

Handy for making a quick HTTPServer:
python -m SimpleHTTPServer 8000

json.tool

Handy for pretty printing:
echo '{"greeting": "hello", "name": "world"}' | python -m json.tool

Gzip

python -m gzip [file] # compress
python -m gzip -d [file] # decompress

antigravity

python -m antigravity

this

python -m this

Hello World

I’ve started and abandoned many blogs in the past. Over time, most of them fizzeled out because I thought that I didn’t have anything worth writing. If it wasn’t original, I reasoned, I wouldn’t be worth my time to write and worth the time of others to read.

Turns out, that’s a crappy approach to writing because you never get better that way. Quantity generate quality in the long run.

So, I’ve created this blog with a completely different goal in mind: write as often as possible, even if I’m writing about something pretty stupid. I think there’ll be other benefits:

  1. Hopefully, I’ll learn to become more succinct in real life. I ramble a lot when speaking and it doesn’t make me the best communicator.

  2. In learning to write more clearly, maybe I’ll learn to think more clearly. It’s possible that good communication is just a symptom of a well-organized mind.

  3. Having a written record of things my thoughts and opinions sound like it’ll be nice in a few years [1].

  4. Writing teaches empathy. The best writers make sure there is no ambiguity by placing themselves in the position of the reader. Understanding others is a skill that makes you more successful in, I believe, almost every field imaginable.

Let’s see how far this goes!

[1] As a kid, I used to hate being photographed. It seemed forced to me. However, as I’ve grown older, I often wish I had more pictures.