# Postgres 102 - Explains and Advanced Features

In this tutorial, we’re going to cover some good-to-know Postgres features and tricks for query optimization. The only prerequisite is basic SQL and having Docker/Postgres installed.

## Setting up our environment

First things first, let’s make some mock data for us to play around with. We’re going to make a table of domains (e.g. espn.com) and domain categories (e.g. sports publishers):

And we’re going to insert some data:

Sidenote
generate_series is a wonderful little function that does exactly what you think it does. It’s fantastic for making mock data.

Next, let’s make a table that connects the two together:

Now it’s very easy to get a list of domains in any category:

Now let’s make things a little more complicated. Let’s add parent categories so we can get a categories topology. Imagine a “Sports” category that can contain “Football” and “Basketball” categories. To do this, we need a table that defines the parent-child relationships for categories:

Okay, so far so good. How can we get a list of domains in parent categories? One option is to do a simple join:

However, what if we want the domains of a category and we don’t know if it’s a parent category or not? One possible solution is to both and union the results together:

But A) this is super ugly and B) it stops working if our parent categories get parent categories. In other words, we are only solving a graph that’s two layers deep. A better solution is to use something called a “recursive common table expression”. But first, you should first understand a normal “common table expression” (CTE):

The with syntax here is the CTE. It’s a useful tool to cache subqueries and often, I find them much cleaner than actual subqueries because you can give them names!

A recursive CTE is slightly different because it allows a CTE to iterate on itself:

Now our join will work no matter how many layers we have!

However, it’s quite a bit of work to write this out every time. If we were using an ORM, we’d be reading a lot of documentation to get this syntax down. To avoid this, we can write a view:

The database will pretend this is a table and even join other tables to it! However, be very careful with views because as you add filters and joins, the query planner may be very confused, as we’ll see later.

If the data in your view doesn’t change very often, one common tool is a materialized view. Mviews, as they’re common called, allow you to cache the results of the view and only refresh them manually:

Keep in mind though: refresh materialized view will block reads. If you add a unique index to your mview (as you should), you can use refresh materialized view concurrently, which will refresh your mview without blocking reads.

## Foreign data wrappers

When you have a lot of data, it’s common to split your tables between multiple databases. To simulate this, let’s create another docker instance. This time, we’ll add a “link” so the second docker instance can network to the first.

We’ll be using a Postgres extension called postgres_fdw that allows you to communicate to other Postgres instances. There’s a lot of cool Postgres extensions out there: they range from adding different data types to different foreign data wrappers even new storage engines and indices.

There are also foreign data wrappers to MySQL, Redis, Dynamo, you name it.

## Optimizing queries

Okay, now the fun stuff :). Let’s create a brand_domain_creative table. We use something more-or-less shaped like this table in Moat Pro and it tells us the estimated impressions (score) a creative had on a certain day on a certain domain.

Neat! Next we’re going to fill it with ~60M rows of simulated data. This may take a short while.

Now we can do queries like “what are the top 10 brands in January for domain 0?”

Yikes! Never mind. That’s taking way too slow. Ctrl+C to interrupt the query and get out of there.

To see what happened, we can use the explain query. It’ll output the database’s execution plan for running the query:

You’ll notice that each step has a cost estimate. Postgres takes statistics about your tables to estimate how long each step would take so it can intelligently choose the optimal strategy. In this case, the statistics are a wee-bit off as it thinks the domain will have ~1500 rows in this date range when in actuality, it’s around 31k. We can tell it to re-analyze its contents via an analyze command:

Now our explain looks something like this:

Even with more accurate statistics, the database doesn’t have any other option. To execute this query, it needs to go through every single row and do a count. How can we give it a shortcut? An index of course!

What is an index? Essentially, think of of it as a glossary for a very large book. If you want to find each page that has the word aardvark, it’s much faster to find the entry in the glossary than to read every page.

By default, Postgres indices are b-trees under the hood because they’re very versatile. However, you can choose other index types if you know what you’re doing.

Building this index took a long time because the computer had to go through every single data point. When you think about it, two and a half minutes to organize 60M data points sounds pretty great. Dang, computers are cool.

Sidenote
Why are they called B-trees? Rudolf Bayer and Ed McCreight invented it while working at Boeing, so it could stand for “Boeing”, “Bayer”, or even “balance” tree. McCreight says they couldn’t decide between the options. They couldn’t name it Boeing without lawyers being involved, so the company missed out on some great free advertising.

Now let’s try this query again!

That was significantly faster. And it’ll be faster if you run it again because of caching:

Caching? How does caching work? Well, virtual memory plays the biggest role here, but Postgres has shared_buffers that cache recent information including:

• Table data
• Indexes
• Query execution plans

Keeping shared_buffers consistent all of these while writes are coming in is some serious voodoo magic, so if you see a Postgres contributor, buy them a beer.

Let’s see how our query was faster via an explain analyze. explain analyze is like an explain but it also runs the query to give you more information. verbose and buffers give you debug information about each step.

Let’s interpret this explain from inside out.

• “Bitmap index scan”: our index is large enough to take up several blocks. Because of the way B-Trees work, we can make a map that can tell us which blocks contain indices that match our conditions. The resulting array of booleans is our bitmap. We then use this bitmap to read the relevant index blocks and collect all the indices that match our conditions. This took 1.27s.

• “Bitmap heap scan”: Armed with our indices, we create a bitmap of heap blocks to read and then we read them. This took almost no time at 0.16s and resulted in 31k rows.

• “Sort”: Looks like Postgres is sorting the rows with quicksort to make it easier to…

• “GroupAggregate”: Group the rows together by brand_id and sum the scores (334 resulting rows).

• “Sort”: Sort our grouped rows based on sum(score) DESC using top-N heapsort

• “Limit”: limit our results from 100 to 10.

Sidenote
Quicksort is in-place, so it makes sense they chose that for 31k rows. Top-N heapsort is a sort where you only keep the Top-N, which is significantly less complex. It only makes sense if you do a limit after your sort.

Can we do better? Sure! Seems like the slow part is getting stuff from the index. We have to read 118k buffers here and only 31k buffers to actually get the data (gee, I’m starting to suspect our buffers are exactly 10k rows each).

Why does does index need to read so many blocks? Well, it’s because of the shape of our index. Our index looks like this: (start_date, domain_id, brand_id, creative_id). This means if you ordered our index in a list, it would look like this:

So in every 1M index entries, only 1k of them are relevant to our query at hand. Thus, we can assume that we have to read a lot of different index blocks to collect our heap.

What happens if we make a new index organized by (domain_id, start_date)? Then our index blocks are significantly closer together and our b-tree doesn’t have to make keys for creative_id/brand_id.

Great Neptune’s trident, that was fast! Let’s see how things changed:

As expected, we got our rows out significantly faster (144ms). Interestingly, the DB switched from a GroupAggregate to a HashAggregate even though the only step that should have been affected was the index scan. Databases are mysterious beasts. In this case, it bought us 2ms. Huzzah!

Sidenote
Another common reason for slow Bitmap Index Scans is a lack of vaccuuming. By default, Postgres keeps old versions of rows arounds for MVCC (Multi-version consistency control) and they can remain in your index as well. vaccuum frequently, kids.

## Performance tuning

Let’s try another query: what are brands that have appeared on the most domains?

Yikes. This is going to be slow again. Our indices can only eliminate half of our dataset. What can we do?

One solution is to have a (start_date, brand, domain) index. Maybe this way, Postgres doesn’t need the actual rows to perform the query:

Programmer uses index! It’s not very effective!

Whoa, wtf? Why is it doing a sequential scan on the rows? Even an analyze doesn’t change this. If it used the index, it would go through 31 (days) * 334 (brand) * 1000 (domain) = 10.354M index entries. That’s 60 times fewer than going through 600M rows!

Well, the difference is that index disk access is random whereas the sequential scan is, well, sequential. The optimizer gives random reads an estimated random_page_cost cost of “4” by default. And keep in mind that reading one index block involves reading a number of other index blocks because that’s how b-trees work.

But wait! My computer has an SSD! Shouldn’t they be weighed the same? Well, you can tune that in your Postgres config by changing random_page_cost to 1 :). Or, you can do it temporarily in your session:

Wtf Rodrigo? You lied! It’s still doing a sequential scan!

Well, as a hack, if you set seq_page_cost to 9999, you can see what its index plan would look like:

Huh, so the database doesn’t have a method to do an index scan and do a GroupAggregate at the same time! So it’s forced to to index scan for 31M entries! Maybe there’s a good reason for that - database programming is hard because there’s a ton of corner cases.

If you want to research it, pull requests are welcome ;).

## Pre-aggregation

First, let’s set our seq_page_cost back to normal:

So how can we make the above query faster? Well, if it’s dedicated to doing a sequential scan, we can simply give it less rows! In this query, we don’t need the creative column. So what if we removed it and rolled up all the creative scores into brands?

The contents of this table are ~3x smaller! So it makes sense that our query time went down by that much.

And what happens if we add an index?

Well, now we get out 10M row method :).

But here’s one more optimization we can do! We reduced brand_domain_creative to brand_domain and in our business logic, we frequently do month long queries. What happens if we rollup if start_date to the nearest month?

Now, this query turns into:

Woosh! The moral of the story here is, you can learn to be a database whisperer, but normally the simplest approach (pre-compute as much as you can) is the right answer.

However, every neat trick in CS comes at a cost and pre-aggregation is no exception. Specifically:

• Disk space

• Compute time

• Out of sync tables

One last thing: what if you have a query where you want to see the brands with the most domains from ‘2017-01-25’ to ‘2017-02-28? In this case, the optimal query involves getting the daily rows from brand_domain and the monthly rows from monthly_brand:

Pretty freakin’ fast. In Moat Pro, we have every cut imaginable, from brand to monthly_creative_hostname to alltime_tag_brand_creative. We have a query engine that choose the right cuts and aggregates their results together.

# An Introduction to Linear Regression

In this post, we will be using a lot of Python. All of the code can be found here.

We’ll be using data from R2D3, which contains real estate information about New York City and San Francisco real estate.

## Regression

One of the most important fields within data science, regression is about describing data. Simply put, we will try to draw the best line possible through our data.

First, an example. Let’s plot some New York City apartments’ prices by square foot:

As you can see, there’s a bit of bunching up in the lower left corner because we have a bunch of outliers. Let’s say any apartment over 2M or 2000 square feet is an outlier (although, I think this is our dataset showing it’s age…). Okay, so we can notice a couple things about this data: • There’s a minimum number of square feet and price. No one has an apartment smaller than 250 square feet and no one has a price lower than 300K. This makes sense. • The data seems to go in a up-right direction, so we more or less draw a line naively if we tried. • As we go along the axes, the data spreads. There must be other variables at work. ### Drawing our line Let’s try to draw the best straight line we can though the data. This is called linear regression. So our line is going to have the familiar-looking formula: \begin{align} h_\theta(x) = \theta_0 + \theta_1 x \end{align} Whereh_\theta(x)$is our hypthesis$h$of the price given a square foot of$x$. Our goal is to find our$\thetas, which define the slope and y-intercept of our line. So how do we define the best line? Through an cost function, which defines how off a line is! The line with the least cost is the best line. Okay, so how do we choose an cost function? Well, there’s a lot of different cost functions out there, but least squares error is perhaps the most common: \begin{align} J(\theta) = {1 \over 2} \sum_{i=0}^n (h_\theta(x_i) - y)^2 \end{align} Essentially, for each data point(x_i, y_i)$, we’re simply taking the difference between$y_{i}$and$h_\theta(x_i)$and squaring it. Then we’re summing them all together and halving it. Makes sense. ### Gradient Descent Okay, so now the hard part. We have data and a cost function, so now we need a process for reducing cost. To do this, we’ll use gradient descent. Gradient descent works by constantly updating any$\theta_j$in the direction that will dimension$J(\theta). Or: \begin{align} \theta_j := \theta_j - \alpha {\delta \over \delta \theta_j} J(\theta) \end{align} Where\alpha$is essentially “how far” down the slope you want to go at every step. Thus, with a little bit of math, we can find the derivative of our$\theta_j$with respect to$J(\theta)for an individual data point: \begin{align} {\delta \over \delta \theta_j} J(\theta) &= {\delta \over \delta \theta_j} {1 \over 2} (h_\theta(x) - y)^2 \cr &= 2 \cdot {1 \over 2} \cdot (h_\theta(x) - y) \cdot {\delta \over \delta \theta_j} (h_\theta(x) - y) \cr &= (h_\theta(x) - y) \cdot {\delta \over \delta \theta_j} (h_\theta(x) - y) \end{align} Now things begin to diverge for our\thetas: \begin{align} {\delta \over \delta \theta_0} J(\theta) &= (h_\theta(x) - y) \cdot {\delta \over \delta \theta_0} (\theta_0 + \theta_1 x - y) \cr &= (h_\theta(x) - y) \end{align} \begin{align} {\delta \over \delta \theta_1} J(\theta) &= (h_\theta(x) - y) \cdot {\delta \over \delta \theta_1} (\theta_0 + \theta_1 x - y) \cr &= (h_\theta(x) - y) \cdot x \end{align} So now we can apply our derivatives to the original algorithm… \begin{align} \theta_0 := \theta_0 + \alpha (y - h_\theta(x)) \cr \theta_1 := \theta_1 + \alpha (y - h_\theta(x)) \cdot x \end{align} However, this only applies to one data point. How do we apply this to multiple data points? ### Batch Gradient Descent The idea behind batch gradient descent is simple: go through your batch and get all the derivatives of{\delta \over \delta \theta_0} J(\theta). Then take the average: \begin{align} \theta_0 := \theta_0 + \alpha \sum_{i=0}^m (y - h_\theta(x)) \cr \theta_1 := \theta_1 + \alpha \sum_{i=0}^m (y - h_\theta(x)) \cdot x \end{align} This means that you only change your theta at the end of the batch: This results in: Pretty good! The code completes in 316 iterations, an error of 65662214541.7 and estimates\theta_0$to be 80241.5458922 and$\theta_1to be 1144.09527519. ### Stochastic Gradient Descent Another way to train our data is to apply our changes to each data point individually. This is stochastic gradient descent: \begin{align} \theta_0 := \theta_0 + \alpha (y - h_\theta(x)) \cr \theta_1 := \theta_1 + \alpha (y - h_\theta(x)) \cdot x \end{align} It has one big benefit: we don’t have to go through the entire batch. This is hugely important for problems with large, large datasets. You’ll see it used frequently in deep learning. It’s worth noting that this took longer than batch descent since our dataset is so small. Also, we had to loosen up the right answer since we oscillated around the error a lot more. We would oscillate a lot less if it weren’t for the randomness! There are in-betweens like mini-batch gradient descent, where you randomly turn your large dataset into small batches. This is hugely useful if you’re doing regression across multiple machines! ### Multivariate Linear Regression Okay, now things start to get fun. At the moment, we’re dealing with one input dimension (AKA “Simple” Linear Regression), which is great for getting started, but most datasets have more than one dimension. We can generalize our algorithm using linear algebra. First, let’s say that we haven$dimensions. Thus, when we treat every input$x$as a vector, we get: $$\overrightarrow{x}^{(i)} = \begin{bmatrix} x^{(i)}_1 \cr x^{(i)}_2 \cr \vdots \cr x^{(i)}_n \cr \end{bmatrix}$$ We can generalize our$m$number of training vectors to be: $$X = \begin{bmatrix} — (\overrightarrow{x}^{(1)})^T — \cr — (\overrightarrow{x}^{(2)})^T — \cr \vdots \cr — (\overrightarrow{x}^{(m)})^T — \cr \end{bmatrix}$$ And our answers,$y$, can be a vector as well: $$\overrightarrow{y} = \begin{bmatrix} y^{(1)} \cr y^{(2)} \cr \vdots \cr y^{(m)} \cr \end{bmatrix}$$ And our parameters as well: $$\overrightarrow{\theta} = \begin{bmatrix} \theta_{1} \cr \theta_{2} \cr \vdots \cr \theta_{n} \cr \end{bmatrix}$$ We’re losing our “error” parameter of$\theta. If you really want it, you can simply add a “1” column to each datapoint and that’ll accomplish the same thing and allow us to simplify our math and code a bit. So our hypothesis for an individual data point looks like: \begin{align} h_\theta(\overrightarrow{x}^{(i)}) = (\overrightarrow{x}^{(i)})^T \cdot \overrightarrow{\theta} \end{align} So going back to our cost function, which we can put in matrix form: \begin{align} J(\theta) &= {1 \over 2} \sum_{i=0}^n (h_\theta(\overrightarrow{x}^{(i)}) - y^{(i)})^2 \cr &= {1 \over 2} (h_\theta(X) - \overrightarrow{y})^T(h_\theta(X) - \overrightarrow{y}) \cr &= {1 \over 2} (X \cdot \overrightarrow{\theta} - \overrightarrow{y})^T(X \cdot \overrightarrow{\theta} - \overrightarrow{y}) \end{align} So if we wanted to find the derivative ofJ(\theta)$now (aka$\nabla_{\theta}J(\theta)), we’d have to do some funky math. If you want to read how it can be derived, I recommend reading page 11 of Andrew Ng’s lecture notes on Linear Regression. Skipping to the answer, we get: \begin{align} \nabla_{\theta}J(\theta) &= X^T(X \cdot \theta - \overrightarrow{y}) \end{align} And thus, we can create steps for our\overrightarrow{\theta}: \begin{align} \overrightarrow{\theta} := \overrightarrow{\theta} + \alpha X^T(\overrightarrow{y} - (X \cdot \theta)) \end{align} This kind of looks like our step for\theta_1not too long ago! \begin{align} \theta_1 := \theta_1 + \alpha (y - h_\theta(x)) x \end{align} ### Programming this in numpy We get a slightly different answer here than in our old batch gradient descent code because our y-intercept has a different learning rate. If we gave it enough repititions, it would eventually get near the same area. Of course, there are plenty of high level ML libraries to explore that do this stuff for you! However, it’s fun to understand what’s happening under the hood. ### Things to worry about Multivariable linear regression with gradient descent can have a lot of complications. For example, local minima: As you can see, gradient descent might accidentally think the minimum is in the saddle there. There’s a ton of interesting papers about this. Training multiple times with different starting parameters is one way around this. Overfitting is also an issue: You can avoid this by splitting your data into a large “training” set and a large “testing” set. That’s standard procedure in most data science problems. ### One last method: using the derivative As many of you know, if you set the derivative of a curve to 0, it’ll either be a local maxima or a local minima. Since our error function does not have an upper limit, we know that the point where the derivative is 0 is a local minima. So we can make our derivative\nabla_{\theta}J(\theta)zero: \begin{align} \nabla_{\theta}J(\theta) &= X^T(X\theta - \overrightarrow{y}) \cr 0 &= X^T(X\theta - \overrightarrow{y}) \cr X^TX\theta &= X^T\overrightarrow{y} \cr \theta &= (X^TX)^{-1}X^T\overrightarrow{y} \end{align} And we can throw it in our code: And it totally works! You may ask “but why did we learn all about the gradient descent stuff then”? Well, you’ll need it for things that aren’t as straight forward as linear regression like deep neural networks. ### To learn more: Andrew Ng’s notes on Linear Regression. ML @ Berkeley’s Machine Learning Crash Course. # How to Docker For Great Good ## What is Docker? Docker is a container system. It allows you to run code in a predefined environment that will Run Anywhere™. So how is it different from a virtual machine? To start a VM, you allocate resources (X bytes of memory, Y CPU cores, etc) and these allocations are absolute. That is, if the VM only needs half its allocated memory or just a few CPU cycles, you can’t remove/add it dynamically. That creates a lot of waste! It means that your services always use their maximum number of resources. In addition, you have the overhead of emulating their operating system and their hardware. For the most part, as developers, we really want only a few things: • Make sure processes can’t affect the host operating system. We want our containers to be a jail. • Make sure processes can’t affect one another. So give me isolated memory addresses and file systems. • Give me hard memory/CPU limits, so only use what you need until a certain limit to make sure it doesn’t affect other processes. That’s containers in a nut shell. Essentially Docker provides this by using: • cgroups: A Linux kernel feature that isolates and limits the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes. • apparmor: A Linux security module that allows you to restrict per-program network access, raw socket access or file path access. • aufs (Another Union File System): Imagine a file system that works off of diffs. So every single change is just a diff layered on top of an existing file system. It allows you to “fork” other containers very easily. • Many more cool Linux modules and features This is why Docker originally could only run on Ubuntu. Only recently can you run it on OSX/Windows without VirtualBox! Downsides: • Difficult to get it working natively on host OSs • Security issues • Limited OS choice. (All official Docker images are using Ubuntu distro. Soon they’ll use Alpine Linux) ## Actually using Docker Okay, so first you need to install Docker. This installs the Docker daemon which controls all of the containers running on your computer. Assuming your Docker daemon is running, you can now pull the base Ubuntu image: Images are snapshots of a filesystem. You can push/pull images from a central repository. Docker, being a private company, made the default repo for pulling its own servers. You’ll learn how to push/pull from private container repos later. Now, we can use the base Ubuntu image to run a command: That was really fast (slightly longer than a second for me)! So what happened? • The Docker CLI parsed our command a realized that we wanted to run “echo ‘hello’” on the Ubuntu image. It passes that information to the Docker daemon. • The Docker daemon started a process with all of the voodoo magic that isolates it. • It made sure that the process had access to a file system that we pulled (the Ubuntu image). • That process ran our echo 'hello' command We can run any other bash command! For example, we could use ls to explore the filesystem. Neat, right? So it feels like an actual Linux virtual machine! What if we want to actually use bash within the container? We can use the ‘-i’ (interactive, keeps STDIN open) and ‘-t’ (pseudo-tty) options: ## Pulling a prebuilt image and advanced options Now we can have some fun. First, let’s pull the official Docker image for Redis: Now we can run Redis: You can see our currently running containers: And you can stop a running container: It worth mentioning that docker keeps old containers around! For example: This means two things: 1. Docker takes up more and more space. Use docker rm(docker ps -a -q).

2. If you assign a name to docker containers during run, you might see a name conflict. So use --rm

## Making our own Docker container

Okay, now things can start to get fun. Let’s say we want to make a Docker container that runs a little flask app. First, let’s make a directory called test-app and make ourselves a little app:

Next, let’s make a Dockerfile. Dockerfiles are files that define a container. They use some set commands defined by Docker:

It’s worth mentioning that we’re using the offical “Python” image as our base. This installs pip and other goodies for us.

So now we can build our actual Docker image:

You’ll notice that after each command, it makes an intermediate container. Remember: Docker uses AUFS. As you customize your container instances’ file paths, your changes will be layers on top of this base image.

Also, we ran with a -t option. This “tags” our container so we can reference it more easily. We can see it here:

Now we can run it:

To make a change, simply run docker build again! But that’s an annoying dev cycle…

Okay, let’s say we want our Docker container to link to our host file system:

Ka-pow! Volumes are layer on top of a container, which means that we overwrite the previous version of app.py.

Now let’s try to make our Flask app talk to Redis. First let’s run a redis container:

Cool, now let’s change app.py to talk to redis. First add the pip install for python-redis in our Dockerfile:

Then we build it again:

Then we run our Docker container while linking to the existing redis container:

Now we can change our app.py to talk to redis:

## docker-compose

Some people realized that all of this Docker stuff could be made simpler so they made a Python library to do that called “fig”. It was so successful that it became part of docker as “Docker Compose”.

Essentially, it allows you to run several Docker containers at once:

Now we can run docker-compose up and everything will be running.

## Docker in production

Docker containers are powerful for development, but they’re a really powerful idea for deployment as well for a couple reasons:

1. Allows you to make micro-services super easily

2. Very easy clustering (docker-swarm, ECS, Kubernetes)

3. Easy ops: blue-green deployment and rollbacks are easy

4. Autoscaling

There’s still no “set” way of doing things, so here’s an example of a task definition in ECS:

“””json
{
“family”: “logstash-production”,
“networkMode”: “bridge”,
“containerDefinitions”: [
{
“name”: “logstash”,
“image”: “317260457025.dkr.ecr.us-east-1.amazonaws.com/search/logstash:1.0.1”,
“cpu”: 512,
“memory”: 1000,
“essential”: true,
“command”: [“-f”, “/src/logstash.production.conf”],
“logConfiguration”: {
“logDriver”: “awslogs”,
“options”: {
“awslogs-group”: “logstash-production”,
“awslogs-region”: “us-east-1”
}
}
}
],
“placementConstraints”: [],
“volumes”: []
}
“””

And the work flow is simply:

• Build your image: docker build -t $ECR_URL/<my_role_name>:VERSION . • Push your image to ECR: docker push$ECR_URL/<my_role_name>:VERSION

• Update your task definition: aws ecs register-task-definition --cli-input-json file://production.json

• Tell your AWS cluster to use the new task definition (either in the UI or the CLI).

It comes with a couple headaches:

• Making security definitions and IAM roles

• Making your actual cluster instances (different IAM role for this!)

• Why on earth aren’t there CNAMEs for ECR urls?

Odds are, you know about Python lists and dicts. But there’s a couple other really useful ones out there.

### Tuples

Most of you have heard of this. Tuples are immutable collections - once you make them, you can’t change them.

Why are tuples useful? Two reasons:

1. Immutability is good sometimes

2. It saves space.

### Sets

Sets are kind of like dictionaries without values. They work the same way under the hood (with a hashmap) but keys don’t store anything.

It’s very useful to keep track of unique things that happen.

You can also use frozenset, which is immutable like a tuple.

### collections

The collections library has a bunch of useful ones:

• namedtuple() - make a tuple with predefined named fields. Great for things like SQLAlchemy rows.
• deque - a simple Python queue
• Counter - counts things. Kind of like a set for counting.
• OrderedDict - dict that remembers order
• defaultdict - you can define a default thing to get in your dictionary

### Generators

They’re pretty great:

Generators start again at the previous place when their called again. If the generator function actually returns, it raises a StopIteration and the generator is done.

They’re good because they don’t require much memory. That’s why they recommend you use xrange instead of range.

This is cool:

This is cooler:

This is coolest:

## Comprehensions

List/dict/tuple/set comprehensions are amazing. Want to make a simple list?

Easy. And you can add filters:

And do double loops!

Neat-o gang. It saves a lot of LOC. I use it frequently in API calls to generate JSON blobs.

See? Now they don’t look so scary.

## Common Python gotchas

Credit for most of this section goes to the Python Guide.

### Mutable default arguments

This is a classic Python interview question by pedantic people. What do you think will happen when I run this?

Huh?

Not what you expected right? Remember that functions are made before the code gets run. So the same list stays.

Instead of that, you should always do:

Don’t use mutable objects as default values (dicts, lists, etc). You can use strings and ints because they aren’t mutable.

### Closures are weird

Here, we make lambdas in the comprehension:

You’d expect the result to be 0,2,4,6,8 but you actually get 8,8,8,8,8! Why? Because the closure is generated at run time and we leave our loop with i as 8.

After all, you’d expect this:

To print 2, correct?

Instead of this, use default values:

### Classes inheritance is weird

This one comes from Martin Chikillian.

Sure.

K.

Wat?

When you reference an object’s attribute, it’ll first try to get it from its own definition. After that, it’ll go up and get it from its parents. This is called “Method Resolution Order (MRO)”.

### Local/global variables are weird

The moment you assign something, it creates it in the local scope. So a += 1 turns into a = a + 1 and it doesn’t know what the second a is.

To do this properly:

### is vs ==

Remember: is will only be true if two things are the same object. == will call the object’s __eq__ method, which you can override:

There are a ton of other magic functions like __le__, __lte__, __add__ that affect other operators like <, <=, +.

## iPython is great

It remembers your history, does tab completion, give information about objects, has magic pasting.

It even has magic functions like %timeit:

Or %run, which runs external scripts.

## Context Managers

Context managers are cool if you need to make sure something gets cleaned up:

You can do other cool things with contextlib:

## List slicing and dicing

Python’s slice notation is the bomb. Get the last element by using -1: mylist[-1].

You can also get slices this way: mylist[0:5] or mylist[:5] or mylist[5:].

You can even define the “jumps”: myrange(10)[::2] gets even numbers. A good trick to reverse lists is to do myrange(10)[::-1] (although you should probably use reversed() instead!).

## args and kwargs and splat

* is called the “splat” operator. Use it for making functions that take multiple things:

Or use it for the opposite:

## itertools.chain

I love itertools. Very handy library. Combine iterables:

Use group by:

Combinations and permutations (great for tests):

## zip

Handy for combining lists for iterating:

## Decorators are great

Wrap your functions! Here’s a decorator that makes sure a view is only useable if the user is logged in:

## Best Python -m tools:

A lot of handy tools out there. -m runs the main function of a module. Here’s a good list. Some highlights:

### SimpleHTTPServer

Handy for making a quick HTTPServer:
python -m SimpleHTTPServer 8000

### json.tool

Handy for pretty printing:
echo '{"greeting": "hello", "name": "world"}' | python -m json.tool

## Gzip

python -m gzip [file] # compress
python -m gzip -d [file] # decompress

### antigravity

python -m antigravity

### this

python -m this

# Hello World

I’ve started and abandoned many blogs in the past. Over time, most of them fizzeled out because I thought that I didn’t have anything worth writing. If it wasn’t original, I reasoned, I wouldn’t be worth my time to write and worth the time of others to read.

Turns out, that’s a crappy approach to writing because you never get better that way. Quantity generate quality in the long run.

So, I’ve created this blog with a completely different goal in mind: write as often as possible, even if I’m writing about something pretty stupid. I think there’ll be other benefits:

1. Hopefully, I’ll learn to become more succinct in real life. I ramble a lot when speaking and it doesn’t make me the best communicator.

2. In learning to write more clearly, maybe I’ll learn to think more clearly. It’s possible that good communication is just a symptom of a well-organized mind.

3. Having a written record of things my thoughts and opinions sound like it’ll be nice in a few years [1].

4. Writing teaches empathy. The best writers make sure there is no ambiguity by placing themselves in the position of the reader. Understanding others is a skill that makes you more successful in, I believe, almost every field imaginable.

Let’s see how far this goes!

[1] As a kid, I used to hate being photographed. It seemed forced to me. However, as I’ve grown older, I often wish I had more pictures.