In this post, we will be using a lot of Python. All of the code can be found here.
We’ll be using data from R2D3, which contains real estate information about New York City and San Francisco real estate.
Regression
One of the most important fields within data science, regression is about describing data. Simply put, we will try to draw the best line possible through our data.
First, an example. Let’s plot some New York City apartments’ prices by square foot:


As you can see, there’s a bit of bunching up in the lower left corner because we have a bunch of outliers. Let’s say any apartment over $2M or 2000 square feet is an outlier (although, I think this is our dataset showing it’s age…).


Okay, so we can notice a couple things about this data:
There’s a minimum number of square feet and price. No one has an apartment smaller than 250 square feet and no one has a price lower than 300K. This makes sense.
The data seems to go in a upright direction, so we more or less draw a line naively if we tried.
As we go along the axes, the data spreads. There must be other variables at work.
Drawing our line
Let’s try to draw the best straight line we can though the data. This is called linear regression. So our line is going to have the familiarlooking formula:
$$
\begin{align}
h_\theta(x) = \theta_0 + \theta_1 x
\end{align}
$$
Where $h_\theta(x)$ is our hypthesis $h$ of the price given a square foot of $x$. Our goal is to find our $\theta$ s, which define the slope and yintercept of our line.
So how do we define the best line? Through an cost function, which defines how off a line is! The line with the least cost is the best line.
Okay, so how do we choose an cost function? Well, there’s a lot of different cost functions out there, but least squares error is perhaps the most common:
$$
\begin{align}
J(\theta) = {1 \over 2} \sum_{i=0}^n (h_\theta(x_i)  y)^2
\end{align}
$$
Essentially, for each data point $(x_i, y_i)$, we’re simply taking the difference between $y_{i}$ and $h_\theta(x_i)$ and squaring it. Then we’re summing them all together and halving it. Makes sense.
Gradient Descent
Okay, so now the hard part. We have data and a cost function, so now we need a process for reducing cost. To do this, we’ll use gradient descent.
Gradient descent works by constantly updating any $\theta_j$ in the direction that will dimension $J(\theta)$. Or:
$$
\begin{align}
\theta_j := \theta_j  \alpha {\delta \over \delta \theta_j} J(\theta)
\end{align}
$$
Where $\alpha$ is essentially “how far” down the slope you want to go at every step.
Thus, with a little bit of math, we can find the derivative of our $\theta_j$ with respect to $J(\theta)$ for an individual data point:
$$
\begin{align}
{\delta \over \delta \theta_j} J(\theta) &= {\delta \over \delta \theta_j} {1 \over 2} (h_\theta(x)  y)^2 \cr
&= 2 \cdot {1 \over 2} \cdot (h_\theta(x)  y) \cdot {\delta \over \delta \theta_j} (h_\theta(x)  y) \cr
&= (h_\theta(x)  y) \cdot {\delta \over \delta \theta_j} (h_\theta(x)  y)
\end{align}
$$
Now things begin to diverge for our $\theta$s:
$$
\begin{align}
{\delta \over \delta \theta_0} J(\theta) &= (h_\theta(x)  y) \cdot {\delta \over \delta \theta_0} (\theta_0 + \theta_1 x  y) \cr
&= (h_\theta(x)  y)
\end{align}
$$
$$
\begin{align}
{\delta \over \delta \theta_1} J(\theta) &= (h_\theta(x)  y) \cdot {\delta \over \delta \theta_1} (\theta_0 + \theta_1 x  y) \cr
&= (h_\theta(x)  y) \cdot x
\end{align}
$$
So now we can apply our derivatives to the original algorithm…
$$
\begin{align}
\theta_0 := \theta_0 + \alpha (y  h_\theta(x)) \cr
\theta_1 := \theta_1 + \alpha (y  h_\theta(x)) \cdot x
\end{align}
$$
However, this only applies to one data point. How do we apply this to multiple data points?
Batch Gradient Descent
The idea behind batch gradient descent is simple: go through your batch and get all the derivatives of ${\delta \over \delta \theta_0} J(\theta)$. Then take the average:
$$
\begin{align}
\theta_0 := \theta_0 + \alpha \sum_{i=0}^m (y  h_\theta(x)) \cr
\theta_1 := \theta_1 + \alpha \sum_{i=0}^m (y  h_\theta(x)) \cdot x
\end{align}
$$
This means that you only change your theta at the end of the batch:


This results in:
Pretty good! The code completes in 316 iterations, an error of 65662214541.7 and estimates $\theta_0$ to be 80241.5458922 and $\theta_1$ to be 1144.09527519.
Stochastic Gradient Descent
Another way to train our data is to apply our changes to each data point individually. This is stochastic gradient descent:
$$
\begin{align}
\theta_0 := \theta_0 + \alpha (y  h_\theta(x)) \cr
\theta_1 := \theta_1 + \alpha (y  h_\theta(x)) \cdot x
\end{align}
$$
It has one big benefit: we don’t have to go through the entire batch. This is hugely important for problems with large, large datasets. You’ll see it used frequently in deep learning.


It’s worth noting that this took longer than batch descent since our dataset is so small. Also, we had to loosen up the right answer since we oscillated around the error a lot more. We would oscillate a lot less if it weren’t for the randomness!
There are inbetweens like minibatch gradient descent, where you randomly turn your large dataset into small batches. This is hugely useful if you’re doing regression across multiple machines!
Multivariate Linear Regression
Okay, now things start to get fun. At the moment, we’re dealing with one input dimension (AKA “Simple” Linear Regression), which is great for getting started, but most datasets have more than one dimension. We can generalize our algorithm using linear algebra.
First, let’s say that we have $n$ dimensions. Thus, when we treat every input $x$ as a vector, we get:
$$
\overrightarrow{x}^{(i)} = \begin{bmatrix}
x^{(i)}_1 \cr
x^{(i)}_2 \cr
\vdots \cr
x^{(i)}_n \cr
\end{bmatrix}
$$
We can generalize our $m$ number of training vectors to be:
$$
X = \begin{bmatrix}
— (\overrightarrow{x}^{(1)})^T — \cr
— (\overrightarrow{x}^{(2)})^T — \cr
\vdots \cr
— (\overrightarrow{x}^{(m)})^T — \cr
\end{bmatrix}
$$
And our answers, $y$, can be a vector as well:
$$
\overrightarrow{y} = \begin{bmatrix}
y^{(1)} \cr
y^{(2)} \cr
\vdots \cr
y^{(m)} \cr
\end{bmatrix}
$$
And our parameters as well:
$$
\overrightarrow{\theta} = \begin{bmatrix}
\theta_{1} \cr
\theta_{2} \cr
\vdots \cr
\theta_{n} \cr
\end{bmatrix}
$$
We’re losing our “error” parameter of $\theta$. If you really want it, you can simply add a “1” column to each datapoint and that’ll accomplish the same thing and allow us to simplify our math and code a bit.
So our hypothesis for an individual data point looks like:
$$
\begin{align}
h_\theta(\overrightarrow{x}^{(i)}) = (\overrightarrow{x}^{(i)})^T \cdot \overrightarrow{\theta}
\end{align}
$$
So going back to our cost function, which we can put in matrix form:
$$
\begin{align}
J(\theta) &= {1 \over 2} \sum_{i=0}^n (h_\theta(\overrightarrow{x}^{(i)})  y^{(i)})^2 \cr
&= {1 \over 2} (h_\theta(X)  \overrightarrow{y})^T(h_\theta(X)  \overrightarrow{y}) \cr
&= {1 \over 2} (X \cdot \overrightarrow{\theta}  \overrightarrow{y})^T(X \cdot \overrightarrow{\theta}  \overrightarrow{y})
\end{align}
$$
So if we wanted to find the derivative of $J(\theta)$ now (aka $\nabla_{\theta}J(\theta)$), we’d have to do some funky math. If you want to read how it can be derived, I recommend reading page 11 of Andrew Ng’s lecture notes on Linear Regression.
Skipping to the answer, we get:
$$
\begin{align}
\nabla_{\theta}J(\theta) &= X^T(X \cdot \theta  \overrightarrow{y})
\end{align}
$$
And thus, we can create steps for our $\overrightarrow{\theta}$:
$$
\begin{align}
\overrightarrow{\theta} := \overrightarrow{\theta} + \alpha X^T(\overrightarrow{y}  (X \cdot \theta))
\end{align}
$$
This kind of looks like our step for $\theta_1$ not too long ago!
$$
\begin{align}
\theta_1 := \theta_1 + \alpha (y  h_\theta(x)) x
\end{align}
$$
Programming this in numpy


We get a slightly different answer here than in our old batch gradient descent code because our yintercept has a different learning rate. If we gave it enough repititions, it would eventually get near the same area.
Of course, there are plenty of high level ML libraries to explore that do this stuff for you! However, it’s fun to understand what’s happening under the hood.
Things to worry about
Multivariable linear regression with gradient descent can have a lot of complications. For example, local minima:
As you can see, gradient descent might accidentally think the minimum is in the saddle there. There’s a ton of interesting papers about this. Training multiple times with different starting parameters is one way around this.
Overfitting is also an issue:
You can avoid this by splitting your data into a large “training” set and a large “testing” set. That’s standard procedure in most data science problems.
One last method: using the derivative
As many of you know, if you set the derivative of a curve to 0, it’ll either be a local maxima or a local minima. Since our error function does not have an upper limit, we know that the point where the derivative is 0 is a local minima.
So we can make our derivative $\nabla_{\theta}J(\theta)$ zero:
$$
\begin{align}
\nabla_{\theta}J(\theta) &= X^T(X\theta  \overrightarrow{y}) \cr
0 &= X^T(X\theta  \overrightarrow{y}) \cr
X^TX\theta &= X^T\overrightarrow{y} \cr
\theta &= (X^TX)^{1}X^T\overrightarrow{y}
\end{align}
$$
And we can throw it in our code:


And it totally works! You may ask “but why did we learn all about the gradient descent stuff then”? Well, you’ll need it for things that aren’t as straight forward as linear regression like deep neural networks.
To learn more:
Andrew Ng’s notes on Linear Regression.
ML @ Berkeley’s Machine Learning Crash Course.