Linear Regression#
Linear regression is one of the most important and commonly-used methods in all of ML. It goes by many names, including Linear Regression, Ordinary Least Squares, and sometimes Linear Interpolation. In particular, linear regression is valuable because it is both deterministic and highly interpretable. Deterministic in this context just means that linear regression does not depend on any hyperparameters—given the same set of training data, the same model parameters will always be learned. Interpretable in this context means that one can understand the relationships in the data captured by the model by examining the model’s trained parameters. For some of the models we will discuss later, especially convolutional neural networks (Lesson 4.2), it is very difficult to deduce how the model works just from examining its parameters.
Linear regression models are among the simplest possible models; they are much more likely to exhibit bias (underfitting) than to exhibit variance (overfitting). For this reason, linear regression is a good method to use as a baseline when comparing to other models. If a complex model is performing no better than linear regression on a task, then that model could probably be a lot simpler.
In many cases, linear regression is the only tool that one needs for good scientific deduction. Even in many cases when it is not the best tool, it’s often still a good place to start.
What does Linear Regression do?#
Linear regression takes as input a set of feature vectors and a single target vector. It’s goal is to explain the target vector using a weighted sum of feature vectors.
A lot of variables that are measured in research can be explained as the sum of a few quantities, often with some relative weighting. For example, we might suspect that a child’s height will in general be well predicted by some combination of their biological parents’ heights. Our model essentially says that \(h = w_\mbox{M} h_\mbox{M} + w_\mbox{F} h_\mbox{F}\), where \(h\) is the predicted adult height of the child, \(h_\mbox{M}\) and \(h_\mbox{F}\) are the heights of the mother and father, respectively, and \(w_\mbox{M}\) and \(w_\mbox{F}\) are the weights. If we were to carefully measure the heights of many biological parents, then much later the heights of their adult children, all at the same age, we would have many observations of \(h\), \(h_\mbox{M}\), and \(h_\mbox{F}\), and we could use linear regression to find the weights \(w_\mbox{M}\) and \(w_\mbox{F}\).
Advantages and Limitations#
Linear regression is one of the simplest and most powerful supervised learning tools. As a low-complexity model, it is much more likely to exhibit bias (where it is too inflexible to model a dataset well) than variance (where it is so flexible that it fits noise in the dataset).
Limitations#
Linear regression can only explain linear relationships between things; i.e., relationships of the form \(\boldsymbol{y} = m \boldsymbol{x} + b\).
Because linear regression minimizes the sum of squared errors, is sensitive to outliers. A single outlier can dramatically change the results of a linear regression.
Advantages#
Linear regression is simple and easy to interpret.
Linear regression is computationally very efficient.
The algorithm for linear regression is deterministic, meaning that it is not affected by randomness and will produce the same output for identical inputs.
Linear Regression Algorithm#
Inputs#
A feature matrix. Typically feature matrices are organized like spreadsheets with features (or dimensions) represented by columns and distinct observations represented by the rows. There can be any number of features and any number of points, and duplicate entries are allowed.
A target vector. The values, one per row of the feature matrix, that the model is learning to predict from the rows of the feature matrix.
Note
The order of the rows of the feature matrix and target vector doesn’t matter in linear regression as long as they are matched to each other. In other words, features[i] must always be associated with targets[i] (for any valid i), but as long as this remains true, you can jointly reorder their rows without changing the result of the regression.
Algorithm#
The algorithm for linear regression is called linear least squares. We won’t discuss the algorithm in detail in this lesson, but, in brief, one can demonstrate using linear algebra that the optimal coefficients are computed by a few matrix operations on the feature matrix input. To be precise, the coefficients are equal to
where \(\mathbf{F}\) is the feature matrix and \(\boldsymbol{t}\) is the target vector. The algorithm calculates this result.
Outputs#
A set of coefficients, one per feature; if like the CA Housing Dataset our feature matrix \(\mathbf{F}\) contains eight features, each of which are columns (i.e., \(\mathbf{F} = \left(\boldsymbol{f_1} \; \boldsymbol{f_2} \; ... \; \boldsymbol{f_8}\right)\)), then the coefficients will be a NumPy vector \(\left(w_1, w_2 ... w_8\right)\) where each of the \(w\) values are real numbers.
A real-valued intercept, \(b\).
The model’s prediction of the target data is equal to \(w_1 \boldsymbol{f_1} + w_2 \boldsymbol{f_2} + ... + w_n \boldsymbol{f_n} + b\).
Example: the California Housing Dataset#
Let’s work through an example of linear regression using the California Housing Dataset. We can start by loading in the dataset itself, using Scikit-learn.
import sklearn as skl
# We use scikit-learn to download and return the CA housing dataset:
ca_housing_dataset = skl.datasets.fetch_california_housing()
# Extract the actual data rows and the feature names:
ca_housing_featdata = ca_housing_dataset['data']
ca_housing_featnames = ca_housing_dataset['feature_names']
# We also extract the "target" data, since we are using supervised learning:
ca_housing_targdata = ca_housing_dataset['target']
ca_housing_targnames = ca_housing_dataset['target_names']
It’s good practice in general to use cross validation when training and testing out models, so we’ll go ahead and split our data into a training and test datasets.
import numpy as np
# Randomly select 75% of the rows to be in the training dataset.
all_rows = np.arange(ca_housing_featdata.shape[0])
n_train = int(round(len(all_rows) * 0.75))
n_test = len(all_rows) - n_train
train_rows = np.random.choice(all_rows, n_train, replace=False)
test_rows = np.setdiff1d(all_rows, train_rows)
# Extract these rows into separate matrices:
train_featdata = ca_housing_featdata[train_rows]
train_targdata = ca_housing_targdata[train_rows]
test_featdata = ca_housing_featdata[test_rows]
test_targdata = ca_housing_targdata[test_rows]
We can now employ the LinearRegression class of the Scikit-learn library in order to run the linear regression itself.
# Import the LinearRegression class from scikit-learn:
from sklearn.linear_model import LinearRegression
# Create the linear regression manager object:
linreg = LinearRegression()
# Because linear regression is supervised, we need to fit it using both a
# matrix of input data (one feature per column, one observation per row) AND
# a vector of "correct" outputs for it to learn. For us, the input data (used
# to make the predictions) is the CA housing features while the output data
# (gold-standard outputs that go with each input) is the CA housing targets
# (the median income of a house in each region).
# We train using the training data.
linreg.fit(train_featdata, train_targdata)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
What data does the LinearRegression object provide?#
After fitting the linear regression object (linreg) using our training data, the object will contain some relevant data about the regression, just like the PCA object in Lesson 1.2. To understand each of these data, let’s first review the model that the LinearRegression class has fit.
Linear regression attempts to model some value \(y\) (the median age of a house in a particular geographical region) based on a weighted sum of input variables. For example, if we had three input variables (\(x_1\), the median age of a house in a geographical region; \(x_2\), the median number of bedrooms; and \(x_3\), the median square footage), the linear regression model would find the values of \(w_1\), \(w_2\), \(w_3\), and \(b\) in the following equation that makes the left and right sides of the equation most similar.
In the example we trained above, we actually have 8 input variables, one per column of the train_featdata matrix (to see what these columns represent, you can evaluate print(ca_housing_featnames)). We have one output variable, the median housing price in the region. Each row of the matrix represents a different observation (a different geographical region, in our case), and the model minimizes the sum of squared errors across all observations in the training dataset. The squared error for a single row of the dataset is just \(\left(y - (w_1\,x_1 + w_2\,x_2 + w_3\,x_3 + b)\right)^2\).
The main results of the fitting of the linear model represent these values: the \(w_1\), \(w_2\), etc. are called the coefficients and the value \(b\) is called the intercept. They are represented as part of the linreg object:
linreg.intercept_. The intercept of the fit—i.e., the parameter \(b\).linreg.coef_. The coefficients, or weights, found by the regression—i.e., \(w_1\), \(w_2\), etc. There is one coefficient for each column of the input matrix.
If we want to calculate the prediction of the linear regression model for a particular row of the test dataset, we can do so by taking the dot product of the coefficients and the row’s columns then adding the intercept. This works because the matrix row is essentially a vector of the inputs \((x_1\;x_2\;...\;x_n)\) while the linreg.coef_ value is a vector of coefficients: \((w_1\;w_2\;...\;w_n)\) so the dot product is \(w_1\,x_1 + w_2\,x_2 + ... + w_n\,x_n\).
# Extract a row from the test data:
row_index = 10
test_featrow = test_featdata[10]
test_targrow = test_targdata[10]
# Calculate its prediction:
pred = np.dot(linreg.coef_, test_featrow) + linreg.intercept_
# Print the prediction and the true output.
print("Prediction:", pred)
print("Gold Output:", test_targrow)
Prediction: 2.4349379827100535
Gold Output: 1.914
Alternatively, the LinearRegression class includes a method called predict that can be used to calculate the prediction for a single row or for an entire matrix.
# Use the prediction method for the row we extracted; the predict method
# expects a matrix, so we reshape this into a 1x8 matrix first.
test_featrow_mtx = test_featrow[None, :]
pred_mtx = linreg.predict(test_featrow_mtx)
# predict also returns a vetor of results, one per row of the input
# matrix (which only had 1 row in this case) so we extract that row.
pred = pred_mtx[0]
# Print the prediction and the true output.
print("Prediction:", pred)
print("Gold Output:", test_targrow)
Prediction: 2.4349379827100535
Gold Output: 1.914
Note that because the coefficients are matched to each of the input features, we can tell, just by looking at the coefficients, how important each input feature was to the model.
for (name, coef) in zip(ca_housing_featnames, linreg.coef_):
print(f"{name:10s}: {coef}")
MedInc : 0.4403951748754435
HouseAge : 0.009350678479977947
AveRooms : -0.1110151139641256
AveBedrms : 0.6178853783051885
Population: -7.851555589797818e-06
AveOccup : -0.003397017521630609
Latitude : -0.42505051822238943
Longitude : -0.44063604626370345
At first glance, it looks like the most important value for predicting the mean housing price in a region is the average number of bedrooms of houses in that region. However, after some consideration, one might wonder whether the average number of bedrooms is really almost 100 times more valuable for predicting the price of a house than the average age of a region’s houses? The answer is a little complicated because interpreting the coefficients as metrics of the raw importance of their features to the model is imprecise.
Keeping in mind that the model optimizes for the sum \(w_1\,x_1 + w_2\,x_2 + ... + w_n\,x_n\) to equal the target output value (where the \(w\) values are the coefficients and the \(x\) values are thee features). If we ran the model twice but the second time we scaled all of the features to be equal to \(x_1/2\), \(x_2/2\) … \(x_n/2\), then we would naturally expect that the resulting coefficients would be equal to \(2w_1\), \(2w_2\) … \(2w_n\) in order to preserve the same result, and this is exactly what linear regression returns if you do this. Similarly, if we were to scale only one feature to be smaller, we would expect only its coefficient to be doubled on the second run of the regression.
What all of this means is that the coefficient is an indication of both how important a feature is to the model and the scale of the feature itself. For this reason, it is sometimes desirable to normalize all of your input features by subtracting the mean and dividing by the standard deviation. This ensures that all of your coefficients are on the same scale, especially of you are comparing incomparable columns such as one representing the median housing age (in units of years) and one representing the average number of bedrooms (in units of bedrooms).
Let’s rerun our model using normalized features to make sure we understand the contributions of each feature to the model.
linreg_norm = LinearRegression()
train_featdata_norm = train_featdata - np.mean(train_featdata, 0)
train_featdata_norm /= np.std(train_featdata_norm)
linreg_norm.fit(train_featdata_norm, train_targdata)
for (name, coef) in zip(ca_housing_featnames, linreg_norm.coef_):
print(f"{name:10s}: {coef}")
MedInc : 173.9539157469371
HouseAge : 3.6934717482836517
AveRooms : -43.85042090121405
AveBedrms : 244.0616681810014
Population: -0.0031013256217757856
AveOccup : -1.341805118359223
Latitude : -167.89285226833294
Longitude : -174.0490587538757
A few things to notice about the linear regression results:
The population of a region does not appear to contribute to the model almost at all!
There are several negative coefficients. If a coefficient is negative that means that it negatively covaries with the target output, therefore the model subtracts a coefficient times the feature from the prediction. When determining which feature is the most important to the model, however, one should use the absolute value of the coefficients because negative contributions to the model are as important as positive contributions.
Latitude and longitude both have strong negative contributions to the model, meaning they are negatively correlated with median housing price. If you aren’t familiar with California’s geography, this may seem unintuitive, but many of California’s most expensive neighborhoods are along its southwest coast where San Francisco, Los Angeles, and San Diego can be found. Southwest California is lower in both longitude and latitude than the rest of the state.
It may seem unintuitive that the average number of rooms in a house negatively contributes to the model prediction, but if two houses are the same size, the house with more rooms is often less valuable. This dataset doesn’t include the average area of a house, but the number of bedrooms likely correlates with house size better than the number of rooms.
Evaluating how well our model did.#
The LinearRegression object includes a function called score that calculates how effective our trained model is at predicting something. We could, for example, use this function to evaluate the ability of our model to explain the training dataset itself.
linreg.score(train_featdata, train_targdata)
0.6084709787611451
This function tells us that our score is about 0.61 (this may be slightly different when you run it due to randomization that we used to split the training and test datasets). What does this value mean?
For linear regression, the score function returns a value called the coefficient of determination, which is often written as \(R^2\). The coefficient of determination is always a real number less than or equal to 1 that approximately represents the fraction of the variance in a dataset that can be explained by a particular fitted model. A value of 1 means that the model explains the dataset perfectly. A value of 0 means that the model captures none of the model’s variance (e.g., when a model predicts 0 for all inputs), and a negative value means that the model’s predictions are even less accurate than predicting 0 for all inputs.
\(R^2\) is generally a good machine learning metric because it tells us how well our model is performing on a normalized scale, meaning that \(R^2\) can usually be compared across models and datasets.
An \(R^2\) value of 0.61 means that the model’s predictions of median house prices account for about 61% of the variance in our training subdataset of the California Housing Dataset. A more important metric, however, is how well the model does when predicting the test dataset, since it is guaranteed not to be overfit to the test dataset.
linreg.score(test_featdata, test_targdata)
0.5976552445270319
The \(R^2\) score for the test dataset is also about 60%; this not only indicates that our linear model is modestly accurate at modeling the data, but also that there was a minimum amount of overfitting.