Support Vector Machines#
Support vector machines (SVMs) are among the most studied and well-understood methods in machine learning. Like linear regression and random forests, they can be used for either classifications or for regression or even for (unsupervised) outlier detection (see the OneClassSVM type in Scikit-learn for more information), but in this section of the lesson, we’ll focus on using them as classifiers.
SVMs are very good at certain kinds of problems, especially when the number of features (dimensions/columns) is very large but the number of observations (points/rows) is low.
How do SVMs work?#
SVMs find a hyperplane (or a set of hyperplanes if the input data contain more than 2 classes) that best separates the data into the distinct target classes (i.e., one class on one side of the plane and the other class on the other side).
Note
A “hyperplane” is just an \(n\)-dimensional plane. In 1D, a hyperplane is a point; in 2D, it’s a line; in 3D, it’s a plane, and beyond 3D we just call it a “hyperplane”.
Finding the hyperplane that separates classes can be performed efficiently and deterministically, which is a large part of why SVMs are successful. However, it is very limiting: in many datasets, classes are not separated by a plane but rather by a complex surface, and, generally speaking, when a hyperplane can separate classes, simpler methods like linear regression also tend to work well. Because of these limitations, SVMs use what’s known as a “kernel trick” in which they transform the input data into an equivalent high-dimensional feature space then transform that high-dimensional space. After the space has been transformed, hyperplanes can potentially form much more complex and relevant surfaces in the untransformed original feature space.
Fully understanding the kernel trick requires a deep dive into the relevant math, which we won’t do in this course. The important thing to understand is that SVMs excel at separating classes by hyperplanes, and they can use the kernel trick in conjunction with these hyperplanes to effectively divide classes using non-linear surfaces instead of just (linear) hyperplanes.
Limitations and Advantages of SVMs#
Advantages#
SVMs handle very high-dimensional datasets well. Some datasets, such as a dataset of high-resolution images, inherently have high dimensionality. High-resolution images have 1 dimension per image pixel and per channel, so, for example, a \(640 \times 480\) RGB image contains \(640 \times 480 \times 3 = 921,600\) dimensions (the 3 comes from the red/green/blue channels of the image) because each channel of each pixel contains a distinct independent value. Each of these values is a unique feature, and each image is an observation in the dataset.
SVMs also tend to handle sparse data well. Sparse data occurs when there are many features but most of them have null values (e.g., in numerical data, most of the features are 0). Text data is often represented sparsely, for example by assigning each column of a matrix a different word or concept, then encoding a block of text as a row by marking a 1 in the appropriate column for any word present in the block and leaving the remaining colums equal to 0.
Note
Sparse data in AI/ML creates many kinds of special circumstances and problems and is its own subdomain of data science research. This course doesn’t discuss sparse data in detail; a (highly technical) theoretical overview of sparse data methods can be found in this online book. A more practical introduction to sparse data can be found in the scipy documentation on sparse arrays.
SVMs tend to be robust to overfitting. With good cross-validation, overfitting can often be avoided, but SVMs tend to be more resistant to overfitting than many other ML methods.
Limitations#
SVMs can’t handle non-numerical data. Non-numerical data are very common, especially in demographic and survey data. An example of a non-numeric feature is a person’s ABO blood-type, which is one of a few categories (
A,B,AB, orO). For a dataset with non-numerical data, one would typically use a method called one-hot encoding to create a sparse numerical representation that works with SVMs. We won’t discuss one-hot encoding in this lesson, but Scikit-learn includes a built-inOneHotEncoderclass to make this conversion automatically.SVMs are sensitive to the scales of the input features; for this reason it’s almost always important to normalize a dataset’s features for use with SVMs. This normalization is important but is also the source of a lot of bugs—if the training data are normalized by subtracting the mean and dividing by the standard deviation, then the test data (and any data subsequently used with the SVM) must be normalized exactly the same way, meaning that the mean and standard deviation of the training dataset must be used to normalize any data to which the model is applied after training.
SVMs can be hard to interpret; although the idea of separating points with different classes by a hyperplane is fairly intuitive, how this works in high dimensions and when a kernel transformation is present can be very unintuitive.
What kernel to use and whether to use one are important hyperparameters of SVM. Whether SVM will work well or not can depend heavily these choices. If the classes can be separated linearly, then a kernel is not necessary, but for nonlinear data, the correct choice can be very important.
Example: The Diabetes Dataset#
For SVMs, we’ll use a different dataset than we’ve used in the previous lessons. This dataset is called the “Diabetes Dataset” and can also be loaded using Scikit-learn. It is structured very similarly to the California Housing Dataset, with a few features and a single target. In this case, the rows of the dataset correspond to observations of individual patients while the features each represent some health data such as BMI. The target is a number greater than 0 that quantifies the progression of the patient’s diabetes.
import sklearn as skl
# We use scikit-learn to download and return the diabetes dataset:
diabetes_dataset = skl.datasets.load_diabetes()
# Extract the actual data rows and the feature names:
diabetes_featdata = diabetes_dataset['data']
diabetes_featnames = diabetes_dataset['feature_names']
# We also extract the "target" data, since we are using supervised learning.
# In this dataset, the target is a quantitative measure of disease
# progression.
diabetes_targdata = diabetes_dataset['target']
# We can also print the dataset description:
print(diabetes_dataset['DESCR'])
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
As in the previous sections, we’ll split the dataset into train and test subdatasets.
import numpy as np
# We set a specific random seed to make sure that this notebook runs the same way each time.
np.random.seed(0)
# Randomly select 75% of the rows to be in the training dataset.
all_rows = np.arange(diabetes_featdata.shape[0])
n_train = int(round(len(all_rows) * 0.75))
n_test = len(all_rows) - n_train
train_rows = np.random.choice(all_rows, n_train, replace=False)
test_rows = np.setdiff1d(all_rows, train_rows)
# Extract these rows into separate matrices:
train_featdata = diabetes_featdata[train_rows]
train_targdata = diabetes_targdata[train_rows]
test_featdata = diabetes_featdata[test_rows]
test_targdata = diabetes_targdata[test_rows]
For the Diabetes Dataset, we are going to perform a classification problem; accordingly, we need to convert the target data into classes. Currently, the target data contains a number between 0 and about 350 representing a quantification of disease progression. We can assign these three classes as follows:
Class 0: minimal disease progression (
target <= 100).Class 1: moderate disease progression (
100 < target <= 250).Class 2: advanced disease progression (
250 < target).
train_targclass = np.ones(len(train_targdata))
train_targclass[train_targdata <= 100] = 0
train_targclass[train_targdata > 250] = 2
test_targclass = np.ones(len(test_targdata))
test_targclass[test_targdata <= 100] = 0
test_targclass[test_targdata > 250] = 2
We can now allocate an SVM management object. The classification SVM that we’ll use is called SVC; the equivalent version that performs regression is SVR.
from sklearn.svm import SVC
# Create the SVM object; we provide it with a hyperparameter specifying the
# kind of kernel it should use.
svm = skl.svm.SVC(kernel='linear')
# Normally, we would want to normalize each of the features prior to training
# the SVM. We don't need to do this, however, because the Diabetes Dataset has
# already been normalized!
# Train the svm using our training data.
svm.fit(train_featdata, train_targclass)
SVC(kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| C | 1.0 | |
| kernel | 'linear' | |
| degree | 3 | |
| gamma | 'scale' | |
| coef0 | 0.0 | |
| shrinking | True | |
| probability | False | |
| tol | 0.001 | |
| cache_size | 200 | |
| class_weight | None | |
| verbose | False | |
| max_iter | -1 | |
| decision_function_shape | 'ovr' | |
| break_ties | False | |
| random_state | None |
Notice that the SVM training is very quick, especially compared to the random forest. This is one of the great features of SVMs; though their speed varies substantially with the dataset. For datasets with many observations, SVMs can be very slow, but for datasets with many features, SVMs are typically fast.
Now that we’ve trained the SVC, we can see how well it performs. Because this is a classification method and not a regression method, the score is different. Instead of the \(R^2\) (explained variance) of the model predictions, SCV.score calculates the fraction of correctly predicted target classes.
svm.score(test_featdata, test_targclass)
0.5181818181818182
The model appears to have obtained an accuracy of only about 52%. This is well above chance (~33%), but it’s hardly amazing. Maybe our performance could be improved by trying a different kernel, such as the 'rbf' kernel. RBF stands for “radial basis function” and is a popular kernel for SVMs.
svm_rbf = skl.svm.SVC(kernel='rbf')
svm_rbf.fit(train_featdata, train_targclass)
svm_rbf.score(test_featdata, test_targclass)
0.7636363636363637
This SVM achieved an accuracy of over 76%, substantially better! This is an example where the hyperparameter kernel has a very large effect on the model. Unfortunately, figuring out the best set of hyperparameters for any given model is often a matter of experimentation and careful testing.
Hyperparameter fiddling and the need for validation subdatasets.#
In the immediately previous section, we did something that is common-place when handling and analyzing data but which is nonetheless problematic. What we did was try out a model, then, when it didn’t work as well as we had hoped, we tried a different hyperparameter using the same training and test datasets. Why do you think is this a problem?
Recall that when we discussed cross-validation at the beginning of this lesson, the issue that cross-validation protects against is overfitting, in which the model is fit to its training data so well that it models the noise in the dataset. To avoid this, we update (train) the model only using the training data but we evaluate (test) the model only using the test data. This split prevents the model from fitting noise because it can only fit noise in the training dataset and never the test dataset.
However, if we train a model, look at its performance on the test dataset, then go back and retrain the model using a different hyperparameter (in search of a better score), we are essentially performing a kind of high level optimization that uses the test dataset for training. We have updated a hyperparameter based on the test data performance, which can potentially result in a set of hyperparameters that induce the model to fit noise in the test dataset.
In this particular case, the single change of a single hyperparameter is unlikely to have much of an effect on the accuracy of the resulting model, but the scenario explained above is a real problem that should be avoided. The typical approach for avoiding this is to use a nested split of the data into a training subdataset, a test subdataset, and a validation subdataset. The training and test datasets are used as described above, but the validation dataset is used only after a final model has been trained and no further decisions about hyperparameters are to be made. Ideally, the validation dataset should not really be examined prior to its use to validate a model.
There is no single rule about how large each of the subdatasets should be, but a good rule of thumb is to use 70% of the observations in the training dataset and 15% in each of the test and validation datasets.