What is Unsupervised Learning?#
If a method is an “unsupervised learning” method, that means that it can be used to deduce facts about a dataset without being shown any examples of those facts. A “supervised learning” method, in contrast, learns by studying a set of example facts how to predict what facts are appropriate for unseen data. For example, suppose you make a set of detailed measurements of the nutritional and health profiles of several rabbits over a number of years. To analyze these data, you could start by attempting to cluster the rabbits into groups that had similar nutritional profiles—such an action would be an unsupervised learning approach because it predicts a class for each rabbit without knowing or seeing any examples of correctly classified rabbits. After performing this clustering you might then try to understand how nutrition and health are related in the dataset by sorting the rabbits by the overall quality of their nutritional profiles and assessing which characteristics of a rabbit’s nutritional profile lead to improvements in its health—such an assessment would be a supervised learning approach because it learns the correct association between nutrition and health from examples (the rabbits in the dataset).
Common types of unsupervised learning include clustering, as in the example above, as well as dimensionality reduction, outlier detection, and kernel density estimation.
Why use Unsupervised Learning?#
Frequently in research we do not have the luxury of correct examples from which to train a supervised machine learning algorithm. In such cases, we attempt to deduce as much as possible from the data itself. We can ask questions about how the dimensions of the data vary, what patterns of variability appear, and can the data be clustered into classes based on simple notions of distance?
Unsupervised learning is often a useful starting point for more detailed analyses because several unsupervised methods perform well at the task of simplifying data. For example, in a high-dimensional dataset like a survey with hundreds of questions, it can be very hard to find patterns and similarities in responses by hand, but a dimensionality reduction technique can easily visualize groups of respondents whose answers covaried similarly. Once observed, such a relationship can then be tested more formally with a model.
The California Housing Dataset#
To learn about any machine learning or artificial intelligence method, it’s always useful to have an example dataset, and for these lessons we will use the California Housing Dataset, which can be loaded by the Scikit-learn library. This dataset contains observations about the makeup and locations of households across California based on the 1990 U.S. Census.
# First, we need to import scikit-learn:
import sklearn as skl
# Next, we use scikit-learn to download and return the CA housing dataset:
ca_housing_dataset = skl.datasets.fetch_california_housing()
# Extract the actual data rows and the feature names:
ca_housing_featdata = ca_housing_dataset['data']
ca_housing_featnames = ca_housing_dataset['feature_names']
# We also extract the "target" data:
ca_housing_targdata = ca_housing_dataset['target']
ca_housing_targnames = ca_housing_dataset['target_names']
# Print the description of the dataset:
print(ca_housing_dataset['DESCR'])
.. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics:**
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
:Missing Attribute Values: None
This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).
This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).
A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.
It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.
.. rubric:: References
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33:291-297, 1997.
Dataset Features#
The data from the CA housing dataset is represented as a NumPy array whose rows represent a census block and whose columns represent house-related features (dimensions) of each census block such as the average age of a block’s houses and the average number of bedrooms of houses in each block. See the dataset description above for more information.
A quick inspection confirms that there are the same number of feature names in the dataset as there are columns in the data matrix.
print(f'Feature names: {len(ca_housing_featnames)}')
print(f'Data shape: {ca_housing_featdata.shape}')
Feature names: 8
Data shape: (20640, 8)
We can use Pandas to organize the feature matrix and feature names into a DataFrame.
# To organize the dataset into a dataframe, we use Pandas:
import pandas as pd
feat_df = pd.DataFrame(dict(zip(ca_housing_featnames, ca_housing_featdata.T)))
# Display the dataframe:
feat_df
| MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|
| 0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 |
| 1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 |
| 2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 |
| 3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 |
| 4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 20635 | 1.5603 | 25.0 | 5.045455 | 1.133333 | 845.0 | 2.560606 | 39.48 | -121.09 |
| 20636 | 2.5568 | 18.0 | 6.114035 | 1.315789 | 356.0 | 3.122807 | 39.49 | -121.21 |
| 20637 | 1.7000 | 17.0 | 5.205543 | 1.120092 | 1007.0 | 2.325635 | 39.43 | -121.22 |
| 20638 | 1.8672 | 18.0 | 5.329513 | 1.171920 | 741.0 | 2.123209 | 39.43 | -121.32 |
| 20639 | 2.3886 | 16.0 | 5.254717 | 1.162264 | 1387.0 | 2.616981 | 39.37 | -121.24 |
20640 rows × 8 columns
Target Data#
The CA housing dataset contains a “target” variable representing the median value of a house in each census block. Although we won’t be using this target value in our methods (because we will be studying unsupervised methods in this lesson), we will want to look at it periodically to see if our unsupervised methods have deduced anything useful.
We can keep track of the target data by making it into its own DataFrame.
targ_df = pd.DataFrame({ca_housing_targnames[0]: ca_housing_targdata})
# Display the dataframe:
targ_df
| MedHouseVal | |
|---|---|
| 0 | 4.526 |
| 1 | 3.585 |
| 2 | 3.521 |
| 3 | 3.413 |
| 4 | 3.422 |
| ... | ... |
| 20635 | 0.781 |
| 20636 | 0.771 |
| 20637 | 0.923 |
| 20638 | 0.847 |
| 20639 | 0.894 |
20640 rows × 1 columns
Lesson Goals#
In this lesson we will deduce what we can about the California Housing dataset and attempt to make predictions about the dataset’s target variables. Because this lesson is on unsupervised learning, we won’t be using the target data to train any of the methods in this lesson, but we will see whether we can learn about the target variable nonetheless.
By the end of this lesson, you should be comfortable with the following concepts:
We’ll first look at the k-means algorithm for putting rows of the CA housing dataset into clusters based on their similarity.
Next, we’ll use Principal Component Analysis (PCA) to visualize and examine the covariance of the dataset.
Finally, we’ll look at some ways that these methods can fail by comparing them to Locally Linear Embedding (LLE).