What is Unsupervised Learning?

What is Unsupervised Learning?#

If a method is an “unsupervised learning” method, that means that it can be used to deduce facts about a dataset without being shown any examples of those facts. A “supervised learning” method, in contrast, learns by studying a set of example facts how to predict what facts are appropriate for unseen data. For example, suppose you make a set of detailed measurements of the nutritional and health profiles of several rabbits over a number of years. To analyze these data, you could start by attempting to cluster the rabbits into groups that had similar nutritional profiles—such an action would be an unsupervised learning approach because it predicts a class for each rabbit without knowing or seeing any examples of correctly classified rabbits. After performing this clustering you might then try to understand how nutrition and health are related in the dataset by sorting the rabbits by the overall quality of their nutritional profiles and assessing which characteristics of a rabbit’s nutritional profile lead to improvements in its health—such an assessment would be a supervised learning approach because it learns the correct association between nutrition and health from examples (the rabbits in the dataset).

Common types of unsupervised learning include clustering, as in the example above, as well as dimensionality reduction, outlier detection, and kernel density estimation.

Why use Unsupervised Learning?#

Frequently in research we do not have the luxury of correct examples from which to train a supervised machine learning algorithm. In such cases, we attempt to deduce as much as possible from the data itself. We can ask questions about how the dimensions of the data vary, what patterns of variability appear, and can the data be clustered into classes based on simple notions of distance?

Unsupervised learning is often a useful starting point for more detailed analyses because several unsupervised methods perform well at the task of simplifying data. For example, in a high-dimensional dataset like a survey with hundreds of questions, it can be very hard to find patterns and similarities in responses by hand, but a dimensionality reduction technique can easily visualize groups of respondents whose answers covaried similarly. Once observed, such a relationship can then be tested more formally with a model.

The California Housing Dataset#

To learn about any machine learning or artificial intelligence method, it’s always useful to have an example dataset, and for these lessons we will use the California Housing Dataset, which can be loaded by the Scikit-learn library. This dataset contains observations about the makeup and locations of households across California based on the 1990 U.S. Census.

# First, we need to import scikit-learn:
import sklearn as skl

# Next, we use scikit-learn to download and return the CA housing dataset:
ca_housing_dataset = skl.datasets.fetch_california_housing()

# Extract the actual data rows and the feature names:
ca_housing_featdata = ca_housing_dataset['data']
ca_housing_featnames = ca_housing_dataset['feature_names']

# We also extract the "target" data:
ca_housing_targdata = ca_housing_dataset['target']
ca_housing_targnames = ca_housing_dataset['target_names']

# Print the description of the dataset:
print(ca_housing_dataset['DESCR'])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
  Statistics and Probability Letters, 33:291-297, 1997.

Dataset Features#

The data from the CA housing dataset is represented as a NumPy array whose rows represent a census block and whose columns represent house-related features (dimensions) of each census block such as the average age of a block’s houses and the average number of bedrooms of houses in each block. See the dataset description above for more information.

A quick inspection confirms that there are the same number of feature names in the dataset as there are columns in the data matrix.

print(f'Feature names: {len(ca_housing_featnames)}')
print(f'Data shape: {ca_housing_featdata.shape}')

Feature names: 8
Data shape: (20640, 8)

We can use Pandas to organize the feature matrix and feature names into a DataFrame.

# To organize the dataset into a dataframe, we use Pandas:
import pandas as pd

feat_df = pd.DataFrame(dict(zip(ca_housing_featnames, ca_housing_featdata.T)))

# Display the dataframe:
feat_df

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25
...	...	...	...	...	...	...	...	...
20635	1.5603	25.0	5.045455	1.133333	845.0	2.560606	39.48	-121.09
20636	2.5568	18.0	6.114035	1.315789	356.0	3.122807	39.49	-121.21
20637	1.7000	17.0	5.205543	1.120092	1007.0	2.325635	39.43	-121.22
20638	1.8672	18.0	5.329513	1.171920	741.0	2.123209	39.43	-121.32
20639	2.3886	16.0	5.254717	1.162264	1387.0	2.616981	39.37	-121.24

20640 rows × 8 columns

Target Data#

The CA housing dataset contains a “target” variable representing the median value of a house in each census block. Although we won’t be using this target value in our methods (because we will be studying unsupervised methods in this lesson), we will want to look at it periodically to see if our unsupervised methods have deduced anything useful.

We can keep track of the target data by making it into its own DataFrame.

targ_df = pd.DataFrame({ca_housing_targnames[0]: ca_housing_targdata})

# Display the dataframe:
targ_df

	MedHouseVal
0	4.526
1	3.585
2	3.521
3	3.413
4	3.422
...	...
20635	0.781
20636	0.771
20637	0.923
20638	0.847
20639	0.894

20640 rows × 1 columns

Lesson Goals#

In this lesson we will deduce what we can about the California Housing dataset and attempt to make predictions about the dataset’s target variables. Because this lesson is on unsupervised learning, we won’t be using the target data to train any of the methods in this lesson, but we will see whether we can learn about the target variable nonetheless.

By the end of this lesson, you should be comfortable with the following concepts:

We’ll first look at the k-means algorithm for putting rows of the CA housing dataset into clusters based on their similarity.
Next, we’ll use Principal Component Analysis (PCA) to visualize and examine the covariance of the dataset.
Finally, we’ll look at some ways that these methods can fail by comparing them to Locally Linear Embedding (LLE).