# What is Unsupervised Learning?

If a method is an "unsupervised learning" method, that means that it can be used to deduce facts about a dataset without being shown any examples of those facts. A "supervised learning" method, in contrast, learns by studying a set of example facts how to predict what facts are appropriate for unseen data. For example, suppose you make a set of detailed measurements of the nutritional and health profiles of several rabbits over a number of years. To analyze these data, you could start by attempting to cluster the rabbits into groups that had similar nutritional profiles&mdash;such an action would be an unsupervised learning approach because it predicts a class for each rabbit without knowing or seeing any examples of correctly classified rabbits. After performing this clustering you might then try to understand how nutrition and health are related in the dataset by sorting the rabbits by the overall quality of their nutritional profiles and assessing which characteristics of a rabbit's nutritional profile lead to improvements in its health&mdash;such an assessment would be a supervised learning approach because it learns the correct association between nutrition and health from examples (the rabbits in the dataset).

Common types of unsupervised learning include [clustering](https://en.wikipedia.org/wiki/Cluster_analysis), as in the example above, as well as [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction), [outlier detection](https://en.wikipedia.org/wiki/Anomaly_detection), and [kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation).

## Why use Unsupervised Learning?

Frequently in research we do not have the luxury of correct examples from which to train a supervised machine learning algorithm. In such cases, we attempt to deduce as much as possible from the data itself. We can ask questions about how the dimensions of the data vary, what patterns of variability appear, and can the data be clustered into classes based on simple notions of distance?

Unsupervised learning is often a useful starting point for more detailed analyses because several unsupervised methods perform well at the task of simplifying data. For example, in a high-dimensional dataset like a survey with hundreds of questions, it can be very hard to find patterns and similarities in responses by hand, but a dimensionality reduction technique can easily visualize groups of respondents whose answers covaried similarly. Once observed, such a relationship can then be tested more formally with a model.

## The California Housing Dataset

To learn about any machine learning or artificial intelligence method, it's always useful to have an example dataset, and for these lessons we will use the [California Housing Dataset](https://www.spatial-statistics.com/pace_manuscripts/spletters_ms_dir/statistics_prob_lets/html/ms_sp_lets1.html), which can be loaded by the [Scikit-learn library](https://scikit-learn.org/stable/). This dataset contains observations about the makeup and locations of households across California based on the 1990 U.S. Census. 

In [None]:
# First, we need to import scikit-learn:
import sklearn as skl

# Next, we use scikit-learn to download and return the CA housing dataset:
ca_housing_dataset = skl.datasets.fetch_california_housing()

# Extract the actual data rows and the feature names:
ca_housing_featdata = ca_housing_dataset['data']
ca_housing_featnames = ca_housing_dataset['feature_names']

# We also extract the "target" data:
ca_housing_targdata = ca_housing_dataset['target']
ca_housing_targnames = ca_housing_dataset['target_names']

# Print the description of the dataset:
print(ca_housing_dataset['DESCR'])

### Dataset Features

The `data` from the CA housing dataset is represented as a NumPy array whose rows represent a census block and whose columns represent house-related features (dimensions) of each census block such as the average age of a block's houses and the average number of bedrooms of houses in each block. See the dataset description above for more information.

A quick inspection confirms that there are the same number of feature names in the dataset as there are columns in the data matrix.

In [None]:
print(f'Feature names: {len(ca_housing_featnames)}')
print(f'Data shape: {ca_housing_featdata.shape}')

We can use Pandas to organize the feature matrix and feature names into a `DataFrame`.

In [None]:
# To organize the dataset into a dataframe, we use Pandas:
import pandas as pd

feat_df = pd.DataFrame(dict(zip(ca_housing_featnames, ca_housing_featdata.T)))

# Display the dataframe:
feat_df

### Target Data

The CA housing dataset contains a "target" variable representing the median value of a house in each census block. Although we won't be using this target value in our methods (because we will be studying unsupervised methods in this lesson), we will want to look at it periodically to see if our unsupervised methods have deduced anything useful.

We can keep track of the target data by making it into its own DataFrame.

In [None]:
targ_df = pd.DataFrame({ca_housing_targnames[0]: ca_housing_targdata})

# Display the dataframe:
targ_df

## Lesson Goals

In this lesson we will deduce what we can about the California Housing dataset and attempt to make predictions about the dataset's target variables. Because this lesson is on unsupervised learning, we won't be using the target data to train any of the methods in this lesson, but we will see whether we can learn about the target variable nonetheless.

By the end of this lesson, you should be comfortable with the following concepts:
* We'll first look at the k-means algorithm for putting rows of the CA housing dataset into clusters based on their similarity.
* Next, we'll use Principal Component Analysis (PCA) to visualize and examine the covariance of the dataset.
* Finally, we'll look at some ways that these methods can fail by comparing them to Locally Linear Embedding (LLE).

## Additional Resources

* [Unsupervised Learning at Wikipedia](https://en.wikipedia.org/wiki/Unsupervised_learning)
* [Unsupervised Learning algorithms in Scikit-Learn](https://scikit-learn.org/stable/unsupervised_learning.html)