# Random Forests

Random forests are one of the most flexible and broadly useful AI/ML methods available.
They're sometimes considered the "Swiss army knife" of supervised AI/ML methods because they are capable of handling all kinds of data and are usually highly accurate even when other methods struggle. 
In this section, we'll look at how they work and how to apply them to our example dataset.

## How do random forests work?

Random forests are fundamentally made up of several of another kind of machine learning algorithm called a **decision tree**. The idea behind a decision tree is that a set of data can be split into two subsets based on some feature, then the two subsets can each be split on different features, and their subsets can be split, etc., for some depth. After some number of splits in the data, the subsets will be small and a simple classification or regression can be trained for values in that subset. (In this section, we'll look at using random forests for regression via the `sklearn.ensemble.RandomForestRegressor` class, but another similar type, `sklearn.ensemble.RandomForestClassifier`, more information [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), can be used for classification.) For example, a decision tree modeling the CA Housing Dataset might initially split the data on the number of bedrooms, with all rows with fewer than 3 bedrooms going into one subset and all other rows going into another. These splits can be chosen randomly or using semi-random processes, and they can be generated very quickly.

One downside of decision trees is that they are prone to overfitting. Random forests solve this by training many decision trees from distinct randomly chosen subdatasets and averaging them together to form the final model. Although each decision tree is likely to be overfit, they are very unlikely to be overfit in the same features, so their average will usually be less overfit than any one decision tree.

Most random forest algorithms use cross validation internally to train and validate the decision trees they create, but you should still use cross validation yourself when training a random forest.

### Limitations and Advantages of Random Forests

**Limitations**
* Random forests are very poor at extrapolation.
* You can't generally use details about the random forest model to understand how important/influential the model's input parameters are for the output. (If you want to know if a parameter is negatively or positively related to the outputs, then linear regression is a better choice.)

**Advantages**
* Random forests tend to be highly accurate for many kinds of data.
* Random forests tend to be very robust to outliers.
* Random forests usually handle missing data well.

## Example: the California Housing Dataset

We'll use a random forest to try to predict the median housing prices in the CA Housing Dataset. We can start by loading the dataset as usual.

In [None]:
import sklearn as skl

# We use scikit-learn to download and return the CA housing dataset:
ca_housing_dataset = skl.datasets.fetch_california_housing()

# Extract the actual data rows and the feature names:
ca_housing_featdata = ca_housing_dataset['data']
ca_housing_featnames = ca_housing_dataset['feature_names']

# We also extract the "target" data, since we are using supervised learning:
ca_housing_targdata = ca_housing_dataset['target']
ca_housing_targnames = ca_housing_dataset['target_names']

As in the previous section, we'll split the dataset into train and test subdatasets.

In [None]:
import numpy as np
# We set a specific random seed to make sure that this notebook runs the same way each time.
np.random.seed(0)

# Randomly select 75% of the rows to be in the training dataset.
all_rows = np.arange(ca_housing_featdata.shape[0])
n_train = int(round(len(all_rows) * 0.75))
n_test = len(all_rows) - n_train
train_rows = np.random.choice(all_rows, n_train, replace=False)
test_rows = np.setdiff1d(all_rows, train_rows)

# Extract these rows into separate matrices:
train_featdata = ca_housing_featdata[train_rows]
train_targdata = ca_housing_targdata[train_rows]
test_featdata = ca_housing_featdata[test_rows]
test_targdata = ca_housing_targdata[test_rows]

Next, we can create the random forest management object. In this case, we'll want to give the management object a few *hyperparameters*, specifically the `max_depth`, which tells it how many times to split the data in the decision trees, and `random_state`, which can be used to ensure that randomized choices made by the algorithm are repeatable. We'll use `random_state=0` here, but you can use different random states to see how the algorithm varies across random runs.

In [None]:
from sklearn.ensemble import RandomForestRegressor

randforest = RandomForestRegressor(
    max_depth=6,
    random_state=0)

# Next, we train the random forest with our data.
randforest.fit(train_featdata, train_targdata)

There are a variety of other hyperparameters that can be given to the `RandomForestRegressor`, such as the number of estimators to make and average together (`n_estimators`), that are outside the scope of this lesson. See the [Scikit-learn documentation on random forests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) for more information on these options.

For now, let's see how well the model performs on our test dataset. Like with the `LinearRegression` type, we can use the `score` method to obtain the coefficient of determination for the model and the test data.

In [None]:
randforest.score(test_featdata, test_targdata)

The random forest appears to explain about 70â€“71% of the variance in the test dataset; that's somewhat better than the linear regression model we saw earlier.

### What kind of data does the `RandomForestRegressor` provide?

As we've seen with other ML tools in Scikit-learn, the `RandomForestRegressor` provides us with some data about the trained model. In the `LinearRegression` type, these data were the coefficients associated with each feature in the regression. For random forests, these data are the individual decision tree estimators. We can see the list of estimators by examining the `estimators_` member variable of the `randforest` object we created.

In [None]:
randforest.estimators_

As we can see, the `estimators_` variable is a list of `DecisionTreeRegressor` objects. This makes sense, since random forests are just collections of decision trees. Let's take a closer look at one of the trees in our forest. The Scikit-learn library includes a function `plot_tree` in the `sklearn.tree` subpackage that can be used to visualize a decision tree as part of a `matplotlib` figure.

In [None]:
import matplotlib.pyplot as plt

# We'll look at the first tree:
tree = randforest.estimators_[0]

# Make a figure; we have to make the figure quite large in order for all of
# the text and all the nodes in the tree to be visible!
(fig,ax) = plt.subplots(1, 1, figsize=(24,12), dpi=72*8)

# Plot the tree:
skl.tree.plot_tree(tree, ax=ax)

plt.show()

The tree has very small text in its cells, so you may need to open the image in a new browser tab and zoom in in order to read it. Essentially, each node in the tree details the condition for splitting the data. In the root node of the tree, for example, the data are split according to the rule `x[0] <= 5.715`; the `x[0]` here indicates the first feature used in the training (the median income in a region of CA for our dataset).

One of the nice features of decision trees and random forests is that the trees themselves can be examined and understood&mdash;just by looking through the nodes of this tree, we can get a general sense of how the algorithm has decided to calculate a prediction. The Scikit-learn library additionally includes a number of utilities and tutorials related to decision trees and how to evaluate and examine them. In particular, more information on the structure of the decision trees can be found [here](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html) and more general information on decision trees can be found [here](https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html).