Feed-Forward Neural Networks

Feed-Forward Neural Networks#

All of the neural networks we will discuss in this course (including convolutional neural networks and those networks diagramed in the introduction) are feed-forward neural networks, meaning that information always flows forward from one layer to the next, through to the output layer. There are other kinds of neural networks such as recurrent neural networks, which allow information from later (more efferent) layers to be fed back into earlier (more afferent) layers. These kinds of networks typically excel at modeling time-series data, but can take longer to train due to the need to simulate some notion of time for the model during the training steps.

This section will introduce the concepts of feed-forward neural networks by example in PyTorch. We will start by building a simple feed-forward PyTorch model then will experiment with various components of neural networks that we can add to it, evaluating its performance on the MNIST dataset as we go.

The MNIST Dataset#

For our dataset, we’ll use the MNIST image dataset that was introduced in the previous section. We’ll use the same code to load the dataset in here as we did there.

from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from pathlib import Path

train_dset = MNIST(Path.home(), download=True, train=True, transform=ToTensor())
test_dset = MNIST(Path.home(), download=True, train=False, transform=ToTensor())

Building a simple linear neural network.#

To start with, let’s build a very simple neural network with no activation function. This will make it functionally no different than a linear transformation, but it’s a perfectly good place to start!

We’ll build the model using PyTorch’s Module class, and we’ll construct internal components that are defined in PyTorch’s torch.nn subpackage as well.

import torch

class NNModel(torch.nn.Module):
    "A simple neural network model for the MNIST dataset."
    # The image shape for MNIST images is 28x28 pixels; this is the number of
    # features in the MNIST inputs.
    def __init__(self,
                 input_shape=(28, 28),  # The shape of the MNIST images.
                 hidden_features=1024,  # The number of hidden layer neurons.
                 output_features=10):   # The number of output features.
        # Always start Modules with calls to the superclass.
        super().__init__()
        # The next layer will be a simple linear transformation; i.e., an
        # all-to-all set of connections.
        input_features = torch.prod(torch.tensor(input_shape))
        self.input_to_hidden = torch.nn.Linear(
            input_features,
            hidden_features)
        # Then we transform from hidden features to output features.
        self.hidden_to_output = torch.nn.Linear(
            hidden_features,
            output_features)
    def forward(self, inputs):
        # Keep in mind that inputs has shape (N, C, H, W) where H and W are
        # the image height and width in pixels, N is the batch size, and C is
        # the number of channels, which is always 1 for MNIST because they are
        # grayscale images.
        # First, we want to flatten the image matrix dimensions into a single
        # dimension vector, which is how the linear neural network layers
        # expect their inputs.
        inputs = torch.reshape(inputs, (inputs.shape[0], inputs.shape[1], -1))
        # Then we transform from input to hidden layer.
        hidden = self.input_to_hidden(inputs)
        # Then from hidden to output layer.
        output = self.hidden_to_output(hidden)
        # At this point, the output is size (N, 1, 10): i.e., batch-size (N),
        # then channels (1) then output categories (10). There's no real
        # reason to keep the channels dimension, though; the output values,
        # one per digit, are essentially channels.
        output = output[:, 0, :]
        return output
    # We can add a function for predicting the precise digit from the model
    # outputs.
    def probabilities(self, inputs):
        """Returns a 10-dimensional vector of probabilities that a particular
        input represents each of the 10 digits.

        This model's outputs are a 10-element tensor in which each of the 10
        dimensions represents one digit; the dimension with the highest value
        indicates the model's predicted digit. This function runs the model on
        an input and translates the model's output into a confidence
        (probability) that the image represents each of the possible digits.
        """
        outputs = model(inputs)
        # Keep in mind there will be a batch dimension for inputs and outputs.
        # We want to use a sigmoid function to convert these numbers into
        # probabilities.
        probs = torch.sigmoid(outputs)
        # We should also normalize the probabilities.
        probs = probs / torch.sum(probs, dim=-1)
        return probs
    def predict(self, inputs):
        """Returns the integer digit prediction for the given input tensor.

        This model's outputs are a 10-element tensor in which each of the 10
        dimensions represents one digit; the dimension with the highest value
        indicates the model's predicted digit. This function runs the model on
        an input and translates the model's output into a digit.
        """
        outputs = model(inputs)
        # Keep in mind there will be a batch dimension for inputs and outputs.
        digits = torch.argmax(outputs, dim=-1)
        return digits.to(torch.uint8)

model = NNModel()
model

NNModel(
  (input_to_hidden): Linear(in_features=784, out_features=1024, bias=True)
  (hidden_to_output): Linear(in_features=1024, out_features=10, bias=True)
)

Notice that the number of output channels in the network is 10. This is because we will train the network to produce 10 numbers, one for each digit (0–9), and we will understand the model’s prediction of the digit represented in an image to be the index of the output with the largest value.

In other words, the model will take an image as input and will produce numbers like output = [12.31, -432.50, 1462.74, -755.75, -20501.06, -9610.48, 837.09, -2768.50, 1409.91, -7366.86]. With that output, the largest number is 1462.74, output[2], and so we take 2 to be the digit that the model predicted.

Note

Some of the terminology regarding model inputs, such as the difference between channels, features, and dimensions can be somewhat confusing; it can be helpful to remember that, especially in PyTorch, many of the conventions are organized around convolutional neural networks that operate on images.

The input dimensions consists of every individual number that is part of a sample input. For example, if the input to a model consists of RGB images (3 channels) that are \(64 \times 64\) pixels in size, then the number of input dimensions is \(3 \times 64 \times 64 = 12,288\).
Traditional RGB images have 3 or 4 channels (red, green, blue, and sometimes alpha, which represents transparency), and, in the context of an image, the channels, along with the pixel locations, provide an additional level of structure to the image. In other words, knowing that an input dimension, such as the R channel of the pixel in the 10th row and the 5th column, has the same channel as another dimension tells us something about how the two dimensions are related. Similarly, knowint that two dimensions share the same pixel location but different channels tells us something important about their relationship. Channels are fundamentally not the same as dimensions; however—they are a layer of organization that lies on top of the dimensions.
The word feature is often used to mean something like dimension and something like channel depending on context. In most cases, a feature is equivalent to a dimension. However, with convolutional neural networks (CNNs), the convolutional kernels produce output channels that do not correspond to the original input image’s channels; these convolved channels are often called feature maps. We will discuss feature maps in the next lesson on CNNs.

Training the Neural Network#

Training our neural network is going to look substantially similar to training the nonlinear model in the previous lesson. There are a few important differences, however:

Instead of the SGD (stochastic gradient descent) optimizer, we’ll use the Adam, which is a similar gradient-descent-based optimization strategy that is known to work well with neural networks. Adam’s interface is almost the same as SGD’s.
Instead of the BCELoss or the MSELoss, we’ll use what’s called the cross-entropy loss: CrossEntropyLoss. This loss function is a generalized version of the BCELoss (Binary Cross Entropy Loss) for training problems with multiple classes instead of a just two classes. This loss function is already implemented by PyTorch and is known to work well for evaluating the match of categorical outputs, such as a the prediction of a discrete digit in this case, to a category label. The details of how this loss function works are beyond the scope of this course, but, in brief, the cross entropy loss is low when the model outputs a high value in the channel matching the target and low values in all other channels, and it gets higher the further this is from true. More information can be found here.

# Hyperparameters:
n_epochs = 5       # 1 epoch: you show all your training data to your model once
lr = 0.001         # We use a fairly low learning rate.
batch_size = 1000  # How many images in one training batch.

# Make the model:
model = NNModel()

# Make the optimizer:
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# Declare our loss function:
loss_fn = torch.nn.CrossEntropyLoss(reduction='sum')

# Make the dataloaders:
train_dloader = torch.utils.data.DataLoader(train_dset, batch_size=batch_size, shuffle=True)
test_dloader = torch.utils.data.DataLoader(test_dset, batch_size=batch_size, shuffle=True)

# Now we start the optimization loop:
for epoch_num in range(n_epochs):
    # Put the model in train mode:
    model.train()
    # In each epoch, we go through each training sample once; the dataloader
    # gives these to us in batches:
    total_train_loss = 0
    for (inputs, targets) in train_dloader:
        # We're starting a new step, so we reset the gradients.
        optimizer.zero_grad()
        # Calculate the model prediction for these inputs.
        preds = model(inputs)
        # Calculate the loss between the prediction and the actual outputs.
        train_loss = loss_fn(preds, targets)
        # Have PyTorch backward-propagate the gradients.
        train_loss.backward()
        # Have the optimizer take a step:
        optimizer.step()
        # Add up the total training loss:
        total_train_loss = total_train_loss + train_loss
    mean_train_loss = (total_train_loss / len(train_dset)).detach()
    # Now that we've finished training, put the model back in evaluation mode.
    model.eval()
    # Evaluate the model using the test data.
    total_test_loss = 0
    for (inputs, targets) in test_dloader:
        preds = model(inputs)
        test_loss = loss_fn(preds, targets)
        total_test_loss = total_test_loss + train_loss
    mean_test_loss = (total_test_loss / len(test_dset)).detach()
    # Print something about this step:
    print(f"Epoch {epoch_num:2d}:"
          f"  train loss={mean_train_loss:6.3f};"
          f"  test loss={mean_test_loss:6.3f}")
# After the optimizer has run, print out what it's found:
print("Final result:")
print(f"  train loss = ", float(mean_train_loss))
print(f"   test loss = ", float(mean_test_loss))

Epoch  0:  train loss= 0.558;  test loss= 0.306

Epoch  1:  train loss= 0.297;  test loss= 0.274

Epoch  2:  train loss= 0.279;  test loss= 0.272

Epoch  3:  train loss= 0.274;  test loss= 0.253

Epoch  4:  train loss= 0.264;  test loss= 0.231
Final result:
  train loss =  0.2644980251789093
   test loss =  0.2310202419757843

Evaluating the Model Accuracy#

To evaluate our model, we have an overall loss value from the test dataset, but this doesn’e mean anything very specific. Let’s try to get a sense of our model’s performance by looking at some specific examples from the test dataset.

correct = 0
total = 0
for k in range(10):
    (samp_im, samp_targ) = test_dset[k]
    pred = model.predict(samp_im[None, ...])
    print(f'Image {k}: {int(pred)} ({samp_targ})')
    total += 1
    correct += (int(pred) == samp_targ)
print(f"Accuracy for this subset: {correct * 100 / total}%")

Image 0: 7 (7)
Image 1: 2 (2)
Image 2: 1 (1)
Image 3: 0 (0)
Image 4: 4 (4)
Image 5: 1 (1)
Image 6: 4 (4)
Image 7: 9 (9)
Image 8: 6 (5)
Image 9: 9 (9)
Accuracy for this subset: 90.0%

Hopefully it’s pretty clear that despite being a pretty simple linear model, our ANN is doing quite well at this classification task!

Adding activation functions to our network.#

Let’s make our neural network a little more interesting. We can add some activation functions to add nonlinearities to the model, potentially allowing it to model more complex relationships. We’ll use a very similar network as before, but with a couple more layers.

class ActivatedNNModel(torch.nn.Module):
    "A simple neural network model for the MNIST dataset with activation."
    def __init__(self,
                 input_shape=(28, 28),  # The shape of the MNIST images.
                 hidden_features=1024,  # The number of hidden layer neurons.
                 output_features=10):   # The number of output features.
        super().__init__()
        input_features = torch.prod(torch.tensor(input_shape))
        # Instead of 1 hidden layer, we'll have 2 layers, each with a ReLU
        # operator immediately after them.
        self.input_to_hidden1 = torch.nn.Linear(
            input_features,
            hidden_features)
        self.relu1 = torch.nn.ReLU()
        self.hidden1_to_hidden2 = torch.nn.Linear(
            hidden_features,
            hidden_features)
        self.relu2 = torch.nn.ReLU()
        self.hidden2_to_output = torch.nn.Linear(
            hidden_features,
            output_features)
    def forward(self, inputs):
        inputs = torch.reshape(inputs, (inputs.shape[0], inputs.shape[1], -1))
        hidden1 = self.input_to_hidden1(inputs)
        hidden1 = self.relu1(hidden1)
        hidden2 = self.hidden1_to_hidden2(hidden1)
        hidden2 = self.relu2(hidden2)
        output = self.hidden2_to_output(hidden2)
        output = output[:, 0, :]
        return output
    def predict(self, inputs):
        """Returns the integer digit prediction for the given input tensor.

        This model's outputs are a 10-element tensor in which each of the 10
        dimensions represents one digit; the dimension with the highest value
        indicates the model's predicted digit. This function runs the model on
        an input and translates the model's output into a digit.
        """
        outputs = model(inputs)
        # Keep in mind there will be a batch dimension for inputs and outputs.
        digits = torch.argmax(outputs, dim=-1)
        return digits.to(torch.uint8)

model = ActivatedNNModel()
model

ActivatedNNModel(
  (input_to_hidden1): Linear(in_features=784, out_features=1024, bias=True)
  (relu1): ReLU()
  (hidden1_to_hidden2): Linear(in_features=1024, out_features=1024, bias=True)
  (relu2): ReLU()
  (hidden2_to_output): Linear(in_features=1024, out_features=10, bias=True)
)

Okay, we’ve made a new model, let’s repeat our training!

# Hyperparameters:
n_epochs = 5       # 1 epoch: you show all your training data to your model once
lr = 0.001         # We use a fairly low learning rate.
batch_size = 1000  # How many images in one training batch.

# Make the model:
model = ActivatedNNModel()

# Make the optimizer:
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# Declare our loss function:
loss_fn = torch.nn.CrossEntropyLoss(reduction='sum')

# Make the dataloaders:
train_dloader = torch.utils.data.DataLoader(train_dset, batch_size=batch_size, shuffle=True)
test_dloader = torch.utils.data.DataLoader(test_dset, batch_size=batch_size, shuffle=True)

# Now we start the optimization loop:
for epoch_num in range(n_epochs):
    # Put the model in train mode:
    model.train()
    # In each epoch, we go through each training sample once; the dataloader
    # gives these to us in batches:
    total_train_loss = 0
    for (inputs, targets) in train_dloader:
        # We're starting a new step, so we reset the gradients.
        optimizer.zero_grad()
        # Calculate the model prediction for these inputs.
        preds = model(inputs)
        # Calculate the loss between the prediction and the actual outputs.
        train_loss = loss_fn(preds, targets)
        # Have PyTorch backward-propagate the gradients.
        train_loss.backward()
        # Have the optimizer take a step:
        optimizer.step()
        # Add up the total training loss:
        total_train_loss = total_train_loss + train_loss
    mean_train_loss = (total_train_loss / len(train_dset)).detach()
    # Now that we've finished training, put the model back in evaluation mode.
    model.eval()
    # Evaluate the model using the test data.
    total_test_loss = 0
    for (inputs, targets) in test_dloader:
        preds = model(inputs)
        test_loss = loss_fn(preds, targets)
        total_test_loss = total_test_loss + train_loss
    mean_test_loss = (total_test_loss / len(test_dset)).detach()
    # Print something about this step:
    print(f"Epoch {epoch_num:2d}:"
          f"  train loss={mean_train_loss:6.3f};"
          f"  test loss={mean_test_loss:6.3f}")
# After the optimizer has run, print out what it's found:
print("Final result:")
print(f"  train loss = ", float(mean_train_loss))
print(f"   test loss = ", float(mean_test_loss))

Epoch  0:  train loss= 0.502;  test loss= 0.293

Epoch  1:  train loss= 0.177;  test loss= 0.131

Epoch  2:  train loss= 0.113;  test loss= 0.078

Epoch  3:  train loss= 0.078;  test loss= 0.101

Epoch  4:  train loss= 0.059;  test loss= 0.061
Final result:
  train loss =  0.059296317398548126
   test loss =  0.06140265613794327

Clearly this network performs much better! To some extent, this is expected, because the network contains a more complex internal state with more internal parameters. However, the ReLU operators contain no parameters themselves; they simply operate over each feature identically.

This should demonstrate the importance of simple operators that add nonlinearities to the model structure for the model to exploit. These kinds of operators will continue to be important in the next section.

Copyright © 2025 Noah C. Benson	Content licensed under CC-BY 4.0