7.2 Neural Networks from Scratch

In this assignment, you will implement a neural network from scratch using tensors and autograd to solve a regression task. By building neural networks without relying on high-level abstractions, you’ll gain a deeper understanding of their inner workings and the computational processes that drive them.

Learning objectives:

Understand the fundamental components of neural networks: weights, biases, layers, and activation functions
Use autograd for automatic differentiation
Apply a neural network to solve a regression task

Neural Networks: A Conceptual Foundation

From Linear Regression to Neural Networks

Neural networks are an extension of linear regression. In linear regression, we model relationships between predictors (X) and a dependent variable (Y) using the equation:

\(Y = XW + b\)

Where:

X is the input data (with n observations of p predictors)
W is a vector of weights (coefficients to estimate)
b is the bias term (y-intercept to estimate)

Neural networks expand this concept by introducing multiple layers of transformations, where each layer performs linear operations followed by a simple nonlinear transformation.

The Architecture of Neural Networks

A neural network consists of:

Input layer: the predictors X
Hidden layers: intermediate layers. They are linear combinations of the p predictors, followed by a simple nonlinear function like ReLU: \(g(z) = \max(0, z)\)
Output layer: coefficients \(\beta\) multiplied by the hidden layers to produce the final prediction

What makes neural networks powerful is their ability to model complex nonlinear relationships. This capability comes from activation functions applied after linear transformations in each hidden layer.

The most common activation function in modern neural networks is the Rectified Linear Unit (ReLU):

\(\text{ReLU}(z) = \max(0, z)\)

ReLU introduces nonlinearity in hidden layers by returning 0 for negative inputs and the input value for positive inputs. This simple operation enables neural networks to learn complex patterns that linear models cannot capture.

The Learning Process

Neural networks learn through an iterative gradient descent process:

Forward Propagation: Input data passes through the network to generate predictions
Loss Calculation: The difference between predictions and actual values is measured
Backpropagation: Gradients of the loss with respect to weights are computed
Parameter Update: Weights are adjusted to minimize the loss using gradient descent

This process repeats until the model achieves satisfactory performance or a stopping criterion is met.

Implementation

1. Data Generation

Start by generating synthetic data for our regression task. For simplicity, the relationship between X and Y is linear with added noise: \(Y = .2 X_1 - 1.3 X_2 + .5 X_3 + u\).

library(torch)

# Set seed for reproducibility
torch_manual_seed(42)

# Generate data
n_samples <- 100
n_features <- 3

# Input features
X <- torch_randn(n_samples, n_features)

# True coefficients
true_coefficients <- c(0.2, -1.3, 0.5)

# Generate target with some noise
y <- X$matmul(torch_tensor(true_coefficients))$unsqueeze(2) + 0.1 * torch_randn(n_samples, 1)

2. Model Definition

Now we’ll define our neural network with one hidden layer:

# Dimensions
d_in <- n_features  # Input dimension: 3 predictors
d_hidden <- 12      # Hidden layer dimension: 12 hidden layer nodes
d_out <- 1          # Output dimension: predict 1 dependent variable y

# Initialize weights and biases at random starting positions
# requires_grad = TRUE to track gradients
# Weights: input layer -> hidden layer 
# (hidden layers are linear combinations of predictors)
w1 <- torch_randn(d_in, d_hidden, requires_grad = T)

# Weights: hidden layer -> output layer (betas in linear regression)
w2 <- torch_randn(d_hidden, d_out, requires_grad = T)

# Initialize biases at 0 (y-intercepts)
b1 <- torch_zeros(1, d_hidden, requires_grad = TRUE)
b2 <- torch_zeros(1, d_out, requires_grad = TRUE)


# Hyperparameters
learning_rate <- 1e-4 # gradient descent step size
n_epochs <- 1000 # number of iterations to perform

# Training loop
for (epoch in 1:n_epochs) {
  
  ### -------- Forward pass --------
  # Hidden layer: linear transformation followed by ReLU activation
  hidden <- X$mm(w1)$add(b1)$relu()
  
  y_pred <- X$mm(w1)$add(b1)$relu()$mm(w2)$add(b2)
  
  ### -------- Compute loss -------- 
  loss <- (y_pred - y)$pow(2)$mean()
  if (epoch %% 100 == 0)
    cat("Epoch: ", epoch, "   Loss: ", loss$item(), "\n")
  
  ### -------- Backpropagation --------
  # compute gradient of loss w.r.t. all tensors with
  # requires_grad = TRUE
  loss$backward()
  
  ### -------- Update weights -------- 
  # Wrap in with_no_grad() because this is a part we don't 
  # want to record for automatic gradient computation
  with_no_grad({
    w1 <- w1$sub_(learning_rate * w1$grad)
    w2 <- w2$sub_(learning_rate * w2$grad)
    b1 <- b1$sub_(learning_rate * b1$grad)
    b2 <- b2$sub_(learning_rate * b2$grad)  
    
    # Zero gradients after every pass, as they'd
    # accumulate otherwise
    w1$grad$zero_()
    w2$grad$zero_()
    b1$grad$zero_()
    b2$grad$zero_()  
  })
}

Epoch:  100    Loss:  5.98762 
Epoch:  200    Loss:  4.383363 
Epoch:  300    Loss:  3.365293 
Epoch:  400    Loss:  2.685752 
Epoch:  500    Loss:  2.211841 
Epoch:  600    Loss:  1.867638 
Epoch:  700    Loss:  1.608478 
Epoch:  800    Loss:  1.408716 
Epoch:  900    Loss:  1.251673 
Epoch:  1000    Loss:  1.126104

4. Evaluation

After training, we can evaluate our model on test data:

# Generate test data
X_test <- torch_randn(20, n_features)
y_test <- X_test$matmul(torch_tensor(true_coefficients))$unsqueeze(2) + 0.1 * torch_randn(20, 1)

# Evaluate model on test data
with_no_grad({
  hidden_test <- X_test$mm(w1)$add(b1)$relu()
  y_pred_test <- hidden_test$mm(w2)$add(b2)
  test_loss <- (y_pred_test - y_test)$pow(2)$mean()
})

cat("Test Loss:", test_loss$item(), "\n")

Test Loss: 1.376233

Exercises

Question 1) Create a function `neural_network` that takes data X, y, and a number of hidden layer nodes to estimate.

Question 2) Data Generation

Create a synthetic dataset with nonlinear relationships between predictors and the target variable. Generate data where x1 and x2 are random uniform between 0 and 5, and y = 5 sin(x1) + x2^2 + noise. Visualize the data to understand its structure.

Question 3) Linear Model Baseline

Implement a linear regression model using lm and train it on your synthetic dataset. Report the mean squared error. Why does this model struggle with the nonlinear relationship?

Question 4) Use `neural_network` to fit a neural network on the data from question 2 with 12 hidden nodes. Compare the mean squared error to the linear regression from question 3.

Question 5) Hidden Layer Exploration

Modify the neural network from question 4 to have different numbers of units in the hidden layer (8, 16, 32, 64, 128). Train each model and compare their performance.

Question 6) Learning Rate Analysis

Experiment with different learning rates (1e-2, 1e-3, 1e-4, 1e-5) and observe how they affect convergence. What happens when the learning rate is too high? Too low?