library(torch)
# Set seed for reproducibility
torch_manual_seed(42)
# Generate data
<- 100
n_samples <- 3
n_features
# Input features
<- torch_randn(n_samples, n_features)
X
# True coefficients
<- c(0.2, -1.3, 0.5)
true_coefficients
# Generate target with some noise
<- X$matmul(torch_tensor(true_coefficients))$unsqueeze(2) + 0.1 * torch_randn(n_samples, 1) y
7.2 Neural Networks from Scratch
In this assignment, you will implement a neural network from scratch using tensors and autograd to solve a regression task. By building neural networks without relying on high-level abstractions, you’ll gain a deeper understanding of their inner workings and the computational processes that drive them.
Learning objectives:
- Understand the fundamental components of neural networks: weights, biases, layers, and activation functions
- Use autograd for automatic differentiation
- Apply a neural network to solve a regression task
Neural Networks: A Conceptual Foundation
From Linear Regression to Neural Networks
Neural networks are an extension of linear regression. In linear regression, we model relationships between predictors (X) and a dependent variable (Y) using the equation:
\(Y = XW + b\)
Where:
- X is the input data (with n observations of p predictors)
- W is a vector of weights (coefficients to estimate)
- b is the bias term (y-intercept to estimate)
Neural networks expand this concept by introducing multiple layers of transformations, where each layer performs linear operations followed by a simple nonlinear transformation.
The Architecture of Neural Networks
A neural network consists of:
- Input layer: the predictors X
- Hidden layers: intermediate layers. They are linear combinations of the p predictors, followed by a simple nonlinear function like ReLU: \(g(z) = \max(0, z)\)
- Output layer: coefficients \(\beta\) multiplied by the hidden layers to produce the final prediction
What makes neural networks powerful is their ability to model complex nonlinear relationships. This capability comes from activation functions applied after linear transformations in each hidden layer.
The most common activation function in modern neural networks is the Rectified Linear Unit (ReLU):
\(\text{ReLU}(z) = \max(0, z)\)
ReLU introduces nonlinearity in hidden layers by returning 0 for negative inputs and the input value for positive inputs. This simple operation enables neural networks to learn complex patterns that linear models cannot capture.
The Learning Process
Neural networks learn through an iterative gradient descent process:
- Forward Propagation: Input data passes through the network to generate predictions
- Loss Calculation: The difference between predictions and actual values is measured
- Backpropagation: Gradients of the loss with respect to weights are computed
- Parameter Update: Weights are adjusted to minimize the loss using gradient descent
This process repeats until the model achieves satisfactory performance or a stopping criterion is met.
Implementation
1. Data Generation
Start by generating synthetic data for our regression task. For simplicity, the relationship between X and Y is linear with added noise: \(Y = .2 X_1 - 1.3 X_2 + .5 X_3 + u\).
2. Model Definition
Now we’ll define our neural network with one hidden layer:
# Dimensions
<- n_features # Input dimension: 3 predictors
d_in <- 12 # Hidden layer dimension: 12 hidden layer nodes
d_hidden <- 1 # Output dimension: predict 1 dependent variable y
d_out
# Initialize weights and biases at random starting positions
# requires_grad = TRUE to track gradients
# Weights: input layer -> hidden layer
# (hidden layers are linear combinations of predictors)
<- torch_randn(d_in, d_hidden, requires_grad = T)
w1
# Weights: hidden layer -> output layer (betas in linear regression)
<- torch_randn(d_hidden, d_out, requires_grad = T)
w2
# Initialize biases at 0 (y-intercepts)
<- torch_zeros(1, d_hidden, requires_grad = TRUE)
b1 <- torch_zeros(1, d_out, requires_grad = TRUE)
b2
# Hyperparameters
<- 1e-4 # gradient descent step size
learning_rate <- 1000 # number of iterations to perform
n_epochs
# Training loop
for (epoch in 1:n_epochs) {
### -------- Forward pass --------
# Hidden layer: linear transformation followed by ReLU activation
<- X$mm(w1)$add(b1)$relu()
hidden
<- X$mm(w1)$add(b1)$relu()$mm(w2)$add(b2)
y_pred
### -------- Compute loss --------
<- (y_pred - y)$pow(2)$mean()
loss if (epoch %% 100 == 0)
cat("Epoch: ", epoch, " Loss: ", loss$item(), "\n")
### -------- Backpropagation --------
# compute gradient of loss w.r.t. all tensors with
# requires_grad = TRUE
$backward()
loss
### -------- Update weights --------
# Wrap in with_no_grad() because this is a part we don't
# want to record for automatic gradient computation
with_no_grad({
<- w1$sub_(learning_rate * w1$grad)
w1 <- w2$sub_(learning_rate * w2$grad)
w2 <- b1$sub_(learning_rate * b1$grad)
b1 <- b2$sub_(learning_rate * b2$grad)
b2
# Zero gradients after every pass, as they'd
# accumulate otherwise
$grad$zero_()
w1$grad$zero_()
w2$grad$zero_()
b1$grad$zero_()
b2
}) }
Epoch: 100 Loss: 5.98762
Epoch: 200 Loss: 4.383363
Epoch: 300 Loss: 3.365293
Epoch: 400 Loss: 2.685752
Epoch: 500 Loss: 2.211841
Epoch: 600 Loss: 1.867638
Epoch: 700 Loss: 1.608478
Epoch: 800 Loss: 1.408716
Epoch: 900 Loss: 1.251673
Epoch: 1000 Loss: 1.126104
4. Evaluation
After training, we can evaluate our model on test data:
# Generate test data
<- torch_randn(20, n_features)
X_test <- X_test$matmul(torch_tensor(true_coefficients))$unsqueeze(2) + 0.1 * torch_randn(20, 1)
y_test
# Evaluate model on test data
with_no_grad({
<- X_test$mm(w1)$add(b1)$relu()
hidden_test <- hidden_test$mm(w2)$add(b2)
y_pred_test <- (y_pred_test - y_test)$pow(2)$mean()
test_loss
})
cat("Test Loss:", test_loss$item(), "\n")
Test Loss: 1.376233
Exercises
Question 2) Data Generation
Create a synthetic dataset with nonlinear relationships between predictors and the target variable. Generate data where x1 and x2 are random uniform between 0 and 5, and y = 5 sin(x1) + x2^2 + noise. Visualize the data to understand its structure.
Question 3) Linear Model Baseline
Implement a linear regression model using lm
and train it on your synthetic dataset. Report the mean squared error. Why does this model struggle with the nonlinear relationship?
Question 6) Learning Rate Analysis
Experiment with different learning rates (1e-2, 1e-3, 1e-4, 1e-5) and observe how they affect convergence. What happens when the learning rate is too high? Too low?