7.1 Tensors, Autograd, and Gradient Descent

In this assignment, you will learn about tensors, automatic differentiation, and how to use these tools to implement gradient descent for function minimization. These concepts form the foundation of modern deep learning frameworks.

By the end of this assignment, you should be able to:

Create and manipulate tensors in R using the torch package
Understand automatic differentiation and compute gradients
Implement gradient descent to minimize functions
Apply these concepts to solve optimization problems

Part 1: Tensors

install.packages("torch")

library(torch)

Tensors are the fundamental data structures in deep learning. They are generalizations of vectors and matrices to potentially higher dimensions.

Vectors

Vectors are 1D tensors. In torch, we can create vectors using torch_tensor().

The dollar sign is used to access properties and methods of tensor objects. It provides access to both properties (like shape, dtype) and methods (functions that operate on the tensor).

# Create a vector
x <- torch_tensor(1:5)
print(x)

torch_tensor
 1
 2
 3
 4
 5
[ CPULongType{5} ]

# Use the dollar sign to access the shape of the tensor:
print(x$shape)

[1] 5

# Use the dollar sign to access methods like adding and multiplying.
# This adds 10 to each element:
x$add(10)

torch_tensor
 11
 12
 13
 14
 15
[ CPUFloatType{5} ]

Exercise 1

Create a vector of 10 numbers. Print its shape and then use dollar sign syntax to access the method to subtract 1 from each element of the tensor.

Matrices

2D tensors can be created by passing matrix() into torch_tensor():

# Create a 2x3 matrix
A <- torch_tensor(matrix(1:6, nrow = 2, ncol = 3, byrow = T))
print(A)

torch_tensor
 1  2  3
 4  5  6
[ CPULongType{2,3} ]

print(A$shape)

[1] 2 3

A$add(1)

torch_tensor
 2  3  4
 5  6  7
[ CPUFloatType{2,3} ]

Exercise 2

Create a 3x3 identity matrix using torch, access its shape, and then call its method to multiply each element by the scalar 3.

Arrays (Higher-Dimensional Tensors)

An array is like a matrix, but it can have more than 2 dimensions. When we were studying Markov Decision Processes like the bus engine replacement problem or the lemon tree problem, we could have put several probability transition matrices in an array like this:

x <- array(
  c(
    diag(nrow = 3),
    diag(nrow = 3)
  ),
  dim = c(3, 3, 2)
)

print(x)

, , 1

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

, , 2

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

Notice that when you make x a tensor, it gives you a different view of the array, like you’re looking at the array from the side instead of straight on. Why does it do this? R prints arrays optimized for human reading; torch uses its own printing method optimized for machine learning workflows and debugging. torch_tensor doesn’t change the data, just how the data is displayed.

torch_tensor(x)

torch_tensor
(1,.,.) = 
  1  1
  0  0
  0  0

(2,.,.) = 
  0  0
  1  1
  0  0

(3,.,.) = 
  0  0
  0  0
  1  1
[ CPUFloatType{3,3,2} ]

Exercise 3

Create a 2x2x2 tensor filled with ones using array(). Verify it returns the same thing as torch_ones(c(2, 2, 2)).

In-place Operations

Torch provides operations that modify tensors in-place, typically with an underscore at the end:

x <- torch_tensor(c(1, 2, 3))
print(x)

torch_tensor
 1
 2
 3
[ CPUFloatType{3} ]

# Add 5 to every element of x, modifying the tensor in-place
x$add_(5)

torch_tensor
 6
 7
 8
[ CPUFloatType{3} ]

print(x)

torch_tensor
 6
 7
 8
[ CPUFloatType{3} ]

# Set all elements of x to zero in-place
x$zero_()

torch_tensor
 0
 0
 0
[ CPUFloatType{3} ]

print(x)

torch_tensor
 0
 0
 0
[ CPUFloatType{3} ]

Exercise 4

Create a 3x3 tensor of numbers. Multiply all values by 10 in-place. Then set all elements to 0 in-place.

Part 2: Autograd

Automatic differentiation is a key feature of modern deep learning frameworks. It allows for the automatic computation of gradients, which are essential for optimization algorithms like gradient descent.

Tracking Gradients

Exercise 5

Let \(f(x) = x^2 + 3x + 1\). Find \(f(2)\). Then find \(f'(x)\), and evaluate it at \(x = 2\).

`autograd`

We can use tensors and autograd to do the same task as exercise 5.

First tell torch to track operations on tensors:

# Create a tensor with gradient tracking
x <- torch_tensor(2.0, requires_grad = TRUE)
print(x)

torch_tensor
 2
[ CPUFloatType{1} ][ requires_grad = TRUE ]

# Create another tensor based on x
y <- x^2 + 3*x + 1
print(y)

torch_tensor
 11
[ CPUFloatType{1} ][ grad_fn = <AddBackward1> ]

# Compute gradients. When you call y$backward(), torch automatically computes the gradient (derivative) of y with respect to all tensors that were used to compute y that have requires_grad = TRUE. It's "backward" because you look backward from y to find the derivative with respect to x.
y$backward()

# Access the gradient of y with respect to x
print(x$grad)

torch_tensor
 7
[ CPUFloatType{1} ]

Exercise 6

Create a tensor x with value 3.0 and require gradients. Define \(y = 5 x^3 - 2 x^2 + 4 x - 7\). Compute the gradient of y with respect to x and verify it matches the expected analytical derivative (\(15 x^2 - 4x + 4\)).

Gradient Computation with Multiple Inputs

We can also compute gradients with respect to multiple inputs:

x <- torch_tensor(1.0, requires_grad = TRUE)
y <- torch_tensor(2.0, requires_grad = TRUE)

z <- x^2 + y^3
z$backward()

print(x$grad)  # dz/dx = 2*x = 2*1 = 2

torch_tensor
 2
[ CPUFloatType{1} ]

print(y$grad)  # dz/dy = 3*y^2 = 3*2^2 = 12

torch_tensor
 12
[ CPUFloatType{1} ]

Part 3: Gradient Descent

Gradient descent is an optimization algorithm that uses gradients to find the minimum of a function. Here’s how it works:

Give it a starting position
Find the gradient at that position
Move in the opposite direction, taking a step that’s some fraction of the gradient (this is the “learning rate”)

As you approach a local minimum, the slope (gradient) will approach 0, and you’ll take smaller and smaller steps. The algorithm is good at finding a local minimum; to make sure the minimum is a global minimum, some other considerations must be made.

# Define a simple function: f(x) = x^2
f <- function(x) x^2

# Implement gradient descent
gradient_descent <- function(f, initial_x, learning_rate = 0.1, num_iterations = 100) {
  x <- torch_tensor(initial_x, requires_grad = TRUE)
  
  for (i in 1:num_iterations) {
    # Forward pass
    y <- f(x)
    
    # Backward pass (compute gradients)
    y$backward()
    
    # Update x
    with_no_grad({
      x$sub_(learning_rate * x$grad)
      x$grad$zero_()  # Reset gradients
    })
    
    # Print progress
    if (i %% 10 == 0) {
      cat("Iteration:", i, "x:", x$item(), "f(x):", f(x)$item(), "\n")
    }
  }
  
  return(x$item())
}

# Run gradient descent
min_x <- gradient_descent(f, initial_x = 5.0)

Iteration: 10 x: 0.5368708 f(x): 0.2882303 
Iteration: 20 x: 0.05764607 f(x): 0.00332307 
Iteration: 30 x: 0.0061897 f(x): 3.831239e-05 
Iteration: 40 x: 0.000664614 f(x): 4.417118e-07 
Iteration: 50 x: 7.13624e-05 f(x): 5.092592e-09 
Iteration: 60 x: 7.66248e-06 f(x): 5.87136e-11 
Iteration: 70 x: 8.227526e-07 f(x): 6.769218e-13 
Iteration: 80 x: 8.834239e-08 f(x): 7.804378e-15 
Iteration: 90 x: 9.485691e-09 f(x): 8.997834e-17 
Iteration: 100 x: 1.018518e-09 f(x): 1.03738e-18

cat("Minimum found at x =", min_x, "\n")

Minimum found at x = 1.018518e-09

Exercise 7

Implement gradient descent to find the minimum of the function f(x) = x^4 - 3*x^3 + 2. Start from x = 3.0, use a learning rate of 0.01, and run for 200 iterations.

Exercise 8

Implement gradient descent to find the minimum of the function f(x, y) = (x - 2)^2 + (y - 3)^2. Start from (0, 0), use a learning rate of 0.1, and run for 50 iterations. Verify that the solution converges to the expected minimum at (2, 3).

Exercise 9

Perform Multivariable Gradient Descent:

# # Define a function of two variables: f(x, y) = x^2 + 2*y^2
# f <- function(tensor) {
#   x <- tensor[1]
#   y <- tensor[2]
#   return(___)
# }
# 
# # Implement multivariable gradient descent
# multi_gradient_descent <- function(f, initial_values, learning_rate = 0.1, num_iterations = 100) {
#   
#   tensor <- torch_tensor(initial_values, requires_grad = TRUE)
# 
#   for (___) {
#     # Forward pass
#     y <- ___
#     
#     # Backward pass (compute gradients)
#     ___
#     
#     # Update variables
#     with_no_grad({
#       ___
#       tensor$grad$zero_()  # Reset gradients
#     })
#     
#     # Print progress
#     if (i %% 10 == 0) {
#       cat("Iteration:", i, "values:", tensor$data, "f(x,y):", f(tensor)$item(), "\n")
#     }
#   }
#   
#   return(tensor$data)
# }
# 
# # Run multivariable gradient descent
# initial_values <- c(3.0, 4.0)
# min_values <- multi_gradient_descent(f, initial_values)
# cat("Minimum found at", min_values, "\n")