Gradient#

Torch has tools that allow you to calculate gradients. This page created for this feature.

import torch

Intro#

To compute derivative in torch you need to create torch.tensor with the property requires_grad = True so that torch will look for the gradient of any function this tensor is involved in.

Then we need to define a function that depends on the tensor under consideration. And call the backward method from it - it will compute partial derivatives for each tensor on which it depends.

After previous step you will have derivative values in grad field of tensor under consideration.

Example 1#

Let’s say we have function:

\[y(\omega)=\sum_i^n 3\omega_i.\]

And we need to find the derivative of the function on the variables \(\omega_i, i\in\overline{1,n}\). Let’s do it by hand at first:

\[\frac{dy}{d \omega_i} = \sum_j^n\frac{d3\omega_j}{d \omega_i} = \sum_j^n3\frac{d\omega_j}{d \omega_i}.\]

And that’s considering the fact that:

\[\begin{split}\frac{d \omega_i}{d \omega_j} = \begin{cases} 0 , i\neq j; \\ 1 , i=j.\end{cases}\end{split}\]

We got:

\[\frac{dy}{d \omega_i} = 3.\]

The implementation of this example in Torch is listed in the cell below:

n = 5
w = torch.rand(n, requires_grad=True)
y = torch.sum(w*3)
y.backward()
w.grad
tensor([3., 3., 3., 3., 3.])

Example 2#

Now we have a slightly more complicated function:

\[y(\omega, \gamma)=\sum_i^n \omega_i \gamma_i.\]

So derivatives by \(\omega_i\) and \(\gamma_i, i \in \overline{1,n}\) accordingly:

\[\frac{dy}{d \omega_i} = \sum_j^n\frac{d\omega_j\gamma_j}{d \omega_i} = \sum_j^n\gamma_j\frac{d\omega_j}{d \omega_i}=\gamma_i;\]
\[\frac{dy}{d \gamma_i} = \sum_j^n\frac{d\omega_j\gamma_j}{d \gamma_i} = \sum_j^n\omega_j\frac{d\gamma_j}{d \gamma_i}=\omega_i.\]

And the implementation in the Torch for this case will look like the following cell. The main purpose of this example is to show that if the derivative contains a variable, its value is substituted into the expression. So in the example:

  • \(\omega=(1,2,3,4)\) - so derivatives of the \(\gamma_i\) take these values;

  • likewise \(\gamma=(5,6,7,8)\) - derivatvies of the \(\omega_i\) take these values.

w = torch.tensor([1, 2, 3, 4], dtype=torch.float, requires_grad=True)
g = torch.tensor([5, 6, 7, 8], dtype=torch.float, requires_grad=True)

y = w @ g
y.backward()
print("omega gradient value")
print(w.grad)
print("gamma gradient value")
print(g.grad)
omega gradient value
tensor([5., 6., 7., 8.])
gamma gradient value
tensor([1., 2., 3., 4.])

Ignore gradients#

Sometimes it’s useful to avoid computing gradients for tensors with requires_grad=True. To achieve this, you can use the torch.no_grad() context manager.


Consider a simple example: we’ll use a tensor with requires_grad=True, which is defined in the following cell:

w = torch.rand(n, requires_grad=True)

If we apply any transformation, the torch.backward methods will work correctly, and we can view the gradients as expected.

y = torch.sum(w)
y.backward()
w.grad
tensor([2., 2., 2., 2., 2.])

However, if we wrap the operation in a torch.no_grad() context, calling the backward method will result in an error.

with torch.no_grad():
    y = torch.sum(w)

try: y.backward() 
except Exception as e: print(e)
element 0 of tensors does not require grad and does not have a grad_fn

Non-leaf grad#

If a tensor is not a leaf in the PyTorch computation graph (meaning it is a result of operations involving other tensors) PyTorch will not compute gradients for it using the backward method.


Consider an example with leaf tensors A and B, and a non-leaf tensor temp, which is the result of matrix multiplication between A and B.

A = torch.rand([3,2], dtype=torch.float, requires_grad=True)
B = torch.rand([2,1], dtype=torch.float, requires_grad=True)

temp = A @ B
temp.sum().backward()

Attempting to access the gradient of the temp tensor will result in an warning.

temp.grad
/tmp/ipykernel_9368/463209734.py:1: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:489.)
  temp.grad

However, the gradients of the leaf tensors involved in the operations will be computed correctly.

print(A.grad)
print(B.grad)
tensor([[0.4047, 0.6939],
        [0.4047, 0.6939],
        [0.4047, 0.6939]])
tensor([[1.1646],
        [1.4113]])

According to the instructions provided by the torch warning, we need to call the retain_grad method on non-leaf tensors if we want to retrieve their gradients after calling backward. The following cell demonstrates this with our example.

temp = A @ B
temp.retain_grad()
temp.sum().backward()
temp.grad
tensor([[1.],
        [1.],
        [1.]])

Gradient descent#

The torch’s ability to calculate derivatives is extremely useful for gradient descent. So here is a simple example of a gradient descent implementation using torch derivatives.

Sample generation:

n_features = 2
n_objects = 300

w_true = torch.randn(n_features)
X = (torch.rand(n_objects, n_features) - 0.5) * 5
Y = X @ w_true + torch.randn(n_objects) / 2

The implementation of the algorithm is shown in the following cell.

At each iteration:

  • The predictions for the current weights are computed;

  • The MSE for the current prediction is calculated;

  • A gradient of MSE on the weights is taken with line MSE.backward();

  • In the weights space, a step is made in the direction of the antigradient. We need to wrap this operation with torch.no_grad() so that this calculation isn’t used for gradient calculations.

step_size = 1e-2
w = torch.rand(X.shape[1], requires_grad = True)
for i in range(5000):
    y_pred = torch.matmul(X,w)
    MSE = torch.mean((y_pred - Y)**2)
    MSE.backward()
    
    with torch.no_grad():
        w -= w.grad * step_size

    w.grad.zero_()

So let’s compare the real coefficients and the result of the approximation - it’s pretty close.

print("Real coefs:", w_true.tolist())
print("Approximation coefs:", w.tolist())
Real coefs: [-0.1252637803554535, 0.5651034116744995]
Approximation coefs: [-0.1096344143152237, 0.5543152689933777]