Gradient#
Torch has tools that allow you to calculate gradients. This page created for this feature.
import torch
Intro#
To compute derivative in torch you need to create torch.tensor
with the property requires_grad = True
so that torch will look for the gradient of any function this tensor is involved in.
Then we need to define a function that depends on the tensor under consideration. And call the backward
method from it - it will compute partial derivatives for each tensor on which it depends.
After previous step you will have derivative values in grad
field of tensor under consideration.
Example 1#
Let’s say we have function:
And we need to find the derivative of the function on the variables \(\omega_i, i\in\overline{1,n}\). Let’s do it by hand at first:
And that’s considering the fact that:
We got:
The implementation of this example in Torch is listed in the cell below:
n = 5
w = torch.rand(n, requires_grad=True)
y = torch.sum(w*3)
y.backward()
w.grad
tensor([3., 3., 3., 3., 3.])
Example 2#
Now we have a slightly more complicated function:
So derivatives by \(\omega_i\) and \(\gamma_i, i \in \overline{1,n}\) accordingly:
And the implementation in the Torch for this case will look like the following cell. The main purpose of this example is to show that if the derivative contains a variable, its value is substituted into the expression. So in the example:
\(\omega=(1,2,3,4)\) - so derivatives of the \(\gamma_i\) take these values;
likewise \(\gamma=(5,6,7,8)\) - derivatvies of the \(\omega_i\) take these values.
w = torch.tensor([1, 2, 3, 4], dtype=torch.float, requires_grad=True)
g = torch.tensor([5, 6, 7, 8], dtype=torch.float, requires_grad=True)
y = w @ g
y.backward()
print("omega gradient value")
print(w.grad)
print("gamma gradient value")
print(g.grad)
omega gradient value
tensor([5., 6., 7., 8.])
gamma gradient value
tensor([1., 2., 3., 4.])
Ignore gradients#
Sometimes it’s useful to avoid computing gradients for tensors with requires_grad=True
. To achieve this, you can use the torch.no_grad()
context manager.
Consider a simple example: we’ll use a tensor with requires_grad=True
, which is defined in the following cell:
w = torch.rand(n, requires_grad=True)
If we apply any transformation, the torch.backward
methods will work correctly, and we can view the gradients as expected.
y = torch.sum(w)
y.backward()
w.grad
tensor([2., 2., 2., 2., 2.])
However, if we wrap the operation in a torch.no_grad()
context, calling the backward
method will result in an error.
with torch.no_grad():
y = torch.sum(w)
try: y.backward()
except Exception as e: print(e)
element 0 of tensors does not require grad and does not have a grad_fn
Non-leaf grad#
If a tensor is not a leaf in the PyTorch computation graph (meaning it is a result of operations involving other tensors) PyTorch will not compute gradients for it using the backward
method.
Consider an example with leaf tensors A
and B
, and a non-leaf tensor temp
, which is the result of matrix multiplication between A
and B
.
A = torch.rand([3,2], dtype=torch.float, requires_grad=True)
B = torch.rand([2,1], dtype=torch.float, requires_grad=True)
temp = A @ B
temp.sum().backward()
Attempting to access the gradient of the temp
tensor will result in an warning.
temp.grad
/tmp/ipykernel_9368/463209734.py:1: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:489.)
temp.grad
However, the gradients of the leaf tensors involved in the operations will be computed correctly.
print(A.grad)
print(B.grad)
tensor([[0.4047, 0.6939],
[0.4047, 0.6939],
[0.4047, 0.6939]])
tensor([[1.1646],
[1.4113]])
According to the instructions provided by the torch
warning, we need to call the retain_grad
method on non-leaf tensors if we want to retrieve their gradients after calling backward
. The following cell demonstrates this with our example.
temp = A @ B
temp.retain_grad()
temp.sum().backward()
temp.grad
tensor([[1.],
[1.],
[1.]])
Gradient descent#
The torch’s ability to calculate derivatives is extremely useful for gradient descent. So here is a simple example of a gradient descent implementation using torch derivatives.
Sample generation:
n_features = 2
n_objects = 300
w_true = torch.randn(n_features)
X = (torch.rand(n_objects, n_features) - 0.5) * 5
Y = X @ w_true + torch.randn(n_objects) / 2
The implementation of the algorithm is shown in the following cell.
At each iteration:
The predictions for the current weights are computed;
The MSE for the current prediction is calculated;
A gradient of MSE on the weights is taken with line
MSE.backward()
;In the weights space, a step is made in the direction of the antigradient. We need to wrap this operation with
torch.no_grad()
so that this calculation isn’t used for gradient calculations.
step_size = 1e-2
w = torch.rand(X.shape[1], requires_grad = True)
for i in range(5000):
y_pred = torch.matmul(X,w)
MSE = torch.mean((y_pred - Y)**2)
MSE.backward()
with torch.no_grad():
w -= w.grad * step_size
w.grad.zero_()
So let’s compare the real coefficients and the result of the approximation - it’s pretty close.
print("Real coefs:", w_true.tolist())
print("Approximation coefs:", w.tolist())
Real coefs: [-0.1252637803554535, 0.5651034116744995]
Approximation coefs: [-0.1096344143152237, 0.5543152689933777]