Data primitives#
Torch implements its approach to organizing data management. It assumes that you have two objects: Dataset
and DataLoader
. The Dataset
holds the data and allows access to individual data points, while the DataLoader
organizes the data into mini-batches and provides an iterable interface for iterating over them.
For the fullest description visit torch.utils.data
tutorial.
import torch
import torch.utils.data as td
from torch.utils.data import DataLoader
Data set#
A data set in Torch is a special type of object that prepares data and returns individual data units with indexing syntax. It have to implement such methods:
__len__
: returns the number of elements in the dataset.__getitem__
: implement the[]
operator for objects of dataset.
Check more in the corresponding page.
The following cell shows a simple dataset that wraps python list as a torch primitive.
class Example(td.Dataset):
def __init__(self, data: list[int]) -> None:
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, i: int) -> torch.Tensor:
return torch.tensor(self.data[i])
For each index, you will get the corresponding element, but it will be transformed into a torch tensor.
data_set = Example([3, 2, 7, 3])
data_set[2]
tensor(7)
Data loader#
A DataLoader
in PyTorch is an object that simplifies the process of splitting data into batches.
Find out more in the torch.utils.data.DataLoader
section of the official documentation.
Drop incomplete batch#
The drop_last
argument in torch.DataLoader
controls whether the final batch is dropped if it doesn’t contain enough elements to complete a full batch. If drop_last=True
, any remaining samples that don’t fit into a complete batch will be skipped.
The following cell defines a TensorDataset
tensor that used as base for dataset is showen.
samples = 14
dimentinarity = 5
input_tensor = (
torch.arange(samples*dimentinarity)
.reshape(samples, dimentinarity)
)
print(input_tensor)
dataset = torch.utils.data.TensorDataset(input_tensor)
tensor([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39],
[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59],
[60, 61, 62, 63, 64],
[65, 66, 67, 68, 69]])
Suppose we decided to use batch_size=4
. Since our 14 samples can’t be evenly split into 4-size batches, the following cell defines such a DataLoader
and prints all its batches.
data_loader = DataLoader(
dataset,
batch_size=4,
drop_last=True
)
for d in data_loader:
print(d)
[tensor([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])]
[tensor([[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]])]
[tensor([[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]])]
The numbers from the last two samples (from 60 to 69) haven’t been printed because they didn’t form a complete batch, and thus were not included.
Collate function#
You specify how entities from the dataset should be joined into batches by setting the collate_fn
argument of the DataLoader
.
The collate_fn
is a function that processes a list of tuples, where each tuple represents the outputs from the dataset—typically in the form (X, y)
. collate_fn
should return torch.Tensor
, but in some cases output can be different.
Consider example where we need to build dataset over tensor dataloader cerated in the following cell.
samples = 4
dimentinarity = 5
input_tensor = (
torch.arange(samples*dimentinarity)
.reshape(samples, dimentinarity)
)
print(input_tensor)
dataset = torch.utils.data.TensorDataset(input_tensor)
tensor([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
Here is fucntion, which we will try to pass as a collate_fn
argument. It prints the input passed to the function to check that we got in function exactly what we expected to get. Returns a stack of input tensors.
def check_function(batch: list[tuple[torch.Tensor]]) -> torch.Tensor:
print("I got:", batch)
return torch.stack(list(zip(*batch))[0])
Here is an example of its usage; everything works just as expected.
data_loader = DataLoader(
dataset=dataset,
collate_fn=check_function,
batch_size=2
)
for batch in data_loader:
print(batch)
I got: [(tensor([0, 1, 2, 3, 4]),), (tensor([5, 6, 7, 8, 9]),)]
tensor([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
I got: [(tensor([10, 11, 12, 13, 14]),), (tensor([15, 16, 17, 18, 19]),)]
tensor([[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])