Data set

Data set#

A data set in Torch is a special type of object that prepares data and returns individual data units with indexing syntax. It have to implement such methods:

import torch
import torch.utils.data as td

Iterable datasets#

There is a special kind of dataset - iterable dataset. It’s property is that it should be used as an iterator - it have to have a specil __iter__ method.

Check the corresposponding section of the pytorch documentation.

The following cell defines such kind of dataset.

class MyDataset(td.IterableDataset):
    def __iter__(self):
        return iter(torch.arange(0, 20)[:, None])

As a result you’ll have a object that return all instances one by one.

my_dataset = MyDataset()
list(my_dataset)

[tensor([0]),
 tensor([1]),
 tensor([2]),
 tensor([3]),
 tensor([4]),
 tensor([5]),
 tensor([6]),
 tensor([7]),
 tensor([8]),
 tensor([9]),
 tensor([10]),
 tensor([11]),
 tensor([12]),
 tensor([13]),
 tensor([14]),
 tensor([15]),
 tensor([16]),
 tensor([17]),
 tensor([18]),
 tensor([19])]

The following cell shows the application of the data loader to the dataset we’ve created earlier.

data_loader = td.DataLoader(my_dataset, batch_size=3)
list(data_loader)

[tensor([[0],
         [1],
         [2]]),
 tensor([[3],
         [4],
         [5]]),
 tensor([[6],
         [7],
         [8]]),
 tensor([[ 9],
         [10],
         [11]]),
 tensor([[12],
         [13],
         [14]]),
 tensor([[15],
         [16],
         [17]]),
 tensor([[18],
         [19]])]

Train/test split#

To split data into train/test subsets there is pytorch function torch.utils.data.random_split. It doesn’t actually split the data, but ruther returns an object whose indices refer to the different elements of the original dataset.

The following generates the Torch dataset, which will be used as an example and prints it’s length.

X = torch.arange(20).reshape((5, 4))
data_set = td.TensorDataset(X)
len(data_set)

Applying the torch.utils.data.random_split function to this object returns a Subset object that practicaly behaves just like regular dataset.

The length argument specifies the proportion of the original dataset that will be included result subsets.

train, test = td.random_split(dataset=data_set, lengths=[0.8, 0.2])
type(train), type(test)

(torch.utils.data.dataset.Subset, torch.utils.data.dataset.Subset)

The following cell demonstrates that the sizes of the result subsets correspond to the values specified in the lengths argument.

len(train), len(test)

(4, 1)

Seed#

The seed can be specified using the generator argument. Pass the corresponding torch generator there.

The following cell demonstrates the implementation of a fixed random seed using the generator function Generator().manual_seed().

train, test = td.random_split(
    dataset=td.TensorDataset(X),
    lengths=(0.8, 0.2),
    generator=torch.Generator().manual_seed(20)
)
[t for t in train]

[(tensor([4, 5, 6, 7]),),
 (tensor([12, 13, 14, 15]),),
 (tensor([ 8,  9, 10, 11]),),
 (tensor([16, 17, 18, 19]),)]

Therefore, any run of this cell will produce the same result.

Data set

Contents

Data set#

Iterable datasets#

Train/test split#

Seed#