torch.utils.data¶
-
class
torch.utils.data.
Dataset
[source]¶ An abstract class representing a Dataset.
All other datasets should subclass it. All subclasses should override
__len__
, that provides the size of the dataset, and__getitem__
, supporting integer indexing in range from 0 to len(self) exclusive.
-
class
torch.utils.data.
TensorDataset
(data_tensor, target_tensor)[source]¶ Dataset wrapping data and target tensors.
Each sample will be retrieved by indexing both tensors along the first dimension.
Parameters:
-
class
torch.utils.data.
ConcatDataset
(datasets)[source]¶ Dataset to concatenate multiple datasets. Purpose: useful to assemble different existing datasets, possibly large-scale datasets as the concatenation operation is done in an on-the-fly manner.
Parameters: datasets (iterable) – List of datasets to be concatenated
-
class
torch.utils.data.
DataLoader
(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=<function default_collate>, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None)[source]¶ Data loader. Combines a dataset and a sampler, and provides single- or multi-process iterators over the dataset.
Parameters: - dataset (Dataset) – dataset from which to load the data.
- batch_size (int, optional) – how many samples per batch to load (default: 1).
- shuffle (bool, optional) – set to
True
to have the data reshuffled at every epoch (default: False). - sampler (Sampler, optional) – defines the strategy to draw samples from
the dataset. If specified,
shuffle
must be False. - batch_sampler (Sampler, optional) – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.
- num_workers (int, optional) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)
- collate_fn (callable, optional) – merges a list of samples to form a mini-batch.
- pin_memory (bool, optional) – If
True
, the data loader will copy tensors into CUDA pinned memory before returning them. - drop_last (bool, optional) – set to
True
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. IfFalse
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False) - timeout (numeric, optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)
- worker_init_fn (callable, optional) – If not None, this will be called on each
worker subprocess with the worker id (an int in
[0, num_workers - 1]
) as input, after seeding and before data loading. (default: None)
Note
By default, each worker will have its PyTorch seed set to
base_seed + worker_id
, wherebase_seed
is a long generated by main process using its RNG. You may usetorch.initial_seed()
to access this value inworker_init_fn
, which can be used to set other seeds (e.g. NumPy) before data loading.Warning
If ``spawn’’ start method is used,
worker_init_fn
cannot be an unpicklable object, e.g., a lambda function.
-
class
torch.utils.data.sampler.
Sampler
(data_source)[source]¶ Base class for all Samplers.
Every Sampler subclass has to provide an __iter__ method, providing a way to iterate over indices of dataset elements, and a __len__ method that returns the length of the returned iterators.
-
class
torch.utils.data.sampler.
SequentialSampler
(data_source)[source]¶ Samples elements sequentially, always in the same order.
Parameters: data_source (Dataset) – dataset to sample from
-
class
torch.utils.data.sampler.
RandomSampler
(data_source)[source]¶ Samples elements randomly, without replacement.
Parameters: data_source (Dataset) – dataset to sample from
-
class
torch.utils.data.sampler.
SubsetRandomSampler
(indices)[source]¶ Samples elements randomly from a given list of indices, without replacement.
Parameters: indices (list) – a list of indices
-
class
torch.utils.data.sampler.
WeightedRandomSampler
(weights, num_samples, replacement=True)[source]¶ Samples elements from [0,..,len(weights)-1] with given probabilities (weights).
Parameters: - weights (list) – a list of weights, not necessary summing up to one
- num_samples (int) – number of samples to draw
- replacement (bool) – if
True
, samples are drawn with replacement. If not, they are drawn without replacement, which means that when a sample index is drawn for a row, it cannot be drawn again for that row.
-
class
torch.utils.data.distributed.
DistributedSampler
(dataset, num_replicas=None, rank=None)[source]¶ Sampler that restricts data loading to a subset of the dataset.
It is especially useful in conjunction with
torch.nn.parallel.DistributedDataParallel
. In such case, each process can pass a DistributedSampler instance as a DataLoader sampler, and load a subset of the original dataset that is exclusive to it.Note
Dataset is assumed to be of constant size.
Parameters: - dataset – Dataset used for sampling.
- num_replicas (optional) – Number of processes participating in distributed training.
- rank (optional) – Rank of the current process within num_replicas.