An abstract class representing a Dataset.
All other datasets should subclass it. All subclasses should override
__len__, that provides the size of the dataset, and
__getitem__, supporting integer indexing in range from 0 to len(self) exclusive.
Dataset wrapping tensors.
Each sample will be retrieved by indexing tensors along the first dimension.
Parameters: *tensors (Tensor) – tensors that have the same size of the first dimension.
Dataset to concatenate multiple datasets. Purpose: useful to assemble different existing datasets, possibly large-scale datasets as the concatenation operation is done in an on-the-fly manner.
Parameters: datasets (sequence) – List of datasets to be concatenated
Subset of a dataset at specified indices.
- dataset (Dataset) – The whole Dataset
- indices (sequence) – Indices in the whole set selected for subset
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=<function default_collate>, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None)¶
Data loader. Combines a dataset and a sampler, and provides single- or multi-process iterators over the dataset.
- dataset (Dataset) – dataset from which to load the data.
- batch_size (int, optional) – how many samples per batch to load
- shuffle (bool, optional) – set to
Trueto have the data reshuffled at every epoch (default:
- sampler (Sampler, optional) – defines the strategy to draw samples from
the dataset. If specified,
shufflemust be False.
- batch_sampler (Sampler, optional) – like sampler, but returns a batch of
indices at a time. Mutually exclusive with
- num_workers (int, optional) – how many subprocesses to use for data
loading. 0 means that the data will be loaded in the main process.
- collate_fn (callable, optional) – merges a list of samples to form a mini-batch.
- pin_memory (bool, optional) – If
True, the data loader will copy tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or your
collate_fnreturns a batch that is a custom type see the example below.
- drop_last (bool, optional) – set to
Trueto drop the last incomplete batch, if the dataset size is not divisible by the batch size. If
Falseand the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:
- timeout (numeric, optional) – if positive, the timeout value for collecting a batch
from workers. Should always be non-negative. (default:
- worker_init_fn (callable, optional) – If not
None, this will be called on each worker subprocess with the worker id (an int in
[0, num_workers - 1]) as input, after seeding and before data loading. (default:
By default, each worker will have its PyTorch seed set to
base_seed + worker_id, where
base_seedis a long generated by main process using its RNG. However, seeds for other libraies may be duplicated upon initializing workers (w.g., NumPy), causing each worker to return identical random numbers. (See My data loader workers return identical random numbers section in FAQ.) You may use
torch.initial_seed()to access the PyTorch seed for each worker in
worker_init_fn, and use it to set other seeds before data loading.
spawnstart method is used,
worker_init_fncannot be an unpicklable object, e.g., a lambda function.
The default memory pinning logic only recognizes Tensors and maps and iterables containg Tensors. By default, if the pinning logic sees a batch that is a custom type (which will occur if you have a
collate_fnthat returns a custom batch type), or if each element of your batch is a custom type, the pinning logic will not recognize them, and it will return that batch (or those elements) without pinning the memory. To enable memory pinning for custom batch or data types, define a
pin_memorymethod on your custom type(s).
class SimpleCustomBatch: def __init__(self, data): transposed_data = list(zip(*data)) self.inp = torch.stack(transposed_data, 0) self.tgt = torch.stack(transposed_data, 0) def pin_memory(self): self.inp = self.inp.pin_memory() self.tgt = self.tgt.pin_memory() return self def collate_wrapper(batch): return SimpleCustomBatch(batch) inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5) tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper, pin_memory=True) for batch_ndx, sample in enumerate(loader): print(sample.inp.is_pinned()) print(sample.tgt.is_pinned())
Randomly split a dataset into non-overlapping new datasets of given lengths.
- dataset (Dataset) – Dataset to be split
- lengths (sequence) – lengths of splits to be produced
Base class for all Samplers.
Every Sampler subclass has to provide an __iter__ method, providing a way to iterate over indices of dataset elements, and a __len__ method that returns the length of the returned iterators.
Samples elements sequentially, always in the same order.
Parameters: data_source (Dataset) – dataset to sample from
RandomSampler(data_source, replacement=False, num_samples=None)¶
Samples elements randomly. If without replacement, then sample from a shuffled dataset. If with replacement, then user can specify
Samples elements randomly from a given list of indices, without replacement.
Parameters: indices (sequence) – a sequence of indices
WeightedRandomSampler(weights, num_samples, replacement=True)¶
Samples elements from [0,..,len(weights)-1] with given probabilities (weights).
- weights (sequence) – a sequence of weights, not necessary summing up to one
- num_samples (int) – number of samples to draw
- replacement (bool) – if
True, samples are drawn with replacement. If not, they are drawn without replacement, which means that when a sample index is drawn for a row, it cannot be drawn again for that row.
>>> list(WeightedRandomSampler([0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 5, replacement=True)) [0, 0, 0, 1, 0] >>> list(WeightedRandomSampler([0.9, 0.4, 0.05, 0.2, 0.3, 0.1], 5, replacement=False)) [0, 1, 4, 3, 2]
BatchSampler(sampler, batch_size, drop_last)¶
Wraps another sampler to yield a mini-batch of indices.
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False)) [[0, 1, 2], [3, 4, 5], [6, 7, 8], ] >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True)) [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
DistributedSampler(dataset, num_replicas=None, rank=None)¶
Sampler that restricts data loading to a subset of the dataset.
It is especially useful in conjunction with
torch.nn.parallel.DistributedDataParallel. In such case, each process can pass a DistributedSampler instance as a DataLoader sampler, and load a subset of the original dataset that is exclusive to it.
Dataset is assumed to be of constant size.
- dataset – Dataset used for sampling.
- num_replicas (optional) – Number of processes participating in distributed training.
- rank (optional) – Rank of the current process within num_replicas.