Shortcuts

BucketBatcher

class torchdata.datapipes.iter.BucketBatcher(datapipe: IterDataPipe[T_co], batch_size: int, drop_last: bool = False, batch_num: int = 100, bucket_num: int = 1, sort_key: Optional[Callable] = None, use_in_batch_shuffle: bool = True)

Creates mini-batches of data from sorted bucket (functional name: bucketbatch). An outer dimension will be added as batch_size if drop_last is set to True, or length % batch_size for the last batch if drop_last is set to False.

The purpose of this DataPipe is to batch samples with some similarity according to the sorting function being passed. For an example in the text domain, it may be batching examples with similar number of tokens to minimize padding and to increase throughput.

Parameters:
  • datapipe – Iterable DataPipe being batched

  • batch_size – The size of each batch

  • drop_last – Option to drop the last batch if it’s not full

  • batch_num – Number of batches within a bucket (i.e. bucket_size = batch_size * batch_num)

  • bucket_num – Number of buckets to consist a pool for shuffling (i.e. pool_size = bucket_size * bucket_num)

  • sort_key – Callable to sort a bucket (list)

  • use_in_batch_shuffle – if True, do in-batch shuffle; if False, buffer shuffle

Example

>>> from torchdata.datapipes.iter import IterableWrapper
>>> source_dp = IterableWrapper(range(10))
>>> batch_dp = source_dp.bucketbatch(batch_size=3, drop_last=True)
>>> list(batch_dp)
[[5, 6, 7], [9, 0, 1], [4, 3, 2]]
>>> def sort_bucket(bucket):
>>>     return sorted(bucket)
>>> batch_dp = source_dp.bucketbatch(
>>>     batch_size=3, drop_last=True, batch_num=100,
>>>     bucket_num=1, use_in_batch_shuffle=False, sort_key=sort_bucket
>>> )
>>> list(batch_dp)
[[3, 4, 5], [6, 7, 8], [0, 1, 2]]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources