Shortcuts

DataFrameMaker

class torchdata.datapipes.iter.DataFrameMaker(source_dp: IterDataPipe[T_co], dataframe_size: int = 1000, dtype=None, dtype_generator=None, columns: Optional[List[str]] = None, device: str = '')

Takes rows of data, batches a number of them together and creates TorchArrow DataFrames (functional name: dataframe).

Note

There is a trade-off between having a large number of rows within a DataFrame and usage of memory. Please choose a value carefully.

Parameters:
  • source_dp – IterDataPipe containing rows of data

  • dataframe_size – number of rows of data within each DataFrame, page size can be option

  • dtype – specify the TorchArrow dtype for the DataFrame, use torcharrow.dtypes.DType

  • dtype_generator – function with no input argument that generates a torcharrow.dtypes.DType, which overrides dtype if both are given. This is useful for when the desired dtype is not serializable.

  • columns – List of str that specifies the column names of the DataFrame

  • device – specify the device on which the DataFrame will be stored

Example

>>> from torchdata.datapipes.iter import IterableWrapper
>>> import torcharrow.dtypes as dt
>>> source_data = [(i,) for i in range(3)]
>>> source_dp = IterableWrapper(source_data)
>>> DTYPE = dt.Struct([dt.Field("Values", dt.int32)])
>>> df_dp = source_dp.dataframe(dtype=DTYPE)
>>> list(df_dp)[0]
  index    Values
-------  --------
      0         0
      1         1
      2         2
dtype: Struct([Field('Values', int32)]), count: 3, null_count: 0

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources