Shortcuts

ParquetDataFrameLoader

class torchdata.datapipes.iter.ParquetDataFrameLoader(source_dp: IterDataPipe[str], dtype=None, columns: Optional[List[str]] = None, device: str = '', use_threads: bool = False)

Takes in paths to Parquet files and return a TorchArrow DataFrame for each row group within a Parquet file (functional name: load_parquet_as_df).

Parameters:
  • source_dp – source DataPipe containing paths to the Parquet files

  • columns – List of str that specifies the column names of the DataFrame

  • use_threads – if True, Parquet reader will perform multi-threaded column reads

  • dtype – specify the TorchArrow dtype for the DataFrame, use torcharrow.dtypes.DType

  • device – specify the device on which the DataFrame will be stored

Example

>>> from torchdata.datapipes.iter import FileLister
>>> import torcharrow.dtypes as dt
>>> DTYPE = dt.Struct([dt.Field("Values", dt.int32)])
>>> source_dp = FileLister(".", masks="df*.parquet")
>>> parquet_df_dp = source_dp.load_parquet_as_df(dtype=DTYPE)
>>> list(parquet_df_dp)[0]
  index    Values
-------  --------
      0         0
      1         1
      2         2
dtype: Struct([Field('Values', int32)]), count: 3, null_count: 0

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources