ParquetDataFrameLoader¶
- class torchdata.datapipes.iter.ParquetDataFrameLoader(source_dp: IterDataPipe[str], dtype=None, columns: Optional[List[str]] = None, device: str = '', use_threads: bool = False)¶
Takes in paths to Parquet files and return a TorchArrow DataFrame for each row group within a Parquet file (functional name:
load_parquet_as_df
).- Parameters:
source_dp – source DataPipe containing paths to the Parquet files
columns – List of str that specifies the column names of the DataFrame
use_threads – if
True
, Parquet reader will perform multi-threaded column readsdtype – specify the TorchArrow dtype for the DataFrame, use
torcharrow.dtypes.DType
device – specify the device on which the DataFrame will be stored
Example
>>> from torchdata.datapipes.iter import FileLister >>> import torcharrow.dtypes as dt >>> DTYPE = dt.Struct([dt.Field("Values", dt.int32)]) >>> source_dp = FileLister(".", masks="df*.parquet") >>> parquet_df_dp = source_dp.load_parquet_as_df(dtype=DTYPE) >>> list(parquet_df_dp)[0] index Values ------- -------- 0 0 1 1 2 2 dtype: Struct([Field('Values', int32)]), count: 3, null_count: 0