Attention
June 2024 Status Update: Removing DataPipes and DataLoader V2
We are re-focusing the torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader. We do not plan on continuing development or maintaining the [DataPipes] and [DataLoaderV2] solutions, and they will be removed from the torchdata repo. We’ll also be revisiting the DataPipes references in pytorch/pytorch. In release torchdata==0.8.0 (July 2024) they will be marked as deprecated, and in 0.9.0 (Oct 2024) they will be deleted. Existing users are advised to pin to torchdata==0.8.0 or an older version until they are able to migrate away. Subsequent releases will not include DataPipes or DataLoaderV2. Please reach out if you suggestions or comments (please use this issue for feedback)
Rows2Columnar¶
- class torchdata.datapipes.iter.Rows2Columnar(source_datapipe: IterDataPipe[List[Union[Dict, List]]], column_names: Optional[List[str]] = None)¶
Accepts an input DataPipe with batches of data, and processes one batch at a time and yields a Dict for each batch, with
column_names
as keys and lists of corresponding values from each row as values (functional name:rows2columnar
).Within the input DataPipe, each row within a batch must either be a Dict or a List
Note
If
column_names
are not given and each row is a Dict, the keys of that Dict will be used as column names.- Parameters:
source_datapipe – a DataPipe where each item is a batch. Within each batch, there are rows and each row is a List or Dict
column_names – if each element in a batch contains Dict,
column_names
act as a filter for matching keys; otherwise, these are used as keys to for the generated Dict of each batch
Example
>>> # Each element in a batch is a `Dict` >>> from torchdata.datapipes.iter import IterableWrapper >>> dp = IterableWrapper([[{'a': 1}, {'b': 2, 'a': 1}], [{'a': 1, 'b': 200}, {'b': 2, 'c': 3, 'a': 100}]]) >>> row2col_dp = dp.rows2columnar() >>> list(row2col_dp) [defaultdict(<class 'list'>, {'a': [1, 1], 'b': [2]}), defaultdict(<class 'list'>, {'a': [1, 100], 'b': [200, 2], 'c': [3]})] >>> row2col_dp = dp.rows2columnar(column_names=['a']) >>> list(row2col_dp) [defaultdict(<class 'list'>, {'a': [1, 1]}), defaultdict(<class 'list'>, {'a': [1, 100]})] >>> # Each element in a batch is a `List` >>> dp = IterableWrapper([[[0, 1, 2, 3], [4, 5, 6, 7]]]) >>> row2col_dp = dp.rows2columnar(column_names=["1st_in_batch", "2nd_in_batch", "3rd_in_batch", "4th_in_batch"]) >>> list(row2col_dp) [defaultdict(<class 'list'>, {'1st_in_batch': [0, 4], '2nd_in_batch': [1, 5], '3rd_in_batch': [2, 6], '4th_in_batch': [3, 7]})]