torcharrow.DataFrame.map¶
- DataFrame.map(arg: ty.Union[ty.Dict, ty.Callable], na_action: ty.Literal['ignore', None] = None, dtype: ty.Optional[dt.DType] = None, columns: ty.Optional[ty.List[str]] = None)¶
Maps rows according to input correspondence.
- Parameters:
callable (arg - dict or) – If arg is a dict then input is mapped using this dict and non-mapped values become null. If arg is a callable, this is treated as a user-defined function (UDF) which is invoked on each element of the input. Callables must be global functions or methods on class instances, lambdas are not supported.
None (default) – If your UDF returns null for null input, selecting “ignore” is an efficiency improvement where map will avoid calling your UDF on null values. If None, aways calls the UDF.
None – If your UDF returns null for null input, selecting “ignore” is an efficiency improvement where map will avoid calling your UDF on null values. If None, aways calls the UDF.
DType (dtype -) – DType is used to force the output type. DType is required if result type != item type.
None – DType is used to force the output type. DType is required if result type != item type.
names (columns - list of column) – Determines which columns to provide to the mapping dict or UDF.
None – Determines which columns to provide to the mapping dict or UDF.
Examples
>>> import torcharrow as ta >>> ta.column([1,2,None,4]).map({1:111}) 0 111 1 None 2 None 3 None dtype: Int64(nullable=True), length: 4, null_count: 3
Using a defaultdict to provide a missing value:
>>> from collections import defaultdict >>> ta.column([1,2,None,4]).map(defaultdict(lambda: -1, {1:111})) 0 111 1 -1 2 -1 3 -1 dtype: Int64(nullable=True), length: 4, null_count: 0
Using user-supplied python function:
>>> def add_ten(num): >>> return num + 10 >>> >>> ta.column([1,2,None,4]).map(add_ten, na_action='ignore') 0 11 1 12 2 None 3 14 dtype: Int64(nullable=True), length: 4, null_count: 1
Note that .map(add_ten, na_action=None) in the example above would fail with a type error since addten is not defined for None/null. To pass nulls to a UDF, the UDF needs to prepare for it:
>>> def add_ten_or_0(num): >>> return 0 if num is None else num + 10 >>> >>> ta.column([1,2,None,4]).map(add_ten_or_0, na_action=None) 0 11 1 12 2 0 3 14 dtype: Int64(nullable=True), length: 4, null_count: 0
Mapping to different types requires a dtype parameter:
>>> ta.column([1,2,None,4]).map(str, dtype=dt.string) 0 '1' 1 '2' 2 'None' 3 '4' dtype: string, length: 4, null_count: 0
Mapping over a DataFrame, the UDF gets the whole row as a tuple:
>>> def add_unary(tup): >>> return tup[0]+tup[1] >>> >>> ta.dataframe({'a': [1,2,3], 'b': [1,2,3]}).map(add_unary , dtype = dt.int64) 0 2 1 4 2 6 dtype: int64, length: 3, null_count: 0
Multi-parameter UDFs:
>>> def add_binary(a,b): >>> return a + b >>> >>> ta.dataframe({'a': [1,2,3], 'b': ['a', 'b', 'c'], 'c':[1,2,3]}).map(add_binary, columns = ['a','c'], dtype = dt.int64) 0 2 1 4 2 6 dtype: int64, length: 3, null_count: 0
Multi-return UDFs - functions that return more than one column can be specified by returning a DataFrame (also known as struct column); providing the return dtype is mandatory:
>>> ta.dataframe({'a': [17, 29, 30], 'b': [3,5,11]}).map(divmod, columns= ['a','b'], dtype = dt.Struct([dt.Field('quotient', dt.int64), dt.Field('remainder', dt.int64)])) index quotient remainder ------- ---------- ----------- 0 5 2 1 5 4 2 2 8 dtype: Struct([Field('quotient', int64), Field('remainder', int64)]), count: 3, null_count: 0
UDFs with state can be written by capturing the state in a (data)class and use a method as a delegate:
>>> def fib(n): >>> if n == 0: >>> return 0 >>> elif n == 1 or n == 2: >>> return 1 >>> else: >>> return fib(n-1) + fib(n-2) >>> >>> from dataclasses import dataclass >>> @dataclass >>> class State: >>> state: int >>> def __post_init__(self): >>> self.state = fib(self.state) >>> def add_fib(self, x): >>> return self.state+x >>> >>> m = State(10) >>> ta.column([1,2,3]).map(m.add_fib) 0 56 1 57 2 58 dtype: int64, length: 3, null_count: 0