Skip to content

Pyarrow¶

tensorflow does not support arrow backend¶

data types¶

https://pandas.pydata.org/docs/user_guide/pyarrow.html

string[pyarrow]: this is equivalent to pd.StringDtype('pyarrow') that can return NumPy-backed nullable types but slow
pd.ArrowDtype(pa.string()): will return ArrowDtype much faster
bool[pyarrow]
int64[pyarrow]
uint8[pyarrow]
uint64[pyarrow]
float32[pyarrow]
time64[us][pyarrow]
timestamp[s][pyarrow]? should use pd.ArrowDtype(pa.timestamp('s'))

pd.StringDtype('pyarrow'):

This allows pandas to utilize PyArrow's memory-efficient string representation for the data
Additionally, it can return NumPy-backed nullable types, meaning it can handle missing values efficiently using NumPy arrays
string[pyarrow]: This is a shortcut alias for pd.StringDtype('pyarrow')

pd.ArrowDtype(pa.string()):

It achieves similar memory efficiency as the other options
it returns ArrowDtype objects instead of NumPy-backed nullable types, soit might be less efficient for handling missing values compared to pd.StringDtype('pyarrow')

convert backend arrow/numpy¶

currently not possible using global setting

data = {'c1': [3, 2, 1, 0], 'c2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
d2 = df.convert_dtypes(dtype_backend='pyarrow')
d3 = d2.convert_dtypes(dtype_backend='numpy_nullable')
print(df.dtypes)
print(d2.dtypes)
print(d3.dtypes)

Notice that

df string type is object while

d3 string type is string[python], preserves pd.NA

some_series.astype(str)              # object
some_series.astype('string')         # string[python]
some_series.astype(pd.StringDtype()) # string[python]

`pyarrow table` to `pandas[pyarrow]`¶

import pyarrow.csv as pv
d = pv.read_csv('data.csv').to_pandas(types_mapper=pd.ArrowDtype)

pyarrow backend issues¶

mod and divmod not implmented: https://github.com/pandas-dev/pandas/pull/56694/files

import pandas as pd
d = pd.DataFrame({'x': [1,2,3]}, dtype='int64[pyarrow]')
(d['x'] + 2).mod(2) + 1