Pyarrow¶
tensorflow does not support arrow backend¶
data types¶
https://pandas.pydata.org/docs/user_guide/pyarrow.html
string[pyarrow]: this is equivalent to pd.StringDtype('pyarrow') that can return NumPy-backed nullable types but slow
pd.ArrowDtype(pa.string()): will return ArrowDtype much faster
bool[pyarrow]
int64[pyarrow]
uint8[pyarrow]
uint64[pyarrow]
float32[pyarrow]
time64[us][pyarrow]
timestamp[s][pyarrow]? should use
pd.ArrowDtype(pa.timestamp('s'))
pd.StringDtype('pyarrow'):
This allows pandas to utilize PyArrow's memory-efficient string representation for the data
Additionally, it can return NumPy-backed nullable types, meaning it can handle
missing valuesefficiently using NumPy arraysstring[pyarrow]: This is a shortcut alias for
pd.StringDtype('pyarrow')
pd.ArrowDtype(pa.string()):
It achieves similar memory efficiency as the other options
it returns ArrowDtype objects instead of NumPy-backed nullable types, soit might be less efficient for handling
missing valuescompared topd.StringDtype('pyarrow')
convert backend arrow/numpy¶
currently not possible using global setting
data = {'c1': [3, 2, 1, 0], 'c2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
d2 = df.convert_dtypes(dtype_backend='pyarrow')
d3 = d2.convert_dtypes(dtype_backend='numpy_nullable')
print(df.dtypes)
print(d2.dtypes)
print(d3.dtypes)
Notice that
df string type is
objectwhiled3 string type is
string[python], preservespd.NA
pyarrow table to pandas[pyarrow]¶
pyarrow backend issues¶
modanddivmodnot implmented: https://github.com/pandas-dev/pandas/pull/56694/files