Categorical¶
https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a
benefit¶
reduce memory usage
runtime performance optimization
library integrations
convert¶
fruit_cat = df['fruit'].astype('category')
my_cat = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_cat_2 = pd.Categorical.from_codes(codes, categories, ordered=True)
my_cat_2 = my_cats_2.as_ordered() #change to ordered
method¶
as_ordered, as_unordered, rename_categories, set_categories, add_categories, remove_categories, remove_unused_categories, reorder_categories
my_cat.cat.codes
my_cat.cat.categories
my_cat.cat.set_categories(['a','b','c','d'])
#remove unobserved categories
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3.cat.remove_unused_categories()
operating on categorical columns¶
category can be much faster
df['str'].str.upper()
df['cat'].str.upper() #but became string type again
df['cat'].cat.rename_categories(str.upper) #even faster and still cat type
df['cat'].dtype.categories contains the unique categorical values thus can work on these values directly if there are no appropriate cat functions. merge¶
merge dfs can lead category columns becoming string type
merge(str, cat) => str
merge(cat, cat) => str
df.astype({'cat': df2['cat'].dtype}).merge(df2, on='cat') => cat
import numpy as np
import pandas as pd
d1 = pd.DataFrame({
'id': [5, 6],
'value': pd.Categorical(['b', 'c']),
})
d2 = pd.DataFrame({
'id': [5, 3, 6],
'value': pd.Categorical(['a', 'b', 'c']),
})
df = pd.merge(d1, d2, on='id') # cat
df = pd.merge(d1, d2) # str
df = pd.concat([d1, d2]) # str
groupby¶
When group on a categorical datatype, by default it will group on every value in the datatype even if it isn't present in the data itself.
Using observed=True to solve the issue: df.groupby('cat', observed=True)['val'].mean()