Skip to content

Categorical

https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a

benefit

  • reduce memory usage

  • runtime performance optimization

  • library integrations

convert

fruit_cat = df['fruit'].astype('category')
my_cat = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_cat_2 = pd.Categorical.from_codes(codes, categories, ordered=True)
my_cat_2 = my_cats_2.as_ordered() #change to ordered

method

as_ordered, as_unordered, rename_categories, set_categories, add_categories, remove_categories, remove_unused_categories, reorder_categories

my_cat.cat.codes
my_cat.cat.categories
my_cat.cat.set_categories(['a','b','c','d'])
#remove unobserved categories
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3.cat.remove_unused_categories()

operating on categorical columns

category can be much faster

df['str'].str.upper()
df['cat'].str.upper() #but became string type again
df['cat'].cat.rename_categories(str.upper) #even faster and still cat type
df['cat'].dtype.categories contains the unique categorical values thus can work on these values directly if there are no appropriate cat functions.

merge

merge dfs can lead category columns becoming string type

  • merge(str, cat) => str

  • merge(cat, cat) => str

  • df.astype({'cat': df2['cat'].dtype}).merge(df2, on='cat') => cat

import numpy as np
import pandas as pd

d1 = pd.DataFrame({
    'id': [5, 6],
    'value': pd.Categorical(['b', 'c']),
})
d2 = pd.DataFrame({
    'id': [5, 3, 6],
    'value': pd.Categorical(['a', 'b', 'c']),
})

df = pd.merge(d1, d2, on='id') # cat
df = pd.merge(d1, d2)          # str
df = pd.concat([d1, d2])       # str

groupby

When group on a categorical datatype, by default it will group on every value in the datatype even if it isn't present in the data itself.

Using observed=True to solve the issue: df.groupby('cat', observed=True)['val'].mean()