Parquet¶
read_parquet vs scan_parquet¶
read_parquetwill load all data in RAM an cannot apply any optimization to the scan levelscan_parquetis recommended when dealing with larger file sizes
Read parquet with filters¶
https://github.com/pola-rs/polars/issues/3964
generate random df
read parquet using duckdb
read parquet using polars - cannot use string datetime (polars will cast column to string) - have to use datetime
Performance benchmark (parquet with index 40 MB)
file size is similar to without category
both filter string and category columns are in index
duckdb is not sensitive to index/category
pandas has to read all index columns so became slow
category is the winner
Performance benchmark (parquet without index 40 MB)
file size is similar to without category
best to save parquet file without index with category and read it with pandas.
polars
import polars as pl
from datetime import datetime
df = (
pl
.scan_parquet('df.parquet')
.filter(pl.col('dt') >= datetime(2020,4,1))
.filter(pl.col('val').is_not_null())
.select(['c1', 'c2', 'c4'])
).collect().to_pandas()
print(df.shape)
duckdb