When using polars LazyFrame, at some point you might need to get the properties such as the column names of the LazyFrame. We must be aware that getting LazyFrame properties is expensive. For more details check the discussions here: https://github.com/pola-rs/polars/issues/16328

Here is a list of the some properties for the LazyFrame:

  • LazyFrame.columns
  • LazyFrame.dtypes
  • LazyFrame.schema
  • LazyFrame.width

For example when you using LazyFrame.columns you will get a warning:

PerformanceWarning: Determining the column names of a LazyFrame requires
resolving its schema, which is a potentially expensive operation.
Use `LazyFrame.collect_schema().names()` to get the column names without
this warning.
  d.lazy().columns

However, if you take the suggestion of the warning by using the alternative method you will only avoid the warning but nothing else – the alternative operation is still expensive.

Let’s test it by code example:

import polars as pl
da = {f'v{i}':[i] for i in range(10000)}
df = pl.DataFrame(da)
lf = df.lazy()

_ = df.columns                  # 1.39 ms ± 45 μs without warning
_ = lf.columns                  # 25.3 ms ± 822 μs with warning
_ = lf.collect_schema().names() # 24.5 ms ± 765 μs without warning

So if possible we should get the properties from the DataFrame not from the LazyFrame.


<
Previous Post
Improve VarianceThreshold performance in ML feature selection
>
Next Post
Polars null operations