Learning Faster Python Fast

Polars null operations

2025-10-27T00:00:00+00:00

In Polars, if any of the columns involved in an operation contains null, the result of the operation will also be null for that row. This behavior is consistent with SQL’s NULL propagation principle.

Let’s create a simple example to demonstrate that:

import polars as pl

# Create a DataFrame with some null values
df = pl.DataFrame({
    'x': [None, 'a', 'b'],  # 'None' represents null in Polars
    'y': ['foo', 'bar', None]
})

# Apply the expression to concatenate 'x' + '_' + 'y'
df_result = df.with_columns(
    (pl.col('x') + pl.lit('_') + pl.col('y')).alias('v')
)

# Show the result
print(df_result)

The output is:

shape: (3, 3)
┌──────┬──────┬───────┐
│ x    ┆ y    ┆ v     │
│ ---  ┆ ---  ┆ ---   │
│ str  ┆ str  ┆ str   │
╞══════╪══════╪═══════╡
│ null ┆ foo  ┆ null  │
│ a    ┆ bar  ┆ a_bar │
│ b    ┆ null ┆ null  │
└──────┴──────┴───────┘

And here is the simple fix – fill null with and empty string first:

# Apply the expression to concatenate 'x' + '_' + 'y'
df_result = df.with_columns(
    (
      pl.col('x').fill_null('')
      + pl.lit('_')
      + pl.col('y').fill_null('')
    ).alias('v'),
)

Now the output is correct:

shape: (3, 3)
┌──────┬──────┬───────┐
│ x    ┆ y    ┆ v     │
│ ---  ┆ ---  ┆ ---   │
│ str  ┆ str  ┆ str   │
╞══════╪══════╪═══════╡
│ null ┆ foo  ┆ _foo  │
│ a    ┆ bar  ┆ a_bar │
│ b    ┆ null ┆ b_    │
└──────┴──────┴───────┘

Polars LazyFrame properties are expensive operations

2025-09-05T00:00:00+00:00

When using polars LazyFrame, at some point you might need to get the properties such as the column names of the LazyFrame. We must be aware that getting LazyFrame properties is expensive. For more details check the discussions here: https://github.com/pola-rs/polars/issues/16328

Here is a list of the some properties for the LazyFrame:

LazyFrame.columns
LazyFrame.dtypes
LazyFrame.schema
LazyFrame.width

For example when you using LazyFrame.columns you will get a warning:

PerformanceWarning: Determining the column names of a LazyFrame requires
resolving its schema, which is a potentially expensive operation.
Use `LazyFrame.collect_schema().names()` to get the column names without
this warning.
  d.lazy().columns

However, if you take the suggestion of the warning by using the alternative method you will only avoid the warning but nothing else – the alternative operation is still expensive.

Let’s test it by code example:

import polars as pl
da = {f'v{i}':[i] for i in range(10000)}
df = pl.DataFrame(da)
lf = df.lazy()

_ = df.columns                  # 1.39 ms ± 45 μs without warning
_ = lf.columns                  # 25.3 ms ± 822 μs with warning
_ = lf.collect_schema().names() # 24.5 ms ± 765 μs without warning

So if possible we should get the properties from the DataFrame not from the LazyFrame.

Improve VarianceThreshold performance in ML feature selection

2025-08-18T00:00:00+00:00

For Machine Learning feature selection, one of the basic and efficient methods is using feature’s variance to drop features that are almost constant – these features will not provide any useful information for the target prediction.

scikit-learn implementation is slow

The VarianceThreshold class implemented in sciki-learn is super slow. Here is the example showing how to use it.

import polars as pl
from sklearn.feature_selection import VarianceThreshold
def sklearn_variance_threshold(
    X: pl.DataFrame,
    threshold: float = 0.0,
) -> pl.DataFrame:
    selector = VarianceThreshold(threshold=0.0)
    _ = selector.fit_transform(X)
    df = X[selector.get_support()]
    return df

polars implementation is much faster

The polars is a pandas equivalent data processing package that is implemented using the Rust language. Note that Rust is popular for its performance and other great features such as parallelization and memory management.

Here is the implementation using polars that is about 20x faster.

import polars as pl
def polars_variance_threshold(
    X: pl.DataFrame,
    threshold: float = 0.0,
) -> pl.DataFrame:
    stats = X.select([
        pl.var(col).alias(col) for col in X.columns
    ])
    variances = stats.row(0)  # get variances as a list
    df = X.select([
        col for col, var in zip(X.columns, variances)
        if var > threshold
    ])
    return df

Test it

To compare the performance of the two different implementations we can again use the method I created to generate dummy data for testing.

import pandas as pd
import polars as pl
# create dataset
df_pandas = create_dummy_df()
df = pl.from_pandas(df_pandas)
X = df.select(pl.exclude('target'))
Y = df['target']

X_sklearn = sklearn_variance_threshold(X)
X_polars = polars_variance_threshold(X)
X_sklearn.equals(X_polars)

Reduce a python app run time from two hours to 20 seconds

2025-07-11T00:00:00+00:00

Pandas df.groupby.apply is too slow for two Dataframes.

We have a Python app that was too slow. It took about two hours to extract product forecast data from a database and to merge it with actual records. After some refactorization and optimization, I managed to reduce the run time to less than 20 seconds.

Assume we have some products and each has some daily sales revenue, the actual records. We also have daily forecast revenue. The task is to merge the actual and forecast data together.

Once there are missing records in the actual data, we believe that the actual revenue from that day are not reliable and they should be replaced with forecast data.

To finish this task, for each product, we need to first find the last consecutive date in the actual data and then get the forecast data after that date so we can merge the actual and forecast data together.

Dummy data for testing

The performance of different implementations has been tested using some dummy data. I created dummy data using a function gen_rand_df that is described in my previous post. I also used a function explode_date_range in my another post to explode date ranges.

Firstly, we create some product info with product_id, start_date and end_date for the actual sales records and expand the date ranges to daily records.

nrow = 5000
d1 = gen_rand_df(
    nrow=nrow,
    str_cols={
        'count': 1,
        'name': 'product_id',
        'str_len': 10,
        'str_cnt': nrow,
    },
    ts_cols={
        'count': 2,
        'name': ['start_date', 'end_date'],
        'start_date': '2025-01-01',
        'end_date': '2035-01-01',
        'freq': 'D',
        'random': True,
    },
)
df1 = explode_date_range(
    df=d1.query('start_date < end_date').drop_duplicates(subset=['product_id']),
    start_date_col='start_date',
    end_date_col='end_date',
    freq='D',
)

Secondly, we create some sales forecast info for products that has actual sales data.

d2 = gen_rand_df(
    nrow=nrow,
    str_cols={
        'count': 1,
        'name': 'product_id',
        'col_strs': d1['product_id'].unique(),
    },
    ts_cols={
        'count': 2,
        'name': ['start_date', 'end_date'],
        'start_date': '2025-01-01',
        'end_date': '2035-01-01',
        'freq': 'D',
        'random': True,
    },
)
df2 = explode_date_range(
    df=d2.query('start_date < end_date').drop_duplicates(subset=['product_id']),
    start_date_col='start_date',
    end_date_col='end_date',
    freq='D',
)

Then, we create some dummy product sales revenue for both the actual and forecast records.

d3 = gen_rand_df(
    nrow=max(df1.shape[0], df2.shape[0]),
    float_cols={
        'count': 2,
        'name': ['daily_revenue1', 'daily_revenue2'],
        'low': 0,
        'high': 1e3,
        'missing_pct': [0.1, 0],
    },
)

Finally, we add the sales revenue to the actual and forecast data.

df_actual = (
    df1
    .assign(daily_revenue=d3['daily_revenue1'].values[:df1.shape[0]])
    .set_index(['product_id', 'daily_revenue'])
)
df_forecast = (
    df2
    .assign(daily_revenue=d3['daily_revenue2'].values[:df2.shape[0]])
    .set_index(['product_id', 'daily_revenue'])
)

Here are the first few lines of the actual sales data:

                      daily_revenue
product_id date
P3hLcLj43u 2025-01-26    128.570203
           2025-01-27    499.277862
           2025-01-28    601.498358

Getting last consecutive date

The function used to get the last consecutive date from a date series has been implemented as follows:

def get_last_consecutive_date(dates: pd.Series) -> np.datetime64 | None:
    # Empty input
    if dates.empty:
        return None

    dates = np.unique(dates)

    # Only one unique element in the list
    if len(dates) == 1:
        return dates[0]

    diffs = np.diff(dates).astype('timedelta64[D]').astype(int)
    last_consecutive_day_index = np.where(diffs > 1)[0]
    if len(last_consecutive_day_index) == 0:
        return dates[-1] # all dates are consecutive
    else:
        return dates[last_consecutive_day_index[0]]

Using Pandas `df.groupby.apply`

As we have to perform the same task for each group of products, naturally we can use Pandas df.groupby.apply. But this function generally only works for one DataFrame. Here we have two DataFrames and one option is passing the second DataFrame as a parameter.

Here is the implementation:

def keep_records_after_consecutive_dates_v1(
    df_forecast: pd.DataFrame,
    df_actual: pd.DataFrame,
) -> pd.DataFrame:
    product_id = df_forecast.index.values[0][df_forecast.index.names.index('product_id')]
    dates = df_actual.query('product_id == @product_id & daily_revenue.notna()').index.unique('date')
    # Get last consecutive date and filter df_forecast
    if len(dates) > 0:
        last_consecutive_date = get_last_consecutive_date(dates)
        df_forecast = df_forecast.query('date > @last_consecutive_date')
    return df_forecast

df_v1 = (
    df_forecast
    .groupby('product_id', group_keys=False)
    .apply(keep_records_after_consecutive_dates_v1, df_actual)
)

The run time is 327 seconds.

Avoiding repeated query and filtering

By checking the previous implementation, we can observe that we have a repeated query and filtering for each product on the actual sales DataFrame. That is likely to slow down the process.

Now we do the query for all products and group the product records in advance. Hopefully this will make it much faster.

Here is the updated version:

def keep_records_after_consecutive_dates_v2(
    df_forecast: pd.DataFrame,
    df_actual: DataFrameGroupBy,
) -> pd.DataFrame:
    product_id = df_forecast.index.values[0][df_forecast.index.names.index('product_id')]
    if product_id in df_actual.groups:
        dates = df_actual.get_group(product_id).index.unique('date')
        # Get last consecutive date and filter df_forecast
        if len(dates) > 0:
            last_consecutive_date = get_last_consecutive_datex(dates)
            df_forecast = df_forecast.query('date > @last_consecutive_date')
    return df_forecast

grp_actual = df_actual.query('daily_revenue.notna()').groupby('product_id')
df_v2 = (
    df_forecast
    .groupby('product_id', group_keys=False)
    .apply(keep_records_after_consecutive_dates_v2, grp_actual)
)

Now the run time is 17.6 seconds — that’s about 18x faster.

Using a Python for-loop

The .apply() often has some overhead compared to a pure Python for-loop. We now replace the .apply() with a for-loop. At the same time we can remove the index parsing for all product groups.

def keep_records_after_consecutive_dates_v3(
    df_forecast: pd.DataFrame,
    df_actual: pd.DataFrame,
) -> pd.DataFrame:
    dates = df_actual.index.unique('date')
    # Get last consecutive date and filter df_forecast
    if len(dates) > 0:
        last_consecutive_date = get_last_consecutive_datex(dates)
        df_forecast = df_forecast.query('date > @last_consecutive_date')
    return df_forecast

dfs = []
grp_actual = df_actual.query('daily_revenue.notna()').groupby('product_id')
grp_forecast = df_forecast.groupby('product_id')
product_ids = df_forecast.index.unique('product_id')
for product_id in product_ids:
    if product_id in grp_actual.groups:
        df = keep_records_after_consecutive_dates_v3(
            grp_forecast.get_group(product_id),
            grp_actual.get_group(product_id),
        )
    else:
        df = grp_forecast.get_group(product_id)
    dfs.append(df)
df_v3 = pd.concat(dfs, axis=0)

The run time is 9.8 seconds — that’s about 1.8x faster than version #2.

Vectorized process without for-loop

It’s obvious that we can vectorize the calculation of the last consecutive date for all products. Pandas groupby().apply() on a Series can be very efficient, as it often operates on NumPy arrays internally.

We can also avoid the for-loop by using vectorized join and filtering operations. By doing that we don’t need to join small DataFrames for all products using pd.concat.

The final optimized version is showing as follows:

def keep_records_after_consecutive_dates_v4(
    df_forecast: pd.DataFrame,
    df_actual: pd.DataFrame,
) -> pd.DataFrame:
    # Get last consecutive date for each product
    last_consecutive_dates = (
        df_actual
        .query('daily_revenue.notna()')
        .reset_index('date')
        .groupby('product_id')['date']
        .apply(get_last_consecutive_date)
        .to_frame()
        .rename(columns={'date': 'last_consecutive_date'})
    )
    # Filter df_forecast
    df_forecast = (
        df_forecast
        .join(last_consecutive_dates, on='product_id', how='left')
        .fillna({'last_consecutive_date': pd.Timestamp.min})
        .query('date > last_consecutive_date')
        .drop(columns='last_consecutive_date')
    )
    return df_forecast

df_v4 = keep_records_after_consecutive_dates_v4(df_forecast, df_actual)

The final run time is 2.4 seconds - that’s about 4x faster than version #3 and about 130x faster than the original version #1 (327 seconds).

summary

By avoiding repeated query and filtering operations and vectorization of some other operations, I successfully made a Python process running 130x faster. I also did some similar optimization for extracting forecast data from a database. Ultimately the application total execution time was reduced from over two hours to less than 20 seconds.

Using Gurobi Python matrix API to reduce problem creation time

2025-03-17T00:00:00+00:00

When doing optimizations using Python, many people choose the popular pyomo package to create optimization problems. However, pyomo is known for its slow performance and some other issues. That’s why gurobipy becomes a more attractive alternative.

We definitely know that gurobipy is much faster for creating problems and easier to interact directly with the gurobi solver. But do you know that there is also a Python matrix API in gurobipy? Here I will demonstrate some gurobipy matrix API features that will make your code run even faster and easier to maintain.

How to use gurobipy

An optimization problem basically contains an objective, some variables and some constraints. So the task to create a problem is creating decision variables, setting variable coefficients to the objective, and adding constraints based on the variables.

Here I will use a basic example to show the whole process. Assume we have a variable demand for one year with 5 minutes intervals. The demand should be provided by two generators (g1: integer outputs with price 2, g2: continuous outputs with price 4). If there is not enough generation to meet the demand, there is a penalty proportional to the demand.

The problem creation time is about 8.8 seconds for the following provided example.

# gurobi version: v12.0.0rc1
import time
import numpy as np
import scipy.sparse as sp
import gurobipy as gp
from gurobipy import GRB

np.random.seed(42)

# Create a model
model = gp.Model('gurobi_test')

# Number of periods (1 year with 5min intervals)
n_period = 365 * 288
periods = list(range(n_period))
gen1 = np.random.randint(0, 6, n_period)
gen2 = np.random.uniform(0, 9, n_period)
demand = np.random.uniform(0, 10, n_period)
penalty = 3.5 * demand

# Record time before creating the problem
t0 = time.time()

# Create variables
v_g1 = {
    i: model.addVar(vtype=GRB.INTEGER, lb=0, ub=gen1[i])
    for i in periods
}
v_g2 = {
    i: model.addVar(vtype=GRB.CONTINUOUS, lb=0, ub=gen2[i])
    for i in periods
}
v_on = {
    i: model.addVar(vtype=GRB.BINARY)
    for i in periods
}

# Set objective
obj_expr = gp.quicksum(
    2 * v_g1[i] + 4 * v_g2[i] + penalty[i] * v_on[i]
    for i in periods
)
model.setObjective(obj_expr, GRB.MINIMIZE)

# Add constraints
r1 = {
    i: model.addConstr(
        v_g1[i] + v_g2[i] + demand[i] * v_on[i] == demand[i]
    ) for i in periods
}

# Update model and print time for creating the problem
model.update()
print(f'Time for creating problem: {time.time() - t0:.3f} seconds')

# Write problem to file in LP format
model.write('c:/test/gurobi_test.lp')

# Optimize the model
model.optimize()

# Check if the optimization was successful
if model.status == GRB.OPTIMAL:
    model.write('c:/test/gurobi_test.sol')
    print('Model is optimal.')
    print(f'Objective value: {model.objVal}')
elif model.status == GRB.INFEASIBLE:
    print('Model is infeasible.')
elif model.status == GRB.UNBOUNDED:
    print('Model is unbounded.')
else:
    print(f'Optimization ended with status {model.status}')

How to use Gurobi Python matrix API

Gurobi matrix API has a function called model.addMVar, which has a shape parameter. We can use this function to add multiple variables with the same type. Then we can use matrix operations similar to the ones in numpy to add objective terms and constraints.

Note that the bounds of the variables and the constraint right-hand sides all can be inputs in the numpy array format. The variable bounds can also be updated using a numpy array. The variable solutions can be extracted into a numpy array as well.

...
# Create variables
v_g1 = model.addMVar(n_period, vtype=GRB.INTEGER, lb=0, ub=gen1)
v_g2 = model.addMVar(n_period, vtype=GRB.CONTINUOUS, lb=0, ub=gen2)
v_on = model.addMVar(n_period, vtype=GRB.BINARY)

# Set objective
obj_expr = (2 * v_g1 + 4 * v_g2 + penalty * v_on).sum()
model.setObjective(obj_expr, GRB.MINIMIZE)

# Add constraints
r1 =  model.addConstr(
    v_g1 + v_g2 + demand * v_on == demand
)
...

After using matrix API functions, now the problem creation time is about 1.6 seconds - that’s about 5x to 6x faster. At the same time, the code is shorter and cleaner.

If the coefficients of the MVar elements are different in the constraints, we can use an overloaded decorator @ to do that:

# c[0]:  x[0] + y[0] >= 10
# c[1]: 2x[1] + y[1] >= 11
x = model.addMVar(2, name='x')
y = model.addMVar(2, name='y')
M = sp.diags([1, 2]) # a sparse diagnostic matrix
c = model.addConstr(M@x + y >= np.array([10, 11]), name='c')

If we need to add a constraint only using one element of the MVar, this can be easily done:

x = model.addMVar(2, name='x')
y = model.addVar(name='y')
c = model.addConstr(x[0] + y >= 99)

Example of 2D MVars

Assume we have some constraints like:

c[0]: x[0,0] + x[0,1] + x[0,2] >= 11
c[1]: x[1,0] + x[1,1] + x[1,2] >= 11

To add these constraints to the optimization problem, this can be easily done with 2D MVars:

x = model.addMVar((2,3), name='x')
c = model.addConstr(x.sum(axis=1) >= 11, name='c')

model.setObjective(10 * x.sum(), GRB.MINIMIZE)

By default, MVar.sum() will add all elements on all directions. By setting axis=1 it will add all elements in the row direction.

If the objective coefficients are not a constant, we can add the objective like this:

coeffs = np.array([1,2])
model.setObjective(coeffs @ x.sum(axis=1), GRB.MINIMIZE)

If the objective coefficients for each elements in the MVar are not the same, we can still use a matrix operation to add the objective:

coeffs = np.array([[1,2,3], [4,5,6]])
model.setObjective((coeffs * x).sum(), GRB.MINIMIZE)

Example of shifted MVars

In many cases we need to add constraints related to the difference of variables between two consecutive time points. This can also be done nicely with MVars.

# x[0] - x[-1] >= 0  # x[-1] = 9 is the initial value of x
# x[1] - x[0]  >= 1
# x[2] - x[1]  >= 2
S = sp.diags(np.ones(3 - 1), -1, shape=(3, 3), format='csr')
# S = [[0, 0, 0], [1, 0, 0], [0, 1, 0]]
x0 = np.array([9, 0, 0])
rhs = np.array([0, 1, 2])
c = model.addConstr(x - S@x - x0 >= rhs)

Changing constraint coefficients

If we need to update some constraints, for example, changing some variable coefficients, we can use the method model.chgCoeff:

model.chgCoeff(c[0], x[0], 10.0)

Compared with other language APIs, there is not a method called model.chgCoeffs() in the Python API. Therefore, we can only change one constraint coefficient at a time. Hopefully the method model.chgCoeffs() will be added in the future.

Updating Objective

Think about creating a project using object-oriented programming. In this case, ideally we need to set the objective terms related to each object separately. However, there is not a Python API method that can be used to add the objective terms gradually.

There are two workarounds:

get the objective using getObjective() then add additional terms and set the objective again using setObjective(), or
add objective terms from different objects together and then set the objective using setObjective()

Here is an example showing how to do it using the second option:

obj_expr = 0
obj_expr += x.sum()
obj_expr += (coeffs * y).sum()
model.setObjective(obj_expr, GRB.MINIMIZE)

Note that setting the variable and constraint names will increase the problem creation time. Thus it’s best to do so only for debugging purpose - you can use a flag to enable or disable variable and constraint name setting.

More details about the Gurobi Python matrix API can be found in the manual: https://docs.gurobi.com/projects/optimizer/en/current/index.html

Which orchestration tool is better: Airflow, Prefect, Argo Workflows, or Temporal?

2025-03-05T00:00:00+00:00

Nowadays there are many tools for task orchestration. Some popular ones include Airflow, Prefect, Argo Workflows, and Temporal. Now the question is which tool should I use in my team?

Here I will briefly list the features of the four task orchestration tools. Hopefully this will help you decide which tool is best for your task scheduling.

Airflow

Airflow is a popular task scheduling tool. The data workflows in Airflow are defined using Python.

It has a large user base but also has some limitations as it was created earlier than other orchestration tools:

DAGs are not parameterized - you can’t pass parameters into your workflows
DAGs are static - it can’t automatically create new steps at runtime as needed
We have to package the entire workflow into one container

Prefect

Prefect was created to overcome some of the limitations in Airflow, with a strong emphasis on ease of use and deployment, especially for complex DAGs.

Some features of Prefect:

Workflows are defined in Python, parameterized and dynamic
Can run each step in a container, but need to register docker with workflows in Prefect
Uses state management abstractions that allow for easy retries and failure handling within data workflows
Has built-in integrations with popular data engineering tools and platforms, such as Dask, DBT, and various cloud services

Argo Workflows

Argo Workflows is a container-native workflow engine for orchestrating jobs on Kubernetes. It naturally addresses the deployment issue in Airflow and Prefect.

Workflows are defined in YAML
Every step in an workflow is run in its own container
Relies on Kubernetes for state management
Can only run on Kubernetes clusters

Temporal

While Airflow, Prefect and Argo Workflows focus primarily on data workflow orchestration, Temporal is a more general-purpose workflow tool.

Workflows are defined using the language for the tasks
Provides robust features for state management, retries, and long-running processes
Requires more investment in learning - has a steep learning curve

Summary

As we can see, each tool has its own use cases. We need to select the tool that is most suitable for our work.

If our tasks are about data processing, we perhaps should consider Airflow, Prefect or Argo Workflows.
If we already have our tasks running on Kubernetes cluster, for easy of deployment we might choose Argo Workflows.
If our tasks are more general and require high robustness and reliability, we had better to use Temporal.
If our workflows are complex, a tool using a programming language instead of yaml files might be more suitable.
If our workflows are simple and we do not want to invest too much time in learning, a tool with ease of use might be the best choice.

Make python loops 5x to 10x faster using numba

2024-11-26T00:00:00+00:00

Numba is a just-in-time (JIT) compiler for python that translates python code into highly optimized machine code at runtime. It can significantly improve the performance of numerical computations by enabling high-performance execution of functions, particularly those that make heavy use of numpy arrays.

Here we will first briefly explain key features of numba and when to use it, and then provide an example demonstrating how to accelerate code performance by leveraging various numba features. If you are already familiar with numba, go directly to the third section about the demonstration.

Key features of numba

JIT compilation: Numba compiles python functions into machine code, allowing for efficient code generation tailored to specific hardware and data types.
Numerical acceleration: Numba is particularly well-suited for numerical computations involving arrays and mathematical operations. It can often achieve performance comparable to compiled languages like C or Fortran.
Compatibility with numpy: Numba seamlessly integrates with numpy, to accelerate numpy functions and operations, making them much faster.
Parallel computing: Numba supports parallel execution on multi-core CPUs and GPUs, enabling us to leverage the power of parallel hardware to speed up computations.
Custom UDFs: We can create custom user-defined functions (UDFs) in numba and use them within our python code. These UDFs can be compiled and optimized for performance.

When to use and to avoid numba

Numba is particularly well-suited for numerical computations involving arrays and mathematical operations. Here are some specific cases where we should consider using numba:

Array operations: If our code heavily involves operations on numpy arrays, such as element-wise arithmetic, matrix multiplication, or reductions, numba can significantly accelerate these computations.
Mathematical functions: Numba can optimize calls to mathematical functions like sin, cos, exp, and log, providing a performance boost compared to their python counterparts.
Custom functions: If we have custom functions that perform numerical calculations, numba can compile them into machine code for improved efficiency.
Loops: Numba can often optimize loops that iterate over arrays or perform numerical calculations within the loop body.

However, not all python code can be optimized using numba and thus improve the performance. There are some limitations to consider before using numba:

I/O bound operations: Numba will not help much with operations that are I/O bound, such as reading/writing files or network operations.
Dynamic python features: If our code relies heavily on python’s dynamic features (like modifying functions at runtime), numba may not be suitable, as it works best with statically typed, straightforward code.
Non-numerical code: For code that does not involve numerical calculations or array manipulations, other optimization techniques may be more appropriate.
Numba can introduce overhead: If we are working with small datasets or functions that run very quickly, the overhead of JIT compilation might outweigh the performance benefits.

To determine whether numba is appropriate for our use case, we can:

Profile our code: Use profiling tools to identify the bottlenecks in our code and see if they involve numerical computations.
Try numba and measure the performance: Experiment with numba and compare the performance of our code with and without numba.
Consider the trade-offs: Weigh the potential performance benefits against the overhead and limitations of using numba.

Overall, if we have numerical or scientific computations that need to be optimized, numba is a powerful tool that can lead to significant performance improvements with minimal code changes.

Data for testing demonstration

Let’s create a 2D numpy array filled with randomly generated data. Each row represents a scenario and here we will calculate the distance between any two scenarios.

import numpy as np
np.random.seed(11)
arr = np.random.rand(100, 1000)

Initial version

We calculate the distance between two scenarios using the 1-norm, which measures the sum of the absolute differences between corresponding elements.

def calculate_distances1(arr):
    m = arr.shape[0]
    n = arr.shape[1]
    dist_arr = np.zeros((m, m))
    for i in range(m):
        for j in range(i):
            v = 0.0
            for k in range(n):
                 v += abs(arr[i, k] - arr[j, k])
            dist_arr[i, j] = v
            dist_arr[j, i] = v
    return dist_arr
# 2.68 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The run time is about 2.68 seconds, for 100 scenarios. If we have 1000 scenarios the time would be 268 seconds. That is too slow and we must improve the performance.

Using numpy function

Here we update the code to calculate the 1-norm using the numpy function np.linalg.norm().

def calculate_distances2(arr):
    m = arr.shape[0]
    dist_arr = np.zeros((m, m))
    for i in range(m):
        for j in range(i):
            dist_arr[i, j] = np.linalg.norm(arr[i] - arr[j], 1)
            dist_arr[j, i] = dist_arr[i, j]
    return dist_arr
# 40.7 ms ± 1.50 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now the run time is 40.7 ms - about 65x faster! As the numpy function is implemented in C, there is no surprise that the performance has been improved significantly.

Using numba.njit

Can we improve the performance further? Yes, by using numba, definitely we can.

from numba import njit
@njit(cache=True)
def calculate_distances3(arr):
    ...
# 10.9 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is great that there is a 4x performance improvement with numba on the numpy function.

The numba njit decorator is used to compile the python function to optimized machine code in nopython mode. We can also use the jit decorator, which allows the function to fall back to the original python implementation if numba cannot compile it.

When we set cache=True, numba stores the compiled function in a cache on disk. So the next time we execute the script, it can load the precompiled function, avoiding the overhead of recompilation.

Using numba.njit with data types

Can we do it better? Yes, we need to use numba data type signature.

@njit('float64[:,::1](float64[:,::1])', cache=True)
def calculate_distances4(arr):
    ...
# 10.5 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We explicitly set the data types of the input parameters and the output. In this case, there is only minor performance improvement, most likely that numba can infer data types even without data type signature. More details about the numba data type signature can be found in the numba documents (see References section).

In generall, by specifying data types, numba can generate more efficient machine code. Knowing the exact types allows it to optimize the generated code for those types, leading to faster execution and improved memory management.

Replacing numpy function with a python loop

As numba is good for loops, here we will replace the numpy function by a python loop to further boost performance.

@njit('float64[:,::1](float64[:,::1])', cache=True)
def calculate_distances5(arr):
    m = arr.shape[0]
    n = arr.shape[1]
    dist_arr = np.zeros((m, m))
    for i in range(m):
        for j in range(i):
            v = 0.0
            for k in range(n):
                 v += abs(arr[i, k] - arr[j, k])
            dist_arr[i, j] = v
            dist_arr[j, i] = v
    return dist_arr
# 8.20 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numba is indeed good for loops. There is a 1.2x performance improvement now, and it’s about 5x faster than the numpy version.

Using numba.njit parallel mode

Modern computers often have multiple cores. By leveraging parallel computing, we can significantly reduce execution time.

from numba import njit, prange
@njit('float64[:,::1](float64[:,::1])', cache=True, parallel=True, nogil=True)
def calculate_distances6(arr):
    m = arr.shape[0]
    n = arr.shape[1]
    dist_arr = np.zeros((m, m))
    for i in prange(m):
        for j in range(i):
            v = 0.0
            for k in range(n):
                 v += abs(arr[i, k] - arr[j, k])
            dist_arr[i, j] = v
            dist_arr[j, i] = v
    return dist_arr
# 3.68 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Here we update the code to use numba parallel mode with the help of the prange function.

By setting parallel=True, numba’s JIT compiler will analyze the function’s code and automatically identify opportunities for parallelization, especially within loops. However, using prange provides more explicit control over parallelization and can be more effective in certain cases.

Finally the run time is 3.68 ms (4 cpu cores). It is about 10x faster compared to the numpy function version without using numba.njit (40.7 ms). It is about 700x faster compared to the raw python code (2.68 seconds).

References

The pandas function pd.read_sql returns an empty DataFrame without correct data types

2024-09-06T00:00:00+00:00

We provide a solution to the issue you might need.

When querying data from databases such as MS SQL Server via the Driver pyodbc, we can conveniently get the data as a pandas DataFrame by using pd.read_sql. Generally, the driver provides information about the column names, data types, and other metadata associated with the result set. However, when the query result set is empty, the data type information is not available and pandas returns an empty DataFrame with all column types as object.

An empty DataFrame with wrong data types can cause issues in your Python code. If you do not check whether the returned DataFrame is empty or not your code will crash in many cases such as trying to extract the year from a datetime column and doing aggregations on float columns. Here we explain how to get the data type information in this situation when using sqlalchemy to create the query.

Get the data types from the query statement

There are multiple approaches to get the data types but they are not always available in most situations.

Option 1: `result.cursor.description`

The first approach is using the .description attribute of the database cursor object:

import sqlalchemy

conn_string = f'mssql+pyodbc://{username}:{pwd}@{server_name}'
conn_string += f'/{database_name}?driver=ODBC+Driver+17+for+SQL+Server'
engine = sqlalchemy.create_engine(conn_string)
connection = engine.connect()

query = 'SELECT ID, Name, Price, StartDate FROM sales.Product;'
result = connection.execute(query)
print(result.cursor.description)

The output will be something like:

(
    ('ID', , None, 10, 10, 0, False),
    ('Name', , None, 50, 50, 0, False),
    ('Price', , None, 19, 19, 4, False),
    ('StartDate', , None, 23, 23, 3, False),
)

Option 2: `query.statement.selected_columns`

If the first approach does not work and you use sqlalchemy to create the query, you should still be able to get the data types.

Assume we defined the sales.Product Table as:

from sqlalchemy.orm import declarative_base
from sqlalchemy import create_engine, Column, DateTime, DECIMAL, Integer, Unicode

# Base classes and will hold the metadata about the tables
Base = declarative_base()

# A declarative class for Table `sales.Product` by inheriting from the Base class
class Product(Base):
    __tablename__ = 'Product'
    __table_args__ = {'schema': 'sales'}

    ID = Column(Integer, primary_key=True)
    Name = Name = Column(Unicode(50), nullable=False)
    Price = Column(DECIMAL(19, 4))
    StartDate = Column(DateTime)


# Create a SQLAlchemy engine
engine = create_engine(f'{database_connection_url}')

# Create tables in the database
Base.metadata.create_all(engine)

And we created the query in this way:

from sqlalchemy import sql, types
from sqlalchemy.orm import Session

session = Session(bind=engine)
sp = Product
query = session.query(
    sp.ID.label('id'),
    sp.Name.label('name'),
    sp.Price.label('price'),
    sp.StartDate.label('start_date'),
)

Finally we can get the data types from the query:

if isinstance(query, sqlalchemy.orm.query.Query):
    dtype = {
        c.name: c.type.__class__.__name__
        for c in query.statement.selected_columns
    }

The dtype is a dictionary with column names as keys and data types as values:

dtype = {
    'id': 'Integer',
    'Name': 'Unicode',
    'price': 'DECIMAL',
    'start_date': 'DateTime',
}

Note that the items in query.statement.selected_columns can have different types, such as:

- 
- 
- 

The Cast class is from the sql.func.cast function:

query = session.query(
    sql.func.cast(sp.Price, types.Float),
)

However, the Cast class does not have the name property. To fix the issue we have to convert the Cast column to a Lable column:

query = session.query(
    sql.func.cast(sp.Price, types.Float).label('price'),
)

Pass the data type information to `pd.read_sql`

There is a parameter dtype in pandas.read_sql(..., dtype=None) that can be used to pass the data types for the query results.

Note that in the previous section the extracted data types are the types defined in sqlalchemy. We need to convert them to the types that can be used in pandas. Here we provide a mapping for most of the data types:

sa_to_pd_dtype = {
    'BigInteger': 'int64',
    'BIT': 'bool',
    'Boolean': 'bool',
    'Date': 'datetime64[ns]',
    'DateTime': 'datetime64[ns]',
    'DECIMAL': 'float',
    'Enum': 'category',
    'Float': 'float',
    'Integer': 'int64',
    'Interval': 'timedelta64',
    'LargeBinary': 'str',
    'Numeric': 'float',
    'SmallInteger': 'int16',
    'String': 'str',
    'Time': 'datetime64[ns]',
    'TIMESTAMP': 'datetime64[ns]',
    'Unicode': 'str',
}

And we set the data types when extracting the data using pd.read_sql:

dtype = {
    col: sa_to_pd_dtype[typ]
    for col, typ in dtype.items()
    if typ in sa_to_pd_dtype
}
df = pd.read_sql(..., dtype=dtype)

Why did I get `NullType` for some data columns?

Assume the previous query has been changed to:

query = session.query(
    sp.ID.label('id'),
    sp.Name.label('name'),
    sp.Price.label('price'),
    sql.func.dateadd(
        sql.text('day'), 1, sp.StartDate
    ).label('actual_start_date'),
)

In this case, the data type for the column actual_start_date will be NullType instead of DateTime.

By digging into the sqlalchemy documents we find out that this is caused by the sql.func.dateadd. Basically for functions that are not known, the type defaults to the NullType. There are also other functions such as sql.func.rtrim, sql.func.replace, sql.func.year, sql.func.avg and sql.func.round that might lead to the NullType.

To fix the issue, we need to pass the data type directly to the function:

sql.func.dateadd(
    sql.text('day'), 1, sp.StartDate, type_=types.DateTime
).label('actual_start_date')

Reference

Read CSV files 10x to 40x faster using pyarrow and polars

2024-06-13T00:00:00+00:00

CSV (comma-separated values) files have been widely used in different areas. They can be easily exported from almost all programming languages. They can also be loaded into all text editors and many other applications. However, the main disadvantage is that CSV files are usually larger than files with other formats and it is slow to load them into memory.

Here we compare different options for reading CSV files by using the pandas, polars and pyarrow Python packages. We test the loading performance for CSV files each with a different data type. Based on the test results, we should be able to determine which option to use when we need reading CSV files faster.

Creating test data

CSV files with three data types, string, float, and datetime, have been used to test the file reading performance. All the testing CSV files were created using the scripts in my previous post; each CSV file has 10 million rows and three columns with the same data type and a size of about 500 MB.

The string type CSV file was created with:

df_str = gen_rand_df(
    nrow=10000000,
    str_cols={
        'count': 3,
        'name': ['c1', 'c2', 'c3'],
        'str_len': [10, (1,15), (1,50)],
        'str_count': [1000, 500, 100],
    },
)
df_str.to_csv(filename, index=False)

The float type CSV file was created with:

df_flt = gen_rand_df(
    nrow=10000000,
    float_cols={
        'count': 3,
        'name': ['c1', 'c2', 'c3'],
        'low': [0, -100, 0],
        'high': [1, 100, 1e5],
    },
)
df_flt.to_csv(filename, index=False)

The datetime type CSV file was created with:

df_dts = gen_rand_df(
    nrow=10000000,
    ts_cols={
        'count': 3,
        'name': ['c1', 'c2', 'c3'],
        'start_date': ['2020-01-01', '2021-01-01', '2022-01-01'],
        'end_date': ['2021-01-01', '2022-01-01', '2023-01-01'],
        'freq': 's',
        'random': False,
    },
)
df_dts.to_csv(filename, index=False)

Reading CSV files using `pandas`

In pandas, when reading CSV files, there are three types of parsers that are available (python, c, and pyarrow). The parser can be set via the parameter engine. There are also two backend data types (backend_dtype: numpy_nullable and pyarrow) for storing the data. We will check the performance of the combinations of different parsers and backend data types.

The data types passed to the functions are a dictionary like this: dtype = {'c1': type, 'c2': type, 'c3': type}.

For string values the type is str. There are also two string data types available for pyarrow (dtype_pa): pd.ArrowDtype(pa.string()) and string[pyarrow] (dtype_pa_str2); the latter supports NumPy-backed nullable types.
For float values the type is float and float64[pyarrow], for numpy_nullable and pyarrow backends respectively.
For datatime values the type is datetime64[s] and pd.ArrowDtype(pa.timestamp('s')). Notice that, when using the pandas datetime data types such as datetime64[s], the datetime type columns must be passed to the function separately. While using the pyarrow data types, all types can be passed to the function in the same format.

The following options are tested:

c + numpy_nullable + dtype_str + astype

import pandas as pd
pd.read_csv(
    file, engine='c', dtype_backend='numpy_nullable', dtype=dtype_str
).astype(dtype)

c + numpy_nullable + dtype

For string/float:

pd.read_csv(
    file, engine='c', dtype_backend='numpy_nullable', dtype=dtype
)

For datetime:

pd.read_csv(
    file, engine='c', dtype_backend='numpy_nullable',
    parse_dates=['c1','c2','c3'],
)

c + pyarrow + dtype

For string/float:

pd.read_csv(
    file, engine='c', dtype_backend='pyarrow', dtype=dtype
)

For datetime:

pd.read_csv(
    file, engine='c', dtype_backend='pyarrow',
    parse_dates=['c1','c2','c3'],
)

c + pyarrow + dtype_pa

pd.read_csv(
    file, engine='c', dtype_backend='pyarrow', dtype=dtype_pa
)

pyarrow + numpy_nullable + dtype

For string/float:

pd.read_csv(
    file, engine='pyarrow', dtype_backend='numpy_nullable', dtype=dtype
)

For datetime:

pd.read_csv(
    file, engine='pyarrow', dtype_backend='numpy_nullable',
    parse_dates=['c1','c2','c3'],
)

pyarrow + pyarrow + dtype

pd.read_csv(
    file, engine='pyarrow', dtype_backend='pyarrow', dtype=dtype
)

pyarrow + pyarrow + string[pyarrow]

pd.read_csv(
    file, engine='pyarrow', dtype_backend='pyarrow', dtype=dtype_pa_str2
)

pyarrow + pyarrow + dtype_pa

pd.read_csv(
    file, engine='pyarrow', dtype_backend='pyarrow', dtype=dtype_pa
)

pyarrow + pyarrow + dtype_pa + to numpy_nullable

pd.read_csv(
    file, engine='pyarrow', dtype_backend='pyarrow', dtype=dtype_pa
).convert_dtypes(dtype_backend='numpy_nullable')

pyarrow + pyarrow

pd.read_csv(
    file, engine='pyarrow', dtype_backend='pyarrow'
)

The performance results for these options are as follows:

                                                         str    float  datetime performance_order_for_float
c       + numpy_nullable + dtype_str + astype            3.93s  18.2s  18.5s    10
c       + numpy_nullable + dtype                         3.88s  3.29s  15.4s     6
c       + pyarrow        + dtype                         3.27s  3.55s  16.6s     7
c       + pyarrow        + dtype_pa                      5.17s  16.8s  53.2s     9
pyarrow + numpy_nullable + dtype                         3.50s  0.54s  1.15s     4
pyarrow + pyarrow        + dtype                         7.62s  0.50s  1.67s     3
pyarrow + pyarrow        + string[pyarrow]               4.05s  15.8s  11.1s     8
pyarrow + pyarrow        + dtype_pa                      0.39s  0.48s  0.44s     2
pyarrow + pyarrow        + dtype_pa + to numpy_nullable  2.74s  2.68s  1.64s     5
pyarrow + pyarrow                                        0.48s  0.47s  0.37s     1

Based on the test results, we can conclude that:

We can get the best performance when using pyarrow for the parser, backend and dtype (pyarrow + pyarrow + dtype_pa).
The pyarrow + pyarrow + dtype_pa option is about 10x, 7x, and 35x faster than the default option (c + numpy_nullable + dtype) for string, float and datetime, separately.
Compared to the c parser, the pyarrow parser is a little faster for string, 6x faster for float, and 10-14x faster for datetime.
Using the pyarrow backend with the c parser, there are no performance improvements; if also using the pyarrow dtype the performance is much worse.
The pd.ArrowDtype(pa.string()) string data type is about 10x faster than the string[pyarrow] string data type.
The pyarrow parser can automatically determine the data types without any performance loss; this is especially useful when you do not know the data types in the CSV files.

We should understand that the pyarrow parser works in parallel mode while the c parser is not. Also converting the data from the numpy_nullable to pyarrow dtype or vice versa might be time-consuming.

Reading CSV files using `polars`

The polars package is relatively new. But it becomes popular recently due to its performance both in speed with vectorized execution and memory efficiency using arrow. Also it is designed with a clean and concise API for handling large datasets with lazy evaluation.

The data types passed to the polars functions are a dictionary like this: dtypes = {'c1': dtype, 'c2': dtype, 'c3': dtype}.

For string values the dtype is pl.Utf8.
For float values the dtype is pl.Float64.
For datatime values the dtype is pl.Datetime.

The following options are tested:

default: without providing the dtypes parameter. Note that if some columns with float type have empty values, the data type will be parsed as string - not smart enough compared to pyarrow.csv.
```
import polars as pl
pl.read_csv(file)
```
eager: the default mode, any operations are executed immediately
```
pl.read_csv(file, dtypes=dtypes)
```
lazy: operations are not executed until you explicitly call the collect() method
```
pl.scan_csv(file, dtypes=dtypes).collect()
```
streaming: it processes the data in batches instead of loading everything at once, good for handling large datasets that might exceed available memory
```
pl.scan_csv(file, dtypes=dtypes).collect(streaming=True)
```

sql api eager: interact with data using familiar SQL syntax

pl.SQLContext(
    data=pl.scan_csv(file, dtypes=dtypes)
).execute('select * from data', eager=True)

sql api eager + to pandas

pl.SQLContext(
    data=pl.scan_csv(file, dtypes=dtypes)
).execute(
    'select * from data', eager=True
).to_pandas(use_pyarrow_extension_array=False)

sql api eager + to pandas pyarrow

pl.SQLContext(
    data=pl.scan_csv(file, dtypes=dtypes)
).execute(
    'select * from data', eager=True
).to_pandas(use_pyarrow_extension_array=True)

The tested performance results are as follows:

                                          str    float  datetime
default                                   0.52s  0.38s  0.37s
eager                                     0.46s  0.40s  0.39s
lazy                                      0.45s  0.38s  0.41s
streaming                                 0.42s  0.40s  0.42s
sql api eager                             0.46s  0.38s  0.40s
sql api eager + to pandas                 1.59s  0.47s  0.48s
sql api eager + to pandas pyarrow         0.99s  0.43s  0.45s

It is obvious from the results that:

The performance is quite consistent for all the options using polars.
The polars CSV reading has a similar performance compared to pandas with pyarrow.
If we need a numpy_nullable pandas DataFrame, polars can still be a better option.

Reading CSV files using `pyarrow.csv`

The module, pyarrow.csv, is one of the great modules within the pyarrow library that specifically deals with reading and writing CSV files. It offers robust functionalities to efficiently process CSV data with some great features, such as inferring data types during reading and supporting various file formats.

Here we test the performance of the pyarrow.csv module with three data types in the format convert_options = pv.ConvertOptions(column_types={'c1': dtype, 'c2': dtype, 'c3': dtype}).

For string values the dtype is pa.string().
For float values the dtype is pa.float64().
For datatime values the dtype is pa.timestamp('s').

The following options are tested and compared:

default

import pyarrow.csv as pv
pv.read_csv(file)

default + to pandas
```
pv.read_csv(file).to_pandas()
```

default + to pandas pyarrow

pv.read_csv(file).to_pandas(types_mapper=pd.ArrowDtype)

dtype

pv.read_csv(file, convert_options=convert_options)

dtype + to pandas

pv.read_csv(file, convert_options=convert_options).to_pandas()

dtype + to pandas pyarrow

pv.read_csv(file, convert_options=convert_options).to_pandas(types_mapper=pd.ArrowDtype)

The performance results for the previous options are shown here:

                               str    float  datetime
default                        0.39s  0.44s  0.38s
default + to pandas            1.07s  0.45s  0.42s
default + to pandas pyarrow    0.48s  0.43s  0.33s
dtype                          0.39s  0.40s  0.36s
dtype   + to pandas            0.99s  0.45s  0.41s
dtype   + to pandas pyarrow    0.39s  0.42s  0.37s

From these results we can conclude that:

The pyarrow.csv module has a similar performance compared to polars.
If we need to load CSV files into a pandas DataFrame, pyarrow.csv is the fastest option.

Best options from `pandas`, `polars`, and `pyarrow`

There is no surprise that all options using arrow to store data have a similar performance for reading CSV files; polars also uses arrow to save the data in memory. The arrow package is not just faster by parallelizing the reading, it is also more memory efficient.

The polars package is relatively new compared to pandas. It has some great new features but might not have the functions we need. It’s entirely up to us to decide which package to use. If we use polars do all our data manipulations I would suggest we stick to polars for reading CSV files.

If pandas is still our preference, to load CSV files efficiently, we should use the pyarrow parser, backend and dtype or pyarrow.csv to improve the performance further. If we also need to use the numpy_nullable backend, it is best to read CSV files using pyarrow.csv and then convert the backend to numpy_nullable.

Explode date ranges in a pandas DataFrame 30x faster

2024-04-14T00:00:00+00:00

During data analysis, it is very common that we need to convert our data to the interval resolution from a lower time resolution such as quarterly or monthly to half hourly data.

We can do the conversion easily in Python using pandas. However, we know that the pandas df.explode function is very slow. Here I will show how we can make this process 30x faster without using another Python package.

Pandas DataFrame for testing

For testing the code performance, I used the gen_rand_df function in my previous post to create a dummy pandas DataFrame:

df = gen_rand_df(
    nrow=100,
    str_cols={
        'count': 2,
        'name': ['id', 'category'],
        'str_len': [8, (5,20)],
        'str_count': [100, 30],
    },
    ts_cols={
        'count': 2,
        'name': ['start_date', 'end_date'],
        'start_date': ['2020-01-01', '2023-01-01'],
        'end_date': ['2023-01-01', '2025-01-01'],
        'freq': 'MS',
        'random': True,
    },
    float_cols={
        'count': 2,
        'low': 0.0,
        'high': 100.0,
        'missing_pct': 0.1,
    },
)

Here are the first two rows of the 100 rows of the created DataFrame:

         id    category start_date   end_date        f1         f2
0  8v5KSoKX       jIMki 2020-01-01 2023-07-01  35.20661  76.041564
1  ihXEKLSb  bws6TOEr06 2020-05-01 2023-02-01       NaN  26.725758

Initial solution from ChatGPT and Google Gemini

We need to explode the date range (from start_date to end_date) of each row in the DataFrame to half hourly and keep all other columns.

To do that, I got solutions from ChatGPT and Google Gemini after a few iterations (they are basically the same):

df['ts'] = df.apply(lambda row:
    pd.date_range(row['start_date'], row['end_date'], freq='30min'), axis=1
)
df = df.explode('ts')

And the first two rows of the result DataFrame are:

         id category start_date   end_date        f1         f2                  ts
0  8v5KSoKX    jIMki 2020-01-01 2023-07-01  35.20661  76.041564 2020-01-01 00:00:00
0  8v5KSoKX    jIMki 2020-01-01 2023-07-01  35.20661  76.041564 2020-01-01 00:30:00

The solution works but it is very slow. The time for creating the ts column is 703 ms ± 6.57 ms and exploding is 691 ms ± 7.47 ms, for a DataFrame with only 100 rows.

I tried different prompts to get a faster solution from the AI applications but failed; the solution either is wrong or has errors. My suggestion would be that only use the AI applications to give you some ideas or a draft solution. The best solution can only be created by a person with some knowledge in that domain.

Using a `for-loop` instead of the `df.apply` function

We know that the df.apply is slow so I will replace it by a for-loop. There are a couple of ways to iterate over the DataFrame rows. Let us check them:

# 351 µs ± 57.8 µs
for (_, row) in df.iterrows(): pass
# 271 µs ± 65.7 µs
for row in df.to_records(index=False): pass
# 26.5 µs ± 12.9 µs
for start, end in zip(df['start_date'], df['end_date']): pass
# 10.4 µs ± 2.55 µs
for start, end in zip(df['start_date'].values, df['end_date'].values): pass

The last version is 34x faster than df.iterrows(). The improvement will be even larger for a DataFrame with many more rows.

The improved version for creating the ts column is:

df['ts'] = [
    pd.date_range(start, end, freq='30min')
    for start, end in zip(df['start_date'].values, df['end_date'].values)
]

Now the time for the improved version is 684 ms ± 12.6 ms; it is still too slow.

Implementing a custom `df.explode` function

Seems there is not much we can do for creating the ts column much faster.

Now let us be focusing on the df.explode part. We will implement our own version for exploding the lists in the ts column.

We know that the df.reindex function can be used to resample rows of a DataFrame based on provided new index. Here we will use this function to implement a new explode function.

First we can create the new index and ts column, using pd.concat to merge the DataFrames created from each row:

d = (
    df
    .get(['ts'])
    .reset_index(drop=True)
    .rename_axis('i', axis=0)
    .reset_index()
)
dt = pd.concat([
    pd.DataFrame({'i': i, 'ts': ts})
    for (i, ts) in zip(d['i'].values, d['ts'].values)
]).set_index('i').rename_axis(None, axis=0)

Then we use the df.reindex function to sample the other columns in the original DataFrame and add the exploded ts column:

df = df.drop(columns='ts').reindex(dt.index)
df['ts'] = dt.ts

Putting the two parts together, here is the custom explode function:

def explode_df_column(df):
    dt = pd.concat([
        pd.DataFrame({'i': i, 'ts': ts})
        for (i, ts) in enumerate(df['ts'].values)
    ]).set_index('i').rename_axis(None, axis=0)
    df = df.drop(columns='ts').reindex(dt.index)
    df['ts'] = dt.ts
    return df

The time for this function is 22 ms ± 285 µs, 30x faster compared to the df.explode function that has a time of 691 ms ± 7.47 ms.

Creating a `ts` column with value of lists not required

For our use case, creating a intermediate column with a list of timestamps for each value is not required. We can merge this step into the step for creating the dt DataFrame.

Here is the final solution based on this idea:

# Create a DataFrame with new index and the 30min ts column
dt = pd.concat([
    pd.DataFrame({'i': i, 'ts': pd.date_range(start, end, freq='30min')})
    for i, (start, end) in enumerate(zip(df['start_date'], df['end_date']))
]).set_index('i').rename_axis(None, axis=0)

# Resample original df based on new index and add the exploded ts column
df = df.reindex(dt.index).assign(ts=dt.ts)

Great! The time for this solution is 49.8 ms ± 932 µs, about 30x faster than the initial solution that has a time of 1.394s (703 ms ± 6.57 ms + 691 ms ± 7.47 ms).

We can wrap the method into a function and add other parameters used for limiting the min/max datetime and keeping the original DataFrame index or not. It is up to you to do the remaining work.

In summary, we improved a method 30x faster, used to explode datetime ranges in a DataFrame to a new timestamp column and copy other columns. At the same time, we created a new function that is also about 30x faster than the pandas df.explode function.

Polars version

What about using polars, is it faster?

This link might be helpful: https://stackoverflow.com/questions/73161185/repeating-a-date-in-polars-and-exploding-it

Learning Faster Python Fast

Polars null operations

Polars LazyFrame properties are expensive operations

Improve VarianceThreshold performance in ML feature selection

scikit-learn implementation is slow

polars implementation is much faster

Test it

Reduce a python app run time from two hours to 20 seconds

Dummy data for testing

Getting last consecutive date

Using Pandas df.groupby.apply

Avoiding repeated query and filtering

Using a Python for-loop

Vectorized process without for-loop

summary

Using Gurobi Python matrix API to reduce problem creation time

How to use gurobipy

How to use Gurobi Python matrix API

Example of 2D MVars

Example of shifted MVars

Changing constraint coefficients

Updating Objective

Which orchestration tool is better: Airflow, Prefect, Argo Workflows, or Temporal?

Airflow

Prefect

Argo Workflows

Temporal

Summary

Make python loops 5x to 10x faster using numba

Key features of numba

When to use and to avoid numba

Data for testing demonstration

Initial version

Using numpy function

Using numba.njit

Using numba.njit with data types

Replacing numpy function with a python loop

Using numba.njit parallel mode

References

The pandas function pd.read_sql returns an empty DataFrame without correct data types

Get the data types from the query statement

Option 1: result.cursor.description

Option 2: query.statement.selected_columns

Pass the data type information to pd.read_sql

Why did I get NullType for some data columns?

Reference

Read CSV files 10x to 40x faster using pyarrow and polars

Creating test data

Reading CSV files using pandas

Reading CSV files using polars

Reading CSV files using pyarrow.csv

Best options from pandas, polars, and pyarrow

Explode date ranges in a pandas DataFrame 30x faster

Pandas DataFrame for testing

Initial solution from ChatGPT and Google Gemini

Using a for-loop instead of the df.apply function

Implementing a custom df.explode function

Creating a ts column with value of lists not required

Polars version

Using Pandas `df.groupby.apply`

Option 1: `result.cursor.description`

Option 2: `query.statement.selected_columns`

Pass the data type information to `pd.read_sql`

Why did I get `NullType` for some data columns?

Reading CSV files using `pandas`

Reading CSV files using `polars`

Reading CSV files using `pyarrow.csv`

Best options from `pandas`, `polars`, and `pyarrow`

Using a `for-loop` instead of the `df.apply` function

Implementing a custom `df.explode` function

Creating a `ts` column with value of lists not required