How I reduced a Python app run time from two hours to 20 seconds?¶
Pandas df.groupby.apply is too slow for two Dataframes
We have a Python app that was too slow. It took about two hours to extract product forecast data from a database and to merge it with actual records. After some refactorization and optimization, I managed to reduce the run time to less than 20 seconds.
Assume we have some products and each has some daily sales revenue, the actual records. We also have daily forecast revenue. The task is to merge the actual and forecast data together.
Once there are missing records in the actual data, we believe that the actual revenue from that day are not reliable and they should be replaced with forecast data.
To finish this task, for each product, we need to first find the last consecutive date in the actual data and then get the forecast data after that date so we can merge the actual and forecast data together.
Dummy data for testing¶
The performance of different implementations has been tested using some dummy data. I created dummy data using a function gen_rand_df that is described in my previous post. I also used a function explode_date_range in my another post to explode date ranges.
Firstly, we create some product info with product_id, start_date and end_date for the actual sales records and expand the date ranges to daily records.
nrow = 5000
d1 = gen_rand_df(
nrow=nrow,
str_cols={
'count': 1,
'name': 'product_id',
'str_len': 10,
'str_cnt': nrow,
},
ts_cols={
'count': 2,
'name': ['start_date', 'end_date'],
'start_date': '2025-01-01',
'end_date': '2035-01-01',
'freq': 'D',
'random': True,
},
)
df1 = explode_date_range(
df=d1.query('start_date < end_date').drop_duplicates(subset=['product_id']),
start_date_col='start_date',
end_date_col='end_date',
freq='D',
)
Secondly, we create some sales forecast info for products that has actual sales data.
d2 = gen_rand_df(
nrow=nrow,
str_cols={
'count': 1,
'name': 'product_id',
'col_strs': d1['product_id'].unique(),
},
ts_cols={
'count': 2,
'name': ['start_date', 'end_date'],
'start_date': '2025-01-01',
'end_date': '2035-01-01',
'freq': 'D',
'random': True,
},
)
df2 = explode_date_range(
df=d2.query('start_date < end_date').drop_duplicates(subset=['product_id']),
start_date_col='start_date',
end_date_col='end_date',
freq='D',
)
Then, we create some dummy product sales revenue for both the actual and forecast records.
d3 = gen_rand_df(
nrow=max(df1.shape[0], df2.shape[0]),
float_cols={
'count': 2,
'name': ['daily_revenue1', 'daily_revenue2'],
'low': 0,
'high': 1e3,
'missing_pct': [0.1, 0],
},
)
Finally, we add the sales revenue to the actual and forecast data.
df_actual = (
df1
.assign(daily_revenue=d3['daily_revenue1'].values[:df1.shape[0]])
.set_index(['product_id', 'daily_revenue'])
)
df_forecast = (
df2
.assign(daily_revenue=d3['daily_revenue2'].values[:df2.shape[0]])
.set_index(['product_id', 'daily_revenue'])
)
Here are the first few lines of the actual sales data:
daily_revenue
product_id date
P3hLcLj43u 2025-01-26 128.570203
2025-01-27 499.277862
2025-01-28 601.498358
Getting last consecutive date¶
The function used to get the last consecutive date from a date series has been implemented as follows:
def get_last_consecutive_date(dates: pd.Series) -> np.datetime64 | None:
# Empty input
if dates.empty:
return None
dates = np.unique(dates)
# Only one unique element in the list
if len(dates) == 1:
return dates[0]
diffs = np.diff(dates).astype('timedelta64[D]').astype(int)
last_consecutive_day_index = np.where(diffs > 1)[0]
if len(last_consecutive_day_index) == 0:
return dates[-1] # all dates are consecutive
else:
return dates[last_consecutive_day_index[0]]
Using Pandas df.groupby.apply¶
As we have to perform the same task for each group of products, naturally we can use Pandas df.groupby.apply. But this function generally only works for one DataFrame. Here we have two DataFrames and one option is passing the second DataFrame as a parameter.
Here is the implementation:
def keep_records_after_consecutive_dates_v1(
df_forecast: pd.DataFrame,
df_actual: pd.DataFrame,
) -> pd.DataFrame:
product_id = df_forecast.index.values[0][df_forecast.index.names.index('product_id')]
dates = df_actual.query('product_id == @product_id & daily_revenue.notna()').index.unique('date')
# Get last consecutive date and filter df_forecast
if len(dates) > 0:
last_consecutive_date = get_last_consecutive_date(dates)
df_forecast = df_forecast.query('date > @last_consecutive_date')
return df_forecast
df_v1 = (
df_forecast
.groupby('product_id', group_keys=False)
.apply(keep_records_after_consecutive_dates_v1, df_actual)
)
Avoiding repeated query and filtering¶
By checking the previous implementation, we can observe that we have a repeated query and filtering for each product on the actual sales DataFrame. That is likely to slow down the process.
Now we do the query for all products and group the product records in advance. Hopefully this will make it much faster.
Here is the updated version:
def keep_records_after_consecutive_dates_v2(
df_forecast: pd.DataFrame,
df_actual: DataFrameGroupBy,
) -> pd.DataFrame:
product_id = df_forecast.index.values[0][df_forecast.index.names.index('product_id')]
if product_id in df_actual.groups:
dates = df_actual.get_group(product_id).index.unique('date')
# Get last consecutive date and filter df_forecast
if len(dates) > 0:
last_consecutive_date = get_last_consecutive_datex(dates)
df_forecast = df_forecast.query('date > @last_consecutive_date')
return df_forecast
grp_actual = df_actual.query('daily_revenue.notna()').groupby('product_id')
df_v2 = (
df_forecast
.groupby('product_id', group_keys=False)
.apply(keep_records_after_consecutive_dates_v2, grp_actual)
)
Using a Python for-loop¶
The .apply() often has some overhead compared to a pure Python for-loop. We now replace the .apply() with a for-loop. At the same time we can remove the index parsing for all product groups.
def keep_records_after_consecutive_dates_v3(
df_forecast: pd.DataFrame,
df_actual: pd.DataFrame,
) -> pd.DataFrame:
dates = df_actual.index.unique('date')
# Get last consecutive date and filter df_forecast
if len(dates) > 0:
last_consecutive_date = get_last_consecutive_datex(dates)
df_forecast = df_forecast.query('date > @last_consecutive_date')
return df_forecast
dfs = []
grp_actual = df_actual.query('daily_revenue.notna()').groupby('product_id')
grp_forecast = df_forecast.groupby('product_id')
product_ids = df_forecast.index.unique('product_id')
for product_id in product_ids:
if product_id in grp_actual.groups:
df = keep_records_after_consecutive_dates_v3(
grp_forecast.get_group(product_id),
grp_actual.get_group(product_id),
)
else:
df = grp_forecast.get_group(product_id)
dfs.append(df)
df_v3 = pd.concat(dfs, axis=0)
Vectorized process without for-loop¶
It's obvious that we can vectorize the calculation of the last consecutive date for all products. Pandas groupby().apply() on a Series can be very efficient, as it often operates on NumPy arrays internally.
We can also avoid the for-loop by using vectorized join and filtering operations. By doing that we don't need to join small DataFrames for all products using pd.concat.
The final optimized version is showing as follows:
def keep_records_after_consecutive_dates_v4(
df_forecast: pd.DataFrame,
df_actual: pd.DataFrame,
) -> pd.DataFrame:
# Get last consecutive date for each product
last_consecutive_dates = (
df_actual
.query('daily_revenue.notna()')
.reset_index('date')
.groupby('product_id')['date']
.apply(get_last_consecutive_date)
.to_frame()
.rename(columns={'date': 'last_consecutive_date'})
)
# Filter df_forecast
df_forecast = (
df_forecast
.join(last_consecutive_dates, on='product_id', how='left')
.fillna({'last_consecutive_date': pd.Timestamp.min})
.query('date > last_consecutive_date')
.drop(columns='last_consecutive_date')
)
return df_forecast
df_v4 = keep_records_after_consecutive_dates_v4(df_forecast, df_actual)
summary¶
By avoiding repeated query and filtering operations and vectorization of some other operations, I successfully made a Python process running 130x faster. I also did some similar optimization for extracting forecast data from a database. Ultimately the application total execution time was reduced from over two hours to less than 20 seconds.
Want to know more tips about coding and other things, please visit: https://github.com/seanslma/maki