<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://seanslma.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://seanslma.github.io/" rel="alternate" type="text/html" /><updated>2026-03-02T12:35:53+00:00</updated><id>https://seanslma.github.io/feed.xml</id><title type="html">Learning Faster Python Fast</title><subtitle>Sean&apos;s Blogs</subtitle><author><name>Sean Ma</name></author><entry><title type="html">Polars null operations</title><link href="https://seanslma.github.io/polars-null-operations/" rel="alternate" type="text/html" title="Polars null operations" /><published>2025-10-27T00:00:00+00:00</published><updated>2025-10-27T00:00:00+00:00</updated><id>https://seanslma.github.io/polars-null-operations</id><content type="html" xml:base="https://seanslma.github.io/polars-null-operations/"><![CDATA[<p>In Polars, if any of the columns involved in an operation contains <code class="language-plaintext highlighter-rouge">null</code>, the result of the operation will also be <code class="language-plaintext highlighter-rouge">null</code> for that row. This behavior is consistent with SQL’s <code class="language-plaintext highlighter-rouge">NULL propagation</code> principle.</p>

<p>Let’s create a simple example to demonstrate that:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="n">pl</span>

<span class="c1"># Create a DataFrame with some null values
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span>
    <span class="s">'x'</span><span class="p">:</span> <span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">],</span>  <span class="c1"># 'None' represents null in Polars
</span>    <span class="s">'y'</span><span class="p">:</span> <span class="p">[</span><span class="s">'foo'</span><span class="p">,</span> <span class="s">'bar'</span><span class="p">,</span> <span class="bp">None</span><span class="p">]</span>
<span class="p">})</span>

<span class="c1"># Apply the expression to concatenate 'x' + '_' + 'y'
</span><span class="n">df_result</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">with_columns</span><span class="p">(</span>
    <span class="p">(</span><span class="n">pl</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'x'</span><span class="p">)</span> <span class="o">+</span> <span class="n">pl</span><span class="p">.</span><span class="n">lit</span><span class="p">(</span><span class="s">'_'</span><span class="p">)</span> <span class="o">+</span> <span class="n">pl</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'y'</span><span class="p">)).</span><span class="n">alias</span><span class="p">(</span><span class="s">'v'</span><span class="p">)</span>
<span class="p">)</span>

<span class="c1"># Show the result
</span><span class="k">print</span><span class="p">(</span><span class="n">df_result</span><span class="p">)</span>
</code></pre></div></div>

<p>The output is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>shape: (3, 3)
┌──────┬──────┬───────┐
│ x    ┆ y    ┆ v     │
│ ---  ┆ ---  ┆ ---   │
│ str  ┆ str  ┆ str   │
╞══════╪══════╪═══════╡
│ null ┆ foo  ┆ null  │
│ a    ┆ bar  ┆ a_bar │
│ b    ┆ null ┆ null  │
└──────┴──────┴───────┘
</code></pre></div></div>

<p>And here is the simple fix – fill null with and empty string first:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Apply the expression to concatenate 'x' + '_' + 'y'
</span><span class="n">df_result</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">with_columns</span><span class="p">(</span>
    <span class="p">(</span>
      <span class="n">pl</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'x'</span><span class="p">).</span><span class="n">fill_null</span><span class="p">(</span><span class="s">''</span><span class="p">)</span>
      <span class="o">+</span> <span class="n">pl</span><span class="p">.</span><span class="n">lit</span><span class="p">(</span><span class="s">'_'</span><span class="p">)</span>
      <span class="o">+</span> <span class="n">pl</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'y'</span><span class="p">).</span><span class="n">fill_null</span><span class="p">(</span><span class="s">''</span><span class="p">)</span>
    <span class="p">).</span><span class="n">alias</span><span class="p">(</span><span class="s">'v'</span><span class="p">),</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Now the output is correct:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>shape: (3, 3)
┌──────┬──────┬───────┐
│ x    ┆ y    ┆ v     │
│ ---  ┆ ---  ┆ ---   │
│ str  ┆ str  ┆ str   │
╞══════╪══════╪═══════╡
│ null ┆ foo  ┆ _foo  │
│ a    ┆ bar  ┆ a_bar │
│ b    ┆ null ┆ b_    │
└──────┴──────┴───────┘
</code></pre></div></div>]]></content><author><name>Sean Ma</name></author><category term="Python" /><category term="Polars" /><category term="NULL" /><summary type="html"><![CDATA[In Polars, if any of the columns involved in an operation contains null, the result of the operation will also be null for that row. This behavior is consistent with SQL’s NULL propagation principle.]]></summary></entry><entry><title type="html">Polars LazyFrame properties are expensive operations</title><link href="https://seanslma.github.io/polars-lazyframe-property/" rel="alternate" type="text/html" title="Polars LazyFrame properties are expensive operations" /><published>2025-09-05T00:00:00+00:00</published><updated>2025-09-05T00:00:00+00:00</updated><id>https://seanslma.github.io/polars-lazyframe-property</id><content type="html" xml:base="https://seanslma.github.io/polars-lazyframe-property/"><![CDATA[<p>When using polars LazyFrame, at some point you might need to get the properties such as the column names of the LazyFrame.
We must be aware that getting LazyFrame properties is expensive. For more details check the discussions here: https://github.com/pola-rs/polars/issues/16328</p>

<p>Here is a list of the some properties for the LazyFrame:</p>
<ul>
  <li>LazyFrame.columns</li>
  <li>LazyFrame.dtypes</li>
  <li>LazyFrame.schema</li>
  <li>LazyFrame.width</li>
</ul>

<p>For example when you using <code class="language-plaintext highlighter-rouge">LazyFrame.columns</code> you will get a warning:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PerformanceWarning: Determining the column names of a LazyFrame requires
resolving its schema, which is a potentially expensive operation.
Use `LazyFrame.collect_schema().names()` to get the column names without
this warning.
  d.lazy().columns
</code></pre></div></div>
<p>However, if you take the suggestion of the warning by using the alternative method you will only avoid the warning but nothing else – the alternative operation is still expensive.</p>

<p>Let’s test it by code example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="n">pl</span>
<span class="n">da</span> <span class="o">=</span> <span class="p">{</span><span class="sa">f</span><span class="s">'v</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">'</span><span class="p">:[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">)}</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">da</span><span class="p">)</span>
<span class="n">lf</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">lazy</span><span class="p">()</span>

<span class="n">_</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span>                  <span class="c1"># 1.39 ms ± 45 μs without warning
</span><span class="n">_</span> <span class="o">=</span> <span class="n">lf</span><span class="p">.</span><span class="n">columns</span>                  <span class="c1"># 25.3 ms ± 822 μs with warning
</span><span class="n">_</span> <span class="o">=</span> <span class="n">lf</span><span class="p">.</span><span class="n">collect_schema</span><span class="p">().</span><span class="n">names</span><span class="p">()</span> <span class="c1"># 24.5 ms ± 765 μs without warning
</span></code></pre></div></div>
<p>So if possible we should get the properties from the DataFrame not from the LazyFrame.</p>]]></content><author><name>Sean Ma</name></author><category term="Python" /><category term="Polars" /><category term="LazyFrame" /><summary type="html"><![CDATA[When using polars LazyFrame, at some point you might need to get the properties such as the column names of the LazyFrame. We must be aware that getting LazyFrame properties is expensive. For more details check the discussions here: https://github.com/pola-rs/polars/issues/16328]]></summary></entry><entry><title type="html">Improve VarianceThreshold performance in ML feature selection</title><link href="https://seanslma.github.io/feature-selection-variance-perf/" rel="alternate" type="text/html" title="Improve VarianceThreshold performance in ML feature selection" /><published>2025-08-18T00:00:00+00:00</published><updated>2025-08-18T00:00:00+00:00</updated><id>https://seanslma.github.io/feature-selection-variance-perf</id><content type="html" xml:base="https://seanslma.github.io/feature-selection-variance-perf/"><![CDATA[<p>For Machine Learning feature selection, one of the basic and efficient methods is using feature’s variance to drop features that are almost constant – these features will not provide any useful information for the target prediction.</p>

<h2 id="scikit-learn-implementation-is-slow">scikit-learn implementation is slow</h2>
<p>The <code class="language-plaintext highlighter-rouge">VarianceThreshold</code> class implemented in <code class="language-plaintext highlighter-rouge">sciki-learn</code> is super slow. Here is the example showing how to use it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="n">pl</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_selection</span> <span class="kn">import</span> <span class="n">VarianceThreshold</span>
<span class="k">def</span> <span class="nf">sklearn_variance_threshold</span><span class="p">(</span>
    <span class="n">X</span><span class="p">:</span> <span class="n">pl</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
    <span class="n">threshold</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pl</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
    <span class="n">selector</span> <span class="o">=</span> <span class="n">VarianceThreshold</span><span class="p">(</span><span class="n">threshold</span><span class="o">=</span><span class="mf">0.0</span><span class="p">)</span>
    <span class="n">_</span> <span class="o">=</span> <span class="n">selector</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">selector</span><span class="p">.</span><span class="n">get_support</span><span class="p">()]</span>
    <span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>

<h2 id="polars-implementation-is-much-faster">polars implementation is much faster</h2>
<p>The <code class="language-plaintext highlighter-rouge">polars</code> is a <code class="language-plaintext highlighter-rouge">pandas</code> equivalent data processing package that is implemented using the <code class="language-plaintext highlighter-rouge">Rust</code> language. Note that <code class="language-plaintext highlighter-rouge">Rust</code> is popular for its performance and other great features such as parallelization and memory management.</p>

<p>Here is the implementation using <code class="language-plaintext highlighter-rouge">polars</code> that is about 20x faster.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="n">pl</span>
<span class="k">def</span> <span class="nf">polars_variance_threshold</span><span class="p">(</span>
    <span class="n">X</span><span class="p">:</span> <span class="n">pl</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
    <span class="n">threshold</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pl</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
    <span class="n">stats</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">select</span><span class="p">([</span>
        <span class="n">pl</span><span class="p">.</span><span class="n">var</span><span class="p">(</span><span class="n">col</span><span class="p">).</span><span class="n">alias</span><span class="p">(</span><span class="n">col</span><span class="p">)</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">X</span><span class="p">.</span><span class="n">columns</span>
    <span class="p">])</span>
    <span class="n">variances</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">row</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>  <span class="c1"># get variances as a list
</span>    <span class="n">df</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">select</span><span class="p">([</span>
        <span class="n">col</span> <span class="k">for</span> <span class="n">col</span><span class="p">,</span> <span class="n">var</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">columns</span><span class="p">,</span> <span class="n">variances</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">var</span> <span class="o">&gt;</span> <span class="n">threshold</span>
    <span class="p">])</span>
    <span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>

<h2 id="test-it">Test it</h2>
<p>To compare the performance of the two different implementations we can again use the method I created to generate dummy data for testing.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="n">pl</span>
<span class="c1"># create dataset
</span><span class="n">df_pandas</span> <span class="o">=</span> <span class="n">create_dummy_df</span><span class="p">()</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">from_pandas</span><span class="p">(</span><span class="n">df_pandas</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">pl</span><span class="p">.</span><span class="n">exclude</span><span class="p">(</span><span class="s">'target'</span><span class="p">))</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'target'</span><span class="p">]</span>

<span class="n">X_sklearn</span> <span class="o">=</span> <span class="n">sklearn_variance_threshold</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">X_polars</span> <span class="o">=</span> <span class="n">polars_variance_threshold</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">X_sklearn</span><span class="p">.</span><span class="n">equals</span><span class="p">(</span><span class="n">X_polars</span><span class="p">)</span>
</code></pre></div></div>]]></content><author><name>Sean Ma</name></author><category term="ML" /><category term="Feature Selection" /><summary type="html"><![CDATA[For Machine Learning feature selection, one of the basic and efficient methods is using feature’s variance to drop features that are almost constant – these features will not provide any useful information for the target prediction.]]></summary></entry><entry><title type="html">Reduce a python app run time from two hours to 20 seconds</title><link href="https://seanslma.github.io/python-groupby-perf/" rel="alternate" type="text/html" title="Reduce a python app run time from two hours to 20 seconds" /><published>2025-07-11T00:00:00+00:00</published><updated>2025-07-11T00:00:00+00:00</updated><id>https://seanslma.github.io/python-groupby-perf</id><content type="html" xml:base="https://seanslma.github.io/python-groupby-perf/"><![CDATA[<p>Pandas <code class="language-plaintext highlighter-rouge">df.groupby.apply</code> is too slow for two Dataframes.</p>

<p>We have a Python app that was too slow. It took about two hours to extract product forecast data from a database and to merge it with actual records. After some refactorization and optimization, I managed to reduce the run time to less than 20 seconds.</p>

<p>Assume we have some products and each has some daily sales revenue, the actual records. We also have daily forecast revenue. The task is to merge the actual and forecast data together.</p>

<p>Once there are missing records in the actual data, we believe that the actual revenue from that day are not reliable and they should be replaced with forecast data.</p>

<p>To finish this task, for each product, we need to first find the last consecutive date in the actual data and then get the forecast data after that date so we can merge the actual and forecast data together.</p>

<h2 id="dummy-data-for-testing">Dummy data for testing</h2>
<p>The performance of different implementations has been tested using some dummy data. I created dummy data using a function <code class="language-plaintext highlighter-rouge">gen_rand_df</code> that is described in <a href="https://medium.com/@sean.lma/how-to-create-dummy-pandas-dataframes-for-testing-cf03c52878e3">my previous post</a>. I also used a function <code class="language-plaintext highlighter-rouge">explode_date_range</code> in <a href="https://python.plainenglish.io/how-to-explode-date-ranges-in-a-pandas-dataframe-30x-faster-cb76519c7acf">my another post</a> to explode date ranges.</p>

<p>Firstly, we create some product info with <code class="language-plaintext highlighter-rouge">product_id</code>, <code class="language-plaintext highlighter-rouge">start_date</code> and <code class="language-plaintext highlighter-rouge">end_date</code> for the actual sales records and expand the date ranges to daily records.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nrow</span> <span class="o">=</span> <span class="mi">5000</span>
<span class="n">d1</span> <span class="o">=</span> <span class="n">gen_rand_df</span><span class="p">(</span>
    <span class="n">nrow</span><span class="o">=</span><span class="n">nrow</span><span class="p">,</span>
    <span class="n">str_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="s">'product_id'</span><span class="p">,</span>
        <span class="s">'str_len'</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>
        <span class="s">'str_cnt'</span><span class="p">:</span> <span class="n">nrow</span><span class="p">,</span>
    <span class="p">},</span>
    <span class="n">ts_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'start_date'</span><span class="p">,</span> <span class="s">'end_date'</span><span class="p">],</span>
        <span class="s">'start_date'</span><span class="p">:</span> <span class="s">'2025-01-01'</span><span class="p">,</span>
        <span class="s">'end_date'</span><span class="p">:</span> <span class="s">'2035-01-01'</span><span class="p">,</span>
        <span class="s">'freq'</span><span class="p">:</span> <span class="s">'D'</span><span class="p">,</span>
        <span class="s">'random'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="p">},</span>
<span class="p">)</span>
<span class="n">df1</span> <span class="o">=</span> <span class="n">explode_date_range</span><span class="p">(</span>
    <span class="n">df</span><span class="o">=</span><span class="n">d1</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'start_date &lt; end_date'</span><span class="p">).</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s">'product_id'</span><span class="p">]),</span>
    <span class="n">start_date_col</span><span class="o">=</span><span class="s">'start_date'</span><span class="p">,</span>
    <span class="n">end_date_col</span><span class="o">=</span><span class="s">'end_date'</span><span class="p">,</span>
    <span class="n">freq</span><span class="o">=</span><span class="s">'D'</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Secondly, we create some sales forecast info for products that has actual sales data.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d2</span> <span class="o">=</span> <span class="n">gen_rand_df</span><span class="p">(</span>
    <span class="n">nrow</span><span class="o">=</span><span class="n">nrow</span><span class="p">,</span>
    <span class="n">str_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="s">'product_id'</span><span class="p">,</span>
        <span class="s">'col_strs'</span><span class="p">:</span> <span class="n">d1</span><span class="p">[</span><span class="s">'product_id'</span><span class="p">].</span><span class="n">unique</span><span class="p">(),</span>
    <span class="p">},</span>
    <span class="n">ts_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'start_date'</span><span class="p">,</span> <span class="s">'end_date'</span><span class="p">],</span>
        <span class="s">'start_date'</span><span class="p">:</span> <span class="s">'2025-01-01'</span><span class="p">,</span>
        <span class="s">'end_date'</span><span class="p">:</span> <span class="s">'2035-01-01'</span><span class="p">,</span>
        <span class="s">'freq'</span><span class="p">:</span> <span class="s">'D'</span><span class="p">,</span>
        <span class="s">'random'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="p">},</span>
<span class="p">)</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">explode_date_range</span><span class="p">(</span>
    <span class="n">df</span><span class="o">=</span><span class="n">d2</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'start_date &lt; end_date'</span><span class="p">).</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s">'product_id'</span><span class="p">]),</span>
    <span class="n">start_date_col</span><span class="o">=</span><span class="s">'start_date'</span><span class="p">,</span>
    <span class="n">end_date_col</span><span class="o">=</span><span class="s">'end_date'</span><span class="p">,</span>
    <span class="n">freq</span><span class="o">=</span><span class="s">'D'</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Then, we create some dummy product sales revenue for both the actual and forecast records.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d3</span> <span class="o">=</span> <span class="n">gen_rand_df</span><span class="p">(</span>
    <span class="n">nrow</span><span class="o">=</span><span class="nb">max</span><span class="p">(</span><span class="n">df1</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">df2</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span>
    <span class="n">float_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'daily_revenue1'</span><span class="p">,</span> <span class="s">'daily_revenue2'</span><span class="p">],</span>
        <span class="s">'low'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
        <span class="s">'high'</span><span class="p">:</span> <span class="mf">1e3</span><span class="p">,</span>
        <span class="s">'missing_pct'</span><span class="p">:</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span>
    <span class="p">},</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Finally, we add the sales revenue to the actual and forecast data.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_actual</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">df1</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">daily_revenue</span><span class="o">=</span><span class="n">d3</span><span class="p">[</span><span class="s">'daily_revenue1'</span><span class="p">].</span><span class="n">values</span><span class="p">[:</span><span class="n">df1</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]])</span>
    <span class="p">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">'product_id'</span><span class="p">,</span> <span class="s">'daily_revenue'</span><span class="p">])</span>
<span class="p">)</span>
<span class="n">df_forecast</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">df2</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">daily_revenue</span><span class="o">=</span><span class="n">d3</span><span class="p">[</span><span class="s">'daily_revenue2'</span><span class="p">].</span><span class="n">values</span><span class="p">[:</span><span class="n">df2</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]])</span>
    <span class="p">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">'product_id'</span><span class="p">,</span> <span class="s">'daily_revenue'</span><span class="p">])</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Here are the first few lines of the actual sales data:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                      daily_revenue
product_id date
P3hLcLj43u 2025-01-26    128.570203
           2025-01-27    499.277862
           2025-01-28    601.498358
</code></pre></div></div>

<h2 id="getting-last-consecutive-date">Getting last consecutive date</h2>
<p>The function used to get the last consecutive date from a date series has been implemented as follows:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_last_consecutive_date</span><span class="p">(</span><span class="n">dates</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">np</span><span class="p">.</span><span class="n">datetime64</span> <span class="o">|</span> <span class="bp">None</span><span class="p">:</span>
    <span class="c1"># Empty input
</span>    <span class="k">if</span> <span class="n">dates</span><span class="p">.</span><span class="n">empty</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">None</span>

    <span class="n">dates</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">dates</span><span class="p">)</span>

    <span class="c1"># Only one unique element in the list
</span>    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">dates</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">dates</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>

    <span class="n">diffs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">diff</span><span class="p">(</span><span class="n">dates</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="s">'timedelta64[D]'</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
    <span class="n">last_consecutive_day_index</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">diffs</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">last_consecutive_day_index</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">dates</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># all dates are consecutive
</span>    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">dates</span><span class="p">[</span><span class="n">last_consecutive_day_index</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
</code></pre></div></div>

<h2 id="using-pandas-dfgroupbyapply">Using Pandas <code class="language-plaintext highlighter-rouge">df.groupby.apply</code></h2>
<p>As we have to perform the same task for each group of products, naturally we can use Pandas <code class="language-plaintext highlighter-rouge">df.groupby.apply</code>. But this function generally only works for one DataFrame. Here we have two DataFrames and one option is passing the second DataFrame as a parameter.</p>

<p>Here is the implementation:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">keep_records_after_consecutive_dates_v1</span><span class="p">(</span>
    <span class="n">df_forecast</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
    <span class="n">df_actual</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
    <span class="n">product_id</span> <span class="o">=</span> <span class="n">df_forecast</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">df_forecast</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">names</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="s">'product_id'</span><span class="p">)]</span>
    <span class="n">dates</span> <span class="o">=</span> <span class="n">df_actual</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'product_id == @product_id &amp; daily_revenue.notna()'</span><span class="p">).</span><span class="n">index</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="s">'date'</span><span class="p">)</span>
    <span class="c1"># Get last consecutive date and filter df_forecast
</span>    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">dates</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
        <span class="n">last_consecutive_date</span> <span class="o">=</span> <span class="n">get_last_consecutive_date</span><span class="p">(</span><span class="n">dates</span><span class="p">)</span>
        <span class="n">df_forecast</span> <span class="o">=</span> <span class="n">df_forecast</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'date &gt; @last_consecutive_date'</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">df_forecast</span>

<span class="n">df_v1</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">df_forecast</span>
    <span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'product_id'</span><span class="p">,</span> <span class="n">group_keys</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">keep_records_after_consecutive_dates_v1</span><span class="p">,</span> <span class="n">df_actual</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>The run time is <strong>327 seconds</strong>.</p>

<h2 id="avoiding-repeated-query-and-filtering">Avoiding repeated query and filtering</h2>
<p>By checking the previous implementation, we can observe that we have a repeated query and filtering for each product on the actual sales DataFrame. That is likely to slow down the process.</p>

<p>Now we do the query for all products and group the product records in advance. Hopefully this will make it much faster.</p>

<p>Here is the updated version:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">keep_records_after_consecutive_dates_v2</span><span class="p">(</span>
    <span class="n">df_forecast</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
    <span class="n">df_actual</span><span class="p">:</span> <span class="n">DataFrameGroupBy</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
    <span class="n">product_id</span> <span class="o">=</span> <span class="n">df_forecast</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">values</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">df_forecast</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">names</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="s">'product_id'</span><span class="p">)]</span>
    <span class="k">if</span> <span class="n">product_id</span> <span class="ow">in</span> <span class="n">df_actual</span><span class="p">.</span><span class="n">groups</span><span class="p">:</span>
        <span class="n">dates</span> <span class="o">=</span> <span class="n">df_actual</span><span class="p">.</span><span class="n">get_group</span><span class="p">(</span><span class="n">product_id</span><span class="p">).</span><span class="n">index</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="s">'date'</span><span class="p">)</span>
        <span class="c1"># Get last consecutive date and filter df_forecast
</span>        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">dates</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
            <span class="n">last_consecutive_date</span> <span class="o">=</span> <span class="n">get_last_consecutive_datex</span><span class="p">(</span><span class="n">dates</span><span class="p">)</span>
            <span class="n">df_forecast</span> <span class="o">=</span> <span class="n">df_forecast</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'date &gt; @last_consecutive_date'</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">df_forecast</span>

<span class="n">grp_actual</span> <span class="o">=</span> <span class="n">df_actual</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'daily_revenue.notna()'</span><span class="p">).</span><span class="n">groupby</span><span class="p">(</span><span class="s">'product_id'</span><span class="p">)</span>
<span class="n">df_v2</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">df_forecast</span>
    <span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'product_id'</span><span class="p">,</span> <span class="n">group_keys</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">keep_records_after_consecutive_dates_v2</span><span class="p">,</span> <span class="n">grp_actual</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Now the run time is <strong>17.6 seconds</strong> — that’s about <strong>18x</strong> faster.</p>

<h2 id="using-a-python-for-loop">Using a Python for-loop</h2>
<p>The <code class="language-plaintext highlighter-rouge">.apply()</code> often has some overhead compared to a pure Python for-loop. We now replace the <code class="language-plaintext highlighter-rouge">.apply()</code> with a for-loop. At the same time we can remove the index parsing for all product groups.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">keep_records_after_consecutive_dates_v3</span><span class="p">(</span>
    <span class="n">df_forecast</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
    <span class="n">df_actual</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
    <span class="n">dates</span> <span class="o">=</span> <span class="n">df_actual</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="s">'date'</span><span class="p">)</span>
    <span class="c1"># Get last consecutive date and filter df_forecast
</span>    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">dates</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
        <span class="n">last_consecutive_date</span> <span class="o">=</span> <span class="n">get_last_consecutive_datex</span><span class="p">(</span><span class="n">dates</span><span class="p">)</span>
        <span class="n">df_forecast</span> <span class="o">=</span> <span class="n">df_forecast</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'date &gt; @last_consecutive_date'</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">df_forecast</span>

<span class="n">dfs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">grp_actual</span> <span class="o">=</span> <span class="n">df_actual</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'daily_revenue.notna()'</span><span class="p">).</span><span class="n">groupby</span><span class="p">(</span><span class="s">'product_id'</span><span class="p">)</span>
<span class="n">grp_forecast</span> <span class="o">=</span> <span class="n">df_forecast</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'product_id'</span><span class="p">)</span>
<span class="n">product_ids</span> <span class="o">=</span> <span class="n">df_forecast</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="s">'product_id'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">product_id</span> <span class="ow">in</span> <span class="n">product_ids</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">product_id</span> <span class="ow">in</span> <span class="n">grp_actual</span><span class="p">.</span><span class="n">groups</span><span class="p">:</span>
        <span class="n">df</span> <span class="o">=</span> <span class="n">keep_records_after_consecutive_dates_v3</span><span class="p">(</span>
            <span class="n">grp_forecast</span><span class="p">.</span><span class="n">get_group</span><span class="p">(</span><span class="n">product_id</span><span class="p">),</span>
            <span class="n">grp_actual</span><span class="p">.</span><span class="n">get_group</span><span class="p">(</span><span class="n">product_id</span><span class="p">),</span>
        <span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">df</span> <span class="o">=</span> <span class="n">grp_forecast</span><span class="p">.</span><span class="n">get_group</span><span class="p">(</span><span class="n">product_id</span><span class="p">)</span>
    <span class="n">dfs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">df_v3</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">dfs</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<p>The run time is <strong>9.8 seconds</strong> — that’s about <strong>1.8x</strong> faster than version #2.</p>

<h2 id="vectorized-process-without-for-loop">Vectorized process without for-loop</h2>
<p>It’s obvious that we can vectorize the calculation of the last consecutive date for all products. Pandas <code class="language-plaintext highlighter-rouge">groupby().apply()</code> on a Series can be very efficient, as it often operates on NumPy arrays internally.</p>

<p>We can also avoid the for-loop by using vectorized join and filtering operations. By doing that we don’t need to join small DataFrames for all products using <code class="language-plaintext highlighter-rouge">pd.concat</code>.</p>

<p>The final optimized version is showing as follows:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">keep_records_after_consecutive_dates_v4</span><span class="p">(</span>
    <span class="n">df_forecast</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
    <span class="n">df_actual</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
    <span class="c1"># Get last consecutive date for each product
</span>    <span class="n">last_consecutive_dates</span> <span class="o">=</span> <span class="p">(</span>
        <span class="n">df_actual</span>
        <span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'daily_revenue.notna()'</span><span class="p">)</span>
        <span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="s">'date'</span><span class="p">)</span>
        <span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'product_id'</span><span class="p">)[</span><span class="s">'date'</span><span class="p">]</span>
        <span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">get_last_consecutive_date</span><span class="p">)</span>
        <span class="p">.</span><span class="n">to_frame</span><span class="p">()</span>
        <span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">'date'</span><span class="p">:</span> <span class="s">'last_consecutive_date'</span><span class="p">})</span>
    <span class="p">)</span>
    <span class="c1"># Filter df_forecast
</span>    <span class="n">df_forecast</span> <span class="o">=</span> <span class="p">(</span>
        <span class="n">df_forecast</span>
        <span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">last_consecutive_dates</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'product_id'</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s">'left'</span><span class="p">)</span>
        <span class="p">.</span><span class="n">fillna</span><span class="p">({</span><span class="s">'last_consecutive_date'</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Timestamp</span><span class="p">.</span><span class="nb">min</span><span class="p">})</span>
        <span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'date &gt; last_consecutive_date'</span><span class="p">)</span>
        <span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="s">'last_consecutive_date'</span><span class="p">)</span>
    <span class="p">)</span>
    <span class="k">return</span> <span class="n">df_forecast</span>

<span class="n">df_v4</span> <span class="o">=</span> <span class="n">keep_records_after_consecutive_dates_v4</span><span class="p">(</span><span class="n">df_forecast</span><span class="p">,</span> <span class="n">df_actual</span><span class="p">)</span>
</code></pre></div></div>
<p>The final run time is <strong>2.4 seconds</strong> - that’s about <strong>4x</strong> faster than version #3 and about <strong>130x</strong> faster than the original version #1 (<strong>327 seconds</strong>).</p>

<h2 id="summary">summary</h2>
<p>By avoiding repeated query and filtering operations and vectorization of some other operations, I successfully made a Python process running 130x faster. I also did some similar optimization for extracting forecast data from a database. Ultimately the application total execution time was reduced from over two hours to less than 20 seconds.</p>]]></content><author><name>Sean Ma</name></author><category term="Python" /><category term="Pandas" /><category term="Performance" /><summary type="html"><![CDATA[Pandas df.groupby.apply is too slow for two Dataframes.]]></summary></entry><entry><title type="html">Using Gurobi Python matrix API to reduce problem creation time</title><link href="https://seanslma.github.io/gurobi-matrix-api/" rel="alternate" type="text/html" title="Using Gurobi Python matrix API to reduce problem creation time" /><published>2025-03-17T00:00:00+00:00</published><updated>2025-03-17T00:00:00+00:00</updated><id>https://seanslma.github.io/gurobi-matrix-api</id><content type="html" xml:base="https://seanslma.github.io/gurobi-matrix-api/"><![CDATA[<p>When doing optimizations using Python, many people choose the popular <code class="language-plaintext highlighter-rouge">pyomo</code> package to create optimization problems. However, pyomo is known for its slow performance and some other issues. That’s why <code class="language-plaintext highlighter-rouge">gurobipy</code> becomes a more attractive alternative.</p>

<p>We definitely know that <code class="language-plaintext highlighter-rouge">gurobipy</code> is much faster for creating problems and easier to interact directly with the gurobi solver. But do you know that there is also a Python matrix API in gurobipy? Here I will demonstrate some gurobipy matrix API features that will make your code run even faster and easier to maintain.</p>

<h2 id="how-to-use-gurobipy">How to use gurobipy</h2>
<p>An optimization problem basically contains an objective, some variables and some constraints. So the task to create a problem is creating decision variables, setting variable coefficients to the objective, and adding constraints based on the variables.</p>

<p>Here I will use a basic example to show the whole process. Assume we have a variable demand for one year with 5 minutes intervals. The demand should be provided by two generators (g1: integer outputs with price 2, g2: continuous outputs with price 4). If there is not enough generation to meet the demand, there is a penalty proportional to the demand.</p>

<p>The problem creation time is about 8.8 seconds for the following provided example.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># gurobi version: v12.0.0rc1
</span><span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">scipy.sparse</span> <span class="k">as</span> <span class="n">sp</span>
<span class="kn">import</span> <span class="nn">gurobipy</span> <span class="k">as</span> <span class="n">gp</span>
<span class="kn">from</span> <span class="nn">gurobipy</span> <span class="kn">import</span> <span class="n">GRB</span>

<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>

<span class="c1"># Create a model
</span><span class="n">model</span> <span class="o">=</span> <span class="n">gp</span><span class="p">.</span><span class="n">Model</span><span class="p">(</span><span class="s">'gurobi_test'</span><span class="p">)</span>

<span class="c1"># Number of periods (1 year with 5min intervals)
</span><span class="n">n_period</span> <span class="o">=</span> <span class="mi">365</span> <span class="o">*</span> <span class="mi">288</span>
<span class="n">periods</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">n_period</span><span class="p">))</span>
<span class="n">gen1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="n">n_period</span><span class="p">)</span>
<span class="n">gen2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="n">n_period</span><span class="p">)</span>
<span class="n">demand</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">n_period</span><span class="p">)</span>
<span class="n">penalty</span> <span class="o">=</span> <span class="mf">3.5</span> <span class="o">*</span> <span class="n">demand</span>

<span class="c1"># Record time before creating the problem
</span><span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>

<span class="c1"># Create variables
</span><span class="n">v_g1</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">i</span><span class="p">:</span> <span class="n">model</span><span class="p">.</span><span class="n">addVar</span><span class="p">(</span><span class="n">vtype</span><span class="o">=</span><span class="n">GRB</span><span class="p">.</span><span class="n">INTEGER</span><span class="p">,</span> <span class="n">lb</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">ub</span><span class="o">=</span><span class="n">gen1</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">periods</span>
<span class="p">}</span>
<span class="n">v_g2</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">i</span><span class="p">:</span> <span class="n">model</span><span class="p">.</span><span class="n">addVar</span><span class="p">(</span><span class="n">vtype</span><span class="o">=</span><span class="n">GRB</span><span class="p">.</span><span class="n">CONTINUOUS</span><span class="p">,</span> <span class="n">lb</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">ub</span><span class="o">=</span><span class="n">gen2</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">periods</span>
<span class="p">}</span>
<span class="n">v_on</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">i</span><span class="p">:</span> <span class="n">model</span><span class="p">.</span><span class="n">addVar</span><span class="p">(</span><span class="n">vtype</span><span class="o">=</span><span class="n">GRB</span><span class="p">.</span><span class="n">BINARY</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">periods</span>
<span class="p">}</span>

<span class="c1"># Set objective
</span><span class="n">obj_expr</span> <span class="o">=</span> <span class="n">gp</span><span class="p">.</span><span class="n">quicksum</span><span class="p">(</span>
    <span class="mi">2</span> <span class="o">*</span> <span class="n">v_g1</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">v_g2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">penalty</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">v_on</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">periods</span>
<span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">setObjective</span><span class="p">(</span><span class="n">obj_expr</span><span class="p">,</span> <span class="n">GRB</span><span class="p">.</span><span class="n">MINIMIZE</span><span class="p">)</span>

<span class="c1"># Add constraints
</span><span class="n">r1</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">i</span><span class="p">:</span> <span class="n">model</span><span class="p">.</span><span class="n">addConstr</span><span class="p">(</span>
        <span class="n">v_g1</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">v_g2</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">demand</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">v_on</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">demand</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
    <span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">periods</span>
<span class="p">}</span>

<span class="c1"># Update model and print time for creating the problem
</span><span class="n">model</span><span class="p">.</span><span class="n">update</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Time for creating problem: </span><span class="si">{</span><span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s"> seconds'</span><span class="p">)</span>

<span class="c1"># Write problem to file in LP format
</span><span class="n">model</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'c:/test/gurobi_test.lp'</span><span class="p">)</span>

<span class="c1"># Optimize the model
</span><span class="n">model</span><span class="p">.</span><span class="n">optimize</span><span class="p">()</span>

<span class="c1"># Check if the optimization was successful
</span><span class="k">if</span> <span class="n">model</span><span class="p">.</span><span class="n">status</span> <span class="o">==</span> <span class="n">GRB</span><span class="p">.</span><span class="n">OPTIMAL</span><span class="p">:</span>
    <span class="n">model</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'c:/test/gurobi_test.sol'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'Model is optimal.'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Objective value: </span><span class="si">{</span><span class="n">model</span><span class="p">.</span><span class="n">objVal</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">model</span><span class="p">.</span><span class="n">status</span> <span class="o">==</span> <span class="n">GRB</span><span class="p">.</span><span class="n">INFEASIBLE</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'Model is infeasible.'</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">model</span><span class="p">.</span><span class="n">status</span> <span class="o">==</span> <span class="n">GRB</span><span class="p">.</span><span class="n">UNBOUNDED</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'Model is unbounded.'</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Optimization ended with status </span><span class="si">{</span><span class="n">model</span><span class="p">.</span><span class="n">status</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="how-to-use-gurobi-python-matrix-api">How to use Gurobi Python matrix API</h2>
<p>Gurobi matrix API has a function called <code class="language-plaintext highlighter-rouge">model.addMVar</code>, which has a <code class="language-plaintext highlighter-rouge">shape</code> parameter. We can use this function to add multiple variables with the same type. Then we can use matrix operations similar to the ones in <code class="language-plaintext highlighter-rouge">numpy</code> to add objective terms and constraints.</p>

<p>Note that the bounds of the variables and the constraint right-hand sides all can be inputs in the numpy array format. The variable bounds can also be updated using a numpy array. The variable solutions can be extracted into a numpy array as well.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">...</span>
<span class="c1"># Create variables
</span><span class="n">v_g1</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addMVar</span><span class="p">(</span><span class="n">n_period</span><span class="p">,</span> <span class="n">vtype</span><span class="o">=</span><span class="n">GRB</span><span class="p">.</span><span class="n">INTEGER</span><span class="p">,</span> <span class="n">lb</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">ub</span><span class="o">=</span><span class="n">gen1</span><span class="p">)</span>
<span class="n">v_g2</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addMVar</span><span class="p">(</span><span class="n">n_period</span><span class="p">,</span> <span class="n">vtype</span><span class="o">=</span><span class="n">GRB</span><span class="p">.</span><span class="n">CONTINUOUS</span><span class="p">,</span> <span class="n">lb</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">ub</span><span class="o">=</span><span class="n">gen2</span><span class="p">)</span>
<span class="n">v_on</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addMVar</span><span class="p">(</span><span class="n">n_period</span><span class="p">,</span> <span class="n">vtype</span><span class="o">=</span><span class="n">GRB</span><span class="p">.</span><span class="n">BINARY</span><span class="p">)</span>

<span class="c1"># Set objective
</span><span class="n">obj_expr</span> <span class="o">=</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">v_g1</span> <span class="o">+</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">v_g2</span> <span class="o">+</span> <span class="n">penalty</span> <span class="o">*</span> <span class="n">v_on</span><span class="p">).</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="n">setObjective</span><span class="p">(</span><span class="n">obj_expr</span><span class="p">,</span> <span class="n">GRB</span><span class="p">.</span><span class="n">MINIMIZE</span><span class="p">)</span>

<span class="c1"># Add constraints
</span><span class="n">r1</span> <span class="o">=</span>  <span class="n">model</span><span class="p">.</span><span class="n">addConstr</span><span class="p">(</span>
    <span class="n">v_g1</span> <span class="o">+</span> <span class="n">v_g2</span> <span class="o">+</span> <span class="n">demand</span> <span class="o">*</span> <span class="n">v_on</span> <span class="o">==</span> <span class="n">demand</span>
<span class="p">)</span>
<span class="p">...</span>
</code></pre></div></div>
<p>After using matrix API functions, now the problem creation time is about 1.6 seconds - that’s about 5x to 6x faster. At the same time, the code is shorter and cleaner.</p>

<p>If the coefficients of the MVar elements are different in the constraints, we can use an overloaded decorator <code class="language-plaintext highlighter-rouge">@</code> to do that:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># c[0]:  x[0] + y[0] &gt;= 10
# c[1]: 2x[1] + y[1] &gt;= 11
</span><span class="n">x</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addMVar</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'x'</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addMVar</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'y'</span><span class="p">)</span>
<span class="n">M</span> <span class="o">=</span> <span class="n">sp</span><span class="p">.</span><span class="n">diags</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span> <span class="c1"># a sparse diagnostic matrix
</span><span class="n">c</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addConstr</span><span class="p">(</span><span class="n">M</span><span class="o">@</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span> <span class="o">&gt;=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">10</span><span class="p">,</span> <span class="mi">11</span><span class="p">]),</span> <span class="n">name</span><span class="o">=</span><span class="s">'c'</span><span class="p">)</span>
</code></pre></div></div>

<p>If we need to add a constraint only using one element of the MVar, this can be easily done:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addMVar</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'x'</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addVar</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'y'</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addConstr</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">y</span> <span class="o">&gt;=</span> <span class="mi">99</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="example-of-2d-mvars">Example of 2D MVars</h2>
<p>Assume we have some constraints like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>c[0]: x[0,0] + x[0,1] + x[0,2] &gt;= 11
c[1]: x[1,0] + x[1,1] + x[1,2] &gt;= 11
</code></pre></div></div>

<p>To add these constraints to the optimization problem, this can be easily done with 2D MVars:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addMVar</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">),</span> <span class="n">name</span><span class="o">=</span><span class="s">'x'</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addConstr</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">11</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'c'</span><span class="p">)</span>

<span class="n">model</span><span class="p">.</span><span class="n">setObjective</span><span class="p">(</span><span class="mi">10</span> <span class="o">*</span> <span class="n">x</span><span class="p">.</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">GRB</span><span class="p">.</span><span class="n">MINIMIZE</span><span class="p">)</span>
</code></pre></div></div>

<p>By default, <code class="language-plaintext highlighter-rouge">MVar.sum()</code> will add all elements on all directions. By setting <code class="language-plaintext highlighter-rouge">axis=1</code> it will add all elements in the row direction.</p>

<p>If the objective coefficients are not a constant, we can add the objective like this:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">coeffs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">])</span>
<span class="n">model</span><span class="p">.</span><span class="n">setObjective</span><span class="p">(</span><span class="n">coeffs</span> <span class="o">@</span> <span class="n">x</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">GRB</span><span class="p">.</span><span class="n">MINIMIZE</span><span class="p">)</span>
</code></pre></div></div>

<p>If the objective coefficients for each elements in the MVar are not the same, we can still use a matrix operation to add the objective:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">coeffs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">6</span><span class="p">]])</span>
<span class="n">model</span><span class="p">.</span><span class="n">setObjective</span><span class="p">((</span><span class="n">coeffs</span> <span class="o">*</span> <span class="n">x</span><span class="p">).</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">GRB</span><span class="p">.</span><span class="n">MINIMIZE</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="example-of-shifted-mvars">Example of shifted MVars</h2>
<p>In many cases we need to add constraints related to the difference of variables between two consecutive time points. This can also be done nicely with MVars.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># x[0] - x[-1] &gt;= 0  # x[-1] = 9 is the initial value of x
# x[1] - x[0]  &gt;= 1
# x[2] - x[1]  &gt;= 2
</span><span class="n">S</span> <span class="o">=</span> <span class="n">sp</span><span class="p">.</span><span class="n">diags</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">3</span> <span class="o">-</span> <span class="mi">1</span><span class="p">),</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="nb">format</span><span class="o">=</span><span class="s">'csr'</span><span class="p">)</span>
<span class="c1"># S = [[0, 0, 0], [1, 0, 0], [0, 1, 0]]
</span><span class="n">x0</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">9</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">])</span>
<span class="n">rhs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">addConstr</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">S</span><span class="o">@</span><span class="n">x</span> <span class="o">-</span> <span class="n">x0</span> <span class="o">&gt;=</span> <span class="n">rhs</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="changing-constraint-coefficients">Changing constraint coefficients</h2>
<p>If we need to update some constraints, for example, changing some variable coefficients, we can use the method <code class="language-plaintext highlighter-rouge">model.chgCoeff</code>:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="n">chgCoeff</span><span class="p">(</span><span class="n">c</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mf">10.0</span><span class="p">)</span>
</code></pre></div></div>

<p>Compared with other language APIs, there is not a method called <code class="language-plaintext highlighter-rouge">model.chgCoeffs()</code> in the Python API. Therefore, we can only change one constraint coefficient at a time. Hopefully the method <code class="language-plaintext highlighter-rouge">model.chgCoeffs()</code> will be added in the future.</p>

<h2 id="updating-objective">Updating Objective</h2>
<p>Think about creating a project using object-oriented programming. In this case, ideally we need to set the objective terms related to each object separately. However, there is not a Python API method that can be used to add the objective terms gradually.</p>

<p>There are two workarounds:</p>
<ul>
  <li>get the objective using <code class="language-plaintext highlighter-rouge">getObjective()</code> then add additional terms and set the objective again using <code class="language-plaintext highlighter-rouge">setObjective()</code>, or</li>
  <li>add objective terms from different objects together and then set the objective using <code class="language-plaintext highlighter-rouge">setObjective()</code></li>
</ul>

<p>Here is an example showing how to do it using the second option:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">obj_expr</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">obj_expr</span> <span class="o">+=</span> <span class="n">x</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">obj_expr</span> <span class="o">+=</span> <span class="p">(</span><span class="n">coeffs</span> <span class="o">*</span> <span class="n">y</span><span class="p">).</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="n">setObjective</span><span class="p">(</span><span class="n">obj_expr</span><span class="p">,</span> <span class="n">GRB</span><span class="p">.</span><span class="n">MINIMIZE</span><span class="p">)</span>
</code></pre></div></div>

<p>Note that setting the variable and constraint names will increase the problem creation time. Thus it’s best to do so only for debugging purpose - you can use a flag to enable or disable variable and constraint name setting.</p>

<p>More details about the Gurobi Python matrix API can be found in the manual: https://docs.gurobi.com/projects/optimizer/en/current/index.html</p>]]></content><author><name>Sean Ma</name></author><category term="Python" /><category term="Optimization" /><category term="Gurobi" /><summary type="html"><![CDATA[When doing optimizations using Python, many people choose the popular pyomo package to create optimization problems. However, pyomo is known for its slow performance and some other issues. That’s why gurobipy becomes a more attractive alternative.]]></summary></entry><entry><title type="html">Which orchestration tool is better: Airflow, Prefect, Argo Workflows, or Temporal?</title><link href="https://seanslma.github.io/orchestration-tool/" rel="alternate" type="text/html" title="Which orchestration tool is better: Airflow, Prefect, Argo Workflows, or Temporal?" /><published>2025-03-05T00:00:00+00:00</published><updated>2025-03-05T00:00:00+00:00</updated><id>https://seanslma.github.io/orchestration-tool</id><content type="html" xml:base="https://seanslma.github.io/orchestration-tool/"><![CDATA[<p>Nowadays there are many tools for task orchestration. Some popular ones include Airflow, Prefect, Argo Workflows, and Temporal. Now the question is which tool should I use in my team?</p>

<p>Here I will briefly list the features of the four task orchestration tools. Hopefully this will help you decide which tool is best for your task scheduling.</p>

<h2 id="airflow">Airflow</h2>
<p>Airflow is a popular task scheduling tool. The data workflows in Airflow are defined using Python.</p>

<p>It has a large user base but also has some limitations as it was created earlier than other orchestration tools:</p>
<ul>
  <li>DAGs are not parameterized - you can’t pass parameters into your workflows</li>
  <li>DAGs are static - it can’t automatically create new steps at runtime as needed</li>
  <li>We have to package the entire workflow into one container</li>
</ul>

<h2 id="prefect">Prefect</h2>
<p>Prefect was created to overcome some of the limitations in Airflow, with a strong emphasis on ease of use and deployment, especially for complex DAGs.</p>

<p>Some features of Prefect:</p>
<ul>
  <li>Workflows are defined in Python, parameterized and dynamic</li>
  <li>Can run each step in a container, but need to register docker with workflows in Prefect</li>
  <li>Uses state management abstractions that allow for easy retries and failure handling within data workflows</li>
  <li>Has built-in integrations with popular data engineering tools and platforms, such as Dask, DBT, and various cloud services</li>
</ul>

<h2 id="argo-workflows">Argo Workflows</h2>
<p>Argo Workflows is a container-native workflow engine for orchestrating jobs on Kubernetes. It naturally addresses the deployment issue in Airflow and Prefect.</p>
<ul>
  <li>Workflows are defined in YAML</li>
  <li>Every step in an workflow is run in its own container</li>
  <li>Relies on Kubernetes for state management</li>
  <li>Can only run on Kubernetes clusters</li>
</ul>

<h2 id="temporal">Temporal</h2>
<p>While Airflow, Prefect and Argo Workflows focus primarily on data workflow orchestration, Temporal is a more general-purpose workflow tool.</p>
<ul>
  <li>Workflows are defined using the language for the tasks</li>
  <li>Provides robust features for state management, retries, and long-running processes</li>
  <li>Requires more investment in learning - has a steep learning curve</li>
</ul>

<h2 id="summary">Summary</h2>
<p>As we can see, each tool has its own use cases. We need to select the tool that is most suitable for our work.</p>
<ul>
  <li>If our tasks are about data processing, we perhaps should consider Airflow, Prefect or Argo Workflows.</li>
  <li>If we already have our tasks running on Kubernetes cluster, for easy of deployment we might choose Argo Workflows.</li>
  <li>If our tasks are more general and require high robustness and reliability, we had better to use Temporal.</li>
  <li>If our workflows are complex, a tool using a programming language instead of yaml files might be more suitable.</li>
  <li>If our workflows are simple and we do not want to invest too much time in learning, a tool with ease of use might be the best choice.</li>
</ul>]]></content><author><name>Sean Ma</name></author><category term="DevOps" /><category term="Orchestration" /><summary type="html"><![CDATA[Nowadays there are many tools for task orchestration. Some popular ones include Airflow, Prefect, Argo Workflows, and Temporal. Now the question is which tool should I use in my team?]]></summary></entry><entry><title type="html">Make python loops 5x to 10x faster using numba</title><link href="https://seanslma.github.io/numba-perf/" rel="alternate" type="text/html" title="Make python loops 5x to 10x faster using numba" /><published>2024-11-26T00:00:00+00:00</published><updated>2024-11-26T00:00:00+00:00</updated><id>https://seanslma.github.io/numba-perf</id><content type="html" xml:base="https://seanslma.github.io/numba-perf/"><![CDATA[<p>Numba is a just-in-time (JIT) compiler for python that translates python code into highly optimized machine code at runtime. It can significantly improve the performance of numerical computations by enabling high-performance execution of functions, particularly those that make heavy use of numpy arrays.</p>

<p>Here we will first briefly explain key features of numba and when to use it, and then provide an example demonstrating how to accelerate code performance by leveraging various numba features. If you are already familiar with numba, go directly to the third section about the demonstration.</p>

<h2 id="key-features-of-numba">Key features of numba</h2>
<ul>
  <li><strong>JIT compilation</strong>: Numba compiles python functions into machine code, allowing for efficient code generation tailored to specific hardware and data types.</li>
  <li><strong>Numerical acceleration</strong>: Numba is particularly well-suited for numerical computations involving arrays and mathematical operations. It can often achieve performance comparable to compiled languages like C or Fortran.</li>
  <li><strong>Compatibility with numpy</strong>: Numba seamlessly integrates with numpy, to accelerate numpy functions and operations, making them much faster.</li>
  <li><strong>Parallel computing</strong>: Numba supports parallel execution on multi-core CPUs and GPUs, enabling us to leverage the power of parallel hardware to speed up computations.</li>
  <li><strong>Custom UDFs</strong>: We can create custom user-defined functions (UDFs) in numba and use them within our python code. These UDFs can be compiled and optimized for performance.</li>
</ul>

<h2 id="when-to-use-and-to-avoid-numba">When to use and to avoid numba</h2>
<p>Numba is particularly well-suited for numerical computations involving arrays and mathematical operations. Here are some specific cases where we should consider using numba:</p>
<ul>
  <li><strong>Array operations</strong>: If our code heavily involves operations on numpy arrays, such as element-wise arithmetic, matrix multiplication, or reductions, numba can significantly accelerate these computations.</li>
  <li><strong>Mathematical functions</strong>: Numba can optimize calls to mathematical functions like <code class="language-plaintext highlighter-rouge">sin</code>, <code class="language-plaintext highlighter-rouge">cos</code>, <code class="language-plaintext highlighter-rouge">exp</code>, and <code class="language-plaintext highlighter-rouge">log</code>, providing a performance boost compared to their python counterparts.</li>
  <li><strong>Custom functions</strong>: If we have custom functions that perform numerical calculations, numba can compile them into machine code for improved efficiency.</li>
  <li><strong>Loops</strong>: Numba can often optimize loops that iterate over arrays or perform numerical calculations within the loop body.</li>
</ul>

<p>However, not all python code can be optimized using numba and thus improve the performance. There are some limitations to consider before using numba:</p>
<ul>
  <li><strong>I/O bound operations</strong>: Numba will not help much with operations that are I/O bound, such as reading/writing files or network operations.</li>
  <li><strong>Dynamic python features</strong>: If our code relies heavily on python’s dynamic features (like modifying functions at runtime), numba may not be suitable, as it works best with statically typed, straightforward code.</li>
  <li><strong>Non-numerical code</strong>: For code that does not involve numerical calculations or array manipulations, other optimization techniques may be more appropriate.</li>
  <li><strong>Numba can introduce overhead</strong>: If we are working with small datasets or functions that run very quickly, the overhead of JIT compilation might outweigh the performance benefits.</li>
</ul>

<p>To determine whether numba is appropriate for our use case, we can:</p>
<ul>
  <li><strong>Profile our code:</strong> Use profiling tools to identify the bottlenecks in our code and see if they involve numerical computations.</li>
  <li><strong>Try numba and measure the performance:</strong> Experiment with numba and compare the performance of our code with and without numba.</li>
  <li><strong>Consider the trade-offs:</strong> Weigh the potential performance benefits against the overhead and limitations of using numba.</li>
</ul>

<p>Overall, if we have numerical or scientific computations that need to be optimized, numba is a powerful tool that can lead to significant performance improvements with minimal code changes.</p>

<h2 id="data-for-testing-demonstration">Data for testing demonstration</h2>
<p>Let’s create a 2D numpy array filled with randomly generated data. Each row represents a scenario and here we will calculate the distance between any two scenarios.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">11</span><span class="p">)</span>
<span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="initial-version">Initial version</h2>
<p>We calculate the distance between two scenarios using the 1-norm, which measures the sum of the absolute differences between corresponding elements.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">calculate_distances1</span><span class="p">(</span><span class="n">arr</span><span class="p">):</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">arr</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">n</span> <span class="o">=</span> <span class="n">arr</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">dist_arr</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
            <span class="n">v</span> <span class="o">=</span> <span class="mf">0.0</span>
            <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
                 <span class="n">v</span> <span class="o">+=</span> <span class="nb">abs</span><span class="p">(</span><span class="n">arr</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">k</span><span class="p">]</span> <span class="o">-</span> <span class="n">arr</span><span class="p">[</span><span class="n">j</span><span class="p">,</span> <span class="n">k</span><span class="p">])</span>
            <span class="n">dist_arr</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span>
            <span class="n">dist_arr</span><span class="p">[</span><span class="n">j</span><span class="p">,</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span>
    <span class="k">return</span> <span class="n">dist_arr</span>
<span class="c1"># 2.68 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</span></code></pre></div></div>
<p>The run time is about 2.68 seconds, for 100 scenarios. If we have 1000 scenarios the time would be 268 seconds. That is too slow and we must improve the performance.</p>

<h2 id="using-numpy-function">Using numpy function</h2>
<p>Here we update the code to calculate the 1-norm using the numpy function <code class="language-plaintext highlighter-rouge">np.linalg.norm()</code>.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">calculate_distances2</span><span class="p">(</span><span class="n">arr</span><span class="p">):</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">arr</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">dist_arr</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
            <span class="n">dist_arr</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">arr</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">arr</span><span class="p">[</span><span class="n">j</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span>
            <span class="n">dist_arr</span><span class="p">[</span><span class="n">j</span><span class="p">,</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">dist_arr</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">dist_arr</span>
<span class="c1"># 40.7 ms ± 1.50 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</span></code></pre></div></div>
<p>Now the run time is 40.7 ms - about <code class="language-plaintext highlighter-rouge">65x</code> faster! As the numpy function is implemented in C, there is no surprise that the performance has been improved significantly.</p>

<h2 id="using-numbanjit">Using numba.njit</h2>
<p>Can we improve the performance further? Yes, by using numba, definitely we can.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">numba</span> <span class="kn">import</span> <span class="n">njit</span>
<span class="o">@</span><span class="n">njit</span><span class="p">(</span><span class="n">cache</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">calculate_distances3</span><span class="p">(</span><span class="n">arr</span><span class="p">):</span>
    <span class="p">...</span>
<span class="c1"># 10.9 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
</span></code></pre></div></div>
<p>It is great that there is a <code class="language-plaintext highlighter-rouge">4x</code> performance improvement with numba on the numpy function.</p>

<p>The numba <code class="language-plaintext highlighter-rouge">njit</code> decorator is used to compile the python function to optimized machine code in nopython mode. We can also use the <code class="language-plaintext highlighter-rouge">jit</code> decorator, which allows the function to fall back to the original python implementation if numba cannot compile it.</p>

<p>When we set <code class="language-plaintext highlighter-rouge">cache=True</code>, numba stores the compiled function in a cache on disk. So the next time we execute the script, it can load the precompiled function, avoiding the overhead of recompilation.</p>

<h2 id="using-numbanjit-with-data-types">Using numba.njit with data types</h2>
<p>Can we do it better? Yes, we need to use numba data type signature.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">njit</span><span class="p">(</span><span class="s">'float64[:,::1](float64[:,::1])'</span><span class="p">,</span> <span class="n">cache</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">calculate_distances4</span><span class="p">(</span><span class="n">arr</span><span class="p">):</span>
    <span class="p">...</span>
<span class="c1"># 10.5 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
</span></code></pre></div></div>
<p>We explicitly set the data types of the input parameters and the output. In this case, there is only minor performance improvement, most likely that numba can infer data types even without data type signature. More details about the numba data type signature can be found in the numba documents (see References section).</p>

<p>In generall, by specifying data types, numba can generate more efficient machine code. Knowing the exact types allows it to optimize the generated code for those types, leading to faster execution and improved memory management.</p>

<h2 id="replacing-numpy-function-with-a-python-loop">Replacing numpy function with a python loop</h2>
<p>As numba is good for loops, here we will replace the numpy function by a <code class="language-plaintext highlighter-rouge">python loop</code> to further boost performance.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">njit</span><span class="p">(</span><span class="s">'float64[:,::1](float64[:,::1])'</span><span class="p">,</span> <span class="n">cache</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">calculate_distances5</span><span class="p">(</span><span class="n">arr</span><span class="p">):</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">arr</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">n</span> <span class="o">=</span> <span class="n">arr</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">dist_arr</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
            <span class="n">v</span> <span class="o">=</span> <span class="mf">0.0</span>
            <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
                 <span class="n">v</span> <span class="o">+=</span> <span class="nb">abs</span><span class="p">(</span><span class="n">arr</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">k</span><span class="p">]</span> <span class="o">-</span> <span class="n">arr</span><span class="p">[</span><span class="n">j</span><span class="p">,</span> <span class="n">k</span><span class="p">])</span>
            <span class="n">dist_arr</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span>
            <span class="n">dist_arr</span><span class="p">[</span><span class="n">j</span><span class="p">,</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span>
    <span class="k">return</span> <span class="n">dist_arr</span>
<span class="c1"># 8.20 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
</span></code></pre></div></div>
<p>Numba is indeed good for loops. There is a <code class="language-plaintext highlighter-rouge">1.2x</code> performance improvement now, and it’s about <code class="language-plaintext highlighter-rouge">5x</code> faster than the numpy version.</p>

<h2 id="using-numbanjit-parallel-mode">Using numba.njit parallel mode</h2>
<p>Modern computers often have multiple cores. By leveraging parallel computing, we can significantly reduce execution time.</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">numba</span> <span class="kn">import</span> <span class="n">njit</span><span class="p">,</span> <span class="n">prange</span>
<span class="o">@</span><span class="n">njit</span><span class="p">(</span><span class="s">'float64[:,::1](float64[:,::1])'</span><span class="p">,</span> <span class="n">cache</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">parallel</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">nogil</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">calculate_distances6</span><span class="p">(</span><span class="n">arr</span><span class="p">):</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">arr</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">n</span> <span class="o">=</span> <span class="n">arr</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">dist_arr</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">prange</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
            <span class="n">v</span> <span class="o">=</span> <span class="mf">0.0</span>
            <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
                 <span class="n">v</span> <span class="o">+=</span> <span class="nb">abs</span><span class="p">(</span><span class="n">arr</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">k</span><span class="p">]</span> <span class="o">-</span> <span class="n">arr</span><span class="p">[</span><span class="n">j</span><span class="p">,</span> <span class="n">k</span><span class="p">])</span>
            <span class="n">dist_arr</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span>
            <span class="n">dist_arr</span><span class="p">[</span><span class="n">j</span><span class="p">,</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span>
    <span class="k">return</span> <span class="n">dist_arr</span>
<span class="c1"># 3.68 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
</span></code></pre></div></div>
<p>Here we update the code to use numba <code class="language-plaintext highlighter-rouge">parallel mode</code> with the help of the <code class="language-plaintext highlighter-rouge">prange</code> function.</p>

<p>By setting <code class="language-plaintext highlighter-rouge">parallel=True</code>, numba’s JIT compiler will analyze the function’s code and automatically identify opportunities for parallelization, especially within loops. However, using <code class="language-plaintext highlighter-rouge">prange</code> provides more explicit control over parallelization and can be more effective in certain cases.</p>

<p>Finally the run time is 3.68 ms (4 cpu cores). It is about <code class="language-plaintext highlighter-rouge">10x</code> faster compared to the numpy function version without using numba.njit (40.7 ms). It is about <code class="language-plaintext highlighter-rouge">700x</code> faster compared to the raw python code (2.68 seconds).</p>

<h2 id="references">References</h2>
<ul>
  <li><a href="https://numba.pydata.org/numba-doc/dev/reference/types.html">Numba data type signature</a></li>
  <li><a href="https://stackoverflow.com/questions/66205186/python-signature-with-numba">Numba data type signature caveats</a></li>
  <li><a href="https://pythonspeed.com/articles/slow-numba">Optimizing python loops using numba</a></li>
</ul>]]></content><author><name>Sean Ma</name></author><category term="Python" /><category term="Numba" /><category term="Performance" /><summary type="html"><![CDATA[Numba is a just-in-time (JIT) compiler for python that translates python code into highly optimized machine code at runtime. It can significantly improve the performance of numerical computations by enabling high-performance execution of functions, particularly those that make heavy use of numpy arrays.]]></summary></entry><entry><title type="html">The pandas function pd.read_sql returns an empty DataFrame without correct data types</title><link href="https://seanslma.github.io/read-sql-data-type/" rel="alternate" type="text/html" title="The pandas function pd.read_sql returns an empty DataFrame without correct data types" /><published>2024-09-06T00:00:00+00:00</published><updated>2024-09-06T00:00:00+00:00</updated><id>https://seanslma.github.io/read-sql-data-type</id><content type="html" xml:base="https://seanslma.github.io/read-sql-data-type/"><![CDATA[<p>We provide a solution to the issue you might need.</p>

<p>When querying data from databases such as MS SQL Server via the Driver <code class="language-plaintext highlighter-rouge">pyodbc</code>, we can conveniently get the data as a pandas DataFrame by using <code class="language-plaintext highlighter-rouge">pd.read_sql</code>. Generally, the driver provides information about the column names, data types, and other metadata associated with the result set. However, when the query result set is empty, the data type information is not available and pandas returns an <code class="language-plaintext highlighter-rouge">empty</code> DataFrame with all column types as <code class="language-plaintext highlighter-rouge">object</code>.</p>

<p>An empty DataFrame with wrong data types can cause issues in your Python code. If you do not check whether the returned DataFrame is empty or not your code will crash in many cases such as trying to extract the year from a datetime column and doing aggregations on float columns. Here we explain how to get the data type information in this situation when using <code class="language-plaintext highlighter-rouge">sqlalchemy</code> to create the query.</p>

<h2 id="get-the-data-types-from-the-query-statement">Get the data types from the query statement</h2>
<p>There are multiple approaches to get the data types but they are not always available in most situations.</p>

<h3 id="option-1-resultcursordescription">Option 1: <code class="language-plaintext highlighter-rouge">result.cursor.description</code></h3>
<p>The first approach is using the <code class="language-plaintext highlighter-rouge">.description</code> attribute of the database <code class="language-plaintext highlighter-rouge">cursor</code> object:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sqlalchemy</span>

<span class="n">conn_string</span> <span class="o">=</span> <span class="sa">f</span><span class="s">'mssql+pyodbc://</span><span class="si">{</span><span class="n">username</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="n">pwd</span><span class="si">}</span><span class="s">@</span><span class="si">{</span><span class="n">server_name</span><span class="si">}</span><span class="s">'</span>
<span class="n">conn_string</span> <span class="o">+=</span> <span class="sa">f</span><span class="s">'/</span><span class="si">{</span><span class="n">database_name</span><span class="si">}</span><span class="s">?driver=ODBC+Driver+17+for+SQL+Server'</span>
<span class="n">engine</span> <span class="o">=</span> <span class="n">sqlalchemy</span><span class="p">.</span><span class="n">create_engine</span><span class="p">(</span><span class="n">conn_string</span><span class="p">)</span>
<span class="n">connection</span> <span class="o">=</span> <span class="n">engine</span><span class="p">.</span><span class="n">connect</span><span class="p">()</span>

<span class="n">query</span> <span class="o">=</span> <span class="s">'SELECT ID, Name, Price, StartDate FROM sales.Product;'</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">connection</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">cursor</span><span class="p">.</span><span class="n">description</span><span class="p">)</span>
</code></pre></div></div>

<p>The output will be something like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(
    ('ID', &lt;class 'int'&gt;, None, 10, 10, 0, False),
    ('Name', &lt;class 'str'&gt;, None, 50, 50, 0, False),
    ('Price', &lt;class 'decimal.Decimal'&gt;, None, 19, 19, 4, False),
    ('StartDate', &lt;class 'datetime.datetime'&gt;, None, 23, 23, 3, False),
)
</code></pre></div></div>

<h3 id="option-2-querystatementselected_columns">Option 2: <code class="language-plaintext highlighter-rouge">query.statement.selected_columns</code></h3>
<p>If the first approach does not work and you use <code class="language-plaintext highlighter-rouge">sqlalchemy</code> to create the query, you should still be able to get the data types.</p>

<p>Assume we defined the <code class="language-plaintext highlighter-rouge">sales.Product</code> Table as:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sqlalchemy.orm</span> <span class="kn">import</span> <span class="n">declarative_base</span>
<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">create_engine</span><span class="p">,</span> <span class="n">Column</span><span class="p">,</span> <span class="n">DateTime</span><span class="p">,</span> <span class="n">DECIMAL</span><span class="p">,</span> <span class="n">Integer</span><span class="p">,</span> <span class="n">Unicode</span>

<span class="c1"># Base classes and will hold the metadata about the tables
</span><span class="n">Base</span> <span class="o">=</span> <span class="n">declarative_base</span><span class="p">()</span>

<span class="c1"># A declarative class for Table `sales.Product` by inheriting from the Base class
</span><span class="k">class</span> <span class="nc">Product</span><span class="p">(</span><span class="n">Base</span><span class="p">):</span>
    <span class="n">__tablename__</span> <span class="o">=</span> <span class="s">'Product'</span>
    <span class="n">__table_args__</span> <span class="o">=</span> <span class="p">{</span><span class="s">'schema'</span><span class="p">:</span> <span class="s">'sales'</span><span class="p">}</span>

    <span class="n">ID</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">Integer</span><span class="p">,</span> <span class="n">primary_key</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">Name</span> <span class="o">=</span> <span class="n">Name</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">Unicode</span><span class="p">(</span><span class="mi">50</span><span class="p">),</span> <span class="n">nullable</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">Price</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">DECIMAL</span><span class="p">(</span><span class="mi">19</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
    <span class="n">StartDate</span> <span class="o">=</span> <span class="n">Column</span><span class="p">(</span><span class="n">DateTime</span><span class="p">)</span>


<span class="c1"># Create a SQLAlchemy engine
</span><span class="n">engine</span> <span class="o">=</span> <span class="n">create_engine</span><span class="p">(</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">database_connection_url</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>

<span class="c1"># Create tables in the database
</span><span class="n">Base</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">create_all</span><span class="p">(</span><span class="n">engine</span><span class="p">)</span>
</code></pre></div></div>

<p>And we created the query in this way:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">sql</span><span class="p">,</span> <span class="n">types</span>
<span class="kn">from</span> <span class="nn">sqlalchemy.orm</span> <span class="kn">import</span> <span class="n">Session</span>

<span class="n">session</span> <span class="o">=</span> <span class="n">Session</span><span class="p">(</span><span class="n">bind</span><span class="o">=</span><span class="n">engine</span><span class="p">)</span>
<span class="n">sp</span> <span class="o">=</span> <span class="n">Product</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">session</span><span class="p">.</span><span class="n">query</span><span class="p">(</span>
    <span class="n">sp</span><span class="p">.</span><span class="n">ID</span><span class="p">.</span><span class="n">label</span><span class="p">(</span><span class="s">'id'</span><span class="p">),</span>
    <span class="n">sp</span><span class="p">.</span><span class="n">Name</span><span class="p">.</span><span class="n">label</span><span class="p">(</span><span class="s">'name'</span><span class="p">),</span>
    <span class="n">sp</span><span class="p">.</span><span class="n">Price</span><span class="p">.</span><span class="n">label</span><span class="p">(</span><span class="s">'price'</span><span class="p">),</span>
    <span class="n">sp</span><span class="p">.</span><span class="n">StartDate</span><span class="p">.</span><span class="n">label</span><span class="p">(</span><span class="s">'start_date'</span><span class="p">),</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Finally we can get the data types from the query:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">sqlalchemy</span><span class="p">.</span><span class="n">orm</span><span class="p">.</span><span class="n">query</span><span class="p">.</span><span class="n">Query</span><span class="p">):</span>
    <span class="n">dtype</span> <span class="o">=</span> <span class="p">{</span>
        <span class="n">c</span><span class="p">.</span><span class="n">name</span><span class="p">:</span> <span class="n">c</span><span class="p">.</span><span class="nb">type</span><span class="p">.</span><span class="n">__class__</span><span class="p">.</span><span class="n">__name__</span>
        <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">query</span><span class="p">.</span><span class="n">statement</span><span class="p">.</span><span class="n">selected_columns</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">dtype</code> is a dictionary with column names as keys and data types as values:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dtype</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">'id'</span><span class="p">:</span> <span class="s">'Integer'</span><span class="p">,</span>
    <span class="s">'Name'</span><span class="p">:</span> <span class="s">'Unicode'</span><span class="p">,</span>
    <span class="s">'price'</span><span class="p">:</span> <span class="s">'DECIMAL'</span><span class="p">,</span>
    <span class="s">'start_date'</span><span class="p">:</span> <span class="s">'DateTime'</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note that the items in <code class="language-plaintext highlighter-rouge">query.statement.selected_columns</code> can have different types, such as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- &lt;class 'sqlalchemy.sql.elements.Label'&gt;
- &lt;class 'sqlalchemy.sql.elements.Cast'&gt;
- &lt;class 'sqlalchemy.sql.annotation.AnnotatedColumn'&gt;
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">Cast</code> class is from the <code class="language-plaintext highlighter-rouge">sql.func.cast</code> function:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">query</span> <span class="o">=</span> <span class="n">session</span><span class="p">.</span><span class="n">query</span><span class="p">(</span>
    <span class="n">sql</span><span class="p">.</span><span class="n">func</span><span class="p">.</span><span class="n">cast</span><span class="p">(</span><span class="n">sp</span><span class="p">.</span><span class="n">Price</span><span class="p">,</span> <span class="n">types</span><span class="p">.</span><span class="n">Float</span><span class="p">),</span>
<span class="p">)</span>
</code></pre></div></div>

<p>However, the <code class="language-plaintext highlighter-rouge">Cast</code> class does not have the <code class="language-plaintext highlighter-rouge">name</code> property. To fix the issue we have to convert the <code class="language-plaintext highlighter-rouge">Cast</code> column to a <code class="language-plaintext highlighter-rouge">Lable</code> column:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">query</span> <span class="o">=</span> <span class="n">session</span><span class="p">.</span><span class="n">query</span><span class="p">(</span>
    <span class="n">sql</span><span class="p">.</span><span class="n">func</span><span class="p">.</span><span class="n">cast</span><span class="p">(</span><span class="n">sp</span><span class="p">.</span><span class="n">Price</span><span class="p">,</span> <span class="n">types</span><span class="p">.</span><span class="n">Float</span><span class="p">).</span><span class="n">label</span><span class="p">(</span><span class="s">'price'</span><span class="p">),</span>
<span class="p">)</span>
</code></pre></div></div>

<h2 id="pass-the-data-type-information-to-pdread_sql">Pass the data type information to <code class="language-plaintext highlighter-rouge">pd.read_sql</code></h2>
<p>There is a parameter <code class="language-plaintext highlighter-rouge">dtype</code> in <code class="language-plaintext highlighter-rouge">pandas.read_sql(..., dtype=None)</code> that can be used to pass the data types for the query results.</p>

<p>Note that in the previous section the extracted data types are the types defined in <code class="language-plaintext highlighter-rouge">sqlalchemy</code>. We need to convert them to the types that can be used in <code class="language-plaintext highlighter-rouge">pandas</code>. Here we provide a mapping for most of the data types:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sa_to_pd_dtype</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">'BigInteger'</span><span class="p">:</span> <span class="s">'int64'</span><span class="p">,</span>
    <span class="s">'BIT'</span><span class="p">:</span> <span class="s">'bool'</span><span class="p">,</span>
    <span class="s">'Boolean'</span><span class="p">:</span> <span class="s">'bool'</span><span class="p">,</span>
    <span class="s">'Date'</span><span class="p">:</span> <span class="s">'datetime64[ns]'</span><span class="p">,</span>
    <span class="s">'DateTime'</span><span class="p">:</span> <span class="s">'datetime64[ns]'</span><span class="p">,</span>
    <span class="s">'DECIMAL'</span><span class="p">:</span> <span class="s">'float'</span><span class="p">,</span>
    <span class="s">'Enum'</span><span class="p">:</span> <span class="s">'category'</span><span class="p">,</span>
    <span class="s">'Float'</span><span class="p">:</span> <span class="s">'float'</span><span class="p">,</span>
    <span class="s">'Integer'</span><span class="p">:</span> <span class="s">'int64'</span><span class="p">,</span>
    <span class="s">'Interval'</span><span class="p">:</span> <span class="s">'timedelta64'</span><span class="p">,</span>
    <span class="s">'LargeBinary'</span><span class="p">:</span> <span class="s">'str'</span><span class="p">,</span>
    <span class="s">'Numeric'</span><span class="p">:</span> <span class="s">'float'</span><span class="p">,</span>
    <span class="s">'SmallInteger'</span><span class="p">:</span> <span class="s">'int16'</span><span class="p">,</span>
    <span class="s">'String'</span><span class="p">:</span> <span class="s">'str'</span><span class="p">,</span>
    <span class="s">'Time'</span><span class="p">:</span> <span class="s">'datetime64[ns]'</span><span class="p">,</span>
    <span class="s">'TIMESTAMP'</span><span class="p">:</span> <span class="s">'datetime64[ns]'</span><span class="p">,</span>
    <span class="s">'Unicode'</span><span class="p">:</span> <span class="s">'str'</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And we set the data types when extracting the data using <code class="language-plaintext highlighter-rouge">pd.read_sql</code>:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dtype</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">col</span><span class="p">:</span> <span class="n">sa_to_pd_dtype</span><span class="p">[</span><span class="n">typ</span><span class="p">]</span>
    <span class="k">for</span> <span class="n">col</span><span class="p">,</span> <span class="n">typ</span> <span class="ow">in</span> <span class="n">dtype</span><span class="p">.</span><span class="n">items</span><span class="p">()</span>
    <span class="k">if</span> <span class="n">typ</span> <span class="ow">in</span> <span class="n">sa_to_pd_dtype</span>
<span class="p">}</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql</span><span class="p">(...,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="why-did-i-get-nulltype-for-some-data-columns">Why did I get <code class="language-plaintext highlighter-rouge">NullType</code> for some data columns?</h2>
<p>Assume the previous query has been changed to:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">query</span> <span class="o">=</span> <span class="n">session</span><span class="p">.</span><span class="n">query</span><span class="p">(</span>
    <span class="n">sp</span><span class="p">.</span><span class="n">ID</span><span class="p">.</span><span class="n">label</span><span class="p">(</span><span class="s">'id'</span><span class="p">),</span>
    <span class="n">sp</span><span class="p">.</span><span class="n">Name</span><span class="p">.</span><span class="n">label</span><span class="p">(</span><span class="s">'name'</span><span class="p">),</span>
    <span class="n">sp</span><span class="p">.</span><span class="n">Price</span><span class="p">.</span><span class="n">label</span><span class="p">(</span><span class="s">'price'</span><span class="p">),</span>
    <span class="n">sql</span><span class="p">.</span><span class="n">func</span><span class="p">.</span><span class="n">dateadd</span><span class="p">(</span>
        <span class="n">sql</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="s">'day'</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">sp</span><span class="p">.</span><span class="n">StartDate</span>
    <span class="p">).</span><span class="n">label</span><span class="p">(</span><span class="s">'actual_start_date'</span><span class="p">),</span>
<span class="p">)</span>
</code></pre></div></div>
<p>In this case, the data type for the column <code class="language-plaintext highlighter-rouge">actual_start_date</code> will be <code class="language-plaintext highlighter-rouge">NullType</code> instead of <code class="language-plaintext highlighter-rouge">DateTime</code>.</p>

<p>By digging into the <code class="language-plaintext highlighter-rouge">sqlalchemy</code> documents we find out that this is caused by the <code class="language-plaintext highlighter-rouge">sql.func.dateadd</code>. Basically for functions that are not known, the type defaults to the <code class="language-plaintext highlighter-rouge">NullType</code>. There are also other functions such as <code class="language-plaintext highlighter-rouge">sql.func.rtrim</code>, <code class="language-plaintext highlighter-rouge">sql.func.replace</code>, <code class="language-plaintext highlighter-rouge">sql.func.year</code>, <code class="language-plaintext highlighter-rouge">sql.func.avg</code> and <code class="language-plaintext highlighter-rouge">sql.func.round</code> that might lead to the <code class="language-plaintext highlighter-rouge">NullType</code>.</p>

<p>To fix the issue, we need to pass the data type directly to the function:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sql</span><span class="p">.</span><span class="n">func</span><span class="p">.</span><span class="n">dateadd</span><span class="p">(</span>
    <span class="n">sql</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="s">'day'</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">sp</span><span class="p">.</span><span class="n">StartDate</span><span class="p">,</span> <span class="n">type_</span><span class="o">=</span><span class="n">types</span><span class="p">.</span><span class="n">DateTime</span>
<span class="p">).</span><span class="n">label</span><span class="p">(</span><span class="s">'actual_start_date'</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="reference">Reference</h2>

<ul>
  <li><a href="https://stackoverflow.com/questions/64761911/sqlalchemy-accessing-column-types-from-query-results">SQLAlchemy accessing column types from query results</a></li>
  <li><a href="https://stackoverflow.com/questions/2258072/sqlalchemy-getting-column-data-types-of-query-results">SQLAlchemy getting column data types of query results</a></li>
  <li><a href="https://docs.sqlalchemy.org/en/gerrit/3941/core/functions.html">SQL and Generic Functions</a></li>
</ul>]]></content><author><name>Sean Ma</name></author><category term="Python" /><category term="SQL" /><category term="Data Type" /><category term="Python" /><category term="Pandas" /><summary type="html"><![CDATA[We provide a solution to the issue you might need.]]></summary></entry><entry><title type="html">Read CSV files 10x to 40x faster using pyarrow and polars</title><link href="https://seanslma.github.io/read-csv-perf/" rel="alternate" type="text/html" title="Read CSV files 10x to 40x faster using pyarrow and polars" /><published>2024-06-13T00:00:00+00:00</published><updated>2024-06-13T00:00:00+00:00</updated><id>https://seanslma.github.io/read-csv-perf</id><content type="html" xml:base="https://seanslma.github.io/read-csv-perf/"><![CDATA[<p>CSV (comma-separated values) files have been widely used in different areas. They can be easily exported from almost all programming languages. They can also be loaded into all text editors and many other applications. However, the main disadvantage is that CSV files are usually larger than files with other formats and it is slow to load them into memory.</p>

<p>Here we compare different options for reading CSV files by using the <code class="language-plaintext highlighter-rouge">pandas</code>, <code class="language-plaintext highlighter-rouge">polars</code> and <code class="language-plaintext highlighter-rouge">pyarrow</code> Python packages. We test the loading performance for CSV files each with a different data type. Based on the test results, we should be able to determine which option to use when we need reading CSV files faster.</p>

<h2 id="creating-test-data">Creating test data</h2>
<p>CSV files with three data types, <code class="language-plaintext highlighter-rouge">string</code>, <code class="language-plaintext highlighter-rouge">float</code>, and <code class="language-plaintext highlighter-rouge">datetime</code>, have been used to test the file reading performance. All the testing CSV files were created using the scripts in <a href="https://medium.com/@sean.lma/how-to-create-dummy-pandas-dataframes-for-testing-cf03c52878e3">my previous post</a>; each CSV file has 10 million rows and three columns with the same data type and a size of about 500 MB.</p>

<p>The <code class="language-plaintext highlighter-rouge">string</code> type CSV file was created with:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_str</span> <span class="o">=</span> <span class="n">gen_rand_df</span><span class="p">(</span>
    <span class="n">nrow</span><span class="o">=</span><span class="mi">10000000</span><span class="p">,</span>
    <span class="n">str_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'c1'</span><span class="p">,</span> <span class="s">'c2'</span><span class="p">,</span> <span class="s">'c3'</span><span class="p">],</span>
        <span class="s">'str_len'</span><span class="p">:</span> <span class="p">[</span><span class="mi">10</span><span class="p">,</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">15</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">50</span><span class="p">)],</span>
        <span class="s">'str_count'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">500</span><span class="p">,</span> <span class="mi">100</span><span class="p">],</span>
    <span class="p">},</span>
<span class="p">)</span>
<span class="n">df_str</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">float</code> type CSV file was created with:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_flt</span> <span class="o">=</span> <span class="n">gen_rand_df</span><span class="p">(</span>
    <span class="n">nrow</span><span class="o">=</span><span class="mi">10000000</span><span class="p">,</span>
    <span class="n">float_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'c1'</span><span class="p">,</span> <span class="s">'c2'</span><span class="p">,</span> <span class="s">'c3'</span><span class="p">],</span>
        <span class="s">'low'</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">100</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span>
        <span class="s">'high'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mf">1e5</span><span class="p">],</span>
    <span class="p">},</span>
<span class="p">)</span>
<span class="n">df_flt</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">datetime</code> type CSV file was created with:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_dts</span> <span class="o">=</span> <span class="n">gen_rand_df</span><span class="p">(</span>
    <span class="n">nrow</span><span class="o">=</span><span class="mi">10000000</span><span class="p">,</span>
    <span class="n">ts_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'c1'</span><span class="p">,</span> <span class="s">'c2'</span><span class="p">,</span> <span class="s">'c3'</span><span class="p">],</span>
        <span class="s">'start_date'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2020-01-01'</span><span class="p">,</span> <span class="s">'2021-01-01'</span><span class="p">,</span> <span class="s">'2022-01-01'</span><span class="p">],</span>
        <span class="s">'end_date'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2021-01-01'</span><span class="p">,</span> <span class="s">'2022-01-01'</span><span class="p">,</span> <span class="s">'2023-01-01'</span><span class="p">],</span>
        <span class="s">'freq'</span><span class="p">:</span> <span class="s">'s'</span><span class="p">,</span>
        <span class="s">'random'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
    <span class="p">},</span>
<span class="p">)</span>
<span class="n">df_dts</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="reading-csv-files-using-pandas">Reading CSV files using <code class="language-plaintext highlighter-rouge">pandas</code></h2>
<p>In pandas, when reading CSV files, there are three types of parsers that are available (<code class="language-plaintext highlighter-rouge">python</code>, <code class="language-plaintext highlighter-rouge">c</code>, and <code class="language-plaintext highlighter-rouge">pyarrow</code>). The parser can be set via the parameter <code class="language-plaintext highlighter-rouge">engine</code>. There are also two backend data types (backend_dtype: <code class="language-plaintext highlighter-rouge">numpy_nullable</code> and <code class="language-plaintext highlighter-rouge">pyarrow</code>) for storing the data. We will check the performance of the combinations of different parsers and backend data types.</p>

<p>The data types passed to the functions are a dictionary like this: <code class="language-plaintext highlighter-rouge">dtype = {'c1': type, 'c2': type, 'c3': type}</code>.</p>
<ul>
  <li>For <code class="language-plaintext highlighter-rouge">string</code> values the type is <code class="language-plaintext highlighter-rouge">str</code>. There are also two string data types available for pyarrow (dtype_pa): <code class="language-plaintext highlighter-rouge">pd.ArrowDtype(pa.string())</code> and <code class="language-plaintext highlighter-rouge">string[pyarrow]</code> (dtype_pa_str2); the latter supports NumPy-backed nullable types.</li>
  <li>For <code class="language-plaintext highlighter-rouge">float</code> values the type is <code class="language-plaintext highlighter-rouge">float</code> and <code class="language-plaintext highlighter-rouge">float64[pyarrow]</code>, for <code class="language-plaintext highlighter-rouge">numpy_nullable</code> and <code class="language-plaintext highlighter-rouge">pyarrow</code> backends respectively.</li>
  <li>For <code class="language-plaintext highlighter-rouge">datatime</code> values the type is <code class="language-plaintext highlighter-rouge">datetime64[s]</code> and <code class="language-plaintext highlighter-rouge">pd.ArrowDtype(pa.timestamp('s'))</code>. Notice that, when using the pandas datetime data types such as <code class="language-plaintext highlighter-rouge">datetime64[s]</code>, the datetime type columns must be passed to the function separately. While using the <code class="language-plaintext highlighter-rouge">pyarrow</code> data types, all types can be passed to the function in the same format.</li>
</ul>

<p>The following options are tested:</p>
<ul>
  <li>c + numpy_nullable + dtype_str + astype
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'c'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'numpy_nullable'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype_str</span>
<span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="n">dtype</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>c + numpy_nullable + dtype</p>

    <p>For <code class="language-plaintext highlighter-rouge">string/float</code>:</p>
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'c'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'numpy_nullable'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype</span>
<span class="p">)</span>
</code></pre></div>    </div>
    <p>For <code class="language-plaintext highlighter-rouge">datetime</code>:</p>
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'c'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'numpy_nullable'</span><span class="p">,</span>
    <span class="n">parse_dates</span><span class="o">=</span><span class="p">[</span><span class="s">'c1'</span><span class="p">,</span><span class="s">'c2'</span><span class="p">,</span><span class="s">'c3'</span><span class="p">],</span>
<span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>c + pyarrow + dtype</p>

    <p>For <code class="language-plaintext highlighter-rouge">string/float</code>:</p>
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'c'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype</span>
<span class="p">)</span>
</code></pre></div>    </div>
    <p>For <code class="language-plaintext highlighter-rouge">datetime</code>:</p>
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'c'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span>
    <span class="n">parse_dates</span><span class="o">=</span><span class="p">[</span><span class="s">'c1'</span><span class="p">,</span><span class="s">'c2'</span><span class="p">,</span><span class="s">'c3'</span><span class="p">],</span>
<span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>c + pyarrow + dtype_pa
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'c'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype_pa</span>
<span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>pyarrow + numpy_nullable + dtype</p>

    <p>For <code class="language-plaintext highlighter-rouge">string/float</code>:</p>
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'numpy_nullable'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype</span>
<span class="p">)</span>
</code></pre></div>    </div>
    <p>For <code class="language-plaintext highlighter-rouge">datetime</code>:</p>
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'numpy_nullable'</span><span class="p">,</span>
    <span class="n">parse_dates</span><span class="o">=</span><span class="p">[</span><span class="s">'c1'</span><span class="p">,</span><span class="s">'c2'</span><span class="p">,</span><span class="s">'c3'</span><span class="p">],</span>
<span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>pyarrow + pyarrow + dtype
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype</span>
<span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>pyarrow + pyarrow + string[pyarrow]
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype_pa_str2</span>
<span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>pyarrow + pyarrow + dtype_pa
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype_pa</span>
<span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>pyarrow + pyarrow + dtype_pa + to numpy_nullable
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">dtype_pa</span>
<span class="p">).</span><span class="n">convert_dtypes</span><span class="p">(</span><span class="n">dtype_backend</span><span class="o">=</span><span class="s">'numpy_nullable'</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>pyarrow + pyarrow
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="nb">file</span><span class="p">,</span> <span class="n">engine</span><span class="o">=</span><span class="s">'pyarrow'</span><span class="p">,</span> <span class="n">dtype_backend</span><span class="o">=</span><span class="s">'pyarrow'</span>
<span class="p">)</span>
</code></pre></div>    </div>
  </li>
</ul>

<p>The performance results for these options are as follows:</p>
<div class="scroll">
  <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                                                         str    float  datetime performance_order_for_float
c       + numpy_nullable + dtype_str + astype            3.93s  18.2s  18.5s    10
c       + numpy_nullable + dtype                         3.88s  3.29s  15.4s     6
c       + pyarrow        + dtype                         3.27s  3.55s  16.6s     7
c       + pyarrow        + dtype_pa                      5.17s  16.8s  53.2s     9
pyarrow + numpy_nullable + dtype                         3.50s  0.54s  1.15s     4
pyarrow + pyarrow        + dtype                         7.62s  0.50s  1.67s     3
pyarrow + pyarrow        + string[pyarrow]               4.05s  15.8s  11.1s     8
pyarrow + pyarrow        + dtype_pa                      0.39s  0.48s  0.44s     2
pyarrow + pyarrow        + dtype_pa + to numpy_nullable  2.74s  2.68s  1.64s     5
pyarrow + pyarrow                                        0.48s  0.47s  0.37s     1
</code></pre></div>  </div>
</div>

<p>Based on the test results, we can conclude that:</p>
<ul>
  <li>We can get the best performance when using <code class="language-plaintext highlighter-rouge">pyarrow</code> for the parser, backend and dtype (<code class="language-plaintext highlighter-rouge">pyarrow + pyarrow + dtype_pa</code>).</li>
  <li>The <code class="language-plaintext highlighter-rouge">pyarrow + pyarrow + dtype_pa</code> option is about 10x, 7x, and 35x faster than the default option (<code class="language-plaintext highlighter-rouge">c + numpy_nullable + dtype</code>) for <code class="language-plaintext highlighter-rouge">string</code>, <code class="language-plaintext highlighter-rouge">float</code> and <code class="language-plaintext highlighter-rouge">datetime</code>, separately.</li>
  <li>Compared to the <code class="language-plaintext highlighter-rouge">c</code> parser, the <code class="language-plaintext highlighter-rouge">pyarrow</code> parser is a little faster for <code class="language-plaintext highlighter-rouge">string</code>, 6x faster for <code class="language-plaintext highlighter-rouge">float</code>, and 10-14x faster for <code class="language-plaintext highlighter-rouge">datetime</code>.</li>
  <li>Using the <code class="language-plaintext highlighter-rouge">pyarrow</code> backend with the <code class="language-plaintext highlighter-rouge">c</code> parser, there are no performance improvements; if also using the <code class="language-plaintext highlighter-rouge">pyarrow</code> dtype the performance is much worse.</li>
  <li>The <code class="language-plaintext highlighter-rouge">pd.ArrowDtype(pa.string())</code> string data type is about 10x faster than the <code class="language-plaintext highlighter-rouge">string[pyarrow]</code> string data type.</li>
  <li>The <code class="language-plaintext highlighter-rouge">pyarrow</code> parser can automatically determine the data types without any performance loss; this is especially useful when you do not know the data types in the CSV files.</li>
</ul>

<p>We should understand that the <code class="language-plaintext highlighter-rouge">pyarrow</code> parser works in parallel mode while the <code class="language-plaintext highlighter-rouge">c</code> parser is not. Also converting the data from the <code class="language-plaintext highlighter-rouge">numpy_nullable</code> to <code class="language-plaintext highlighter-rouge">pyarrow</code> dtype or vice versa might be time-consuming.</p>

<h2 id="reading-csv-files-using-polars">Reading CSV files using <code class="language-plaintext highlighter-rouge">polars</code></h2>
<p>The <code class="language-plaintext highlighter-rouge">polars</code> package is relatively new. But it becomes popular recently due to its performance both in speed with vectorized execution and memory efficiency using <code class="language-plaintext highlighter-rouge">arrow</code>. Also it is designed with a clean and concise API for handling large datasets with lazy evaluation.</p>

<p>The data types passed to the <code class="language-plaintext highlighter-rouge">polars</code> functions are a dictionary like this: <code class="language-plaintext highlighter-rouge">dtypes = {'c1': dtype, 'c2': dtype, 'c3': dtype}</code>.</p>
<ul>
  <li>For <code class="language-plaintext highlighter-rouge">string</code> values the dtype is <code class="language-plaintext highlighter-rouge">pl.Utf8</code>.</li>
  <li>For <code class="language-plaintext highlighter-rouge">float</code> values the dtype is <code class="language-plaintext highlighter-rouge">pl.Float64</code>.</li>
  <li>For <code class="language-plaintext highlighter-rouge">datatime</code> values the dtype is <code class="language-plaintext highlighter-rouge">pl.Datetime</code>.</li>
</ul>

<p>The following options are tested:</p>
<ul>
  <li>default: without providing the dtypes parameter. Note that if some columns with <code class="language-plaintext highlighter-rouge">float</code> type have empty values, the data type will be parsed as <code class="language-plaintext highlighter-rouge">string</code> - not smart enough compared to <code class="language-plaintext highlighter-rouge">pyarrow.csv</code>.
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="n">pl</span>
<span class="n">pl</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>eager: the default mode, any operations are executed immediately
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pl</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">,</span> <span class="n">dtypes</span><span class="o">=</span><span class="n">dtypes</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>lazy: operations are not executed until you explicitly call the <code class="language-plaintext highlighter-rouge">collect()</code> method
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pl</span><span class="p">.</span><span class="n">scan_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">,</span> <span class="n">dtypes</span><span class="o">=</span><span class="n">dtypes</span><span class="p">).</span><span class="n">collect</span><span class="p">()</span>
</code></pre></div>    </div>
  </li>
  <li>streaming: it processes the data in batches instead of loading everything at once, good for handling large datasets that might exceed available memory
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pl</span><span class="p">.</span><span class="n">scan_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">,</span> <span class="n">dtypes</span><span class="o">=</span><span class="n">dtypes</span><span class="p">).</span><span class="n">collect</span><span class="p">(</span><span class="n">streaming</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>sql api eager: interact with data using familiar SQL syntax
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pl</span><span class="p">.</span><span class="n">SQLContext</span><span class="p">(</span>
    <span class="n">data</span><span class="o">=</span><span class="n">pl</span><span class="p">.</span><span class="n">scan_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">,</span> <span class="n">dtypes</span><span class="o">=</span><span class="n">dtypes</span><span class="p">)</span>
<span class="p">).</span><span class="n">execute</span><span class="p">(</span><span class="s">'select * from data'</span><span class="p">,</span> <span class="n">eager</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>sql api eager + to pandas
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pl</span><span class="p">.</span><span class="n">SQLContext</span><span class="p">(</span>
    <span class="n">data</span><span class="o">=</span><span class="n">pl</span><span class="p">.</span><span class="n">scan_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">,</span> <span class="n">dtypes</span><span class="o">=</span><span class="n">dtypes</span><span class="p">)</span>
<span class="p">).</span><span class="n">execute</span><span class="p">(</span>
    <span class="s">'select * from data'</span><span class="p">,</span> <span class="n">eager</span><span class="o">=</span><span class="bp">True</span>
<span class="p">).</span><span class="n">to_pandas</span><span class="p">(</span><span class="n">use_pyarrow_extension_array</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>sql api eager + to pandas pyarrow
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pl</span><span class="p">.</span><span class="n">SQLContext</span><span class="p">(</span>
    <span class="n">data</span><span class="o">=</span><span class="n">pl</span><span class="p">.</span><span class="n">scan_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">,</span> <span class="n">dtypes</span><span class="o">=</span><span class="n">dtypes</span><span class="p">)</span>
<span class="p">).</span><span class="n">execute</span><span class="p">(</span>
    <span class="s">'select * from data'</span><span class="p">,</span> <span class="n">eager</span><span class="o">=</span><span class="bp">True</span>
<span class="p">).</span><span class="n">to_pandas</span><span class="p">(</span><span class="n">use_pyarrow_extension_array</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
</ul>

<p>The tested performance results are as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                                          str    float  datetime
default                                   0.52s  0.38s  0.37s
eager                                     0.46s  0.40s  0.39s
lazy                                      0.45s  0.38s  0.41s
streaming                                 0.42s  0.40s  0.42s
sql api eager                             0.46s  0.38s  0.40s
sql api eager + to pandas                 1.59s  0.47s  0.48s
sql api eager + to pandas pyarrow         0.99s  0.43s  0.45s
</code></pre></div></div>

<p>It is obvious from the results that:</p>
<ul>
  <li>The performance is quite consistent for all the options using <code class="language-plaintext highlighter-rouge">polars</code>.</li>
  <li>The <code class="language-plaintext highlighter-rouge">polars</code> CSV reading has a similar performance compared to <code class="language-plaintext highlighter-rouge">pandas</code> with <code class="language-plaintext highlighter-rouge">pyarrow</code>.</li>
  <li>If we need a <code class="language-plaintext highlighter-rouge">numpy_nullable</code> pandas DataFrame, <code class="language-plaintext highlighter-rouge">polars</code> can still be a better option.</li>
</ul>

<h2 id="reading-csv-files-using-pyarrowcsv">Reading CSV files using <code class="language-plaintext highlighter-rouge">pyarrow.csv</code></h2>
<p>The module, <code class="language-plaintext highlighter-rouge">pyarrow.csv</code>, is one of the great modules within the <code class="language-plaintext highlighter-rouge">pyarrow</code> library that specifically deals with reading and writing CSV files. It offers robust functionalities to efficiently process CSV data with some great features, such as inferring data types during reading and supporting various file formats.</p>

<p>Here we test the performance of the <code class="language-plaintext highlighter-rouge">pyarrow.csv</code> module with three data types in the format <code class="language-plaintext highlighter-rouge">convert_options = pv.ConvertOptions(column_types={'c1': dtype, 'c2': dtype, 'c3': dtype})</code>.</p>
<ul>
  <li>For <code class="language-plaintext highlighter-rouge">string</code> values the dtype is <code class="language-plaintext highlighter-rouge">pa.string()</code>.</li>
  <li>For <code class="language-plaintext highlighter-rouge">float</code> values the dtype is <code class="language-plaintext highlighter-rouge">pa.float64()</code>.</li>
  <li>For <code class="language-plaintext highlighter-rouge">datatime</code> values the dtype is <code class="language-plaintext highlighter-rouge">pa.timestamp('s')</code>.</li>
</ul>

<p>The following options are tested and compared:</p>
<ul>
  <li>default
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pyarrow.csv</span> <span class="k">as</span> <span class="n">pv</span>
<span class="n">pv</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>default + to pandas
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pv</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">).</span><span class="n">to_pandas</span><span class="p">()</span>
</code></pre></div>    </div>
  </li>
  <li>default + to pandas pyarrow
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pv</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">).</span><span class="n">to_pandas</span><span class="p">(</span><span class="n">types_mapper</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">ArrowDtype</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>dtype
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pv</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">,</span> <span class="n">convert_options</span><span class="o">=</span><span class="n">convert_options</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>dtype + to pandas
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pv</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">,</span> <span class="n">convert_options</span><span class="o">=</span><span class="n">convert_options</span><span class="p">).</span><span class="n">to_pandas</span><span class="p">()</span>
</code></pre></div>    </div>
  </li>
  <li>dtype + to pandas pyarrow
    <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pv</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="nb">file</span><span class="p">,</span> <span class="n">convert_options</span><span class="o">=</span><span class="n">convert_options</span><span class="p">).</span><span class="n">to_pandas</span><span class="p">(</span><span class="n">types_mapper</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">ArrowDtype</span><span class="p">)</span>
</code></pre></div>    </div>
  </li>
</ul>

<p>The performance results for the previous options are shown here:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                               str    float  datetime
default                        0.39s  0.44s  0.38s
default + to pandas            1.07s  0.45s  0.42s
default + to pandas pyarrow    0.48s  0.43s  0.33s
dtype                          0.39s  0.40s  0.36s
dtype   + to pandas            0.99s  0.45s  0.41s
dtype   + to pandas pyarrow    0.39s  0.42s  0.37s
</code></pre></div></div>

<p>From these results we can conclude that:</p>
<ul>
  <li>The <code class="language-plaintext highlighter-rouge">pyarrow.csv</code> module has a similar performance compared to <code class="language-plaintext highlighter-rouge">polars</code>.</li>
  <li>If we need to load CSV files into a <code class="language-plaintext highlighter-rouge">pandas</code> DataFrame, <code class="language-plaintext highlighter-rouge">pyarrow.csv</code> is the fastest option.</li>
</ul>

<h2 id="best-options-from-pandas-polars-and-pyarrow">Best options from <code class="language-plaintext highlighter-rouge">pandas</code>, <code class="language-plaintext highlighter-rouge">polars</code>, and <code class="language-plaintext highlighter-rouge">pyarrow</code></h2>
<p>There is no surprise that all options using <code class="language-plaintext highlighter-rouge">arrow</code> to store data have a similar performance for reading CSV files; <code class="language-plaintext highlighter-rouge">polars</code> also uses <code class="language-plaintext highlighter-rouge">arrow</code> to save the data in memory. The <code class="language-plaintext highlighter-rouge">arrow</code> package is not just faster by parallelizing the reading, it is also more memory efficient.</p>

<p>The <code class="language-plaintext highlighter-rouge">polars</code> package is relatively new compared to <code class="language-plaintext highlighter-rouge">pandas</code>. It has some great new features but might not have the functions we need. It’s entirely up to us to decide which package to use. If we use <code class="language-plaintext highlighter-rouge">polars</code> do all our data manipulations I would suggest we stick to <code class="language-plaintext highlighter-rouge">polars</code> for reading CSV files.</p>

<p>If <code class="language-plaintext highlighter-rouge">pandas</code> is still our preference, to load CSV files efficiently, we should use the <code class="language-plaintext highlighter-rouge">pyarrow</code> parser, backend and dtype or <code class="language-plaintext highlighter-rouge">pyarrow.csv</code> to improve the performance further. If we also need to use the <code class="language-plaintext highlighter-rouge">numpy_nullable</code> backend, it is best to read CSV files using <code class="language-plaintext highlighter-rouge">pyarrow.csv</code> and then convert the backend to <code class="language-plaintext highlighter-rouge">numpy_nullable</code>.</p>]]></content><author><name>Sean Ma</name></author><category term="Python" /><category term="IO" /><category term="Performance" /><summary type="html"><![CDATA[CSV (comma-separated values) files have been widely used in different areas. They can be easily exported from almost all programming languages. They can also be loaded into all text editors and many other applications. However, the main disadvantage is that CSV files are usually larger than files with other formats and it is slow to load them into memory.]]></summary></entry><entry><title type="html">Explode date ranges in a pandas DataFrame 30x faster</title><link href="https://seanslma.github.io/explode-date-range/" rel="alternate" type="text/html" title="Explode date ranges in a pandas DataFrame 30x faster" /><published>2024-04-14T00:00:00+00:00</published><updated>2024-04-14T00:00:00+00:00</updated><id>https://seanslma.github.io/explode-date-range</id><content type="html" xml:base="https://seanslma.github.io/explode-date-range/"><![CDATA[<p>During data analysis, it is very common that we need to convert our data to the interval resolution from a lower time resolution such as quarterly or monthly to half hourly data.</p>

<p>We can do the conversion easily in Python using pandas. However, we know that the pandas <code class="language-plaintext highlighter-rouge">df.explode</code> function is very slow. Here I will show how we can make this process <strong>30x</strong> faster without using another Python package.</p>

<h2 id="pandas-dataframe-for-testing">Pandas DataFrame for testing</h2>
<p>For testing the code performance, I used the <code class="language-plaintext highlighter-rouge">gen_rand_df</code> function in <a href="https://medium.com/@sean.lma/how-to-create-dummy-pandas-dataframes-for-testing-cf03c52878e3">my previous post</a> to create a dummy pandas DataFrame:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">gen_rand_df</span><span class="p">(</span>
    <span class="n">nrow</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
    <span class="n">str_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'id'</span><span class="p">,</span> <span class="s">'category'</span><span class="p">],</span>
        <span class="s">'str_len'</span><span class="p">:</span> <span class="p">[</span><span class="mi">8</span><span class="p">,</span> <span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">20</span><span class="p">)],</span>
        <span class="s">'str_count'</span><span class="p">:</span> <span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">30</span><span class="p">],</span>
    <span class="p">},</span>
    <span class="n">ts_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
        <span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'start_date'</span><span class="p">,</span> <span class="s">'end_date'</span><span class="p">],</span>
        <span class="s">'start_date'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2020-01-01'</span><span class="p">,</span> <span class="s">'2023-01-01'</span><span class="p">],</span>
        <span class="s">'end_date'</span><span class="p">:</span> <span class="p">[</span><span class="s">'2023-01-01'</span><span class="p">,</span> <span class="s">'2025-01-01'</span><span class="p">],</span>
        <span class="s">'freq'</span><span class="p">:</span> <span class="s">'MS'</span><span class="p">,</span>
        <span class="s">'random'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="p">},</span>
    <span class="n">float_cols</span><span class="o">=</span><span class="p">{</span>
        <span class="s">'count'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
        <span class="s">'low'</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">'high'</span><span class="p">:</span> <span class="mf">100.0</span><span class="p">,</span>
        <span class="s">'missing_pct'</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>
    <span class="p">},</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Here are the first two rows of the 100 rows of the created DataFrame:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         id    category start_date   end_date        f1         f2
0  8v5KSoKX       jIMki 2020-01-01 2023-07-01  35.20661  76.041564
1  ihXEKLSb  bws6TOEr06 2020-05-01 2023-02-01       NaN  26.725758
</code></pre></div></div>

<h2 id="initial-solution-from-chatgpt-and-google-gemini">Initial solution from ChatGPT and Google Gemini</h2>
<p>We need to explode the date range (from start_date to end_date) of each row in the DataFrame to half hourly and keep all other columns.</p>

<p>To do that, I got solutions from ChatGPT and Google Gemini after a few iterations (they are basically the same):</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">[</span><span class="s">'ts'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">date_range</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s">'start_date'</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="s">'end_date'</span><span class="p">],</span> <span class="n">freq</span><span class="o">=</span><span class="s">'30min'</span><span class="p">),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span>
<span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">explode</span><span class="p">(</span><span class="s">'ts'</span><span class="p">)</span>
</code></pre></div></div>
<p>And the first two rows of the result DataFrame are:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         id category start_date   end_date        f1         f2                  ts
0  8v5KSoKX    jIMki 2020-01-01 2023-07-01  35.20661  76.041564 2020-01-01 00:00:00
0  8v5KSoKX    jIMki 2020-01-01 2023-07-01  35.20661  76.041564 2020-01-01 00:30:00
</code></pre></div></div>

<p>The solution works but it is very slow. The time for creating the <code class="language-plaintext highlighter-rouge">ts</code> column is <code class="language-plaintext highlighter-rouge">703 ms ± 6.57 ms</code> and exploding is <code class="language-plaintext highlighter-rouge">691 ms ± 7.47 ms</code>, for a DataFrame with only 100 rows.</p>

<p>I tried different prompts to get a faster solution from the AI applications but failed; the solution either is wrong or has errors. My suggestion would be that only use the AI applications to give you some ideas or a draft solution. The best solution can only be created by a person with some knowledge in that domain.</p>

<h2 id="using-a-for-loop-instead-of-the-dfapply-function">Using a <code class="language-plaintext highlighter-rouge">for-loop</code> instead of the <code class="language-plaintext highlighter-rouge">df.apply</code> function</h2>
<p>We know that the <code class="language-plaintext highlighter-rouge">df.apply</code> is slow so I will replace it by a <code class="language-plaintext highlighter-rouge">for-loop</code>.
There are a couple of ways to iterate over the DataFrame rows. Let us check them:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 351 µs ± 57.8 µs
</span><span class="k">for</span> <span class="p">(</span><span class="n">_</span><span class="p">,</span> <span class="n">row</span><span class="p">)</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">iterrows</span><span class="p">():</span> <span class="k">pass</span>
<span class="c1"># 271 µs ± 65.7 µs
</span><span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">to_records</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span> <span class="k">pass</span>
<span class="c1"># 26.5 µs ± 12.9 µs
</span><span class="k">for</span> <span class="n">start</span><span class="p">,</span> <span class="n">end</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'start_date'</span><span class="p">],</span> <span class="n">df</span><span class="p">[</span><span class="s">'end_date'</span><span class="p">]):</span> <span class="k">pass</span>
<span class="c1"># 10.4 µs ± 2.55 µs
</span><span class="k">for</span> <span class="n">start</span><span class="p">,</span> <span class="n">end</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'start_date'</span><span class="p">].</span><span class="n">values</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s">'end_date'</span><span class="p">].</span><span class="n">values</span><span class="p">):</span> <span class="k">pass</span>
</code></pre></div></div>
<p>The last version is <strong>34x</strong> faster than <code class="language-plaintext highlighter-rouge">df.iterrows()</code>. The improvement will be even larger for a DataFrame with many more rows.</p>

<p>The improved version for creating the <code class="language-plaintext highlighter-rouge">ts</code> column is:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">[</span><span class="s">'ts'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">date_range</span><span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">end</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s">'30min'</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">start</span><span class="p">,</span> <span class="n">end</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'start_date'</span><span class="p">].</span><span class="n">values</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s">'end_date'</span><span class="p">].</span><span class="n">values</span><span class="p">)</span>
<span class="p">]</span>
</code></pre></div></div>
<p>Now the time for the improved version is <code class="language-plaintext highlighter-rouge">684 ms ± 12.6 ms</code>; it is still too slow.</p>

<h2 id="implementing-a-custom-dfexplode-function">Implementing a custom <code class="language-plaintext highlighter-rouge">df.explode</code> function</h2>
<p>Seems there is not much we can do for creating the <code class="language-plaintext highlighter-rouge">ts</code> column much faster.</p>

<p>Now let us be focusing on the <code class="language-plaintext highlighter-rouge">df.explode</code> part. We will implement our own version for exploding the lists in the <code class="language-plaintext highlighter-rouge">ts</code> column.</p>

<p>We know that the <code class="language-plaintext highlighter-rouge">df.reindex</code> function can be used to resample rows of a DataFrame based on provided new index. Here we will use this function to implement a new <code class="language-plaintext highlighter-rouge">explode</code> function.</p>

<p>First we can create the new index and <code class="language-plaintext highlighter-rouge">ts</code> column, using <code class="language-plaintext highlighter-rouge">pd.concat</code> to merge the DataFrames created from each row:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">df</span>
    <span class="p">.</span><span class="n">get</span><span class="p">([</span><span class="s">'ts'</span><span class="p">])</span>
    <span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="p">.</span><span class="n">rename_axis</span><span class="p">(</span><span class="s">'i'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="p">)</span>
<span class="n">dt</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'i'</span><span class="p">:</span> <span class="n">i</span><span class="p">,</span> <span class="s">'ts'</span><span class="p">:</span> <span class="n">ts</span><span class="p">})</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">ts</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">d</span><span class="p">[</span><span class="s">'i'</span><span class="p">].</span><span class="n">values</span><span class="p">,</span> <span class="n">d</span><span class="p">[</span><span class="s">'ts'</span><span class="p">].</span><span class="n">values</span><span class="p">)</span>
<span class="p">]).</span><span class="n">set_index</span><span class="p">(</span><span class="s">'i'</span><span class="p">).</span><span class="n">rename_axis</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<p>Then we use the <code class="language-plaintext highlighter-rouge">df.reindex</code> function to sample the other columns in the original DataFrame and add the exploded <code class="language-plaintext highlighter-rouge">ts</code> column:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="s">'ts'</span><span class="p">).</span><span class="n">reindex</span><span class="p">(</span><span class="n">dt</span><span class="p">.</span><span class="n">index</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'ts'</span><span class="p">]</span> <span class="o">=</span> <span class="n">dt</span><span class="p">.</span><span class="n">ts</span>
</code></pre></div></div>

<p>Putting the two parts together, here is the custom <code class="language-plaintext highlighter-rouge">explode</code> function:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">explode_df_column</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
    <span class="n">dt</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span>
        <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'i'</span><span class="p">:</span> <span class="n">i</span><span class="p">,</span> <span class="s">'ts'</span><span class="p">:</span> <span class="n">ts</span><span class="p">})</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">ts</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'ts'</span><span class="p">].</span><span class="n">values</span><span class="p">)</span>
    <span class="p">]).</span><span class="n">set_index</span><span class="p">(</span><span class="s">'i'</span><span class="p">).</span><span class="n">rename_axis</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="s">'ts'</span><span class="p">).</span><span class="n">reindex</span><span class="p">(</span><span class="n">dt</span><span class="p">.</span><span class="n">index</span><span class="p">)</span>
    <span class="n">df</span><span class="p">[</span><span class="s">'ts'</span><span class="p">]</span> <span class="o">=</span> <span class="n">dt</span><span class="p">.</span><span class="n">ts</span>
    <span class="k">return</span> <span class="n">df</span>
</code></pre></div></div>

<p>The time for this function is <code class="language-plaintext highlighter-rouge">22 ms ± 285 µs</code>, <strong>30x</strong> faster compared to the <code class="language-plaintext highlighter-rouge">df.explode</code> function that has a time of <code class="language-plaintext highlighter-rouge">691 ms ± 7.47 ms</code>.</p>

<h2 id="creating-a-ts-column-with-value-of-lists-not-required">Creating a <code class="language-plaintext highlighter-rouge">ts</code> column with value of lists not required</h2>
<p>For our use case, creating a intermediate column with a list of timestamps for each value is not required. We can merge this step into the step for creating the <code class="language-plaintext highlighter-rouge">dt</code> DataFrame.</p>

<p>Here is the final solution based on this idea:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create a DataFrame with new index and the 30min ts column
</span><span class="n">dt</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'i'</span><span class="p">:</span> <span class="n">i</span><span class="p">,</span> <span class="s">'ts'</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">date_range</span><span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">end</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s">'30min'</span><span class="p">)})</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">end</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'start_date'</span><span class="p">],</span> <span class="n">df</span><span class="p">[</span><span class="s">'end_date'</span><span class="p">]))</span>
<span class="p">]).</span><span class="n">set_index</span><span class="p">(</span><span class="s">'i'</span><span class="p">).</span><span class="n">rename_axis</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

<span class="c1"># Resample original df based on new index and add the exploded ts column
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">dt</span><span class="p">.</span><span class="n">index</span><span class="p">).</span><span class="n">assign</span><span class="p">(</span><span class="n">ts</span><span class="o">=</span><span class="n">dt</span><span class="p">.</span><span class="n">ts</span><span class="p">)</span>
</code></pre></div></div>

<p>Great! The time for this solution is <code class="language-plaintext highlighter-rouge">49.8 ms ± 932 µs</code>, about <strong>30x</strong> faster than the initial solution that has a time of <code class="language-plaintext highlighter-rouge">1.394s</code> (<code class="language-plaintext highlighter-rouge">703 ms ± 6.57 ms</code> + <code class="language-plaintext highlighter-rouge">691 ms ± 7.47 ms</code>).</p>

<p>We can wrap the method into a function and add other parameters used for limiting the min/max datetime and keeping the original DataFrame index or not. It is up to you to do the remaining work.</p>

<p>In summary, we improved a method <strong>30x</strong> faster, used to explode datetime ranges in a DataFrame to a new timestamp column and copy other columns. At the same time, we created a new function that is also about <strong>30x</strong> faster than the pandas <code class="language-plaintext highlighter-rouge">df.explode</code> function.</p>

<h2 id="polars-version">Polars version</h2>
<p>What about using polars, is it faster?</p>

<p>This link might be helpful:
https://stackoverflow.com/questions/73161185/repeating-a-date-in-polars-and-exploding-it</p>]]></content><author><name>Sean Ma</name></author><category term="Python" /><category term="Pandas" /><category term="Performance" /><summary type="html"><![CDATA[During data analysis, it is very common that we need to convert our data to the interval resolution from a lower time resolution such as quarterly or monthly to half hourly data.]]></summary></entry></feed>