Azure Blob Storage¶

https://learn.microsoft.com/en-us/azure/storage/files/storage-python-how-to-use-file-storage?tabs=python

fsspec¶

If the path ends with /, only folders are returned.

adlfs¶

Not correct: performance using AzureBlobFileSystem

When reading Parquet files from Azure Blob Storage using pd.read_parquet with the engine set to pyarrow,
the performance can **become suboptimal** when reading in parallel due to the limitations of the
AzureBlobFileSystem and the way PyArrow handles parallel reading.

The main reason for this performance issue is that the `AzureBlobFileSystem`, which is used to interact with
Azure Blob Storage, **does not natively support parallel reads efficiently**. This limitation can lead to
slower performance when multiple threads or processes are trying to read Parquet files in parallel from
the same container in Azure Blob Storage.

create file system¶

from adlfs.spec import AzureBlobFileSystem
from azure.identity.aio import DefaultAzureCredential
def get_file_system(
  account_name: str,
  credential: object=None,
) -> AzureBlobFileSystem:
    # Disable messages from azure.identity.aio
    logging.getLogger('azure.identity.aio').setLevel(logging.ERROR)
    credential = credential or DefaultAzureCredential()
    return AzureBlobFileSystem(
        account_name=account_name,
        credential=credential,
    )

`fs.glob` vs `fs.ls`¶

fs.glob('container-name/xyz-*.parquet') is supper slow - will scan all files in this container
fs.ls(path='container-name', prefix='xyz-') is much faster - will only list folders (not subfolders) and files in the path
but fs.ls will crash if the folder not exist

adlfs parallel performance¶

pd.read_parquet is 2-3x faster than dd.read_parquet.

adlfs glob Do not create the file system for each parallel call - can be slow.

def read_parquet_file( *, fs: AzureBlobFileSystem, path: str, columns: list[str], filters: list[tuple] = None, class=p>) -> pd.DataFrame: """ The `path` should be in this format `az://<blob-name>/folder/file-name` class=sd>    """ # Occasionally it will fail to retrieve a token retries = 3     # Number of retries retry_delay = 2 # Seconds to wait between retries while retries > 0: try: with fs.open(path) as f: df = pd.read_parquet( path=f, columns=columns, filters=filters, engine='pyarrow' ) break except ClientAuthenticationError: retries -= 1 if retries > 0: time.sleep(retry_delay) else: raise except Exception: raise return df class=k>def read_parquet_files( paths: list[str], columns: list[str], filters: list[tuple] = None, use_cache: bool = False, class=p>) -> list[pd.DataFrame]: class=w>    """ class=sd>    Read multiple parquet files in parallel into a list of pd.DataFrame class=sd>    """ if not filters: filters = None # do not create the file system for each thread call, can be very slow fs = get_file_system() if len(paths) == 1: dfs = [ read_parquet_file(fs=fs, path=paths[0], columns=columns, filters=filters) ] else: n_jobs = min(cpu_count(), len(paths)) with ThreadPool(processes=n_jobs) as pool: dfs = pool.map( lambda path: read_parquet_file(fs=fs, path=path, columns=columns, filters=filters), paths, ) return dfs class=c1># can be 2-3x faster than dd.read_parquet class=n>dfs = read_parquet_files( paths=paths, columns=columns, filters=filters, class=p>) class=n>d1 = pd.concat(dfs, axis=0).reset_index().get(columns) class=c1># can 2-3x slower than pd.read_parquet class=n>d2 = dd.read_parquet( paths, index=False, columns=columns, filters=filters, engine='pyarrow', storage_options=storage_options, open_file_options=dict(precache_options={'method': 'parquet'}), schema=schema, class=p>).compute() wildcards performance¶

https://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html

performance

use subfolders such as year/month will siginificantly improve search performance
use wildcard characters in the middle will lead to full scan
search a few sub folders one by one will be much faster

files = fs.glob('dev-blob/2021/01/data*].parquet')
files = fs.glob('dev-blob/2021/01/data[0-9][0-9].parquet')

`azure-storage-blob¶`

can be 2-3x slower than adlfs
azure-storage-blob supports both sync and async versions.

When got this error

AttributeError: 'coroutine' object has no attribute 'token'
sys:1: RuntimeWarning: coroutine 'DefaultAzureCredential.get_token' was never awaited

 It's most likely incorrectly used the async version from azure.identity.aio import DefaultAzureCredential. read parquet blob to df performance¶
adlfs can be 1.5-2x faster than azure-storage-blob. But AzureBlobFileSystem in parallel can be really slow down if used incorrectly (do not create a separate fs for each parallel run - can be very slow).
from adlfs.spec import AzureBlobFileSystem
from azure.storage.blob import ContainerClient
from azure.identity import DefaultAzureCredential

def read_parquet_fast(path: str, columns: list[str]) -> pd.DataFrame:
    fs = AzureBlobFileSystem(
        account_name=account_name,
        credential=DefaultAzureCredential(),
    )
    with fs.open(path) as f:
        df = pd.read_parquet(path=f, columns=columns)
    return df

def read_parquet_slow(path: str, columns: list[str]) -> pd.DataFrame:
    path = path.split('/', 3)[-1]
    container_client = = ContainerClient(
        account_url=f'https://{account_name}.blob.core.windows.net',
        credential=DefaultAzureCredential(),
        container_name=container_name,
    )
    with io.BytesIO() as data_buffer:
        blob_client = container_client.get_blob_client(path)
        blob_client.download_blob().readinto(data_buffer)
        data_buffer.seek(0)
        df = pd.read_parquet(data_buffer, columns=columns, engine='pyarrow')
    return df
Azure Data Lake Storage (ADLS)¶
It's actually created on top of azure-storage-blob.
The azure-storage-file-datalake library is specifically designed to interact with Azure Data Lake Storage Gen2 (ADLS Gen2). ADLS Gen2 is an enterprise-grade distributed file system built on top of Azure Blob Storage, providing hierarchical namespace and capabilities for big data analytics.
When working with Azure Data Lake Storage Gen2, we should use "azure-storage-filedatalake" and for general-purpose object storage in Azure Blob Storage, should use "azure-blob-storage"

Azure Blob Storage¶

fsspec¶

adlfs¶

create file system¶

fs.glob vs fs.ls¶

adlfs parallel performance¶

azure-storage-blob¶

read parquet blob to df performance¶

Azure Data Lake Storage (ADLS)¶

`fs.glob` vs `fs.ls`¶

`azure-storage-blob¶`