Skip to content

Tool¶

Databricks, DBT, and Kafka, SQL, Spark and Python

Databricks¶

Pyspark, Spark SQL, Cluster Management, query optimizations, performance tuning

[Database] Source¶

MySQL
PostgreSQL
NoSQL: MongoBD

[Extraction] Batch and Streaming¶

Kafka (streaming storage)
Flink (streaming computation)

Scheduling and Orchestration¶

Apache Airflow
Argo Kubernetes native

[Loading/Transformation] Processing and Transformation¶

Apache Hive Hive is a distributed database which operates on Hadoop Distributed File System.
Apache Spark Spark is a distributed data framework which helps extract and process large volumes of data in RDD (Resilient Distributed Data) format for analytical purposes.
dbt dbt is an open-source data modeling and transformation tool, designed for analytics engineers and data analysts to transform data in their data warehouse (usually in SQL-based databases like Snowflake, BigQuery, or Redshift).

[Storage] Data Warehouse¶

Redshift Redshift is a fully managed cloud warehouse by Amazon.
BigQuery BigQuery is a fully managed cloud warehouse by Google.
Azure Warehouse
Delta Lake
Snowflake

[Application] Data Analysis¶

PowerBI
Looker (should avoid?) Looker is BI software that helps users visualize data.