Tool¶
Databricks, DBT, and Kafka, SQL, Spark and Python
Databricks¶
Pyspark, Spark SQL, Cluster Management, query optimizations, performance tuning
[Database] Source¶
MySQL
PostgreSQL
NoSQL: MongoBD
[Extraction] Batch and Streaming¶
Kafka(streaming storage)Flink(streaming computation)
Scheduling and Orchestration¶
Apache Airflow
ArgoKubernetes native
[Loading/Transformation] Processing and Transformation¶
Apache HiveHive is a distributed database which operates on Hadoop Distributed File System.Apache SparkSpark is a distributed data framework which helps extract and process large volumes of data in RDD (Resilient Distributed Data) format for analytical purposes.dbtdbt is an open-source data modeling and transformation tool, designed for analytics engineers and data analysts to transform data in their data warehouse (usually in SQL-based databases like Snowflake, BigQuery, or Redshift).
[Storage] Data Warehouse¶
RedshiftRedshift is a fully managed cloud warehouse by Amazon.BigQueryBigQuery is a fully managed cloud warehouse by Google.Azure Warehouse
Delta Lake
Snowflake
[Application] Data Analysis¶
PowerBI
Looker (should avoid?) Looker is BI software that helps users visualize data.