Skip to content

Tool

Databricks, DBT, and Kafka, SQL, Spark and Python

Databricks

Pyspark, Spark SQL, Cluster Management, query optimizations, performance tuning

[Database] Source

  • MySQL

  • PostgreSQL

  • NoSQL: MongoBD

[Extraction] Batch and Streaming

  • Kafka (streaming storage)

  • Flink (streaming computation)

Scheduling and Orchestration

  • Apache Airflow

  • Argo Kubernetes native

[Loading/Transformation] Processing and Transformation

  • Apache Hive Hive is a distributed database which operates on Hadoop Distributed File System.

  • Apache Spark Spark is a distributed data framework which helps extract and process large volumes of data in RDD (Resilient Distributed Data) format for analytical purposes.

  • dbt dbt is an open-source data modeling and transformation tool, designed for analytics engineers and data analysts to transform data in their data warehouse (usually in SQL-based databases like Snowflake, BigQuery, or Redshift).

[Storage] Data Warehouse

  • Redshift Redshift is a fully managed cloud warehouse by Amazon.

  • BigQuery BigQuery is a fully managed cloud warehouse by Google.

  • Azure Warehouse

  • Delta Lake

  • Snowflake

[Application] Data Analysis

  • PowerBI

  • Looker (should avoid?) Looker is BI software that helps users visualize data.