← ...

etl and orchestration

learn batch vs incremental loading, data quality, file formats, pandas/spark transformation, and workflow orchestration using airflow/luigi/dagster.

key concepts

  • etl vs elt
  • airflow dag architecture
  • data flow: source → raw → staging → curated
  • schema validation and idempotency
  • retries, backoff, monitoring

explanation practice

  • dag structure on whiteboard
  • data lineage diagram
  • incremental load visualization

projects

1. weather pipeline

  • api ingestion → s3 → clean parquet → postgres
  • airflow orchestration + alerts

2. cdc pipeline

  • postgres → debezium → kafka → spark → iceberg

3. data quality library

  • python mini “great expectations”

4. data profiler tool

  • summarize stats, nulls, unique counts