reasoning based projects
๐งฉ level 1: foundational (core cs, python, sql, os, networking)
goal: build muscle in core cs and fundamental data flow understanding.
- heap-based ranking system (data structure fundamentals)
- in-memory key-value store (hashmap + persistence simulation)
- tcp chat system (socket + concurrency + message delivery)
- python job scheduler (multiprocessing + process sync)
- normalization and er diagram case study (manual modeling)
- simple query planner visualizer (sql cost estimation demo)
โ๏ธ level 2: data modeling + dbms basics
goal: understand relational modeling, transactions, and schema design.
- postgres analytics schema (star/snowflake model)
- nosql replication comparison (mongodb vs postgres)
- mini db engine in python (sql parser + file i/o)
- query performance analyzer (indexing and joins)
๐ level 3: etl + batch pipeline
goal: understand batch processing, transformations, orchestration.
- weather pipeline (api โ s3 โ parquet โ postgres)
- data quality framework (mini great-expectations clone)
- data profiler tool (statistics + nulls + schema validation)
- incremental etl with airflow (scheduler + retries)
- airflow dag monitoring dashboard (metrics extraction)
๐ level 4: api ingestion + external data
goal: learn data ingestion, scraping, api rate control, websockets.
- crypto price ingestion (websocket โ kafka โ postgres)
- news/web scraper (fetch โ parse โ store โ visualize)
- ingestion framework (retry/backoff logic, monitoring)
- websocket stream visualizer (flow diagram and stats)
โ๏ธ level 5: cloud + containerization fundamentals
goal: deploy small pipelines using docker, ci/cd, aws services.
- airflow on ec2 with s3/rds backend
- lambda-based etl automation
- spark job on emr reading parquet from s3
- dockerized data pipeline with docker-compose
- github actions ci/cd for etl pipelines
๐พ level 6: data warehousing + lakehouse design
goal: implement real lakehouse with partitioning and query layer.
- data lakehouse pipeline (kafka โ iceberg โ trino โ superset)
- warehouse benchmarking (redshift vs bigquery vs trino)
- schema evolution demo (iceberg schema change handling)
- custom partitioning simulator (read-time comparison)
๐ฅ level 7: big data processing and streaming
goal: handle distributed data, spark, kafka, flink, consistency models.
- spark batch analytics (millions of rows to parquet)
- kafka streaming analytics (trade data โ spark stream โ redis)
- flink fraud detector (windowed anomaly detection)
- hdfs simulator (namenode replication logic)
- kafka-lite broker (python simulation of partitions and offsets)
๐ง level 8: performance and scalability optimization
goal: understand partitioning, caching, vectorization, backpressure.
- spark tuning benchmark (shuffle vs broadcast join)
- query optimization analyzer (compare explain plans)
- vectorized dataframe benchmark (pandas vs polars vs spark)
- kafka consumer lag visualizer (real-time lag chart)
๐๏ธ level 9: orchestration + governance + monitoring
goal: build full control layer for metrics, logging, lineage, alerts.
- airflow monitoring dashboard (grafana + prometheus)
- etl log ingestion to elk stack (elastic, logstash, kibana)
- prometheus alerting system with slack webhook
- pipeline health checker (freshness and row count validation)
- data catalog / schema registry demo (metadata management)
๐ก level 10: system design + hybrid architectures
goal: integrate batch + stream pipelines and event-driven microservices.
- hybrid data architecture (api โ kafka โ spark stream โ iceberg)
- fault tolerance simulator (kafka consumer crash โ auto replay)
- event-driven microservice system (notification + accounting)
- data mesh simulation (domain-based data ownership)
๐ค level 11: ml pipeline preparation
goal: integrate data engineering with ml feature systems.
- feature pipeline (postgres โ cleaning โ feast feature store)
- retraining dag (automated airflow-based ml retraining)
- feature drift detection (distributional shift alerting)
๐ฐ level 12: decentralized + fintech engineering
goal: combine blockchain, streaming, and analytics into real fintech systems.
- blazpay transaction analytics (hyperledger + kafka + spark + postgres)
- bks mygold tokenized asset pipeline (solidity + iceberg analytics)
- vault reconciliation system (ledger vs vault validation)
- transaction intelligence (fraud/anomaly scoring)
- accounting microservice (double-entry with kafka replay)
- rwa lakehouse (multi-asset data warehouse for on-chain data)
- oracle connector (price feed + external event integration)
๐งพ level 13: documentation + communication
goal: communicate and visualize complex systems effectively.
- project documentation repo (markdown + mermaid diagrams)
- data contract generator (schema validation registry)
- reproducible jupyter pipeline (papermill automation)
- whiteboard interview exercises (architecture explanation practice)
๐ง learning gradient (conceptual slope)
phase | learning type | focus | example tools |
---|---|---|---|
level 1โ2 | structural | cs + dbms | python, sql |
level 3โ5 | procedural | etl + orchestration | airflow, s3, postgres |
level 6โ8 | distributed | spark + kafka + lakehouse | spark, iceberg, trino |
level 9โ10 | architectural | observability + hybrid | grafana, kafka, airflow |
level 11โ12 | applied | ml + web3 fintech | feast, hyperledger |
level 13 | communication | explanation + diagramming | excalidraw, mermaid |