2025-10-17

topics

build iteratively: start with mini-projects to connect foundational skills, then layer in scalability and observability. adapt to your ideas—quantity builds momentum, quality emerges from iteration. aim for 10+ projects; document trade-offs and metrics in readmes for portfolio depth.

core programming & algorithms
skills: python, data structures, algorithms, problem solving, leetcode patterns
purpose: build efficient code for data manipulation and scalable solutions.
target: solve 80+ problems covering arrays, trees, graphs, and dynamic programming; optimize a python script for o(n) time.
foundational
skills: sql, linux, git, data modelling, unix shells
purpose: core tooling & thinking needed to manipulate, store, and query data.
target: implement 5 normalized schemas, write 20 sql queries, and automate tasks with shell scripts.
cloud platforms
skills: aws/gcp/azure, s3/bigquery, iam, serverless (lambda/cloud functions)
purpose: leverage cloud services for scalable storage and compute without vendor lock-in.
target: deploy an etl pipeline to one cloud using free tier and query 1tb dataset under 10s.
ingestion & storage
skills: apis, batch jobs, s3, object storage, parquet, partitioning
purpose: reliably bring data into persistent storage with cost- and query-efficient layout.
target: implement an idempotent api → s3 ingestion and demonstrate partition pruning on queries.
orchestration & automation
skills: airflow, dagster, cron, kubernetes jobs, ci/cd
purpose: schedule, monitor, and recover data workflows reproducibly.
target: build a dag that retries, backfills, and exposes basic metrics; ci run succeeds.
transformation & modeling
skills: dbt, sql modeling, dimension tables, slowly changing dimensions
purpose: convert raw data into tested, documented analytics models.
target: author dbt models with tests and pass dbt test in ci for a sample mart.
big data processing
skills: spark (batch/pyspark), hadoop basics, distributed computing
purpose: handle large-scale batch transformations efficiently.
target: process a 1gb dataset with spark sql joins and aggregations, optimizing for shuffle reduction.
streaming & realtime
skills: kafka, spark structured streaming, flink, stream joins, windowing
purpose: process event streams with bounded latency and correct window semantics.
target: implement a sliding-window aggregation with late-data handling and measure p95 latency.
warehousing & lakehouse
skills: redshift/bigquery/snowflake, iceberg/delta, trino/presto
purpose: provide fast analytical queries over large datasets with acid/partitioning guarantees.
target: benchmark a representative query and report latency and scanned bytes.
observability & data quality
skills: prometheus, grafana, great expectations, data contracts, lineage
purpose: detect, alert, and explain data issues; ensure contractual guarantees.
target: add 5 expectations/tests and a dashboard that alerts on failures.
data governance & security
skills: rbac/iam, encryption, compliance (gdpr), data cataloging, ethics
purpose: manage access, privacy, and ethical use of data at scale.
target: implement role-based access for a pipeline and audit lineage for a dataset.
infra & scale
skills: terraform, docker, k8s, cost optimisation, resource management
purpose: provision reproducible infra and control access/costs at scale.
target: codify infra for one pipeline with terraform + a cost estimate and iam policy.
system design
skills: architecture patterns, scalability, trade-offs, diagramming
purpose: design robust, end-to-end data systems for real-world scenarios.
target: diagram and document 10 designs (e.g., data lake, realtime pipeline) with bottlenecks addressed.
ml & feature engineering
skills: feast, mlflow, offline-online parity, model monitoring
purpose: produce stable features and serve models with consistency between training and serving.
target: build a simple feature pipeline + online store and show offline-online parity checks.