appendix
for each: include “what i would change at scale”.
sql window-functions practice
- objective: compute running totals, moving averages, sessionization, churn, and retention from a transactions table.
- tech: postgres or sqlite.
- deliverable: 10 parameterized sql queries + sample dataset + expected results.
- acceptance criteria: queries run on sample data and produce documented outputs; each query has o(n) explanation and complexity note.
- metric: correctness vs expected rows (100%), and execution time for datasets of 10k / 100k rows.
etl idempotency and schema drift test
- objective: implement an etl job that can safely re-run after upstream schema changes (column rename/add).
- tech: python (pandas), postgres, docker-compose.
- deliverable: etl script with schema-mapping layer, unit tests simulating drift, migration log.
- acceptance criteria: etl handles 3 simulated schema changes without data loss; tests pass.
- metric: percent of fields mapped automatically vs manually; test coverage >= 80%.
airflow failure-mode injection & testing
- objective: create a dag and simulate 5 failure modes (downstream db outage, network timeout, partial data, task crash, retry storm) and add tests/guards.
- tech: apache airflow docker, pytest, local postgres.
- deliverable: dag, test harness, failure injection scripts, documented mitigations.
- acceptance criteria: dag recovers or fails safely for each mode; alerts are triggered for non-recoverable states.
- metric: mean time to detection (mtd) for failures; % automatic recoveries.
kafka partitioning and consumer-lag experiment
- objective: create topic with multiple partition strategies and measure consumer lag vs producer throughput.
- tech: kafka (docker), python producers/consumers.
- deliverable: experiment scripts, charts of throughput vs lag, recommended partition strategy.
- acceptance criteria: reproducible benchmark showing behavior for 1k, 10k, 100k msg/sec.
- metric: average lag, tail-latency (95th, 99th), throughput.
realtime sliding-window aggregation
- objective: implement a streaming job that computes per-user sliding-window metrics (1m, 5m, 1h) with late data handling.
- tech: spark structured streaming or flink, kafka.
- deliverable: streaming job, synthetic event generator, tests for late/duplicate events.
- acceptance criteria: window correctness under 2-minute event delay; deduplication works.
- metric: accuracy (%) vs ground truth, processing latency median/p95.
cdc-to-lakehouse pipeline (mini)
- objective: capture inserts/updates/deletes from postgres and apply to iceberg/delta table with correct upserts.
- tech: debezium, kafka, consumer app (spark/pyspark) writing iceberg on localfs or s3mock.
- deliverable: configs, consumer app, verification script comparing source vs analytic view.
- acceptance criteria: analytic table matches source after random sequence of ops including deletes.
- metric: data consistency rate, replication lag.
dbt modular package & contract tests
- objective: write a dbt package for a domain with models + contract tests enforcing column types, nullability, uniqueness.
- tech: dbt core, postgres.
- deliverable: dbt project, contract tests, ci yaml for dbt run/test.
- acceptance criteria: all tests pass in ci; schema contract violation produces clear failure.
- metric: number of contract violations detected during simulated upstream changes.
data quality pipeline using great-expectations
- objective: integrate great-expectations into an etl pipeline and automate expectation validation & reporting.
- tech: python, great-expectations, cloud or local storage.
- deliverable: pipeline with expectations, failing dataset example, html reports.
- acceptance criteria: pipeline rejects data that violates >1 expectation and logs clear error.
- metric: false positive/negative rates on synthetic anomalies.
partitioning & compaction strategy for parquet datasets
- objective: design and benchmark partitioning scheme (by date/user/hash) and compaction frequency for parquet on s3/local.
- tech: python, parquet, pyarrow, localfs/s3mock.
- deliverable: benchmarking script, recommended strategy with cost/latency trade-offs.
- acceptance criteria: benchmark shows query time improvement and acceptable write overhead.
- metric: average query time, storage overhead, number of small files.
schema evolution test suite
- objective: simulate upstream schema changes (field add/remove/type change) and demonstrate downstream handling (compatibility modes).
- tech: avro/parquet schemas, small consumer app, pytest.
- deliverable: test scenarios, migration scripts, rollback plan.
- acceptance criteria: downstream consumer can handle 3 schema versions; no silent data corruption.
- metric: number of breaking changes detected, recovery time.
data lineage proof-of-concept
- objective: implement lightweight lineage capture (task-level) and visualize lineage for a simple pipeline.
- tech: airflow/xcom or custom metadata emitter, neo4j or sqlite+simple ui.
- deliverable: lineage graph, query interface, examples linking raw -> transformed -> mart.
- acceptance criteria: lineage shows upstream fields for any mart column.
- metric: percent of assets with lineage coverage.
cost-optimization experiment for cloud storage
- objective: compare storage/costs for different data retention policies and storage classes (hot/warm/cold).
- tech: cost calculator spreadsheet, or simulated costs with sample sizes.
- deliverable: cost model, recommendations, scripts to lifecycle transition sample data.
- acceptance criteria: demonstrate >=30% cost savings with acceptable access latency trade-off.
- metric: estimated monthly cost, access latency increase.
lightweight data catalog with search
- objective: build a searchable catalog storing dataset metadata, owners, schema, freshness.
- tech: fastapi, sqlite/postgres, simple frontend or curl API.
- deliverable: api, sample dataset metadata, search examples.
- acceptance criteria: search returns results for 90% of common queries (name/owner/field).
- metric: search precision/recall on seed queries.
privacy-preserving aggregation
- objective: design and implement differential privacy or noise injection for aggregated metrics.
- tech: python, numpy, dp libraries (or simple gaussian noise).
- deliverable: aggregation functions with privacy budget explanation and tests showing noise vs utility.
- acceptance criteria: aggregated metrics maintain utility while satisfying chosen epsilon-level.
- metric: mean absolute error introduced vs privacy parameter.
ml feature-store mini (offline-online parity)
- objective: implement feature computation offline and online lookup with consistency checks.
- tech: spark/pandas for batch, redis/fastapi for online store.
- deliverable: feature registry, batch job, online api, reconciliation script.
- acceptance criteria: offline vs online feature mismatch rate < 1% on validation set.
- metric: mismatch rate, online lookup latency.
model monitoring for data drift
- objective: deploy a simple model and collect distributional metrics to detect drift; trigger retrain pipeline.
- tech: fastapi, prometheus metrics, python retrain job stub.
- deliverable: monitor dashboard snapshots, alerting rules, retrain trigger.
- acceptance criteria: drift detection triggers when feature ks-test p-value < 0.01.
- metric: detection latency, false positive rate.
end-to-end encrypted pipeline proof
- objective: implement encryption-at-rest and in-transit for data moving between services and verify keys handling.
- tech: ssl for transports, local kms mock, encrypted parquet.
- deliverable: encrypted data flow, key rotation demo, verification scripts.
- acceptance criteria: data remains readable only with correct key; rotation works without data loss.
- metric: encryption/decryption throughput, key rotation downtime.
observability & alerting baseline for pipelines
- objective: instrument a pipeline with metrics (throughput, error rate, latency) and create alerting thresholds.
- tech: prometheus client, grafana dashboards, alertmanager stub.
- deliverable: dashboards, alert rules, runbook for alerts.
- acceptance criteria: alerts fire on injected anomalies and runbook resolves them.
- metric: alert precision, mean time to acknowledge (mtta) in simulated test.
data-contract enforcement via automated tests
- objective: create tests that enforce upstream contracts (field presence, types, semantics) before allowing dbt runs.
- tech: pytest, dbt, lightweight contract definitions (json/yaml).
- deliverable: contract test suite, sample failing/succeeding runs.
- acceptance criteria: pipeline blocks merges that violate contracts in ci.
- metric: number of contract violations prevented.
multi-tenant dataset access controls
- objective: simulate multi-tenant data isolation with row-level security and audit logs.
- tech: postgres row-level security, simple auth layer.
- deliverable: rls policies, audit log viewer, test cases.
- acceptance criteria: tenant cannot query other tenant rows; audits record accesses.
- metric: policy enforcement correctness.
streaming exactly-once semantics experiment
- objective: implement and demonstrate exactly-once processing (idempotent sinks or transactions).
- tech: kafka, spark structured streaming with checkpointing, transactional sink if available.
- deliverable: experiment demonstrating no duplicates after retries/failures.
- acceptance criteria: result set equals ground truth after repeated failures.
- metric: duplicate rate = 0 under test scenarios.
backlog replay & catch-up strategy
- objective: design and test a catch-up plan for a lagging consumer to replay backlog without overwhelming sinks.
- tech: kafka, throttling consumer, batch sink.
- deliverable: replay script with ramping strategy, sink protection mechanism.
- acceptance criteria: catch-up finishes without exceeding sink capacity / alert thresholds.
- metric: peak sink load during replay vs safe capacity.
terraform infra drift detection demo
- objective: provision simple infra (db, bucket) with terraform, make manual change, detect drift and auto-restore or notify.
- tech: terraform, scripts, state checks.
- deliverable: drift detection script, remediation plan.
- acceptance criteria: drift detected within defined interval and remediation invoked or ticket created.
- metric: detection time, remediation success rate.
data-mesh contract test between domains
- objective: simulate two domains publishing datasets with contract tests and cross-domain consumption test.
- tech: dbt packages per domain, contract tests, mock consumer.
- deliverable: domain contracts, consumer tests verifying contract adherence.
- acceptance criteria: consumer tests fail when publisher breaks contract; ci blocks merge.
- metric: contract violation detection rate.
reproducible research notebook + ci
- objective: convert an analysis notebook into reproducible pipeline with parameterized runs and ci validation.
- tech: papermill or nbconvert, github actions, docker.
- deliverable: parameterized notebook, ci workflow, sample report.
- acceptance criteria: notebook executes in ci and outputs expected artifacts.
- metric: ci run time, reproducibility across environments.
cold-start design for feature computation
- objective: design and implement strategies for computing features for new users with no history (imputation, defaults, cold-start models).
- tech: python, simple model, redis for feature cache.
- deliverable: fallback strategy code, evaluation on synthetic cold-start cohort.
- acceptance criteria: cold-start predictive performance within defined delta of warm-start baseline.
- metric: performance delta (e.g., auc difference).
api throttling & backpressure in ingestion layer
- objective: implement client-side throttling and server backpressure handling to protect downstream systems.
- tech: fastapi, rate-limiter, kafka or queuing layer.
- deliverable: throttling policies, load test showing protection.
- acceptance criteria: downstream error rate remains low under burst; graceful degradation.
- metric: error rate, request rejection rate, time-to-recover.
small-scale data-warehouse benchmarking
- objective: benchmark simple analytic queries across two storage formats (parquet vs iceberg/delta) and two engines (trino vs presto or duckdb).
- tech: duckdb/trino, parquet/iceberg local.
- deliverable: benchmark scripts, latency/cost table, recommendation.
- acceptance criteria: documented trade-offs with reproducible numbers.
- metric: query latency, scan bytes, cpu time.
drift-tolerant feature computation (windowing & watermark design)
- objective: design feature computation tolerant to late-arriving events using watermarks and retention strategies.
- tech: spark structured streaming, kafka.
- deliverable: streaming feature job, tests injecting late events, documentation of watermark choices.
- acceptance criteria: feature values converge within window after allowed lateness; late events handled correctly.
- metric: percentage of late events absorbed, final feature stability.
small-scale governance & access audit playbook
- objective: create a documented playbook covering dataset onboarding, ownership, sensitivity classification, and access audit process.
- tech: markdown + yaml templates, sample audit queries.
- deliverable: playbook, sample onboarding PR template, audit checklist.
- acceptance criteria: playbook covers 100% of common cases; mock audit passes.
- metric: time to onboard a dataset (simulated), completeness score from checklist.