2025-10-14

reasoning based learning

heap-based ranking system (data structure fundamentals)
- heap data structures and operations (insert, extract-min/max)
- priority queues for ranking algorithms
- time and space complexity analysis (o(log n) operations)
- basic python implementation of heaps
- approach: synthesize a min-heap via heapq with custom comparator for multi-criteria ranking; ingest a 10k-item dataset, heapify in o(n), then extract top-k with repeated pops, profiling sift-up/down latencies via timeit.
in-memory key-value store (hashmap + persistence simulation)
- hash maps and collision resolution techniques
- crud operations in memory
- simulating persistence with file i/o
- error handling for key collisions and serialization
- approach: engineer a dict-backed hashmap with linear probing for collisions; layer pickle-based serialization for dump/restore, stress-testing with 1m ops via concurrent.futures for thread-safe mutations.
tcp chat system (socket + concurrency + message delivery)
- socket programming and tcp protocol basics
- threading or multiprocessing for concurrency
- message queuing and reliable delivery
- handling client-server communication errors
- approach: bootstrap a non-blocking server with socket and select epoll for i/o multiplexing; enqueue msgs in queue.queue, spawn threads per client for echo/reply, injecting artificial packet loss to tune nack retries.
python job scheduler (multiprocessing + process sync)
- multiprocessing module for parallel execution
- synchronization primitives (locks, queues)
- cron-like scheduling logic
- resource management and deadlock avoidance
- approach: harness multiprocessing.pool for fan-out jobs, gated by rlock for shared state; parse cron strings into heapq-scheduled events, simulating orchestration with poison pills for graceful shutdowns.
normalization and er diagram case study (manual modeling)
- database normalization rules (1nf to bcnf)
- entity-relationship (er) modeling
- identifying keys, relationships, and dependencies
- manual schema design for real-world cases
- approach: deconstruct a e-commerce domain into crow’s-foot erd via plantuml; iterate normalization passes on sample data, quantifying redundancy reduction (e.g., from 3nf violations) with ad-hoc sql diffs.
simple query planner visualizer (sql cost estimation demo)
- sql query execution plans and cost models
- basic optimizer logic (index usage, join orders)
- visualization techniques (trees or graphs)
- estimating i/o and cpu costs
- approach: parse toy sql via antrl, mock a volcano-style optimizer with greedy join enumeration; render ast/explain tree in graphviz, back-of-envelope costing via selectivity histograms for 100-query benchmark.
postgres analytics schema (star/snowflake model)
- star and snowflake schema design for analytics
- dimensional modeling (facts, dimensions)
- postgresql ddl for tables and constraints
- query patterns for olap workloads
- approach: craft a star schema for sales metrics with surrogate keys and bitmap indexes; denormalize to snowflake via fk cascades, load via copy, then olap-ify with window funcs on 1m rows for aggregation perf.
nosql replication comparison (mongodb vs postgres)
- replication models (master-slave vs. multi-master)
- consistency and availability trade-offs (cap theorem)
- setup and testing mongodb vs. postgres replication
- failure recovery scenarios
- approach: spin up replica sets in docker compose (mongo oplog tailing vs. postgres streaming wal); inject chaos monkey kills, measure rpo/rto with linearizability probes, dissecting quorum configs under partition sims.
mini db engine in python (sql parser + file i/o)
- sql parsing with libraries like sqlparse
- basic query execution engine
- file-based storage and indexing
- transaction simulation (acid basics)
- approach: tokenize sql via sqlparse, execute select/insert on lsm-tree stubs with b+-tree indexes in shelve; enforce wal for durability, replaying txns on crash to validate atomicity.
query performance analyzer (indexing and joins)
- database indexing types (b-tree, hash)
- join algorithms (nested loop, hash join)
- explain plans and performance metrics
- optimization techniques for slow queries
- approach: instrument postgres with pg_stat_statements, force hash/nested joins on tpc-h subset; diff explain (analyze) outputs pre/post gin index, quantifying i/o via buffer hit ratios.
weather pipeline (api → s3 → parquet → postgres)
- api ingestion with requests library
- data serialization to parquet format
- s3 bucket operations (upload/download)
- etl loading into postgresql
- approach: poll openweather api with requests + exponential backoff, schema-enforce via pandera, partition parquet writes to minio (s3 emu); bulk-load to postgres via psycopg2 copy, idempotent via upsert.
data quality framework (mini great-expectations clone)
- data validation rules (schema, range checks)
- expectation suites and reporting
- integration with pandas for testing
- handling failures and alerts
- approach: abstract validators as composable funcs (e.g., greatex-style suites in yaml), run on df slices with pytest; aggregate failures to slack via webhook, profiling assertion overhead on 500k-row batches.
data profiler tool (statistics + nulls + schema validation)
- statistical profiling (mean, variance, distributions)
- null and outlier detection
- schema inference and validation
- visualization of data profiles
- approach: leverage pandas-profiling hooks for quintile bins and ks tests on numerics; infer schema via pandera, flag mahalanobis outliers, export html reports with seaborn distplots.
incremental etl with airflow (scheduler + retries)
- airflow dags and operators
- incremental loading logic (timestamps, watermarks)
- retry mechanisms and error handling
- scheduling and dependency management
- approach: orchestrate dag with postgresoperator + pythonoperator for cdc via max(ts) watermark; config xcom for state, exponential retries on transient errs, backfill via airflow backfill.
airflow dag monitoring dashboard (metrics extraction)
- airflow metadata extraction
- metrics like run times and success rates
- simple dashboard with flask or streamlit
- logging and alerting basics
- approach: query airflow’s sqlite meta db via sqlalchemy, etl to influxdb; streamlit-ify with plotly gauges for sla breaches, trigger pagerduty on >5% failure rate.
crypto price ingestion (websocket → kafka → postgres)
- websocket connections for real-time data
- kafka producer/consumer basics
- streaming to batch conversion
- idempotent inserts into postgres
- approach: sub to binance ws with websocket-client, serialize ticks as avro to kafka topic; consumer batches 1s windows, upsert to timescaledb via dedup on (symbol, ts).
news/web scraper (fetch → parse → store → visualize)
- web scraping with beautifulsoup or scrapy
- html parsing and data extraction
- storage in a database or files
- basic visualization (matplotlib charts)
- approach: deploy scrapy spider with xpath selectors for rss feeds, rate-limit via scrapy-twisted; parse to jsonl, sink to clickhouse, matplotlib sentiment timelines via vader scores.
ingestion framework (retry/backoff logic, monitoring)
- exponential backoff for retries
- monitoring with logging libraries
- configurable framework for multiple sources
- fault tolerance patterns
- approach: abstract tenacity decorators for circuit-breaker retries, log spans with structlog; yaml-config multi-source (api/db), prometheus-exported metrics for throughput/latency.
websocket stream visualizer (flow diagram and stats)
- real-time data visualization
- flow diagrams with libraries like graphviz
- streaming stats (throughput, latency)
- handling disconnections gracefully
- approach: pipe ws feeds to dash app with websocket callbacks, graphviz for topology; kafka consumer for p99 latency histograms, auto-reconnect with heartbeat pings.
airflow on ec2 with s3/rds backend
- ec2 instance setup and security groups
- rds postgresql configuration
- airflow deployment with celery executor
- s3 integration for artifacts
- approach: terraform ec2 t3.medium with ssm for airflow helm chart; rds multi-az, celeryflower for worker pool; s3 logs via airflow.cfg, iam roles for cross-account access.
lambda-based etl automation
- aws lambda functions and triggers
- serverless etl patterns
- layer management for dependencies
- cost and cold-start optimization
- approach: author step functions orchestrator invoking lambda (python 3.12 runtime), s3 event triggers; bundle deps in layers, provisioned concurrency for <200ms cold starts, cloudwatch alarms on duration.
spark job on emr reading parquet from s3
- emr cluster provisioning
- spark sql for parquet processing
- s3 as data lake source
- job submission and monitoring
- approach: launch emr 6.10 with spot fleet, spark-submit pyspark script for udf-enriched joins on 100gb parquet; ganglia metrics to s3, auto-terminate post-step via lifecycle.
dockerized data pipeline with docker-compose
- dockerfile creation for services
- docker compose for multi-container apps
- volume mounting and networking
- local testing of full pipelines
- approach: multi-stage dockerfile for airflow + postgres, compose.yaml with bridge net and healthchecks; vol-mount /dags, docker-compose up for e2e smoke tests with pytest.
github actions ci/cd for etl pipelines
- github actions workflows
- ci/cd pipelines for testing/deploying
- secrets management and artifacts
- automated testing for data jobs
- approach: yaml matrix for py3.9-3.11, tox for lint/test; deploy to ecs via oidc, cache pip in artifacts, slack notify on main merges.
data lakehouse pipeline (kafka → iceberg → trino → superset)
- apache iceberg table formats
- trino for federated querying
- superset for bi dashboards
- end-to-end lakehouse architecture
- approach: kafka connect sink to iceberg via minio, trino catalog for schema-on-read; superset sql lab for cohort analysis, manifest audits for acid txns.
warehouse benchmarking (redshift vs bigquery vs trino)
- query performance metrics (tpc-ds benchmarks)
- cost analysis across services
- scalability testing
- migration considerations
- approach: port tpc-ds sf100 to each (redshift wlm, bq slots, trino connectors); run 99th percentile qps with hammerdb, tco via cur queries, diff compression ratios.
schema evolution demo (iceberg schema change handling)
- schema evolution in table formats
- handling adds/drops/renames
- time travel queries
- compatibility testing
- approach: evolve iceberg manifest with spark.sql alters, snapshot rollback via branch; compatibility matrix tests (add col → query ok), audit metadata in parquet footers.
custom partitioning simulator (read-time comparison)
- partitioning strategies (hash, range)
- read performance simulation
- pruning logic
- data skew handling
- approach: simulate hive-style partitions in pandas on skewed synth data; benchmark predicate pushdown with dask, salt hash keys to mitigate hotspots, plot scan times via matplotlib.
spark batch analytics (millions of rows to parquet)
- spark dataframes for large-scale processing
- batch transformations and aggregations
- parquet optimization (compression, partitioning)
- handling out-of-memory errors
- approach: rdd-to-df pipeline with broadcast hints, zstd-compress partitioned parquet; spill-to-disk on oom via spark.sql.adaptive.enabled, process 10m rows on local[*].
kafka streaming analytics (trade data → spark stream → redis)
- spark structured streaming
- kafka integration for input/output
- redis for caching results
- windowed aggregations
- approach: checkpointed sss micro-batches from kafka json, tumbling windows for vwap; sink aggs to redis sorted sets, fault-tolerate with exactly-once semantics.
flink fraud detector (windowed anomaly detection)
- apache flink for stream processing
- windowing and event-time semantics
- anomaly detection algorithms
- stateful computations
- approach: flink sql for session windows on tick data, z-score isolation forest udf; keyed state backend for per-user baselines, backpressure tuning via watermark alignment.
hdfs simulator (namenode replication logic)
- hdfs architecture (namenode, datanodes)
- block replication and fault tolerance
- name resolution and metadata management
- simulation of failures
- approach: python fs sim with threading for datanode heartbeats, in-mem fsimage for nn; rack-aware replication (3x), inject dn failures to trigger under-replicated block scans.
kafka-lite broker (python simulation of partitions and offsets)
- kafka partitioning and leader election
- offset management and consumer groups
- producer/consumer protocols
- durability guarantees
- approach: asyncio broker with pluggable partitioners (murmur2), zab-inspired leader via raft stubs; offset ledger in rocksdb, simulate isr for acks=all durability.
spark tuning benchmark (shuffle vs broadcast join)
- spark shuffle optimizations
- broadcast joins for small tables
- benchmarking with different configs
- partition tuning
- approach: aqe-enabled joins on tpc-h, toggle spark.sql.autobroadcastjointhreshold; shuffle hash vs. sort-merge, profile spill via spark ui, repartition.coalesce for skew.
query optimization analyzer (compare explain plans)
- explain analyze in sql engines
- cost-based optimizers
- plan comparison and rewriting
- index recommendations
- approach: hook postgres pg_hint_plan for forced rewrites, diff json explain trees; cbo stats vacuum-analyze, suggest covering indexes via pg_qualstats.
vectorized dataframe benchmark (pandas vs polars vs spark)
- vectorized operations in dataframes
- performance comparison across libraries
- memory usage profiling
- scalability for large datasets
- approach: microbench groupby/apply on 1gb csv across engines, memory_profiler for peak rss; polars lazy eval vs. spark catalyst, extrapolate to tb-scale via extrapolation.
kafka consumer lag visualizer (real-time lag chart)
- consumer lag metrics
- real-time monitoring with kafka tools
- charting with plotly or similar
- alerting on high lag
- approach: jmx-expose lag via kafka consumer group describe, streamlit + plotly candlesticks; threshold alerts via kafkacat cron, lag = committed - highwater.
airflow monitoring dashboard (grafana + prometheus)
- prometheus metrics scraping
- grafana dashboard creation
- airflow exporter integration
- custom queries and panels
- approach: airflow prometheus exporter on /metrics, grafana promql for dag run quantiles; loki for logs, templated vars for dynamic dag selection.
etl log ingestion to elk stack (elastic, logstash, kibana)
- logstash for parsing and filtering
- elasticsearch indexing
- kibana visualizations
- log aggregation patterns
- approach: logstash grok patterns for airflow json logs, ilm policy for rollover indices; kibana tsvb for error funnels, ingest via filebeat sidecar.
prometheus alerting system with slack webhook
- alertmanager configuration
- slack integration for notifications
- rule definitions for thresholds
- silencing and escalation
- approach: promql rules for cpu >80% over 5m, alertmanager routing to slack via webhook; grouping/inhibit for storm suppression, pagerduty escalation tree.
pipeline health checker (freshness and row count validation)
- data freshness monitoring
- row count and schema checks
- automated validation scripts
- integration with schedulers
- approach: airflow sensor for lag <1h, greatex for row/schema diffs; bash wrapper with pg_dump counts, fail-fast on drift.
data catalog / schema registry demo (metadata management)
- metadata storage and querying
- schema versioning
- catalog tools like amundsen basics
- lineage tracking
- approach: postgres meta tables for schema evo, amundsen frontend for search; openlineage hooks in airflow, graph viz via neo4j.
hybrid data architecture (api → kafka → spark stream → iceberg)
- hybrid batch/stream integration
- event sourcing patterns
- iceberg as unified sink
- end-to-end latency measurement
- approach: api gateway to kafka compacted topics, sss upsert to iceberg merge-on-read; trino for unified views, zipkin traces for e2e p95.
fault tolerance simulator (kafka consumer crash → auto replay)
- consumer group rebalancing
- offset commit strategies
- crash recovery and replay
- at-least-once semantics
- approach: dockerized consumer with sigkill hooks, enable=auto.offset.reset=earliest; measure replay dupes via idempotency keys, tune session.timeout.ms.
event-driven microservice system (notification + accounting)
- microservices with event buses
- saga pattern for distributed transactions
- notification and accounting logic
- service orchestration
- approach: fastapi services pub/sub via kafka, axon-inspired sagas for compensating txns; accounting double-entry via event replay, istio for circuit breaking.
data mesh simulation (domain-based data ownership)
- domain-driven design for data
- federated governance
- self-service data products
- inter-domain discovery
- approach: ddd bounded contexts as iceberg domains, collibra-lite registry; self-serve via trino federation, enforce contracts with schema checks.
feature pipeline (postgres → cleaning → feast feature store)
- feature engineering and cleaning
- feast for online/offline stores
- point-in-time joins
- serving features to models
- approach: dbt for cleaning transforms, feast registry for entity tables; pit-correct joins via ts_hash, redis/tiledb for low-lat serve.
retraining dag (automated airflow-based ml retraining)
- ml model retraining workflows
- airflow for ml ops
- triggering on data changes
- versioning models
- approach: airflow dag with kubeflow ops, trigger on s3 delta; mlflow track params/metrics, a/b routing via canary.
feature drift detection (distributional shift alerting)
- statistical tests for drift (ks test)
- monitoring feature distributions
- alerting on shifts
- integration with pipelines
- approach: evidently ai for psi/ks on online samples, airflow sensor for p-value <0.01; histogram drift viz, quarantine on alert.
blazpay transaction analytics (hyperledger + kafka + spark + postgres)
- blockchain integration (hyperledger fabric)
- transaction event streaming
- spark analytics on ledger data
- real-time p/l calculations
- approach: fabric chaincode events to kafka via connector, spark streaming for rolling pnl; postgres matview for eod recon, z3 for anomaly proofs.
bks mygold tokenized asset pipeline (solidity + iceberg analytics)
- solidity smart contracts for tokens
- on-chain data extraction
- iceberg for analytics tables
- asset valuation queries
- approach: hardhat-deploy erc-721, thegraph subgraph for events; iceberg upsert from subgraph sync, trino for nav calcs with oracle feeds.
vault reconciliation system (ledger vs vault validation)
- reconciliation algorithms
- ledger vs. vault matching
- error detection and resolution
- batch processing for audits
- approach: fuzzywuzzy for tx matching on hashed payloads, spark for diff joins; alert on >0.1% variance, replay logs for root-cause.
transaction intelligence (fraud/anomaly scoring)
- anomaly detection models
- scoring rules and thresholds
- integration with streaming data
- explainable ai basics
- approach: flink cep for rule-based + isolationforest, score via h2o; kafka stream enrich, shap for tx explainability.
accounting microservice (double-entry with kafka replay)
- double-entry bookkeeping logic
- event replay for consistency
- microservice api design
- audit trails
- approach: grpc accountingservice with kafka compacted ledger, t-accounts via event sourcing; replay from offset 0 on bootstrap, immutable audit via append-only.
rwa lakehouse (multi-asset data warehouse for on-chain data)
- real-world asset (rwa) data modeling
- multi-asset partitioning
- lakehouse queries for finance
- compliance reporting
- approach: iceberg partitioned by chain/asset, dbt for rwa dims; trino for sec-compliant queries, time-travel for audit trails.
oracle connector (price feed + external event integration)
- chainlink-style oracles
- external data feeds
- event triggering on chains
- secure data validation
- approach: solidity oracle contract with chainlink vrf, pyth feeds via ws; threshold sigs for tamper-proof, fabric msp for access control.
project documentation repo (markdown + mermaid diagrams)
- markdown for readmes and guides
- mermaid for architecture diagrams
- repo structure best practices
- version control for docs
- approach: mkdocs site gen from markdown, mermaid live editor embeds; github pages deploy, changelog via semantic-release.
data contract generator (schema validation registry)
- data contracts and slas
- schema generation and validation
- registry implementation
- consumer-provider agreements
- approach: avro/protobuf schema from pydantic models, confluent schema reg; enforce at ingress via kafka interceptors, breach alerts.
reproducible jupyter pipeline (papermill automation)
- jupyter notebooks for pipelines
- papermill for parameterization
- reproducibility with environments
- automation in workflows
- approach: papermill sweep params over nb, poetry for env pinning; airflow bashop exec, nbconvert to html artifacts.
whiteboard interview exercises (architecture explanation practice)
- system design sketching
- verbal explanation techniques
- common de interview scenarios
- feedback loops for improvement
- approach: pramp mocks for “design uber’s lakehouse,” c4 model sketches; record loom vids, a/b peer feedback on clarity/conciseness.