← ...

big data and streaming

cover hdfs, mapreduce, spark, kafka, flink, distributed consistency, and cap theorem.

key concepts

  • hdfs architecture (namenode, datanodes)
  • spark rdd/dataframe, catalyst optimizer
  • kafka partitions, brokers, offsets
  • flink checkpointing and windowing
  • cap theorem and distributed tradeoffs

explanation practice

  • spark dag diagram
  • kafka partitioning and offset diagram
  • flink windowing illustration

projects

1. spark batch analytics

  • process millions of records → parquet → trino queries

2. kafka streaming analytics

  • trade stream → spark streaming → redis → grafana
  • detect anomalies in payment streams

4. build kafka clone (lite)

  • python socket/queue simulation

5. hdfs simulator

  • directory replication in python