← ...

performance and scalability

focus on partitioning, bucketing, caching, spark tuning, vectorized computation, and distributed join optimization.

key concepts

  • partitioning and bucketing
  • caching strategies
  • spark shuffle, broadcast join
  • vectorized computation (arrow, polars)
  • cap theorem tradeoffs

explanation practice

  • partition vs bucket diagram
  • spark shuffle visualization
  • vectorized vs row-wise computation

projects

1. spark tuning benchmark

  • test join strategies and shuffle performance

2. query optimization analyzer

  • explain plan comparison

3. vectorized dataframe benchmark

  • pandas vs polars vs spark

4. kafka consumer lag visualizer

  • monitor backpressure