← ...
project 4: the billion row optimization
project 4: the “billion row” optimization
scenario: a nightly reporting job is taking 6 hours to run. you need to cut it to 20 minutes. the mission: process a massive dataset (generate 1gb of dummy csv data) and optimize the write speed.
- tech: pyspark, parquet, snappy compression.
- challenge: understanding “skew” and “shuffling.”
- dev to prod:
- generate a massive csv with random sales data.
- write a naive pyspark job to
group bycity andsumsales. time it. - prod requirement: optimize it using
partitionby('city'),repartition(), and converting to parquet. prove the speedup (e.g., “reduced time by 80%”).