← ...
project 6: the gdpr deletion system
project 6: the “gdpr” deletion system
scenario: a user exercises their “right to be forgotten.” you must delete their data from all parquet files in the lake. the mission: efficiently find and delete a specific key from a petabyte-scale (simulated) lake.
- tech: pyspark, delta lake
vacuum. - challenge: doing this without rewriting the entire dataset.
- dev to prod:
- use delta lake’s
deletecommand:deltatable.forpath(...).delete("userid = '123'"). - run
vacuumto physically remove the old files. - prod requirement: log an audit trail of exactly what was deleted and when.
- use delta lake’s