theory
1. sql (zero to advanced)
highest priority.
a. foundation (the “select” statement)
- basic syntax:
select,from,where,distinct. - filtering: operators (
=,<>,>,<),between,in,like(wildcards%,_). - null handling: understanding
is null,is not null, and usingcoalesce()to handle missing values. - sorting:
order by(asc/desc), multi-column sorting. - limiting:
limit,offset(used for pagination).
b. aggregation & grouping
- aggregate functions:
count(),sum(),avg(),min(),max(). - grouping:
group byclause (grouping by one or multiple columns). - filtering groups:
havingvswhere(crucial interview question:wherefilters rows,havingfilters groups).
c. joins (the “venn diagram” logic)
- inner join: matching rows only.
- left/right join: retaining rows from one side even if no match exists.
- full outer join: retaining all rows from both sides.
- cross join: cartesian product (every row matches every row).
- self join: joining a table to itself (e.g., “find employees who earn more than their managers”).
d. advanced & analytical (the “senior” skills)
window functions:
row_number()vsrank()vsdense_rank().lead()andlag()(accessing previous/next rows without joining).running totals:
sum(salary) over (order by date).set operations:
unionvsunion all(removes duplicates vs keeps duplicates),intersect,except.ctes (common table expressions): using
withclauses to clean up complex subqueries.date/string manipulation:
date_trunc,datediff,concat,substring,trim.
2. data structures & algorithms (python)
focus on practical data manipulation, not academic theory. stick to “easy” and “medium” problems.
a. core structures
- arrays (lists): slicing, indexing, iteration.
- hashmaps (dictionaries): the most important structure. key-value lookups, counting frequencies.
- sets: removing duplicates, set operations (union/intersection).
- strings: reversing, splitting, parsing (crucial for data cleaning).
b. key algorithms
- two pointers: moving indices from both ends (e.g., palindrome check).
- sliding window: processing a subset of data (e.g., “max profit in a sub-array”).
- stack & queue: lifo vs fifo concepts.
- binary search: finding items in sorted data ().
c. “good to know” (if you have time)
- linked lists: basic traversal and reversal.
- trees (bfs/dfs): basic traversal (bytedance asks this).
- recursion: base cases and self-calling functions.
3. theory & system design (data engineering)
justifies.
a. data warehousing concepts
oltp vs olap: why we don’t run analytics on production databases.
row vs columnar storage: why parquet/redshift is faster for analytics than postgres.
dimensional modeling:
fact tables: (events, transactions).
dimension tables: (users, products, locations).
star schema vs snowflake schema.
scd (slowly changing dimensions):
type 1: overwrite old data (no history).
type 2: add new row with flag/date (preserve history).
b. big data architecture
- etl vs elt: load then transform (modern) vs transform then load (legacy).
- data lake vs data warehouse: s3 (raw files) vs redshift/snowflake (structured tables).
- cap theorem: consistency, availability, partition tolerance (you can only pick 2).
- acid vs base: strong consistency (banks) vs eventual consistency (social media).
c. distributed systems
- sharding vs partitioning: how to split data across servers.
- horizontal vs vertical scaling: adding more machines vs adding more cpu/ram.
- idempotency: ensuring re-running a job doesn’t duplicate data.
- batch vs streaming: daily jobs (airflow) vs real-time (kafka).
4. resume-specific justification topics
you must be able to speak about these “deep dives” for your specific resume.
apache airflow: what is a dag? what are operators? how do you handle backfills?
apache kafka: topics, partitions, producers, consumers, consumer groups, offsets.
apache spark:
transformations vs actions: lazy evaluation.
shuffle: why moving data between nodes is expensive.
optimization: partitioning, caching, broadcast joins.
ci/cd for data: how you used git actions to test sql/python code (linting).