← ...
etl
🔹 definition
etl (extract, transform, load) is a data integration process that collects, cleans, and organizes data from multiple sources into a single, consistent dataset — typically for a data warehouse, data lake, or analytical system.
🔸 step 1: extract
raw data is copied or exported from source systems into a staging area.
typical data sources:
- sql / nosql servers
- crm or erp systems
- json / xml files
- flat-file databases (csv, txt)
- emails or logs
- web pages or apis
🔸 step 2: transform
this step cleans and restructures the extracted data.
common operations:
- filter
- clean
- aggregate
- de-duplicate
- validate
- authenticate
- calculate metrics
- apply audits
- encrypt for security and compliance
- reformat or join tables to match schema
🔸 step 3: load
the final transformed dataset is moved from the staging area into a target warehouse or analytical system.
✅ benefits
- best suited for smaller data volumes
- preferred when data needs to be transformed before loading
- good for batch-oriented updates