← ...

etl

🔹 definition

etl (extract, transform, load) is a data integration process that collects, cleans, and organizes data from multiple sources into a single, consistent dataset — typically for a data warehouse, data lake, or analytical system.


🔸 step 1: extract

raw data is copied or exported from source systems into a staging area.

typical data sources:

  • sql / nosql servers
  • crm or erp systems
  • json / xml files
  • flat-file databases (csv, txt)
  • emails or logs
  • web pages or apis

🔸 step 2: transform

this step cleans and restructures the extracted data.

common operations:

  • filter
  • clean
  • aggregate
  • de-duplicate
  • validate
  • authenticate
  • calculate metrics
  • apply audits
  • encrypt for security and compliance
  • reformat or join tables to match schema

🔸 step 3: load

the final transformed dataset is moved from the staging area into a target warehouse or analytical system.


✅ benefits

  • best suited for smaller data volumes
  • preferred when data needs to be transformed before loading
  • good for batch-oriented updates