← ...

data engineering - ibm overview

source: ibm data engineering


🔹 what data engineers do

  • create and deploy algorithms, pipelines, and workflows that convert raw data → ready-to-use datasets
  • enable analysis and application of data regardless of source or format
  • follow data mesh principles — a decentralized architecture where data is organized by business domains (marketing, sales, services, etc.)

🔹 key use cases

  • data collection, storage, and management
  • real-time data analysis
  • machine learning integration

🔹 de and core datasets

  • data engineers build systems that turn massive raw data into core, usable datasets
  • focus on data as a product (daap) — different from data as a service (daas)
  • real-world examples:
    • retail and e-commerce data standardization
    • fraud detection pipelines
    • manufacturing analytics pipelines

⚙️ how data engineering works

data engineering designs and builds data pipelines that convert unstructured raw data into unified, high-quality datasets.

steps:

  1. data ingestion – movement of data from multiple sources into a single ecosystem
  2. data transformation – cleaning, correcting, normalizing data to ensure consistency and integrity

🧩 data normalization (detailed concept)

definition:
data normalization means organizing and transforming data to make it consistent, structured, and comparable.
it removes redundancy (duplicates) and ensures data integrity (accurate relationships between entities).


🔹 simple analogy

like cleaning and organizing your messy room:

  • books on shelves (grouped)
  • clothes in the cupboard
  • shoes on the rack
    → not deleting anything, just organizing logically for easier retrieval.

🔹 in data context

in databases (like sql):

  • normalization = splitting large tables into smaller ones and defining relationships (foreign keys)
  • it follows normal forms (1nf, 2nf, 3nf, etc.) — each removing specific redundancy or dependency type.

🔹 real-world example

before normalization:

student_idstudent_namecourseteacherteacher_phone
1alexmathmr. raj9999991111
2miasciencems. neha9999992222
3sammathmr. raj9999991111

redundancy: teacher info repeats for each student.


after normalization:

students table

student_idstudent_namecourse_id
1alex101
2mia102
3sam101

courses table

course_idcourse_nameteacher_id
101matht1
102sciencet2

teachers table

teacher_idteacher_nameteacher_phone
t1mr. raj9999991111
t2ms. neha9999992222

→ cleaner, consistent, efficient for updates and queries.


🧠 da vs ds vs de

rolefocuspurpose
da (data analyst)analyze large datasetsextract insights for present-day decisions
ds (data scientist)build ml modelspredict future outcomes
de (data engineer)design pipelines & infradeliver reliable, structured data for da/ds

🧰 tools & pipeline formats

data engineers primarily use:


💾 data storage solutions

  • cloud computing services
  • relational databases
  • nosql databases
  • data warehouses
  • data lakes
  • data lakehouses

🔄 other data integration methods

  • change data capture (cdc): tracks and streams only changes (inserts, updates, deletes) from databases in real time.
  • data replication: creates exact copies of data across systems to ensure consistency and availability.
  • data visualization: converts datasets into graphical insights for easy pattern and trend detection.
  • stream data integration (sdi): continuously processes and merges live data streams from multiple real-time sources.

💻 core programming languages

  • sql → querying and data modeling
  • python → scripting, orchestration, and analytics
  • scala → big data and spark pipelines
  • go → building concurrent, high-performance data services
  • node.js → managing real-time, event-driven data integrations
  • java → enterprise-scale data processing