Pacific Seal

Welcome
Database Management Systems 2


Course COMP 263
10:00 AM - 11:20 AM, Tuesday and Thursday
Location: Baun Hall 214

Week Meeting Dates (TR) Topic Notes Assignments
Week 1 Aug 26, Aug 28 Unstructured and Evolving Data 🔗 IndexedDB Lab 1, HW 1, Roster Verification
Week 2 Sept 2, Sept 4 Data Modeling and Engineering 🔗
Week 3 Sept 9, Sept 11 Data Quality and Standards 🔗 MongoDB Lab 2, HW 2, Project Part 1: Data Lake Queries
Week 4 Sept 16, Sept 18 Data Transformation and Processing 🔗
Week 5 Sept 23, Sept 25 Database Pipeline Architecture and Data Lake 🔗 Neo4j
Week 6 Sept 30, Oct 2 Database Pipeline Architecture and Data Warehouse 🔗 Lab 3, HW 3
Week 7 Oct 7, Oct 9 Database Pipeline Architecture and Performance / Security🔗 Redis
Week 8 Oct 14, Oct 16 Midterm Preparation Midterm: Oct 16th
Week 9 Oct 21, Oct 23 Database Pipeline Architecture and Data Dashboard 🔗 ClickHouse
Week 10 Oct 28, Oct 30 Database Sharding and Replication Cassandra Lab 4, HW 4, Project Part 2
Week 11 Nov 4, Nov 6 Database Migration from SQL to NoSQL 🔗
Week 12 Nov 11, Nov 13 Database Pipeline and Security 🔗 Faiss Lab 5, HW 5
Week 13 Nov 18, Nov 20 Course Review
Week 14 Nov 25
Nov 27 (Thanksgiving)
Finals Preparation
Week 15 Dec 2, Dec 4 Project Presentation Final: Dec 4th from 10-11:20 AM in Baun Hall 214
Week 16 Dec 9, Dec 11 Finals Week Final: Dec 11th from 10-11:20 AM in Baun Hall 214

Today

  • What changed?
  • Why DMS 2?
  • Why IndexedDB?

Data Management Before 2000

  • Limited Internet Connectivity
  • Isolated Systems
  • Manual Data Entry
  • From Paper Tables to SQL Tables
  • Static Data Structure
  • Predefined Schemas Before Data Entry
Data Management Illustration

Early 2000s: Data Shift

  • Connectivity: Internet, NFC, BLE, Wi-Fi
  • Human Data: Real-time user data
  • Machine Data: Real-time IoT sensor data
  • Mobile Device: Human and Machine data
Data Shift Illustration

Connectivity and Automation 🙂

  • Healthcare (Scale, Glucose)
  • Agriculture
  • Automotive
  • City (Parking, Energy, Environment)
  • Retail (Stock, JITD)
  • Home (Meters, Laudry Machine)
  • Buildings & facilities
  • Industrial IoT (manufacturing)
  • Utilities (water/gas)
  • Logistics & supply chain
  • Wearables & medical devices
Connectivity and Automation
smart things

Connectivity and Automation 🤔

Connectivity and automation led to increased dependency on complex software systems (Sensors β†’ Embedded IoT β†’ Phone/App β†’ Backend/Cloud) across industries

Connectivity and Automation 😠

Software continuously updated independently of other dependent components.

Updating Relational Schemas

Static pre-defined data schema not scalable in large software systems

Example in Retail...

Point of Sale System

A point of sale SQLite database should flexibly handle scanned data, adapting to changes in data structure without schema updates.

Point of Sale System
Image 1 Sunday Image 2 Friday Image 3 Monday
flowchart LR

  %% Producers 
  subgraph PROD[Producers]
    PUMA[Puma SQL πŸ›’οΈ]
    NIKE[Nike SQL πŸ›’οΈ]
  end

  %% Backend 
  subgraph BE[Backend Service]
    API[Backend API]
  end

  %% Frontend
  subgraph FE[Frontend Phone App]
    APP[Mobile App]
    LDB[Local SQL DB πŸ›’οΈ]
  end

  %% Data flows
  PUMA -->|JSON feed| API
  NIKE -->|JSON feed| API

  APP <--> |JSON requests/responses| API

  APP -->|SQL queries| LDB

  API -->|Sync JSON| APP
  APP -->|Send JSON| API
    
{
  "product": "t-shirt",
  "brand": "Puma",
  "size": "M",
  "price": 29.99,
  "stock": 120,
  "color": "orange"   ← added today
}
      
INSERT INTO products 
  (product, brand, size, price, stock, color)
VALUES 
  ('t-shirt', 'Puma', 'M', 29.99, 120, 'orange');
      
⚠️ Mismatch:
JSON has "color"
Table has no "color" column

ERROR: column "color"
does not exist
      

From SQL to NoSQL...

Carlo Strozzi β€œNoSQL” (1998)

  • No SQL: relational DB without SQL.
  • Unix style: text tables + pipes.
  • Portable: awk/sed/grep tooling.
  • Lightweight: small, script-driven RDBMS.
  • Still relational: algebra via scripts.
  • Focus: simplicity over complexity.
  • Against ORMs: avoided heavy abstraction.
  • Open format: plain text over binaries.
flowchart TB
    SQL[SQL Queries] --> RDBMS[(Relational DB)]
    S1[awk] --> RDBMS
    S2[sed] --> RDBMS
    S3[grep] --> RDBMS
      
Not Only SQL

Schema-less NoSQL – 2000s

  • Google Bigtable (2006): column-oriented
  • Amazon Dynamo (2007): key-value store
  • FB Cassandra (2008): Bigtable + Dynamo
  • 10gen MongoDB (2009): document store
  • W3C IndexedDB (2010): key-value store
  • Shift in meaning: No SQL
  • Use cases: scalability, flexibility, web & IoT
flowchart TB
  MQL[MQL MongoDB] --> DB[NoSQL Data Stores]
  GraphQL[GraphQL] --> DB
  JSIDB[JavaScript IndexedDB] --> DB
  Redis[Redis Commands] --> DB
  CQL[CQL Cassandra] --> DB
      

Why is Data Unstructured and Evolving?

Software
Hardware
Data
Hardware
Software
Data
Hardware
Data
Software

Today, Thursday Aug 28th

  1. Big Data Characteristics (5V)
  2. Browser and IndexedDB Architecture
  3. Lab Use Case Examples
  4. Homework 1 and Lab 1

Volume

  • Massive data size
  • Petabytes to exabytes
  • Scalability critical
Example RDBMS Limitation
FB: ~4 PB/day No horizontal scaling
IoT sensors: TB/hour Storage bottlenecks

Velocity

  • Real-time streams
  • Low-latency ingest
  • Continuous updates
Example RDBMS Limitation
NYSE: ms updates No real-time support
Twitter: 6K tweets/sec Slow insert rate

Variety

  • Structured & unstructured
  • Multiformat input
  • No fixed schema
Example RDBMS Limitation
Logs, images, JSON Strict schemas only
Emails, videos No native support

Veracity

  • Uncertain accuracy
  • Missing/incomplete values
  • Low signal-to-noise
Example RDBMS Limitation
User-entered data Rejects nulls/invalid
Web scraping Requires clean data

Value

  • Actionable insight
  • Advanced analytics
  • Complex queries
Example RDBMS Limitation
ML pipelines No parallel compute
BigQuery/Spark jobs Slow joins & scans
  • Weather forecasting
    ✓ Velocity: because data changes rapidly (e.g., satellite, radar)
  • Traffic navigation
    ✓ Velocity: live GPS and sensor updates
  • Product recommendations
    ✓ Value: insights from user behavior
  • Social media analysis
    ✓ Variety: text, images, video, audio
  • Smart farm sensors
    ✓ Volume: constant data from many devices

Today, Thursday Aug 28th

  1. Big Data Characteristics (5V) ✓
  2. Browser and IndexedDB Architecture
  3. Lab Use Case Examples
  4. Homework 1 and Lab 1
IndexedDB
Redis
MongoDB
CassandraDB
Faiss
ClickHouse
Neo4j

IndexedDB Timeline

  • 2010: W3C introduces IndexedDB as a web standard for client-side storage
  • 2011: First browser implementations enable offline apps
  • 2014: IndexedDB 2.0 draft improves performance & API
  • 2016: Broad adoption across Chrome, Firefox, Edge, Safari
  • 2018: IndexedDB 2.0 becomes official W3C recommendation
  • 2020s: Continuous improvements & optimizations

IndexedDB vs SQL: Structure

IndexedDB (Browser)
Object Store: items
keyPath: id
indexes: by_name, by_price
Record (JSON-like):
{
  id: 1,
  name: "Chair",
  price: 49,
  stock: 10
}
⇄
⇄
⇄
SQL Database
Table: items
CREATE TABLE items (
  id    INTEGER PRIMARY KEY,
  name  TEXT,
  price REAL,
  stock INTEGER
);
Row:
(1, 'Chair', 49, 10)

CRUD: IndexedDB ↔ SQL

IndexedDB
// CREATE
store.add({id: 1, name: "Chair", price: 49, stock: 10});

// READ
store.get(1);

// UPDATE
store.put({id: 1, name: "Chair", price: 59, stock: 10});

// DELETE
store.delete(1);
↔
SQL
-- CREATE
INSERT INTO items VALUES (1, 'Chair', 49, 10);

-- READ
SELECT * FROM items WHERE id = 1;

-- UPDATE
UPDATE items SET price = 59 WHERE id = 1;

-- DELETE
DELETE FROM items WHERE id = 1;

IndexedDB Event Flow

1) const req = indexedDB.open("furnitureDB", 1);
(CREATE DATABASE furnitureDB)
2) req.onupgradeneeded = function (event) { }
(CREATE TABLE items ...)
3) req.onsuccess = function (event) { }
(CONNECT)
4) req.onerror = function (event) { }
(CONNECTION ERROR)
5) const tx = db.transaction("items", "readwrite");
(BEGIN TRANSACTION)
6) const store = tx.objectStore("items");
(SELECT FROM items)
7) request.onsuccess = function (event) { }
(QUERY OK)
8) request.onerror = function (event) { }
(QUERY ERROR)
9) tx.oncomplete = function (event) { }
(COMMIT)
10) tx.onerror = function (event) { }
(ROLLBACK)
11) tx.onabort = function (event) { }
(ROLLBACK)
12) db.onversionchange = function (event) { }
(ALTER DATABASE / SCHEMA MIGRATION)

CRUD (Object Store API)

store.add(object);
(INSERT INTO items VALUES (...))
store.get(key);
(SELECT * FROM items WHERE id = ?)
store.getAll();
(SELECT * FROM items)
store.openCursor();
(SELECT * FROM items ORDER BY id)
store.put(object);
(UPDATE items SET ... WHERE id = ?)
store.delete(key);
(DELETE FROM items WHERE id = ?)
store.clear();
(DELETE FROM items)
store.index("by_name").getAll(query);
(SELECT * FROM items WHERE name = ?)

IndexedDB Resources

GitHub Repository: SE4CPS/2025-COMP-263

API Documentation: MDN IndexedDB API

Coding

β–Ά OneCompiler.com
IndexedDB
Redis
MongoDB
CassandraDB
Faiss
ClickHouse
Neo4j

Browser Support for IndexedDB

  • Feature set: getAll(), store/index rename
  • Support: Chrome, Edge, Firefox, Opera
  • Safari: support on current macOS/iOS
  • Mobile: Android and iOS support
  • Takeaway: Safe to use in modern browsers
  • Documentation: caniuse.com
Browser support matrix for IndexedDB improvements

Why IndexedDB is I/O

  • Disk: data stored on device, not memory
  • Async API: returns Promises/events
  • Non-blocking: avoids freezing UI
  • Transactions: safe grouped reads/writes
  • Sandboxed: per-tab storage β†’ still disk I/O
flowchart LR
  subgraph TAB["Browser Tab (JavaScript)"]
    A["async call β†’ Promise"]
    B["await (until resolve)"]
  end

  subgraph PROC["Browser Process"]
    IDB["IndexedDB backend"]
    STOR["Storage service"]
  end

  subgraph OS["Operating System"]
    FS["File System (disk I/O)"]
  end

  A --> IDB
  B -.-> A
  IDB --> STOR --> FS
  FS --> STOR --> IDB
  IDB -->|"completion event"| A

IndexedDB and I/O

  • Does the browser know if IndexedDB data will return from disk I/O?
    ➜ No. Until the request runs, the browser cannot know if a matching record exists.
  • Does the browser know when the IndexedDB read/write will complete?
    ➜ No. Disk latency, scheduling, and transactions make timing unpredictable.
  • What does a Promise from IndexedDB always return to JS?
    ➜ It always returns a handle (the Promise object) immediately, which will later settle (resolve/reject) with the result.

Access Control: Browser, Tab, Sandbox

  • Browser process: master controller, manages memory, networking, disk I/O
  • Tab = renderer process: each tab runs in its own thread/process
  • Sandboxed: tabs are isolated for security β†’ no direct memory sharing
  • IndexedDB: each tab/origin gets its own database namespace
  • Async I/O: tab makes request β†’ browser process handles disk β†’ result returned via promise
flowchart TB
  BPROC["Browser Process"]
  TAB1["Tab 1: Renderer
(sandboxed)"] TAB2["Tab 2: Renderer
(sandboxed)"] DB["IndexedDB Storage
(per-origin, on disk)"] BPROC --> TAB1 BPROC --> TAB2 TAB1 -->|async request| DB TAB2 -->|async request| DB

IndexedDB Access Control

  • Can one browser tab read another tab’s IndexedDB data?
    ➜ No. Each tab is sandboxed; access is restricted by the same-origin policy
  • What defines the database namespace in IndexedDB?
    ➜ The origin (protocol + host + port). Databases are scoped per origin, not per tab
  • How can multiple tabs of the same site coordinate IndexedDB access?
    ➜ Through transactions, locks, and optional cross-tab messaging (e.g., BroadcastChannel)
  • Is IndexedDB a single-user or multi-user database?
    ➜ Single-user, per-browser. It is local to one user, not shared across browsers, accounts, or devices
%%{init: {'flowchart': { 'htmlLabels': true, 'useMaxWidth': true, 'wrap': true, 'curve': 'linear', 'nodeSpacing': 40, 'rankSpacing': 60 }}}%%
flowchart TB

classDef box fill:#eef,stroke:#333,stroke-width:1px,color:#000;
classDef os  fill:#fffdcc,stroke:#333,stroke-width:1px,color:#000;

APP["Layer 7: Application
(apps: browser (IndexedDB))"]:::box OS["Operating System
(spans multiple layers)"]:::os US["Layers 5–6
User-space libraries & services
(session, TLS, encoding)"]:::box KRN["Layers 2–4
Kernel networking stack
(L2/L3/L4)"]:::box HW["Layer 1: Physical
(NIC, cable, radio)"]:::box APP --> OS OS --> US OS --> KRN US --> HW KRN --> HW %% Explicit widths so labels never clip style APP width:760px style OS width:760px style US width:360px style KRN width:360px style HW width:760px

Why await is Required with IndexedDB

  • JavaScript is async by default β†’ non-blocking, event loop driven
  • IndexedDB CRUD = I/O-bound β†’ returns a Promise
  • Dependency: Read after Write β†’ must wait until write commits
  • await order β†’ prevents stale/undefined data

// Async default: does not wait
store.put({id: 1, name: "Chair"});
let item = store.get(1);   // ❌ may run before put finishes

// Correct with await
await store.put({id: 1, name: "Chair"});
let item = await store.get(1); // βœ… read after write

IndexedDB and Performance

  • How does IndexedDB impact performance of a web app?
    ➜ IndexedDB operations are asynchronous I/O. They do not block the UI, but frequent dependent reads/writes or large payloads require waiting (await) and can add latency and increase memory/disk usage.
  • How can performance be improved?
    ➜ Use indexes for efficient queries, batch writes instead of many single puts, cache in runtime memory (eg local array of objects) when possible.

Questions

IndexedDB and ACID Properties

Property Meaning IndexedDB
Atomicity All operations in a transaction succeed or all fail. Yes: transactions are atomic.
Consistency Database moves from one valid state to another. Yes: enforced within a transaction.
Isolation Concurrent transactions do not interfere. Partial: isolation within one tab, but multiple tabs can interfere.
Durability Once committed, data persists even after crash/power loss. Yes: backed by disk storage.

ACID Properties: Questions

  • Atomicity: What happens if one operation in a transaction fails?
    ➜ The whole transaction is rolled back; nothing is committed.
  • Consistency: If a store has keyPath: "id", autoIncrement: true, what happens if you try to insert two records with the same id?
    ➜ The transaction fails: IndexedDB enforces uniqueness and prevents an invalid state.
  • Isolation: If two tabs write to the same object store at the same time, do their transactions interfere?
    ➜ Within one tab, operations are isolated; across tabs, race conditions may occur.
  • Durability: After committing a transaction, will the data still exist after closing the browser and reopening it?
    ➜ Yes, data is persisted to disk (until explicitly cleared).

What happens every time the page loads?

// Open DB
const req = indexedDB.open("furnitureDB", 1);

req.onupgradeneeded = e => {
  let db = e.target.result;
  db.createObjectStore("items", { keyPath: "id", autoIncrement: true });
};

req.onsuccess = e => {
  let db = e.target.result;
  let tx = db.transaction("items", "readwrite");
  let store = tx.objectStore("items");
  store.add({ name: "Chair" });   // Insert one record each page load
};

Avoiding Duplicate Inserts

  • Problem: autoIncrement inserts a new record each page load
  • Fix: first check if record exists, then update (put) or add

let tx = db.transaction("items", "readwrite");
let store = tx.objectStore("items");

let req = store.get(1); // Check first

req.onsuccess = e => {
  if (e.target.result) {
    store.put({ id: 1, name: "Chair" }); // Record exists β†’ update
  } else {
    store.add({ id: 1, name: "Chair" }); // Not found β†’ insert
  }
};

Questions

Why is it called "IndexedDB"?

Why "IndexedDB"?

  • IndexedDB is a key–value store
  • Every object store requires a primary key
  • The primary key is automatically indexed
  • Lookups by key are O(log n) (via B-tree)
  • It’s an indexed database by default

// Create store with primary key
let store = db.createObjectStore("items", { 
  keyPath: "id", autoIncrement: true 
});

// Primary key "id" is automatically indexed
store.add({ name: "Chair", price: 49 });

// Fast lookup by id
store.get(1);  // uses the implicit index

Database Indexing

  • Without index β†’ full scan: O(n)
  • With index (B-tree) β†’ search: O(log n)
  • Index = sorted structure (keys, pointers)
  • Fast read, slow create, update, delete
B-tree index illustration

IndexedDB and UNIQUE Index

  • UNIQUE index enforces no duplicates
  • Created when defining the object store
  • Lookup is fast: O(log n) using a B-tree
  • Eg. fields like email or username
  • Insert duplicate value β†’ transaction fails
const req = indexedDB.open("furnitureDB", 1);
req.onupgradeneeded = e => {
  let db = e.target.result;
  let store = db.createObjectStore("items", { 
    keyPath: "id" 
  });
  store.createIndex("by_name", "name", { 
    unique: true 
  });
};
let tx = db.transaction("items", "readonly");
let store = tx.objectStore("items");
let idx = store.index("by_name");
let req = idx.get("Chair");

Questions

Today, Thursday Aug 28th

  1. Big Data Characteristics (5V) ✓
  2. Browser and IndexedDB Architecture ✓
  3. Lab Use Case Examples
  4. Homework 1 and Lab 1

Use Case: IndexedDB as Cache

  • Cache API responses locally
  • Reduce repeated network calls
  • Enable faster page reloads
  • Improve perceived performance
IndexedDB as cache

Use Case: Offline-First Apps

  • Store user actions when offline
  • Sync changes back to server later
  • Essential for mobile and flaky networks
  • Eg.: note-taking apps, messaging apps
Offline-first IndexedDB

Use Case: Large Local Datasets

  • MBs–GBs of structured data in-browser
  • Query efficiently using indexes (O(log n))
  • Keep UI responsive with async access
  • Examples: sensor reading, GIS maps
Large datasets IndexedDB

Use Case: Privacy & Access Control

  • Data is stored locally only
  • Access is restricted to same-origin policy
  • One site can't read other site’s IndexedDB
  • User can wipe data via browser settings
  • No server access unless app explicit syncs
Privacy and access control IndexedDB

Questions

Do you recommend IndexedDB?

  • Database must support strong aggregation (complex queries, joins)
  • Database must support unstructured data (JSON objects, nested fields)
  • Database must support multi-user access across devices
  • Database must work offline in the browser
  • Database must handle large local datasets efficiently

Today, Tuesday Sept 2nd

  1. From Tables to JSON Documents
  2. JSON Modeling Structures
  3. Design Principles
  4. Lab Use Case Examples
  5. Roster Verification
  6. Homework 1 and Lab 1
JSON Format
Frontend
JS Object
Backend
OOP Classes
Database
SQL Tables

Database

Data Modeling in Tables

E. F. Codd
  • 1970s: E.F. Codd and Relational Model
  • ER Diagrams for conceptual design
  • Tables with rows & columns
  • Relationships via Foreign Keys
  • Normalization for consistency
erDiagram
  CUSTOMER ||--o{ ORDER : places
  PRODUCT ||--o{ ORDER : included_in

  CUSTOMER {
    int id
    string name
    string email
  }
  PRODUCT {
    int id
    string name
    float price
  }
  ORDER {
    int id
    date order_date
    int product_id
    int quantity
  }
      

Normalization (Step 1)

  • Denormalized: Multiple values in one field
  • Normal Form 1 (1NF):
    – All attributes are atomic (no lists)
    – Each row identified by a primary key
    – No repeating groups
OrderID (PK) Customer Products
101 Alice Chair, Table
102 Bob Sofa

In 1NF, Products column would be split so each row has a single product.

SQL and Table Structures

  • SQL (Structured Query Language)
  • Designed to query data in tables
  • Supports filtering, joining, grouping
  • Declarative: state what, not how
  • ⚠ Risk: Missing knowledge of SQL & table structure is critical

SELECT * FROM Customer;

SELECT c.name, o.order_date FROM Customer c
JOIN Order o ON c.id = o.customer_id;

SELECT p.name, SUM(o.quantity) AS total_sold FROM Order o
JOIN Product p ON o.product_id = p.id
GROUP BY p.name;
      

Backend

Data Modeling in Classes

E. F. Codd
  • Simula (1967, Ole-Johan Dahl)
  • Smalltalk (1972, Alan Kay)
  • C++ (1985, Bjarne Stroustrup)
  • Java (1995, James Gosling)
  • Classes define structure & behavior
  • Objects are instances with state
  • Relationships via composition & references
classDiagram
  class Customer {
    +int id
    +string name
    +string email
    +placeOrder(product, qty)
  }
  class Order {
    +int id
    +date orderDate
    +addItem(product, qty)
    +getTotal()
  }
  class Product {
    +int id
    +string name
    +float price
  }

  Customer "1" o-- "*" Order : places
  Order "*" *-- "*" Product : items
      

Encapsulation

  • From data bags β†’ classes: data + methods
  • Encapsulation: hide internals behind APIs
  • Simplify communication: Conway's Law

// ❌ Before
const order = { items: [] };
order.items.push({ name:"Chair", price:50 });
console.log(order.items[0].price);

// βœ… After
class Order {
  constructor(){ this.items=[]; }
  add(p){ this.items.push(p); }
}

const o = new Order();
o.add({ name:"Chair", price:50 });
console.log(o.items[0].price);

OOP and Backend

  • Objects group data + methods
  • Query via methods (e.g., order.total())
  • Available in: Backend tier only
  • Mapping Objects: Frontend/Database
  • ⚠ Risk: Missing/Wrong OOP knowledge limits backend design

class Order {
  constructor(){ this.items=[]; }
  add(p, q){ this.items.push({p,q}); }
  get total(){ return this.items
    .reduce((s,i)=>s+i.p.price*i.q,0); }
}

const o = new Order();
o.add({name:"Chair",price:50},2);
console.log(o.total); // query object
      

Frontend

Data Modeling in JavaScript

Brendan Eich
  • 1995: JavaScript in Browser (Brendan Eich)
  • Objects as native structure
  • JSON introduced for serialization
  • Basis for frontend data exchange

const customer = {
  id: 1,
  name: "Alice",
  orders: [
    { id: 101, product: "Chair", qty: 2 },
    { id: 102, product: "Table", qty: 1 }
  ]
};
console.log(customer.name);
      

Frontend Data Normalization

  • Denormalized: nested, repetitive fields
  • Normalized JSON: break into arrays & IDs
  • Frontend: simple traversal, fast rendering

// Denormalized
{
  "id": 1,
  "name": "Alice",
  "orders": [
    { "product": "Chair", "qty": 2 },
    { "product": "Table", "qty": 1 }
  ]
}

// Normalized
{
  "customers": [{ "id": 1, "name": "Alice" }],
  "orders": [
    { "id": 101, "customerId": 1, "product": "Chair", ... },
    { "id": 102, "customerId": 1, "product": "Table", ... }
  ]
}
      

JSON / Document Modeling

  • 2000s: NoSQL & flexible schema
  • JSON documents, nested objects/arrays
  • Tree-like structures replace joins
  • Denormalization for read efficiency
  • Query using any programming language
{
  "customer": {
    "id": 1,
    "name": "Alice",
    "email": "alice@example.com",
    "orders": [
      {
        "id": 101,
        "order_date": "2025-09-01",
        "items": [
          { "product": "Chair", "price": 49.99,... },
          { "product": "Table", "price": 199.99,... }
        ]
      }
    ]
  }
}

JSON in Frontend

  • Query with JS syntax: Map-Filter-Reduce
  • Native in frontend: JSON ⇄ JS Objects
  • Basis for APIs: exchange with backend
  • ⚠ Risk: Weak JSON skills β†’ hard to handle frontend data

// Query JSON directly
const data = {
  customer: { name: "Alice" },
  orders: [ { product:"Chair", qty:2 } ]
};

console.log(data.customer.name); // "Alice"
console.log(data.orders[0].product); // "Chair"
      
JSON Format
Frontend
JS Object
Backend
OOP Classes
Database
SQL Tables

Today, Tuesday Sept 2nd

  1. From Tables to JSON Documents ✓
  2. JSON Modeling Structures
  3. Design Principles
  4. Lab Use Case Examples
  5. Roster Verification
  6. Homework 1 and Lab 1
  • Created: 2001 by Douglas Crockford
  • Standardized: ECMA-404 (2013), RFC 8259 (IETF)
  • Where used:
    – Frontend: JavaScript objects, APIs
    – Backend: REST, GraphQL, config
    – Databases: NoSQL (MongoDB, CouchDB), IndexedDB
  • Historical context: Lightweight alternative to XML for data exchange on the web

JSON Object

  • Curly braces { }
  • Key–value pairs
  • Keys are always strings

const customer = {
  id: 1,
  name: "Alice",
  email: "alice@example.com"
};
console.log(customer.name); // Alice
      

JSON Array

  • Square brackets [ ]
  • Ordered collection
  • Can hold any JSON values

const products = [
  "Chair",
  "Table",
  "Sofa"
];
console.log(products[1]); // Table
      

JSON String

  • Double quotes required
  • Supports Unicode
  • Escape special chars with \

const message = {
  text: "Hello, World!",
  escaped: "Line1\nLine2"
};
console.log(message.text);
      

Numbers & Booleans

  • Numbers: integer or decimal
  • Boolean: true / false
  • No quotes required

const order = {
  qty: 2,
  price: 49.99,
  available: true
};
console.log(order.price * order.qty); // 99.98
      

Null

  • Represents empty / unknown
  • Written as null
  • No quotes around it

const profile = {
  name: "Alice",
  note: null
};
console.log(profile.note); // null
      
JSON Data Types
https://www.json.org/json-en.html

Is each snippet valid JSON?


1. { "name": "Alice" }
2. { name: "Alice" }
3. { "age": 25, }
4. [ "red", "green", "blue" ]
5. [ "a", "b",, "c" ]
6. { "valid": true, "value": null }
7. { "price": 19.99, "qty": 2 }
8. { "flag": True }
9. { "nested": { "x": 1, "y": 2 } }
10. "Just a string"
  

Today, Tuesday Sept 2nd

  1. From Tables to JSON Documents ✓
  2. JSON Modeling Structures ✓
  3. Design Principles
  4. Lab Use Case Examples
  5. Roster Verification
  6. Homework 1 and Lab 1

Flat


{
  "customer_id": 1,
  "customer_name": "Alice",
  "order_id": 101,
  "product": "Chair",
  "qty": 2
}
      

Nested


{
  "customer": {
    "id": 1,
    "name": "Alice"
  },
  "order": {
    "id": 101,
    "item": {
      "product": "Chair",
      "qty": 2
    }
  }
}
      

Design Principles for JSON Data

Design Principles for JSON Data

DIKW as Nested JSON


{
  "Wisdom": {
    "Knowledge": {
      "Information": {
        "Data": [
          "leaf1",
          "leaf2",
          "leaf3"
        ]
      }
    }
  }
}
  

Today, Tuesday Sept 2nd

  1. From Tables to JSON Documents ✓
  2. JSON Modeling Structures ✓
  3. Design Principles ✓
  4. Lab Use Case Examples
  5. Roster Verification
  6. Homework 1 and Lab 1

Model JSON requirements:

https://jsoncrack.com/editor

Lab Use Case Examples

  • IoT Sensor Data: Time-series data from temperature, humidity sensors
  • Social Media Posts: User posts with text, images, comments, likes
  • Healthcare Records: Patient records with demographics, visits, prescriptions
  • Geospatial Data: Locations with coordinates, metadata, timestamps

Questions

Today, Tuesday Sept 4th

  1. JSON Modeling Structures
  2. Summary IndexedDB
  3. Homework 1 and Lab 1
  4. Roster Verification
  5. Lab Use Case Examples

JSON Data Distributed

πŸ—„οΈ πŸ“š πŸ•ΈοΈ πŸ“Š

Data is stored across multiple machines for scalability, fault tolerance, and availability.

Defining IDs

  • Unique – each ID is distinct, no duplicates
  • No Meaning – ID is just an identifier
  • Purpose: reference only

// βœ… Good ID
{ "id": "7f8c9d12", "name": "Alice" }

// ❌ Bad ID (meaning encoded)
{ "id": "2025-student-Alice", "name": "Alice" }
      

Defining JSON IDs

πŸ”‘ Every object shouldhave a unique id
This ensures:

  • Reliable referencing
  • Consistent updates
  • Safe merges in distributed systems

{
  "id": "a12f-34cd-56ef",
  "name": "...",
  "type": "...",
  "createdAt": "...",
  "otherField": "..."
}
      

Defining JSON IDs

  • Business ID – human-readable, domain-specific (e.g., order#1234)
  • Auto-increment – sequential ID (1, 2, 3...)
  • UUID – globally unique 128-bit identifier

// Built-in in modern browsers
const id = crypto.randomUUID();

console.log(id);
// e.g. "3b241101-e2bb-4255-8caf-4136c566a962"
      

Why UUID is a Universal ID

  • Universally Unique Identifier – 128-bit number
  • Guarantees uniqueness across time & space
  • Divided into structured parts:
    • Timestamp / Random / Hash depending on version
    • Variant bits specify format
    • Version bits indicate generation method

UUIDs prevent collisions across distributed systems without coordination.

UUID Structure Breakdown


  123e4567-e89b-12d3-a456-426614174000
  β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”˜ β””β”€β”€β”˜ β””β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   Time     Ver   Seq  Var   Node/Random
  
  • Time – milliseconds since epoch (v1)
  • Version – 1 to 8 (defines how UUID is made)
  • Variant – layout type (e.g., RFC 4122)
  • Node/Random – MAC address or random bits

UUID Versions (History)

Version Basis Notes
v1 Timestamp + MAC First defined, 1990s
v2 DCE Security Rare, site-specific
v3 MD5 Hash (Name-based) Stable for same input
v4 Random Most common today
v5 SHA-1 Hash (Name-based) More secure than v3
v6, v7, v8 Proposed (time-ordered, Unix time, custom) 2020s improvements

ISO 8601 Date & Time Format

  • Standardized format for date & time
  • Unambiguous across regions
  • Order: YYYY-MM-DD β†’ hh:mm:ss β†’ timezone
  • Preferred for APIs, JSON, databases
  • Machine & human-readable

// ISO 8601 Examples

"2025-09-04" 
// date only (YYYY-MM-DD)

"2025-09-04T10:15:30Z" 
// UTC time (Z = Zulu)

"2025-09-04T10:15:30-07:00" 
// local time with offset
      

βœ… Always use ISO 8601 for storing & exchanging dates

What Does the Z Mean in ISO 8601?

  • Z = Zulu time (UTC+0)
  • UTC (Coordinated Universal Time)
  • No offset from zero meridian (Greenwich)
  • Same everywhere in the world
  • For distributed DB to avoid timezone errors

// ISO 8601 with Z

"2025-09-04T10:15:30Z"
// 10:15:30 UTC

"2025-09-04T03:15:30-07:00"
// Same moment in California (UTC-7)

"2025-09-04T12:15:30+02:00"
// Same moment in Central Europe (UTC+2)
      

βœ… Z ensures a single global reference, independent of local timezones.

UTC Timezones

// JavaScript (Browser)
const z = new Date().toISOString();
// e.g. "2025-09-04T13:22:11.123Z"
      

// JavaScript (Node.js)
const z = new Date().toISOString();
      

# Python 3
from datetime import datetime, timezone
z = datetime.now(timezone.utc).isoformat()
# e.g. "2025-09-04T13:22:11.123456+00:00"
      

# Ruby
require "time"
z = Time.now.utc.iso8601
      

// Java (java.time)
import java.time.*;
String z = Instant.now().toString();
// or:
String z2 = OffsetDateTime.now(ZoneOffset.UTC).toString();
      

// Go
import "time"
z := time.Now().UTC().Format(time.RFC3339)
      

// C#
string z = DateTime.UtcNow.ToString("o");
// ISO 8601, e.g. 2025-09-04T13:22:11.1234567Z
      

// PHP
$z = (new DateTime('now', new DateTimeZone('UTC')))
       ->format(DateTime::ATOM);
      

# Shell (Unix)
date -u +"%Y-%m-%dT%H:%M:%SZ"
      

Tip: Prefer ISO 8601 strings (with Z) when storing or exchanging timestamps.

Question

Embed when:

  • 1-to-1 relationship
  • Data is always accessed together
  • Example: user profile + address

{
  "userId": "u1",
  "name": "Alice",
  "address": {
    "city": "Stockton",
    "zip": "95211"
  }
}
      

Reference when:

  • 1-to-n or n-to-n
  • Entities are reused across docs
  • Example: users and groups

// 1 β†’ n (one user in multiple groups)
{ "userId": "u1", "groupIds": ["g1", "g2"] }

// n ↔ n (mutual references between users & groups)
{ "userId": "u2", "groupIds": ["g1"] }

{ "groupId": "g1", "memberIds": ["u1", "u2"] }

Optimize for Access Patterns

  • Model data around how it’s queried, not abstract theory
  • Include redundant fields if it avoids joins/lookups
  • Precompute aggregates when queries need speed
  • Design for read efficiency over write purity

// Store authorName directly with posts
{
  "postId": "p1",
  "title": "JSON Best Practices",
  "authorId": "u1",
  "authorName": "Alice"
}
  

JSON Attribute Naming Conventions

  • Booleans: start with is, has, or can isActive, hasLicense, canEdit
  • Arrays: use plural or end with List userList, tags, orderIds
  • Numbers: add units or context priceUSD, ageYears, timeoutMs
  • Strings: descriptive nouns, no type suffix userName, email, addressLine
  • Dates / Time: use ISO 8601, suffix with At createdAt, updatedAt, expiresAt
  • Objects: use clear entity names userProfile, billingInfo, geoLocation
  • IDs: always include id suffix userId, orderId, sessionId
{
  "userId": "u123",
  "userName": "Alice",
  "isActive": true,
  "tags": ["student", "premium"],
  "orderIds": ["o1001", "o1002"],
  "priceUSD": 199.99,
  "createdAt": "2025-09-04T10:15:00Z",
  "userProfile": {
    "ageYears": 29,
    "email": "alice@example.com"
  }
}

Common Metadata Attributes

βœ… Common fields for metadata object:

  • createdAt, updatedAt, deletedAt
  • author – creator / last editor
  • schemaVersion – track schema changes
  • source – origin system or API
  • tags – labels for categorization
  • status – draft, active, archived, etc.
  • revision – document version number
  • checksum – hash for integrity check
  • permissions – access control rules
{
  "metadata": {
    "createdAt": "2025-09-04T12:00:00Z",
    "updatedAt": "2025-09-04T12:30:00Z",
    "author": "u1", // source device
    "tags": ["json", "nosql", "big-data"],
    "status": "active",
    "checksum": "a1b2c3d4",
    "permissions": {
      "read": ["admin", "editor"],
      "write": ["admin"]
    }
  }
}
Question

Today, Tuesday Sept 4th

  1. JSON Modeling Structures ✓
  2. Summary IndexedDB
  3. Homework 1 and Lab 1
  4. Roster Verification
  5. Lab Use Case Examples

NoSQL Database Properties

Database Type Query Language Data Format ACID CAP
A C I D C A P
IndexedDB Key-Value (browser) JavaScript API JSON βœ“ βœ“ βœ“ βœ“
MongoDB Document Store
Neo4j Graph Database
Cassandra Wide-Column Store
Reddis Key-Value (in-memory)
ClickHouse Columnar (OLAP)
FAISS Vector Index / Library

Homework 1 (Canvas Module 1)

Lab 1: Agricultural Data Collection with IndexedDB

In this lab, you will develop a web application that collects and stores unstructured agricultural data using IndexedDB. You will implement functionality to handle various data types, including sensor readings, images, farmer notes, GPS coordinates, and timestamps

Lab 1 Example

Lab 1: Submission Instructions

  1. Name file: lab1-lastname-XXXX.html (e.g. lab1-john-doe-1234.html)
  2. Push code to GitHub repo
  3. Add 5 unit tests + screenshot of console with retrieved data
  4. Upload GitHub link + screenshot in course portal

Questions

Course Terms

NoSQL, Key-Value Store, Document Store, Column Store, Graph Database, CAP Theorem, BASE Model, ACID Properties, Eventual Consistency, Strong Consistency, Sharding, Replication, Partitioning, Horizontal Scaling, Vertical Scaling, Consistency, Availability, Partition Tolerance, Vector Clock, Schema-less, Indexing, Secondary Index, Primary Key, Unique Index, Compound Key, MapReduce, Aggregation, Query Engine, Query Planner, Execution Plan, CRUD, Insert, Update, Delete, Read, Transaction, Object Store, Async I/O, Promise, Await, Sandbox, Same-Origin Policy, JSON, BSON, Data Lake, Data Warehouse, ETL, ELT, Streaming, Batch Processing, Lambda Architecture, Kappa Architecture, Pub/Sub, Message Queue, Idempotency, Conflict Resolution, Event Sourcing, CQRS, Distributed Cache, Sharding, Replication, In-Memory Database, Time-Series Database, Search Index, Inverted Index, Full-Text Search, 5 Vs of Big Data: Volume, Velocity, Variety, Veracity, Value.

  1. What is the difference between NoSQL and SQL?
  2. What was the motivation to introduce NoSQL databases?
  3. What is the difference between a SQL database and IndexedDB?
  1. Why is IndexedDB considered asynchronous and I/O bound?
  2. What role do indexes play in both SQL and IndexedDB?
  3. How does access control work for IndexedDB in the browser?
  4. What does big data mean?
  1. How are Full Stack Engineering and JSON related in the way data flows across frontend, backend, and databases?
  2. What are the trade-offs between using a flat JSON structure vs. a nested JSON structure for representing the same data?
  3. Whats are disadvantages of modling data in JSON?

Should You Use IndexedDB?

  1. Do you need to store data locally in the browser?
  2. Do you want to work with structured objects, not just strings?
  3. Do you need to store more than 5–10 MB of data?
  4. Do you need offline access to your data?
  5. Do you need querying by keys or indexes?
  6. Do you want transactions (ACID support) in the browser?
  7. Do you need to store files/blobs (e.g., images) locally?
  8. Do you expect your app to sync later with a backend server?
  9. Do you need better performance than localStorage can provide?
  10. Is your app a progressive web app (PWA) or offline-first design?

Today, Tuesday Sept 4th

  1. JSON Modeling Structures ✓
  2. Summary IndexedDB ✓
  3. Homework 1 and Lab 1 ✓
  4. Roster Verification
  5. Lab Use Case Examples

Lab Use Case

Write a local offline database using IndexedDB to keep track of books borrowed from a library.

  • Each book should have a unique id
  • Store the title and author
  • Record the borrower
  • Save borrowed date and due date
  • Support adding, updating, deleting, and listing books

Today Agenda

  • Data Quality and Standards
  • MongoDB History & Features
  • CRUD, Operators, Practice
  • Homework, Lab, and Project

What is data quality?

How does Data quality relate toBig Data

  • Volume β†’ Errors scale with size
  • Velocity β†’ Errors spreads fast
  • Variety β†’ Less usability
  • Veracity β†’ Noise & bias lower trust
  • Value β†’ Errors reduces insigh

Use Case Automotive

  • Bad data β†’ Wrong decisions
  • Incomplete β†’ Missed chances
  • Outdated β†’ Irrelevant actions
  • Inconsistent β†’ Conflicts
  • Duplicate β†’ Wasted resources
Autonomous vehicle data

Use Case Agriculture

  • Accurate weather β†’ Planning
  • Reliable yields β†’ Supply
  • Clean data β†’ Security
  • Timely updates β†’ Less waste
Agriculture data

The quality of the data
determines the quality
of the software.

Origins of Data Quality Metrics

NIST Research Data Framework (RDaF)

  • NIST SP 1500-18r2 (2024)
  • Focus: research data management
  • Core: quality, accessibility, reuse
  • Lifecycle: Envision β†’ Plan β†’ Acquire β†’ Analyze β†’ Share β†’ Preserve
  • Read PDF

ISO 8000 vs NIST RDaF

  • Focus: Data quality in place
  • Core: accuracy, consistency, completeness
  • Focus: End-to-end research data
  • Core: quality, accessibility, reuse
Accuracy Completeness Consistency
Timeliness Validity Uniqueness
Integrity

Data Quality: Consistency

  • Uniform values across records
  • No contradictions in fields
  • Same units, codes, and formats

// βœ… Consistent
{ "id": 1, "unit": "kg" }
{ "id": 2, "unit": "kg" }

// ❌ Inconsistent
{ "id": 3, "unit": "kg" }
{ "id": 4, "unit": "lbs" }
      

Data Quality: Accuracy

  • Reflects real-world truth
  • Correct at the source
  • Free from measurement errors

// βœ… Accurate
{ "crop": "Wheat", "yieldKg": 7200 }

// ❌ Inaccurate (too high to be realistic)
{ "crop": "Wheat", "yieldKg": 7200000 }
      

Data Quality: Completeness

  • All necessary data is present
  • No missing critical fields
  • No empty required arrays

// βœ… Complete
{ "id": 1, "crop": "Maize", "soil": "Loam" }

// ❌ Incomplete (missing soil)
{ "id": 2, "crop": "Maize" }
      

Data Quality: Timeliness

  • Data is current when needed
  • No outdated values
  • Updated on schedule

// βœ… Timely
{ "crop": "Rice", "updatedAt": "2025-09-08T09:00Z" }

// ❌ Outdated
{ "crop": "Rice", "updatedAt": "2020-05-01T10:00Z" }
      

Data Quality: Uniqueness

  • No duplicate records
  • Each entity has a distinct ID
  • One fact represented only once

// βœ… Unique IDs
{ "id": 1, "crop": "Barley" }
{ "id": 2, "crop": "Millet" }

// ❌ Duplicate IDs
{ "id": 3, "crop": "Wheat" }
{ "id": 3, "crop": "Wheat" }
      

Data Quality: Validity

  • Data follows defined formats
  • Adheres to business rules
  • Within acceptable ranges

// βœ… Valid
{ "id": 10, "organic": true, "harvestDate": "2025-09-01" }

// ❌ Invalid (wrong type + bad date)
{ "id": "ten", "organic": "yes", "harvestDate": "jan-2025" }
      

Data Quality: Integrity

  • Data is accurate in its relationships
  • No broken links between entities
  • References are valid and consistent

// βœ… With integrity
{ "farmId": 1, "cropId": 101, "crop": "Maize" }
{ "cropId": 101, "name": "Maize" }

// ❌ Broken reference (farm points to missing crop)
{ "farmId": 2, "cropId": 999, "crop": "Unknown" }
      

How to validate data quality?

SW/Data Design
β†’
SW/Data Dev
β†’
SW/Data Test
β†’
SW/Data Release (Software Fixed)
β†’
Data Runtime Changing
β†’
Data Runtime Testing

Data Quality Seven Metrics

  • Integrity
  • Validity
  • Consistency
  • Accuracy
  • Completeness
  • Timeliness
  • Uniqueness
NASA Mars Climate Orbiter
1 lb = 0.4536 kg
NASA Mars Climate Orbiter failed in 1999 due to data inconsistency.

How to validate data quality?

Integrity Definition

  • Data unchanged & correct
  • Reliable end-to-end
Data Integrity Example

Integrity Examples

  • No negative yields
  • Immutable harvest dates
  • Signed sensor logs

// MongoDB
db.farm.find({ yieldKg: { $lt: 0 } })

// SQL
SELECT * FROM farm
WHERE yield_kg < 0;
      

Integrity Check

flowchart LR A[Source Data] --> B[Generate Checksum] B --> C[Transmit Data + Checksum] C --> D[Verify Checksum] D -->|Match βœ…| E[Data Accepted] D -->|Mismatch ❌| F[Data Rejected] %% Green link for correct linkStyle 3 stroke:#0a0,stroke-width:2px; %% Red link for incorrect linkStyle 4 stroke:#f66,stroke-width:2px;

Generating Checksums

Type: SHA-256


# Python
import hashlib

with open('data.txt', 'rb') as f:
    bytes = f.read()
    hash = hashlib.sha256(bytes).hexdigest()
    # Example output: 3a7bd3e2360a...
    print(hash) 
            

Type: SHA-256


// JavaScript (Node.js)
const crypto = require('crypto');
const fs = require('fs');

const fileBuffer = fs.readFileSync('data.txt');
const hash = crypto.createHash('sha256')
                   .update(fileBuffer)
                   .digest('hex');
// Example output: 3a7bd3e2360a...
console.log(hash); 
            

Data Integrity with Checksums

  • Include checksum in metadata
  • Verify integrity before processing
  • SHA-256 recommended
  • Checksum is not encryption
{
  "meta": {
    "checksum": "3a7bd3e2360a9e3f2a1c4b..."
  },
  "data": {
    "sensorId": "S1001",
    "temperature": 22.5,
    "humidity": 45,
    "timestamp": "2025-09-08T15:00:00Z"
  }
}
            

Validity Definition

  • Values follow format
  • Within allowed ranges
Data Validity Example

Validity Examples

  • Valid season values
  • Temperature numeric
  • ISO timestamps
  • Use standards

// MongoDB
db.farm.find({
  season:{ $nin:["Spring","Fall"] }
})

// SQL
SELECT * FROM farm
WHERE season NOT IN
 ('Spring','Fall');
      

Validity Check

  • Which data is invalid?
{
  "id": 1,
  "crop": "Wheat",
  "yieldKg": 7500,
  "harvestDate": "32-13-2025",
  "price": 250,
  "irrigation": "Drip",
  "stock": 120,
  "locaton": "Field A",
  "owner": 12345,
  "organic": "maybe"
}
            

Validity MongoDB Check

  • Check if organic field is valid

// MongoDB: find docs where 'organic' is not boolean
db.farm.find({
  $expr: { $ne: [ { $type: "$organic" }, "bool" ] }
})
                

Consistency β€’ Accuracy β€’ Completeness β€’ Timeliness β€’ Uniqueness

➑️ Thursday

Questions?

Which Data Quality metrics are NOT met?

  • Are units consistent across records?
  • Is harvestDate in the correct format?
  • Is the record up to date?
  • Is crop present and non-empty?
  • Is owner the expected type?
{
  "id": 42,
  "crop": "Wheat",
  "yieldKg": 7200,
  "unit": "lbs",
  "owner": "Alice",
  "harvestDate": "13/31/2025",
  "updatedAt": "2022-01-10T10:00:00Z"
}
❌ Metrics not met:
- Consistency
- Validity
- Timeliness

Questions?

Today Agenda

  • Data Quality and Standards ✓
  • MongoDB History & Features
  • CRUD, Operators, Practice
  • Homework, Lab, and Project
IndexedDB
Redis
MongoDB
CassandraDB
Faiss
ClickHouse
Neo4j

MongoDB History

  • 2007: Start at 10gen
  • 2009: Open source (GitHub)
  • 2014: WiredTiger
  • 2016: Atlas cloud
  • 2017: IPO
  • 2021+: ACID, time series

Note: The name MongoDB comes from "humongous" (meaning huge).

MongoDB History

The MEAN Stack

  • 2009: Node.js introduced (JS on server)
  • 2010: AngularJS launched (Google)
  • 2013: Term MEAN stack popularized
  • MEAN =MongoDB, Express, Angular, Node
  • Full JavaScript stack β†’ front-end to back-end
MEAN Stack

NoSQL Types & SQL

  • Key–Value: Fast lookups
  • Document: JSON records
  • Column-Family: Wide tables
  • Graph: Nodes + edges
  • SQL: Rows + columns
NoSQL Database Types

MongoDB and the 5 V's of Big Data

  • Volume: Horizontal scaling petabyte data
  • Velocity: High write throughput
  • Variety: Flexible schema-less docs
  • Veracity: Validation rules & consistency
  • Value: Aggregation & BI integration
Big Data Volume

RDBMS vs MongoDB

RDBMS vs MongoDB

Scaling Volume in MongoDB vs RDBMS

  • MongoDB:Horizontal scaling (sharding)
  • RDBMS: Limited to vertical scaling
  • Sharding β†’ distribute data multiple servers
  • Improves performance and fault-tolerance
Horizontal vs Vertical Scaling
How come horizontal scaling works now?

Hardware & Scaling

  • 2000s: cheaper commodity computers
  • Made horizontal scaling practical
  • Clusters cheaper than one big server
Computer Hardware Price Trends

What is a Document?

  • Document is self-contained unit of information
  • Stores related data together in one JSON-like structure
  • No need to join across tables to reconstruct meaning
  • Each document has a unique _id (UUID/ObjectId)
  • Each document can stand alone and be understood by itself
{
  "_id": ObjectId("64f9c1a2b1234c5678d90ef1"),
  "crop": "Wheat",
  "yieldKg": 7500,
  "irrigation": "Drip",
  "location": { // No joins needed
    "field": "North",
    "soil": "Loam"
  },
  "farmerName": "John Doe",         
  "farmerPhone": "555-1234",         
  "equipmentUsed": "Tractor-101",    
  "marketPriceUSD": 250       
}

What is a Document?

SQL (Normalized):

-- Crops table
Crops(id, crop, yieldKg, irrigation, locationId)

-- Location table
Locations(id, field, soil)

-- Farmers table
Farmers(id, name, phone)

-- Equipment table
Equipment(id, name)
            

βœ… Avoids duplication, but needs joins to query.

MongoDB (Denormalized):

{
  "_id": ObjectId("64f9c1a2b1234c5678d90ef1"),
  "crop": "Wheat",
  "yieldKg": 7500,
  "irrigation": "Drip",
  "location": { "field": "North", "soil": "Loam" },
  "farmer": { "name": "John Doe", "phone": "555-1234" },
  "equipmentUsed": ["Tractor-101", "Seeder-202"]
}
            

⚑ All info in one document β†’ fast reads, but duplicates farmer/equipment across docs.

MongoDB Local vs Online

  • MongoDB Compass (Local)
    • GUI client for local databases
    • Runs on developer’s machine
    • Good for testing & development
    • No internet required
  • MongoDB Atlas (Online)
    • Fully managed cloud service
    • Global availability & scaling
    • Backups, monitoring, security
    • Accessible anywhere via internet

MongoDB Features

  • JSON docs
  • Dynamic schema
  • Sharding & Replica
  • Indexes & Aggs
  • ACID Tx

// Document
{ crop:"Wheat",yieldKg:3000,
  soil:"Loam" }
      

MongoDB Query Language (MQL)

  • Sample JSON Document
    
    {
      "farm": "GreenField",
      "crop": "Wheat",
      "year": 2024,
      "yield_tons": 120,
      "location": "California"
    }
                                  
  • MQL Query Examples
    
    // Find farms growing Wheat
    db.crops.find({ 
        crop: "Wheat" 
    });
                                  
  • I/O Options
  • Compass (GUI JSON view)
  • Shell queries
  • Drivers (Python, Node.js, Java)
  • Atlas API (REST/GraphQL)

// Principles of MQL
- JSON-like syntax
- Expressive operators ($gt, $set, etc.)
- CRUD: Create, Read, Update, Delete
- Works locally or in Atlas
                      

MQL Syntax: Filter First

  • MQL methods filter as the first parameter
  • Filter defines which documents match
  • Second param = projection (return field)
  • Options: sort, limit, update operators
// Find all wheat crops
db.crops.find(
  { crop: "Wheat" },          // filter
  { crop: 1, yieldKg: 1 }     // projection
)
// Find with range filter
db.crops.find(
  { yieldKg: { $gt: 5000 } }, // filter
  { crop: 1, irrigation: 1 }  // projection
)
            

MQL Syntax Overview

  • db.collection.find(filter, projection)
  • db.collection.insertOne(document)
  • db.collection.updateOne(filter, update, opt)
  • db.collection.deleteOne(filter)
  • db.collection.aggregate(pipeline)

Reference: MongoDB Query Language (MQL) Documentation

// Example: Find crops with yield > 5000
db.crops.find(
  { yieldKg: { $gt: 5000 } },   // filter
  { crop: 1, yieldKg: 1 }       // projection
)
            

MongoDB ObjectId

  • UUID for documents auto-generated
  • Timestamp, MAC, process ID, counter
{
  "_id": ObjectId("650f5a2e1c4d3a6b7f9d1234"),
  "name": "Rose",
  "color": "Red",
  "weightInKg": 0.5,
  "timestamp": ISODate("2025-09-08T15:00:00Z")
}
            

MongoDB Query Results

  • All queries return object or cursor
  • Write operations return ack object
  • Check `ack` for confirmation
// Insert query
const insertResult = db.flowers.insertOne({ 
  name: "Tulip", 
  color: "Yellow" 
});

// Output:
// {
//   acknowledged: true,
//   insertedId: ObjectId("f9d5678...")
// }
            

CRUD Create


// SQL
CREATE TABLE farm (
    id INT PRIMARY KEY AUTO_INCREMENT,
    crop VARCHAR(50)
);

INSERT INTO farm(crop)
VALUES ('Rice');
            

// MongoDB
use farmDB
db.createCollection("farm")
db.farm.insertOne({ crop:"Rice" })
            

CRUD Read


// SQL
SELECT * 
FROM farm
WHERE crop = 'Rice';
            

// MongoDB
db.farm.find({ crop: "Rice" })
            

CRUD Update


// SQL
UPDATE farm 
SET yield_kg = 4000
WHERE crop = 'Rice';
            

// MongoDB
db.farm.updateOne(
    { crop: "Rice" },
    { $set: { yieldKg: 4000 } }
)
            

CRUD Delete


// SQL
DELETE FROM farm
WHERE crop = 'Rice';
            

// MongoDB
db.farm.deleteOne({ crop: "Rice" })
            

Comparison Ops

  • $gt / $lt
  • $in / $ne
  • $exists
// MongoDB
db.farm.find({yieldKg:{$gt:3000}})

// SQL
SELECT * FROM farm
WHERE yield_kg>3000;
      

Logical Ops

  • $and / $or / $not / $nor
// MongoDB
db.farm.find({$and:[
 {crop:"Wheat"},{yieldKg:{$gt:2000}}]})

// SQL
SELECT * FROM farm
WHERE crop='Wheat' AND yield_kg>2000;
      

Evaluation Ops

  • $regex
  • $type
  • $expr
// MongoDB
db.farm.find({crop:/^W/})

// SQL
SELECT * FROM farm
WHERE crop LIKE 'W%';
      

Practice Query 1


// SQL
SELECT * FROM farm
WHERE yield_kg > 6000;
            

// MongoDB
db.farm.find({ yieldKg: { $gt: 6000 } })
                

Practice Query 2


// SQL
SELECT * FROM farm
WHERE irrigation IS NULL;
            

// MongoDB
db.farm.find({ irrigation: null })
                

Practice Query 3


// SQL
SELECT * 
FROM farm
ORDER BY price ASC
LIMIT 1;
            

// MongoDB
db.farm.find().sort({ price: 1 }).limit(1)
                

Practice Query 4


// SQL
UPDATE farm
SET stock = stock + 20
WHERE crop = 'Maize';
            

// MongoDB
db.farm.updateOne(
    { crop: "Maize" },
    { $inc: { stock: 20 } }
)
                

Practice Query 5


// SQL
DELETE FROM farm
WHERE yield_kg < 1000;
            

// MongoDB
db.farm.deleteMany(
    { yieldKg: { $lt: 1000 } }
)
                

Today Agenda

  • How to Validate Data Quality
  • MongoDB History & Features
  • CRUD, Operators, Practice
  • Homework, Lab, and Project

How to validate data quality?

Accuracy Completeness Consistency
Timeliness Validity Uniqueness
Integrity

Minimum Schema & Types (Part 1)

  • Use $jsonSchema for types & required.
  • Disallow NULL for critical fields.
// Validity
// Enforce required fields
db.createCollection("readings", {
  validator: {
    $jsonSchema: {
      // Completeness
      required: ["deviceId", "ts"]
    }
  }
})

Minimum Schema & Types (Part 2)

  • Range checks for moisture (0–100).
  • Enum for unit (e.g., "percent").
  • Reject malformed docs on write.
// Validity, Accuracy
properties: {
  deviceId: { bsonType: "string" },
  ts: { bsonType: "date" },
  moisture: { minimum: 0, maximum: 100 },
  unit: { enum: ["percent"] }
}

bsonType Example (Part 1)

  • bsonType validates field data types.
  • Ensures string, date, int.
// Validity
properties: {
  deviceId: { bsonType: "string" },
  ts: { bsonType: "date" },
  moisture: { bsonType: "double" }
}

bsonType Example (Part 2)

  • Prevents accidental type drift.
  • Supports arrays, objects, bool, null.
// Consistency
properties: {
  battery: { bsonType: "int" },
  healthy: { bsonType: "bool" },
  tags: { bsonType: "array" },
  meta: { bsonType: "object" }
}

Defaults & Normalization (Part 1)

  • Normalize unit & numeric types.
  • Prefer upsert for idempotence.
// Normalize on write
db.readings.updateOne(
  { deviceId: "plant-007", ts: ts },
  {
    $set: {
      unit: "percent",
      moisture: { $toDecimal: "$moisture" }
    }
  },
  { upsert: true }
)

Defaults & Normalization (Part 2)

  • Apply defaults with $ifNull.
  • Backfill battery if null.
db.readings.updateOne(
  { deviceId: "plant-007", ts: ts },
  {
    $set: {
      battery: { $ifNull: ["$battery", 100] }
    }
  },
  { upsert: true }
)

Bulk Normalize (Part 1)

  • Use filter {} to target all docs.
  • Cast types consistently in multiple fields.
db.readings.updateMany(
  {},
  [{ $set: {
    deviceId: { $toString: "$deviceId" },
    ts: { $toDate: "$ts" },
    moisture: { $toDecimal: "$moisture" },
    battery: { $ifNull: ["$battery", 100] }
  }}]
)

Bulk Normalize (Part 2)

  • Cleaning historical IoT data.
  • Safe to run multiple times (idempotent).
db.readings.updateMany(
  {},
  [{ $set: {
    unit: "percent",
    healthy: { $toBool: "$healthy" },
    count: { $toInt: "$count" },
    tags: { $ifNull: ["$tags", []] },
    meta: { $ifNull: ["$meta", {}] }
  }}]
)

Before vs After Normalization

  • Before: Raw IoT readings can be messy.
  • Types are inconsistent (string vs number).
  • Null or missing fields (e.g., battery).
  • Arrays or objects not initialized.
{
  deviceId: 12345,
  ts: "2025-09-11",
  moisture: "42",
  battery: null,
  healthy: "true"
}
  • After: Normalized and validated.
  • All types aligned with schema.
  • Defaults filled (battery=100).
  • Ready for analytics & actuation.
{
  deviceId: "12345",
  ts: ISODate("2025-09-11T00:00:00Z"),
  moisture: NumberDecimal("42"),
  battery: 100,
  healthy: true,
  tags: [],
  meta: {}
}

Enforce Uniqueness

  • Prevent duplicates with unique index.
  • Use compound key (deviceId + ts).
  • Guarantee uniqueness of samples.
// Uniqueness
db.readings.createIndex(
  { deviceId: 1, ts: 1 },
  { unique: true }
)

Detect Outliers

  • Run QA queries for invalid ranges.
  • Quarantine suspicious documents.
  • Protect accuracy of analytics.
// Accuracy, Validity
db.readings.find({
  $or: [
    { moisture: { $lt: 0 } },   // Accuracy
    { moisture: { $gt: 100 } }  // Validity
  ]
})

Use Partial Indexes

  • Ensure rules only apply when fields exist.
  • Avoid blocking inserts with missing keys.
  • Supports validity and uniqueness.
// Validity, Uniqueness
db.readings.createIndex(
  { deviceId: 1, ts: 1 },
  {
    unique: true,
    partialFilterExpression: {
      deviceId: { $exists: true },
      ts: { $type: "date" }
    }
  }
)

Cross-Document Validation

  • Check latest reading before actuation.
  • Abort if invalid, out of range, or null.
  • Protect integrity of control actions.
// Integrity, Accuracy
const r = db.readings.findOne(
  { deviceId: "plant-007" },
  { sort: { ts: -1 } }
)

if (!r || r.moisture < 0 || r.moisture > 100) {
  throw "Invalid input"
}

Data Quality Dashboard

  • Visualize data quality in real time.
  • Good for operational monitoring.
Data Quality Dashboard
  • Business intelligence & decision-making
  • Emphasizes data visibility and trust
  • (Quality) Metrics for organization success
Book Dashboard Effect

MQL Aggregation Functions

  • MQL aggregation.
  • Functions for sum, avg, min, max.
  • Operators: $group, $match, $project.
  • Pipeline for step-by-step transformation.
  • Data quality checks & metrics calculation.
// Accuracy, Completeness
db.readings.aggregate([
  { $match: { unit: "percent" } },
  { $group: {
      _id: "$deviceId",
      avgMoisture: { $avg: "$moisture" },
      minMoisture: { $min: "$moisture" },
      maxMoisture: { $max: "$moisture" },
      totalReadings: { $sum: 1 },
      count: { $count: {} }
])

Count NULL and Inaccurate Data

  • Run scheduled aggregates for QA reports.
  • Surface % nulls, invalid ranges, duplicates.
  • Improves accuracy and completeness.
// Accuracy, Completeness
db.readings.aggregate([
  { $group: {
      _id: null,
      nullMoisture: { $sum: { 
        $cond: [{$eq:["$moisture",null]},1,0] }},
      outOfRange:   { $sum: { 
        $cond: [
            {
                $or:[{$lt:["$moisture",0]},
                {$gt:["$moisture",100]}]},1,0
        ]}}
  }}
])

Logging & Auditing

  • Maintain audit trail of changes.
  • Replay and verify historical decisions.
  • Supports integrity & traceability.
// Integrity
db.audit.insertOne({
  action: "UPDATE",
  target: "readings",
  ts: new Date(),
  user: "system",
  changes: { field: "moisture", old: "42", new: "42.0" }
})
Unstructured
Inconsistent
Normalized
Consistent

Eventually Consistent

Questions?
Accuracy ✓ Completeness ✓ Consistency ✓
Timeliness Validity ✓ Uniqueness ✓
Integrity ✓
Questions?
Slowest component in realtime application?

Slowest Component in Realtime Application?

  • Realtime apps rely on multiple tiers.
  • Bottleneck decides end-to-end speed.
  • Commonly database writes or network.
CLIENT
SERVER
DATABASE

Term: Network Latency

Definition: Time delay for data to travel from source to destination and back (round trip). Factors can be bandwidth, distance, routing, congestion, and packet size.
Network Latency

Slowest Component in Realtime Application?

Big Data Transport Latency (100 MB)
-----------------------------------
1 Gbps : ~0.8 s one-way
100 Mbps : ~8 s one-way
10 Mbps : ~80 s one-way


Round-trip β‰ˆ 2x + small ACK delay
CLIENT
SERVER
DATABASE

Timeliness via Sharded Architecture

  • Supports horizontal scale (sharding).
  • Each shard holds part of the dataset.
  • Query router (mongos) directs requests.
  • Config servers maintain cluster metadata.
  • Documentation
MongoDB Sharded Cluster

Horizontal Scaling for Timeliness

  • Distribute workload in multiple nodes.
  • Each shard holds a subset of the data.
  • Improves timeliness of queries.
  • Reduces latency under heavy IoT loads.
// Timeliness
// Shard readings by deviceId
db.adminCommand({
  shardCollection: "iot.readings",
  key: { deviceId: 1 }
})

Balanced Cluster Load

  • MongoDB balancer moves chunks automatically.
  • Keeps shards evenly loaded.
  • Maintains consistency and speed.
// Check balancer status
db.adminCommand({ balancerStatus: 1 })

// Move chunk manually if needed
db.adminCommand({
  moveChunk: "iot.readings",
  find: { deviceId: "plant-007" },
  to: "shard0001"
})

Scaling Reads & Writes

  • Reads routed to appropriate shard.
  • Writes distributed by shard key.
  • Supports high-throughput IoT Transformation.
  • Preserves timeliness under scale.
// Example query routed by key
db.readings.find({ deviceId: "plant-007" })

// Query router (mongos)
// ensures minimal latency
Other options?

Apply TTL for Ephemeral Data

  • Auto-expire stale telemetry buffers.
  • Preserve only fresh, relevant data.
  • Supports timeliness.
// Timeliness
db.staging.createIndex(
  { expiresAt: 1 },
  { expireAfterSeconds: 0 }
)

TTL Expiration Options

  • Expire After 24h
  • Deletes docs 24 hours after ts.
  • Good for temporary telemetry buffers.
// Timeliness
// Auto-delete 24h after ts
db.staging.createIndex(
  { ts: 1 },
  { expireAfterSeconds: 86400 }
)
  • Expire On Specific Date
  • Deletes docs at given expiresAt.
  • Useful for fixed retention windows.
// Timeliness
// Auto-delete at expiresAt
db.logs.createIndex(
  { expiresAt: 1 },
  { expireAfterSeconds: 0 }
)

db.logs.insertOne({
  msg: "debug",
  expiresAt: ISODate("2025-09-12T00:00:00Z")
})
Accuracy ✓ Completeness ✓ Consistency ✓
Timeliness ✓ Validity ✓ Uniqueness ✓
Integrity ✓

Practice Lab

https://onecompiler.com/mongodb

Conceptual Q1

In a farm databases, which data quality metricis the hardest to guarantee?

  • Accuracy
  • Consistency
  • Completeness
  • Timeliness
  • Uniqueness
  • Validity
  • Integrity

Conceptual Q2

  • Why is timeliness critical in IoT pipelines?
Self-driving car

Today Agenda

  • Data Transformation and Processing
  • Map: Transform data items
  • Filter: Transform data rows
  • Reduce: Transform data columns
  • Lab 2, Homework 2, Project Part 1

Data Transformation and Processing (1970)

Ed Codd
  • Ed Codd: relational db model
  • Data (table) transformation first
  • SQL Transforming limited RDMS

-- Filter high-yield crops
SELECT * 
FROM crops 
WHERE yield_per_hectare > 5000;

-- Sum total production
SELECT SUM(production_tonnes) AS total 
FROM crops;

-- Update harvest_date to DATE format
UPDATE crops
SET harvest_date = DATE('2025-09-14')
WHERE id = 1;

Data Transformation and Processing (1966)

BΓΆhm–Jacopini
  • Structured Program Theorem (BΓΆhm–Jacopini)
  • Algorithm first = Sequence, Selection, Iteration
  • Minimal building blocks for computation
#include <stdio.h>

int main() {
    int yields[5] = {4000, 5200, 3100};
    int total = 0;

    for (int i = 0; i < 5; i++) { // Iteration
        if (yields[i] > 5000) {   // Selection
            total += yields[i];    // Sequence
        }
    }
    printf("Total high yield = %d\n", total);
    return 0;
}

Data Transformation and Processing (1972)

Dennis Ritchie
  • Dennis Ritchie: C programming language
  • Transformations limited to if & loops
// Example: transform yields with C
#include <stdio.h>

int main() {
    int yields[5] = {4000, 5200, 3100};
    for (int i = 0; i < 5; i++) {
        if (yields[i] > 5000) {
          printf("High yield: %d\n", yields[i]);
        }
    }
    return 0;
}

Data Transformation and Processing (~2000)

-- Simple SQL query
SELECT name, price
FROM Flowers
WHERE price > 10;
      
-- Loop + if-then-else
FOR each flower IN Flowers:
    IF flower.price > 10 THEN
        PRINT "Expensive"
    ELSE
        PRINT "Affordable"
    END IF
END FOR
      

...and today?

How to transform JSON data?

How not to transform JSON data?

  • Risk bugs (manual index, array size mismatch)
  • Not maintainable (hard to change logic)
  • Not readable (nested loops & conditions)
  • Doesn't scale for real JSON with nested objects
  • Lacks composability (can't easily reuse)
#include <stdio.h>

int main() {
    int yields[5] = {4000, 5200, 3100};
    int total = 0;

    for (int i = 0; i < 5; i++) { // Iteration
        if (yields[i] > 5000) {   // Selection
            total += yields[i];    // Sequence
        }
    }
    printf("Total high yield = %d\n", total);
    return 0;
}

JSON Data Transformation with Map, Filter, Reduce

JSON Data Transformation with Map, Filter, Reduce

  • Map – Declarative: what to transform, not how
  • Filter – Declarative: what rows to keep, not how to loop
  • Reduce – Declarative: what summary you want (sum, avg, min, max, count)

Contrast: Non-declarative requires specifying both what + how (loops, indices, control flow).

MAP

MAP Syntax

  • Transform each JSON object item in an array
  • Apply a function to each element
  • Create a new array from existing data
const numbers = [1, 2, 3, 4];

const doubled = numbers.map(x => x * 2);
// Result: [2, 4, 6, 8]

const objects = [{id:1}, {id:2}];
const transformed = objects.map(obj => ({
    ...obj, 
    value: obj.id * 10
}));
// Result: [{id:1, value:10}, {id:2, value:20}]

Map Example

  • Map applies to item granularity
  • Used for item requirements
  • IndexedDB, MongoDB, and SQL
//Before:
[
  { "crop": "Wheat", "yield": "5000" },
  { "crop": "Corn", "yield": "4000" }
]

//After:
[
  { "crop": "Wheat", "yield": 5000 },
  { "crop": "Corn", "yield": 4000 }
]
            
Map Filter Reduce
IndexedDB (JS)
MongoDB (MQL)

Map in JavaScript

  • Translation: English β†’ Spanish
  • Data Format: Date β†’ ISO string
  • Type Conversion: String β†’ Integer
  • Default Value: Fill missing fields
// Translation: English to Spanish
["Wheat","Corn"].map(c=>c==="Wheat"?"Trigo":"MaΓ­z");

// Data Format: Local to UTC
["01/12/25"].map(d => new Date(d).toISOString());

// Type Conversion: String to Int
["5000","4000"].map(y => parseInt(y));

// Default Value: Region
[{r:null}].map(x=>({r:x.r||"Unknown"}))
            

Data Mapping Requirements: Req 1 & 2

  • Req 1: Fill missing data with default
  • Req 2: Convert yield to integer
Before:
[
  { "crop": "Wheat", "yield": "5000" },
  { "crop": "Corn" }
]

After Map:
[
  { "crop": "Wheat", "yield": 5000 },
  { "crop": "Corn", "yield": "missing" }
]
            

Map IndexedDB Data with JavaScript

Data Mapping Requirements: Req 1 & 2

const data = [
  { crop: "Wheat", yield: "5000" },
  { crop: "Corn" }
];

const mapped = data.map(d => ({
  crop: d.crop,
  yield: d.yield ? parseInt(d.yield) : "missing"
}));
            
const data = [
  { crop: "Wheat", yield: "5000" },
  { crop: "Corn" }
];

const transformed = [];
for (let i = 0; i < data.length; i++) {
  const d = data[i];
  let yieldVal;
  if (d.yield) {
    yieldVal = parseInt(d.yield);
  } else {
    yieldVal = "missing";
  }
  transformed.push({crop:d.crop,yield:yieldVal});
}
            

Data Mapping Requirements: Req 3 & 4

  • Req 3: Add timestampLastUpdate meta data
  • Req 4: Add uuid meta data
// Before:
const data = [{ "crop": "Wheat" }];

// After:
[
  { 
    "crop": "Wheat", 
    "tsLastUpdate": "2025-09-15T15:00:00Z", 
    "uuid": "a1b2c3-"
  }
]
            

Map IndexedDB Data with JavaScript

Data Mapping Requirements: Req 3 & 4


const data = [
  { crop: "Wheat" },
  { crop: "Corn" }
];

const mapped = data.map(d => ({
  ...d,
  meta: {
    tsLastUpdate: new Date(),
    uuid: crypto.randomUUID()
  }
}));
            

const data = [
  { crop: "Wheat" },
  { crop: "Corn" }
];

const transformed = [];
for (let i = 0; i < data.length; i++) {
  const d = data[i];
  transformed.push({
    ...d,
    meta: {
      tsLastUpdate: new Date(),
      uuid: crypto.randomUUID()
    }
  });
}
            

Map Use Case

  1. Poor quality data ⟢ Better quality data (e.g., remove nulls, fix typos)
  2. Unnormalized data ⟢ Normalized data (e.g., split address: street/city/state)
  3. Unstandardized data ⟢ Standardized data (e.g., timestamps β†’ UTC)
  4. Format 1 ⟢ Format 2 (e.g., English β†’ Spanish)
  5. Incomplete data ⟢ Complete data (e.g., fill missing yield with average)
  6. Duplicate records ⟢ Unique dataset (e.g., remove repeated crop entries)
  7. Inconsistent units ⟢ Standardized units (e.g., lbs β†’ kg, gallons β†’ liters)
  8. Unstructured text ⟢ Structured fields (e.g., parse β€œWheat, 5000kg” into JSON)
  9. Less secure ⟢ More secure (e.g., anonymize SSN β€œ123-45-6789” ⟢ β€œXXX-XX-6789”)
  10. Less integrity ⟢ More integrity (e.g., add metadata like checksum or UUID to verify integrity)
  11. No Data Pipeline ⟢ Data Pipelin (e.g., extract-transform-load (ETL) process for IoT data)
Map Filter Reduce
IndexedDB (JS)
MongoDB (MQL)

MAP Syntax in MongoDB

  • input: Array to iterate over
  • as: Variable name for each element
  • in: Expression to apply to each element
  • Creates a new array with transformed values

{
  $map: {
    input: [/* array */],
    as: "element",
    in: { /* transformation */ }
  }
}
            

MAP Example: Double Numbers

  • Double each number in an array
// Create a temporary collection
db.tempNumbers.insertMany([ {
     _id: 1, 
     numbers: [1,2,3,4] 
    } 
]);

db.tempNumbers.aggregate([
  {
    $project: {
      doubled: {
        $map: {
          input: "$numbers",
          as: "n",
          in: { $multiply: ["$$n", 2] }
        }
      }
    }
  }
]);
// Result: [{_id:1, doubled: [2,4,6,8]}]
            

MAP Example: Transform Objects

  • Transform each object in an array
  • Add computed values for each element
db.tempObjects.insertOne({ 
    _id: 1, 
    objects: [{id:1},{id:2}] 
});

db.tempObjects.aggregate([
  {
    $project: {
      transformed: {
        $map: {
          input: "$objects",
          as: "obj",
          in: { 
            id: "$$obj.id", 
            value: { $multiply: ["$$obj.id", 10] } 
          }
        }
      }
    }
  }
]);
            

Map MongoDB Data with JavaScript

Map Filter Reduce
IndexedDB (JS)
MongoDB (MQL)
Filter

What is a Filter?

Coffee Filter Example

Filter Syntax

  • Filter elements in an array based on a condition
  • Returns new array with elements that meet the condition
  • array.filter(element => condition === true)
// Coffee menu as objects
const drinks = [
  { name: "Espresso", price: 3 },
  { name: "Latte", price: 4 },
  { name: "Cappuccino", price: 5 }
];

// Filter: keep only Latte
const lattes = drinks.filter(drink => drink.name === "Latte");

console.log(lattes);
// [ { name: "Latte", price: 4 } ]

Filter Requirements in JavaScript: Req 1 & 2

// Data set
const drinks = [
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
];

// Req 1: Filter only Cappuccino
      
// Data set
const drinks = [
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
];

// Req 2a: Map price to number
// Req 2b: Filter drinks under $5
      

Filter IndexedDB Data with JavaScript

Filter Requirements in JavaScript: Req 1 & 2

// Data set
const drinks = [
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
];
const cappuccinos = drinks.filter(d => d.name === "Cappuccino");
const cheapNames = drinks
  .map(d => ({ drink: d.name, price: Number(d.price) }))
  .filter(d => d.price < 5);
      
// Data set
const drinks = [
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
];
const cappuccinos = [];
const cheapNames = [];
for (let i = 0; i < drinks.length; i++) {
  const d = drinks[i];
    if (d.name === "Cappuccino") {
    cappuccinos.push(d);
  }
  const price = Number(d.price);
  if (price < 5) {
    cheapNames.push({ drink: d.name, price: price });
  }
}
      
Map Filter Reduce
IndexedDB (JS)
MongoDB (MQL)

Filter Requirements in MongoDB: Req 1 & 2

// Collection: drinks
// Documents
[
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
]

// Req 1: Filter only Cappuccino
      
// Collection: drinks
// Documents
[
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
]

// Req 2a: Convert price to number
// Req 2b: Filter Cappuccinos under $5
      

Filter Data with MongoDB (MQL)

Filter Requirements in MongoDB: Req 1 & 2

// Insert sample data
db.drinks.insertMany([
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
])

// Req 1: Filter only Cappuccino
db.drinks.find({ name: "Cappuccino" })

// Result:
[ { name: "Cappuccino", price: "5" } ]
      
// Insert sample data
db.drinks.insertMany([
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
])

// Req 2a: Convert price to number
// Req 2b: Filter Cappuccinos under $5
db.drinks.aggregate([
  { $match: { name: "Cappuccino" } },
  { $addFields: { price: { $toInt: "$price" } } },
  { $match: { price: { $lt: 5 } } }
])

// Result:
[ { name: "Cappuccino", price: 5 } ]
      

Filter Use Case

  1. Noise ⟢ Signal (e.g., remove irrelevant logs, keep error events)
  2. All crops ⟢ Selected crops (e.g., only β€œWheat”)
  3. All dates ⟢ Relevant period (e.g., last 7 days only)
  4. All customers ⟢ Target customers (e.g., local farmers)
  5. All products ⟢ In-stock products (e.g., quantity > 0)
  6. All diagnoses ⟢ Selected ICD-9 codes (e.g., only 250.xx)
  7. All devices ⟢ Active devices (e.g., last ping < 5 min ago)
  8. All fields ⟢ Sensitive fields removed (e.g., drop SSN, keep name + city)
  9. All records ⟢ Valid records (e.g., only rows passing schema validation)
  10. Unfiltered API ⟢ Specific view (e.g., Query selecting only posts with tag β€œIoT”)
Map Filter Reduce
IndexedDB (JS)
MongoDB (MQL)
Reduce

What is a Reduce?

  • Reduce like combining coffee grounds into one cup β˜•
  • What does reduce do? β†’ Big data to One Value
Coffee Cup Example

Reduce Syntax

  • Reduce combines array elements into a single result
  • Requires an accumulator and a function
  • array.reduce((acc, element) =>..., initValue)
// Example: sum numbers
const numbers = [1, 2, 3];
const total = numbers.reduce(
  (acc, n) => acc + n, 0
);

console.log(total);
// 6
      

Reduce in JavaScript

// Data set
const drinks = [
  { name: "Espresso", price: 3 },
  { name: "Latte", price: 4 },
  { name: "Cappuccino", price: 5 }
];

// Total price of all drinks
const total = drinks.reduce(
  (sum, d) => sum + d.price, 0
);

console.log(total);
// 12
      
// Same with loop
const drinks = [
  { name: "Espresso", price: 3 },
  { name: "Latte", price: 4 },
  { name: "Cappuccino", price: 5 }
];

let total = 0;
for (let i = 0; i < drinks.length; i++) {
  total += drinks[i].price;
}

console.log(total);
// 12
      

Reduce in MongoDB (Aggregation)

// Insert data
db.drinks.insertMany([
  { name: "Espresso", price: 3 },
  { name: "Latte", price: 4 },
  { name: "Cappuccino", price: 5 }
])

// Total price of all drinks
db.drinks.aggregate([
  { $group: { total: { $sum: "$price" } } }
])

// Result:
[ { total: 12 } ]
      
// Total price of Cappuccino only
db.drinks.aggregate([
  { $match: { name: "Cappuccino" } },
  { $group: { _id: "$name", total: { $sum: "$price" } } }
])

// Result:
[ { _id: "Cappuccino", total: 5 } ]
      
Map Filter Reduce
IndexedDB (JS)
MongoDB (MQL)

Reduce Use Case

  1. Many sales ⟢ Total sales (e.g., SUM revenue)
  2. Multiple prices ⟢ Average price (e.g., AVG cost)
  3. All values ⟢ Minimum / Maximum (e.g., lowest or highest temperature)
  4. All customers ⟢ Count of customers (e.g., COUNT records)
  5. All errors ⟢ Error rate summary (e.g., % failed requests)
  6. Sensor data stream ⟢ Aggregated metric (e.g., rolling average)
  7. Daily transactions ⟢ Weekly or monthly totals
  8. All text entries ⟢ One concatenated string (e.g., join comments)
  9. Distributed logs ⟢ Unified summary (e.g., number of events per category)
  10. Detailed dataset ⟢ One KPI dashboard metric (e.g., ROI = (Ξ£revβˆ’Ξ£cost)/Ξ£cost)
Questions
Dashboard

Today Agenda

  • MQL CRUD Query
  • MQL Aggregation Pipeline
  • Lab 2, Homework 2, Project Part 1
MongoDB Aggregation Pipeline Example

Today Agenda

  1. DB Architecture - End-to-End Aspects
  2. DB Architecture - Design Time Aspects
  3. DB Architecture - Runtime Aspects
  4. NoSQL Graph DB - Neo4j
  5. Lab 2, Homework 2, Project Part 1

End-to-End Database Pipeline Aspects

  • Pipeline connects data sources to sinks
  • Ingest, process, store, and deliver data
  • Supports both real-time and batch workloads
End-to-end pipeline overview
Software Required to Transport Data from Database to Database

ETL Transport Software

  • E – Extract: Pull data from source systems
  • T – Transform: Clean, validate, and reshape data
  • L – Load: Insert data into target system
  • E: databases, APIs, IoT devices, logs
  • T: quality checks, joins, aggregations
  • L: lake, warehouse, mart, dashboard
  • β€”

History of ETL vs ELT

  • 1970s–80s: ETL emerges relational databases
  • 1990s: ETL dominates for data warehouses
  • 2000s: Big data growth β†’ lakes introduced
  • 2010s: ELT rises with scalable cloud storage
  • Today: Hybrid pipelines mixing ETL + ELT
ETL vs ELT history timeline

Source DB Systems

  • Transactional databases (SQL/NoSQL)
  • APIs, IoT devices, logs
  • Streaming sources (Kafka, MQTT)
Data sources

Data Extraction

  • Captures raw data from sources
  • Batch (files, APIs) and streaming (events)
  • Ensures durability and availability
# extract.py - Example of extracting data from an API
import requests
import json

url = "https://api.weather.com/data"
response = requests.get(url)

if response.status_code == 200:
    data = response.json()
    print("Data extracted:", data)
else:
    print("Failed:", response.status_code)
            
Where to Extract Data To?

Concept of Data Lake

  • Central repository for raw data
  • Add metadata for context and discovery
  • Schema-on-read: flexibility for data veriaty
Data lake concept

Concept of Data Lake

  • Central repository for raw data
  • Add metadata for context and discovery
  • Schema-on-read: flexibility for data variety
  • One query language for multiple source db
  • Raw data is read-only
  • Data available even when source DB offline
# Example structure of a data lake
lake = {
    // READ ONLY
    "uuid": "123e4567-e89b-12d3-a456-426614174000", 
    "item": { 
        "raw_data": {
            "sensor_readings": [...],
            "logs": [...],
            "api_data": [...]
        },
        // READ AND WRITE
        "meta_data": { 
            "source": "IoT sensors",
            "timestamp": "2025-09-23T08:00:00Z",
            "tags": ["temperature", "humidity"]
        }
    }
}
            

Concept of Data Lake

  • Holds source-of-truth data in raw form
  • Supports delayed or retrospective processing
  • Minimizes risk of premature data loss
Truth in data lake

Concept of Data Lake

  • Transform: Enrich raw data with metadata
  • Load: Store enriched data in the data lake
  • Maintain read-only raw data
Truth in data lake

Concept of Data Lake

In many data pipeline architectures, the data lake is often the first and central component.

Data lake as central component

Concept of Data Lake

It is highly recommended to start with a data lake to ensure flexibility and scalability.

Data lake as central component

Case Study: Agriculture Database Pipeline

  • IoT soil moisture sensors collect raw data
  • DB sources: SQL, IndexedDB, and MongoDB
  • Stored in a data lake with metadata
  • Lake holds truth for delayed processing
  • Task: Draw architecture up to lake
  • Include: Metadata on quality Aspects
Agriculture data pipeline case study
When is software just a tool to transport data from one DB to another?
Example with MongoDB Compass
Questions?

1. What is the main purpose of the Extract step in ETL?

  • Transform data for analysis
  • Load data into the target system
  • Capture raw data from source systems
  • Generate reports for end users

2. Which of the following is a key characteristic of a Data Lake?

  • Only stores structured data
  • Data is read-only and enriched with metadata
  • Requires source databases to be always online
  • Data is stored without any metadata or context

3. In ETL processes, what does the Transform step usually involve?

  • Pulling data from APIs and files
  • Cleaning, validating, and reshaping data
  • Loading data into dashboards
  • Backing up the source database

4. When would you recommend using ETL over ELT?

  • When the target system can handle large-scale raw data processing
  • When you need to clean, validate, and transform data before loading
  • When you want to load raw data first and transform later
  • When working exclusively with streaming data
Questions?

Today Agenda

  1. DB Architecture - End-to-End Aspects ✓
  2. DB Architecture - Design Time Aspects
  3. DB Architecture - Runtime Aspects
  4. NoSQL Graph DB - Neo4j
  5. Lab 2, Homework 2, Project Part 1

Design-Time Aspects

  • Choices made before pipeline runs
  • Impact scalability, cost, reliability
  • Guide long-term architecture quality
Design-time planning

Which Database for Lake?

  • Object storage (e.g., S3, GCS, Azure Blob)
  • NoSQL (e.g., MongoDB, Cassandra)
  • Columnar formats (Parquet, ORC)
  • Relational Database (MySQL, OracleDB)
  • Decision: scalability vs. query speed
Databases for data lake

Which Sync Interval?

  • How often to sync from sources
  • Real-time (seconds) vs batch (hours)
  • Impacts freshness and cost
Interval Use Case
Real-time IoT, fraud detection
Hourly ETL jobs, logs
Daily Reporting dashboards

Which Sync Trigger?

  • Event-based: trigger on data arrival
  • Schedule-based: run at fixed times
  • Hybrid: schedule + event validation
# etl.py - all-in-one ETL script
def etl():
    extract()
    transform()
    load()

# Separate modular files
# extract.py
def extract():
    print("Extracting data from source...")

# transform.py
def transform():
    print("Transforming and enriching data...")

# load.py
def load():
    print("Loading data into target system...")
            

Partial or Full Sync?

  • Partial sync: only changed data
  • Full sync: replace entire dataset
  • Tradeoff: performance vs reliability
Type Pros Cons
Partial Faster, less storage Risk missing updates
Full Consistent snapshot High cost & latency

Schema Design

  • Schema-on-read: flexibility
  • Schema-on-write: strict validation
db = MongoClient().Agriculture # connect to MongoDB
print(db.flowers.count_documents({
    "color": 
        {"$exists": True}
    }))  # verify data
    

What is DataRetention Policy?

  • How long to keep raw data in lake
  • Short retention reduces cost
  • Long retention supports audits
  • May be driven by legal requirement
# Retention strategies for raw data
def retention_policy(item):
    # 1. Auto delete
    if item.age_days > 30:
        del item

    # 2. Deletion flag
    elif item.flag_for_deletion:
        item.deleted = True

    # 3. Do not delete
    else:
        pass  # keep indefinitely
            

Security, Access, and Integrity

  • Data Integrity Check
  • Encryption at rest & in transit
  • Access control (RBAC, ABAC)
  • Audit logs for compliance
# Verify checksum of metadata
doc = db.Farm.find_one({"uuid": "123e4567-e89b-..."})
meta_json = json.dumps(doc['meta_data'])
checksum = hashlib.md5(meta_json).hexdigest()

print("Stored checksum:", doc.get("checksum"))
print("Calculated checksum:", checksum)
print("Valid:", doc.get("checksum") == checksum)
            

How muchtraceability?

  • Track data origin and transformations
  • Enable reproducibility & trust
  • Metadata catalog for discovery
  • PubMed Article: 23684593
Lineage and catalog

Design-Time Summary

  • Database type for lake
  • Sync interval & trigger
  • Partial vs full sync
  • Schema, retention, format
  • Integrity, Traceability
Design-time overview

Case Study: Agriculture Data Pipeline (Part 2)

  • IoT soil moisture sensors collect raw data
  • Data ingested through batch and streaming
  • Stored in a data lake with metadata
  • Lake holds truth for delayed processing
  • Task: Draw end-to-end architecture up to lake
  • Include: Design Time Aspects
Agriculture data pipeline case study
Questions?

1. What is a key design-time consideration when choosing ETL vs ELT?

  • Whether to use Python or Java
  • Data volume, source complexity, and processing capabilities
  • Network bandwidth between users
  • Time of day to run the pipeline

2. Which design-time aspect involves defining cleaning, validation, and metadata?

  • Data Sources
  • Transformation
  • Retention & Security
  • Sync Trigger

3. Which design-time aspect determines how long raw data is kept in the pipeline?

  • Data Sources
  • ETL vs ELT choice
  • Retention & Security
  • Transformation & Metadata

4. When planning design-time aspects, what should you consider about data sources?

  • Connectivity requirements
  • The color of the UI dashboard
  • The font used in reports
  • Only the database name
Questions?

Today Agenda

  1. DB Architecture - End-to-End Aspects ✓
  2. DB Architecture - Design Time Aspects ✓
  3. NoSQL Graph DB - Neo4j
  4. DB Architecture - Runtime Aspects
  5. Lab 2, Homework 2, Project Part 1
IndexedDB
Redis
MongoDB
CassandraDB
Faiss
ClickHouse
Neo4j
NoSQL Design Principles Diagram

When to Use a Graph Database?

To Query Relationships

Query connections, paths, and patterns between entities.

Subject Predict Object
  • Alice: FRIEND_OF: Bob
  • Paris: CAPITAL_OF: France
  • Rose: GROWS_IN: Garden
  • Car: MANUFACTURED_BY: Toyota
  • Student: ENROLLED_IN: Course
Graph Database Diagram

History of Graph Databases

  • 1960s–1970s: Early graph theory research
  • 1980s: OOP databases with graph-like features
  • 1990s: Semantic web and RDF for graph data
  • 2000s: Emergence of native graph databases
  • 2010s: GraphDB in social networks
  • 2010s: GraphDB in recommendation systems
  • 2010s: GraphDB in fraud detection

History of Neo4j

  • 2000: Neo4j project started in Sweden
  • 2007: First public release of Neo4j
  • 2010: Neo4j for social network
  • 2010: Neo4j for recommendation system
  • 2014–2020: Expanded support for clustering
  • Today: Widely used native graph database
  • Query language: Cypher
  • Node/edge properties can use JSON structures
  • Schema-less: nodes/edges flexible properties

Neo4j Aura Console

https://console-preview.neo4j.io/

Create Nodes

  • Create a node with label Person
// Create a Person node
CREATE (p:Person {name: "Alice", age: 30})
RETURN p;
            

Create Relationships

  • Connect two nodes with a FRIEND relationship
// Create FRIEND relationship
MATCH (a:Person {name:"Alice"}),(b:Person {name:"Bob"})
CREATE (a)-[:FRIEND]->(b)
RETURN a, b;
            

Find Nodes

  • Retrieve all Person nodes
// Find all Person nodes
MATCH (p:Person)
RETURN p;

// Sample return object
{
  "p": {
    "name": "Alice",
    "age": 30,
    "email": "alice@example.com"
  }
}
            

Filter Nodes

  • Find Person nodes older than 25
// Filter by age
MATCH (p:Person)
WHERE p.age > 25
RETURN p;

// Sample return objects
[
  {"name": "Alice", "age": 30},
  {"name": "Charlie", "age": 28}
]
            

Update Nodes

  • Update the age of a Person
// Update age
MATCH (p:Person {name: "Alice"})
SET p.age = 31
RETURN p;

// Sample return object
{"name": "Alice", "age": 31}
            

Delete Nodes

  • Remove a Person node
// Delete node
MATCH (p:Person {name: "Bob"})
DETACH DELETE p;

// No return object; node is deleted
            

Count Nodes

  • Count how many Person nodes exist
// Count nodes
MATCH (p:Person)
RETURN count(p) AS totalPersons;

// Sample return object
{"totalPersons": 3}
            

Find Relationships

  • List all FRIEND relationships
// Find all FRIEND relationships
MATCH (a)-[r:FRIEND]->(b)
RETURN a.name, b.name;

// Sample return objects
[
  {"a.name": "Alice", "b.name": "Bob"},
  {"a.name": "Charlie", "b.name": "Alice"}
]
            

Aggregate Data

  • Calculate average age of Persons
// Average age
MATCH (p:Person)
RETURN avg(p.age) AS averageAge;

// Sample return object
{"averageAge": 29.7}
            

Optional Match

  • Find Persons and optional FRIENDs
// Optional FRIEND match
MATCH (p:Person)
OPTIONAL MATCH (p)-[:FRIEND]->(f)
RETURN p.name, collect(f.name) AS friends;

// Sample return objects
[
  {"p.name": "Alice", "friends": ["Bob", "Charlie"]},
  {"p.name": "Charlie", "friends": []}
]
            

SQL vs Cypher

  • Find all friends of Alice
// SQL query
SELECT f.name
FROM Person p
JOIN FRIENDS fr ON p.id = fr.person_id
JOIN Person f ON fr.friend_id = f.id
WHERE p.name = 'Alice';

// Cypher query
MATCH (a:Person {name: "Alice"})-[:FRIEND]->(f)
RETURN f.name;
            
Questions

ACID Compatibility

Database Atomicity Consistency Isolation Durability
IndexedDB Yes Yes Yes (per transaction) Yes
MongoDB Yes (single document) Yes (single document) Limited Yes (with journaling)
Neo4j Yes Yes Yes Yes
SQL (e.g., MySQL, PostgreSQL) Yes Yes Yes Yes

Today Agenda

  1. DB Architecture - End-to-End Aspects ✓
  2. DB Architecture - Design Time Aspects ✓
  3. NoSQL Graph DB - Neo4j
  4. DB Architecture - Runtime Aspects
  5. Lab 2, Homework 2, Project Part 1

ETL Runtime Challenges

  • Data volume spikes
  • Overwhelms downstream systems
  • Causes pipeline bottlenecks
  • Increases job failures
  • Leads to data loss or delays
ETL challenges placeholder

Runtime Aspects

  • Log endpoints usage
  • Log requests & responses
  • Log server uptime
  • Log DB health & queries
  • Use logs to prevent broken pipelines
ETL runtime placeholder

Runtime Circuit Breaker

  • Protects services from overload
  • Stops repeated failing requests
  • Allows system to recover
  • Throttles down request rate
  • Write on lake only
Circuit breaker diagram

Runtime Resilience4j

  • Java library for fault tolerance
  • Circuit Breaker support
  • Rate Limiter & Bulkhead
  • Retry & Time Limiter

Monitoring Pipelines

  • Dashboards for job health
  • Alerts on data drift & anomalies
  • Performance metrics (I/O, memory)
  • Audit logs for compliance

import psutil, time

while True:
    cpu = psutil.cpu_percent()
    mem = psutil.virtual_memory().percent
    print(f"CPU: {cpu}%, MEM: {mem}%")
    time.sleep(5)
Questions

Which NoSQL Database Would You Recommend?

  • You need an offline browser-based app to store user preferences β†’ ?
  • You want to manage a large product catalog with flexible schema β†’ ?
  • You need to model social network relationships like friends-of-friends β†’ ?
  • You are building a data lake for weather and IoT sensors β†’ ?
  • You need fast local caching of session data in a web app β†’ ?
Questions

Lab 2 - Upstream Data

Upstream Data Diagram

History of Node.js

  • Created by Ryan Dahl in 2009
  • Built on Chrome’s V8 JavaScript engine
  • Introduced event-driven, non-blocking I/O
  • npm (Node Package Manager) launched in 2010
  • Widely adopted for web apps & APIs
Node.js history placeholder

Getting Started with Node.js & npm

  • Node.js = JavaScript runtime
  • npm = package manager
  • Install packages with npm install
  • Track dependencies in package.json
  • Run apps with node app.js
  • Lab2-Sample

# initialize a project
npm init -y

# install a package
npm install express

# run the app
node app.js

Review package.json

  • name / version: project identity
  • description: project purpose
  • main: entry file (app.js)
  • scripts: run commands (start, dev)
  • dependencies: required libraries
  • devDependencies: tools for development
  • license: usage terms
{
  "name": "lab2-agriculture-app",
  "version": "1.0.0",
  "description": "Simple Node.js app with Express and MongoDB",
  "main": "app.js",
  "scripts": {
    "start": "node app.js",
    "dev": "nodemon app.js"
  },
  "dependencies": {
    "dotenv": "^16.4.5",
    "express": "^4.19.2",
    "mongodb": "^4.17.2"
  },
  "devDependencies": {
    "nodemon": "^3.1.0"
  },
  "license": "MIT"
}

Review .gitignore

  • node_modules/: installed packages, not tracked
  • .env: sensitive credentials (DB, API keys)
  • Logs: debug/error logs ignored
  • OS / Editor files: system-generated clutter
  • Build artifacts: compiled output, not source
# Node dependencies
node_modules/

# Environment variables
.env

# Logs
*.log

# OS / Editor files
.DS_Store
Thumbs.db

# Build artifacts
dist/
build/

Setup Env & App

  • Load .env early
  • Import Express & MongoDB
  • Create the app instance
  • Read runtime ports/creds
// app.js
require("dotenv").config();

const express = require("express");
const { MongoClient } = require("mongodb");

const app = express();

const PORT = process.env.PORT || 3000;
const HOST = process.env.MONGO_HOST;
const USER = process.env.MONGO_USER;
const PASS = process.env.MONGO_PASS;

Early Env Validation

  • Fail fast if secrets missing
  • Clear error message
  • Exit to avoid half-start
(function validateEnv() {
  const missing = [];
  if (!HOST) missing.push("MONGO_HOST");
  if (!USER) missing.push("MONGO_USER");
  if (!PASS) missing.push("MONGO_PASS");
  if (missing.length) {
    console.error("❌ Missing env var(s):", missing.join(", "));
    console.error("   Ensure .env ext to app.js with these keys.");
    process.exit(1);
  }
})();

Create Mongo Client

  • SRV URL host-only
  • Separate auth credentials
  • Retry & unified topology
const client = new MongoClient(
  `${HOST}/?retryWrites=true&w=majority`,
  {
    useNewUrlParser: true,
    useUnifiedTopology: true,
    auth: { username: USER, password: PASS },
    authSource: "admin"
  }
);

let collection; // set after connect

Connect & Start Server

  • Connect & ping admin
  • Select DB/collection
  • Log document count
  • Start Express on success
(async function start() {
  try {
    console.log("⏳ Connecting to MongoDB...");
    await client.connect();
    await client.db("admin").command({ ping: 1 });

    const db = client.db("Lab2");
    collection = db.collection("Agriculture");

    const host = HOST.replace(/^mongodb\+srv:\/\//, "");
    const count = await collection.estimatedDocumentCount();
    console.log(`βœ… Connected to ${host}`);
    console.log(`πŸ“š Lab2.Agriculture docs: ${count}`);

    app.listen(PORT, () =>
      console.log(`πŸš€ http://localhost:${PORT}`)
    );
  } catch (err) {
    console.error("❌ DB connection error:", err.message || err);
    process.exit(1);
  }
})();

Data Endpoint

  • Guard: DB initialized
  • Query and return JSON
  • Handle 5xx on errors
app.get("/agriculture", async (_req, res) => {
  try {
    if (!collection) {
        return res.status(503).send("DB not initialized");
    }
    const docs = await collection.find({}).toArray();
    res.json(docs);
  } catch (e) {
    res.status(500).send(String(e));
  }
});

Health & Debug

  • /health: basic readiness
  • /debug/agriculture: count + sample
  • Use in runtime checks
app.get("/health", (_req, res) => {
  res.json({
    ok: Boolean(collection),
    cluster: HOST.replace(/^mongodb\+srv:\/\//, "")
  });
});

app.get("/debug/agriculture", async (_req, res) => {
  try {
    if (!collection) {
        return res.status(503).send("Database not initialized");
    }
    const count = await collection.estimatedDocumentCount();
    const sample = await collection.find({}).limit(5).toArray();
    res.json({db:"Lab2",collection:"Agriculture",count,sample });
  } catch (e) {
    res.status(500).json({ error: String(e) });
  }
});

POST Endpoint

  • Accept JSON in request body
  • Insert new document
  • Return confirmation with ID
  • Handle errors gracefully
// Enable JSON parsing
app.use(express.json());

// POST: add agriculture doc
app.post("/agriculture", async (req, res) => {
  try {
    const doc = req.body;
    const result = await collection.insertOne(doc);
    res.json({ insertedId: result.insertedId });
  } catch (e) {
    res.status(500).json({ error: String(e) });
  }
});

Today Agenda

  1. DB Architecture - Data for Secondary Usage
  2. DB Architecture - Data Warehouse
  3. DB Architecture - Data Mart
  4. NoSQL Graph DB - Neo4j
  5. Lab 2, Homework 2
  6. Lab 3, Homework 3
  7. Midterm, Project Part 1
DB Architecture - Data for Primary Usage

Data for Primary Usage

  • Support real-time operations
  • Record transactions or events
  • Maintain accurate current data
  • Enable immediate decision-making
  • Monitor systems and devices

Data for Primary Usage – Hospital

  • Track patient vital signs in real-time
  • Maintain accurate medical records
  • Support immediate clinical decisions
  • Monitor ICU devices and alerts
  • Enable timely diagnosis and treatment
{
    "patient_id":"P2025",
    "heart_rate":82,
    "blood_pressure":"120/80",
    "oxygen_level":97,
    "timestamp":"2025-09-30T09:30:00Z"
}

Data for Primary Usage – Agriculture

  • Monitor soil moisture and temperature
  • Track crop growth and yield data
  • Control irrigation systems
  • Support real-time pest detection
  • Ensure sustainable resource use
{
    "field_id":"F010",
    "soil_moisture":22.5,
    "temperature":29.1,
    "irrigation_status":"ON",
    "timestamp":"2025-09-30T09:30:00Z"
}

Data for Primary Usage – Retail

  • Record customer purchases
  • Maintain inventory levels
  • Support checkout and billing
  • Enable personalized promotions
  • Provide real-time sales insights
{
    "transaction_id":"R7845",
    "store_id":"S123",
    "items":["Laptop","Mouse"],
    "total_amount":1150.50,
    "purchase_time":"2025-09-30T09:45:00Z"
}
DB Architecture - Data for Secondary Usage

Data for Secondary Usage

  • Aggregate historical records
  • Analyze trends over time
  • Compute averages and statistics
  • Support predictive models
  • Identify patterns across datasets
  • Train Large Language Models (LLMs)

Healthcare (Aggregated)

  • Patients by condition
  • Avg. medication usage
  • Treatment success rates
  • Hospital occupancy trends
  • Age-group analysis
{
    "total_patients":1200,
    "avg_medications":2.3,
    "treatment_success_pct":87,
    "occupancy_rate":0.78,
    "age_groups":{"0-18":120,"19-65":890,"65+":190}
}

Automotive (Aggregated)

  • Avg. maintenance cost
  • Failure rates by model
  • Service frequency trends
  • Common repair types
  • Fuel efficiency analysis
{
    "avg_cost":320,
    "failures":{"modelA":12,"modelB":7},
    "service_count":420,
    "common_repairs":["brake","oil","tire"],
    "avg_mpg":28
}

Agriculture (Aggregated)

  • Avg. yield per crop
  • Planting vs harvest trends
  • Soil quality averages
  • Climate impact analysis
  • Water usage statistics
{
    "avg_yield_tons":52.5,
    "planting_dates":["2024-03-15","2024-04-01"],
    "harvest_dates":["2024-09-20","2024-09-30"],
    "soil_ph_avg":6.5,
    "water_used_l":5000
}

Retail (Aggregated)

  • Total sales by product
  • Avg. customer spend
  • Top-selling categories
  • Inventory turnover rates
  • Monthly sales trends
{
    "total_sales":10500,
    "avg_spend":45.7,
    "top_categories":["electronics","books"],
    "turnover_rate":1.8,
    "monthly_sales":[1200,1100,1350]
}

What is a Data Lake?

  • Central repo for all data types
  • Stores raw data, highly flexible
  • Handles large-scale data sources
Data Lake
DB Architecture - How from Primary Usage to Secondary Usage?

Data Lake: From Raw to Structured

  • Raw Zone: Original data stored
  • Processed Zone: Cleaned and curated
  • Analytics Zone: Structured for queries
Lake Architecture

ETL: From Lake to Warehouse

  • Extract: Pull data from lake
  • Transform: Clean and standardize
  • Validate: Ensure data quality
  • Load: Store into warehouse
  • Monitor: Track data pipeline health
{
    "extracted_records":1200000,
    "transformed_records":1185000,
    "errors_found":15000,
    "quality_score":0.987,
    "load_status":"success"
}

Data Warehouse

  • Structured & Normalized: Optimized for queries
  • Subject-Oriented: Focused on domains
  • Integrated: Data from multiple sources
  • Time-Variant: Historical trends stored
  • Non-Volatile: Read-only, stable data
Warehouse

Data Flow in Warehouses

  • Extract: Pull data from sources
  • Transform: Normalize & standardize
  • Load: Store in warehouse tables
                                db.crops.aggregate([
  { $match: { crop_type: "Wheat" } },
  { $project: { yield_tons: 1, farm_id: 1 } },
  { $group: {
      _id: "$farm_id",                
      total_yield: { $sum: "$yield_tons" },
      avg_yield: { $avg: "$yield_tons" }   
  }},
  { $sort: { total_yield: -1 } }
])
                            

Challenges of Data Warehouse

  • Single, centralized structure can be overwhelming
  • Difficult to serve diverse departmental needs
  • Performance bottlenecks with large queries
  • Limited flexibility for specialized analysis
  • High cost of maintaining one large system
  • Slow adoption by end-users without tailored views
  • Harder to scale for business-specific use cases
Centralized Data Warehouse

Data Marts

  • Subject-Specific: Focused on one domain
  • Smaller Scope: Limited dataset vs warehouse
  • Faster Access: Optimized for queries
  • Decentralized: Independent warehouse
Data Mart

Types of Data Marts

  • Dependent: From central warehouse
  • Independent: From external sources
  • Hybrid: Combines both approaches
  • By Security: Restricted vs open access
  • By Domain: Sales, marketing, finance
  • By Application: Analytics or reporting
                                {
    "farm_id": "F12345",
    "crop_type": "Wheat",
    "region": "North Valley",
    "harvest_year": 2024,
    "yield_tons": 50.5,
    "water_used_m3": 1200,
    "fertilizer_kg": 300,
    "pesticides_kg": 20,
    "profit_usd": 7500,
    "irrigation_method": "Drip"
}
                                
                            
Which Database Type for Structured Data and Aggregated Queries?

Why Relational Databases for Warehouses & Marts?

  • Support structured schemas (facts + dimensions)
  • Efficient for joins, aggregations, group-bys
  • ACID properties ensure data integrity
  • Indexes optimize analytical queries
  • Widely supported by BI and reporting tools
  • Data Volume: Lake >>> Warehouse
// SQL Query Example
SELECT d.specialty, SUM(f.cost) AS total_cost
FROM Fact_Treatment f
JOIN Dim_Doctor d ON f.doctor_id = d.doctor_id
GROUP BY d.specialty;

Why Facts & Dimensions Matter

  • Facts: Numeric metrics (sales, yield, treatment cost)
  • Dimensions: Context (time, location, product, patient)
  • Facts alone = numbers without meaning
  • Dimensions alone = context without metrics
  • Together: enable analysis (per crop, per doctor, per region)
// Example: Hospital
Fact_Treatment {
  patient_id, doctor_id, date_id, cost
}

Dim_Doctor { doctor_id, specialty }
Dim_Date   { date_id, month, year }

Data Warehouse – Facts & Dimensions

  • Fact: Measure hospital events (treatment, admission)
  • Dimensions: Describe context (patient, doctor, time)
  • Facts = numeric & additive (e.g., cost, duration)
  • Dimensions = descriptive attributes
  • Together form star or snowflake schema
// Hospital Example
Fact_Treatment {
  patient_id, doctor_id, date_id, cost
}

Dim_Patient { patient_id, age, gender }
Dim_Doctor  { doctor_id, specialty }
Dim_Date    { date_id, year, month, day }

Data Warehouse/Mart – Star Schema

  • Central fact table with measures (e.g., sales, revenue)
  • Surrounded by denormalized dimension tables
  • Simpler design, faster queries
  • Easy for reporting and BI tools
  • Consumes more storage due to redundancy
Star Schema Figure

Data Warehouse/Mart – Snowflake Schema

  • Fact table with normalized dimension tables
  • Dimensions split into sub-dimensions (hierarchies)
  • Reduces data redundancy and storage cost
  • More complex design and joins
  • Better for large, scalable warehouses
Snowflake Schema Figure

Challenges of Secondary Data Usage

  • Data may be outdated or irrelevant
  • Original purpose may be missing
  • Data quality can be inconsistent
  • Limited control over data collection
  • Privacy concerns with sensitive data
  • May require cleaning or transformation
  • Primary usage has higher priority
Data Lake vs Warehouse

Warehouse Query Requirement – Average Rainfall

  • Requirement: Find average rainfall in a region
  • Fact Table: Rainfall measurements
  • Dimensions: Region, Date
  • Aggregate: Average per region
// Fact Table: Fact_Rainfall
// region_id, date_id, mm_rain

SELECT r.region_name, 
       AVG(f.mm_rain) AS avg_rainfall
FROM Fact_Rainfall f
JOIN Dim_Region r ON f.region_id = r.region_id
GROUP BY r.region_name;

Please Draw a Database Pipeline on Paper

Questions?

Lab 2

Lab 2 – What Went Well

Client (.html)
  • Almost no for loops and if conditions
  • Used chaining with map, filter, reduce
  • Focused on enriching metadata
  • Mostly button clicks (some triggers, not automatic)
Server (.js)
  • Explicitly allowed Access-Control-Allow-Methods
  • Asked many good questions

Enrichment of Metadata

  • Using Array(10) instead of for loop
  • Fill and map pattern generates objects concisely
  • Metadata enrichment applied with spread operator
  • Cleaner and more functional programming style
  • Scalable for generating synthetic sensor data
const objects = Array(10).fill(null)
  .map((_, i) => ({
    id: i + 1,
    sensorId: `sensor-${i + 1}`,
    reading: Math.floor(Math.random() * 100),
    timestamp: new Date().toISOString(),
    notes: `Reading ${i + 1}`,
    ...metadata
  }));

Lab 2 – What Did Not Go Well

Client (.html)
  • Extraction means querying data automatically
  • Extraction means posting to the lake automatically
  • Manual copy of objects to collection not scalable
Server (.js)
  • Password in plain code (mongodb+srv://...)
  • Should use .env for secrets
  • Need more specific endpoints ('/sync')
  • Some queries deleted entire lake data
  • GitHub structure not followed (one folder per lab)

Metadata as a Separate Object

  • Metadata should not mix with raw sensor data
  • Keeps the structure clean and organized
  • Prevents accidental changes to measurement values
  • Allows flexible updates without touching raw data
  • Improves clarity for downstream processing
const reading = {
  data: {
    id: 1,
    sensorId: "sensor-1",
    value: 72,
    timestamp: "2025-09-30T10:00:00Z"
  },
  metadata: {
    author: "Alice",
    last_sync: new Date().toISOString(),
    uuid_source: crypto.randomUUID()
  }
};

Use sourceId Instead of id

  • Avoids confusion with database _id
  • Clarifies data origin vs system identifier
  • Makes schema easier to interpret
  • Supports better integration with external sources
  • Improves long-term maintainability
{
  "_id": {
    "$oid": "68d7557c7cbbeeea2631a6cb"
  },
  "id": 1,
  "sensorId": "S1",
  "reading": 53,
  "timestamp": "2025-09-27T03:08:47.143Z",
  "notes": "Record for sensor 1",
  "metadata": {
    "author": "Alice",
    "last_sync": "2025-09-27T03:09:47.140Z",
    "uuid_source": "8a72254d-b18c-4706-9e57-a98bb8ae523d"
  }
}
Extraction: Query Database + Upload Data

HTTP POST Request with Fetch API

  • Send data to the server
  • Specify method as POST
  • Include headers (e.g., Content-Type)
  • Body must be a string (e.g., JSON)
  • Commonly used to create new records
fetch("http://localhost:3000/agriculture/sync/indexeddb", {
  method: "POST",
  headers: {
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    sensorId: "S1",
    reading: 75
  })
})
  .then(response => response.json())
  .then(result => {
    console.log("Data posted:", result);
  })
  .catch(error => {
    console.error("Error:", error);
  });

HTTP GET Request with Fetch API

  • Retrieve data from a server
  • Default method is GET
  • Returns a Promise
  • Use response.json() to parse JSON
  • Commonly used for reading APIs
fetch("http://localhost:3000/agriculture/sync/indexeddb")
  .then(response => response.json())
  .then(data => {
    console.log("Data received:", data);
  })
  .catch(error => {
    console.error("Error:", error);
  });
How to Prevent Deleting Data in a Lake?

MongoDB Lake – Avoid Deletion

  • Do not expose DELETE routes in API
  • Restrict operations to insert and update
  • Protect data at the application layer
// API example: No DELETE endpoint
app.post("/add", async (req, res) => {
  await db.collection("lake").insertOne(req.body);
  res.send("Added safely");
});

// No app.delete(...) defined

Soft Delete Instead of Hard Delete

  • Mark documents as deleted, don’t remove them
  • Preserves history for auditing
  • Supports recovery if needed
// Soft delete with a flag
db.lake.updateOne(
  { _id: ObjectId("...") },
  { $set: { isDeleted: true, deletedAt: new Date() } }
);

// Query only active docs
db.lake.find({ isDeleted: { $ne: true } });

Restrict User Roles

  • Use MongoDB RBAC (Role-Based Access Control)
  • Remove remove privilege from users
  • Allow only find, insert, update
db.createRole({
  role: "noDeleteRole",
  privileges: [
    {
      resource: { db: "lakeDB", collection: "" },
      actions: ["find", "insert", "update"]
    }
  ],
  roles: []
});

db.createUser({
  user: "lakeUser",
  pwd: "securePass",
  roles: [ { role: "noDeleteRole", db: "lakeDB" } ]
});

Monitor & Restore

  • Use Change Streams to detect deletes
  • Log and re-insert deleted documents
  • Backups & snapshots as safety net
// Watch for deletes
db.lake.watch([
  { $match: { operationType: "delete" } }
]).on("change", change => {
  console.log("Delete detected:", change);
  // Optionally restore from backup
});

Agenda

  1. DB Architecture - Data Lakehouse
  2. DB Architecture - Streaming Pipelines
  3. NoSQL Graph DB - Neo4j (Queries & Links)
  4. Lab 3 Sample Code, Homework 3
  5. Midterm & Project Part 1 – Guidelines
Data Lakehouse

Recap: Lake β†’ Warehouse β†’ Mart

  • Lake: raw, flexible, cheap
  • Warehouse: curated, reliable
  • Mart: focused, fast to use

Agriculture: Warehouse path

  • Soil tests β†’ ETL β†’ tables
  • Yields by field, by season
  • Reports for decisions

RDBMS for marts; clean, structured facts

Limits: Warehouse

  • Rigid schemas; slow change
  • Costly scale for history
  • Weak for unstructured data
  • Weak for realtime data
  • Raw detail lost on ETL

Agriculture: Lake path

  • Sensor logs in object store
  • Realtime data, weather CSV
  • Ad hoc SQL and scripts

MongoDB for raw docs; flexible ingest

Limits: Data Lake

  • Schema-on-read is fragile
  • No ACID across files
  • Catalog and lineage weak

The Gap

  • Keep raw + curated together
  • Get ACID and time travel
  • Serve SQL, ML, streaming
  • Less ETL: no lake β†’ warehouse
Lake + Warehouse
Sources
ETL
Data Lake
ETL
Data Warehouse
ETL
Marts / Dashboards
Consumers
Lakehouse
Sources
ETL
Lakehouse (open formats + ACID)
SQL / ML / Streaming Engines
Consumers

  • Single ingestion step
  • Less ETL (no lake β†’ warehouse shuffle)
  • One store, many engines

Architectural Concept Lakehouse

  • SQL query engine on raw files
  • Tables on open files format
  • ACID: safe merge and update
  • Schema evolve, time travel
  • One store, many engines
  • Strong catalog and lineage
Lakehouse diagram placeholder

Agriculture: Lakehouse flow

  • Ingest sensors, drones, soil
  • Store raw + curated tables
  • Merge late yield updates
  • SQL for Reports; ML on same data

Where tools fit

  • RDBMS: marts and reports
  • MongoDB: raw field logs
  • Neo4j: farm links, supply
  • IndexedDB: offline tablets
  • Lakehouse: Delta Lake / Apache Iceberg

Agriculture with Lakehouse

  • Soil + yield β†’ curated tables
  • Drone images β†’ raw files
  • BI reports on zones
-- Daily report
SELECT zone_id, AVG(yield_kg)
FROM fact_yield
GROUP BY zone_id;

Graph view: Inputs β†’ Fields

  • Fertilizer β†’ Field β†’ Yield
  • Supplier β†’ Batch β†’ Field

Neo4j to explore causal chains

Takeaways

  • Warehouse: trust, but rigid
  • Lake: flexible, but messy
  • Lakehouse: both in one

Recap: Lakehouse

  • One platform for raw + curated
  • ACID transactions on open files
  • SQL, ML, and BI share same data
  • Good for batch, not always for now

Limits of Batch

  • Updates only daily/hourly
  • Late data is hard to merge fast
  • Farm alerts need real time
Batch-Oriented
  • Insurance risk modeling
  • Census & demographic statistics
  • Climate & weather simulations
  • Pharmaceutical clinical trial data
  • Corporate reporting
Real-Time (Streaming)
  • Air traffic control & flight tracking
  • Emergency response & 911 dispatch systems
  • Healthcare monitoring (ICU sensors, wearables)
  • Power grid & smart meters
  • Autonomous vehicles & traffic systems

New Need: Real Time Pipelines

  • Soil dries in minutes
  • Machine breakdown alerts
  • Weather warnings on time
Batch-first
Sources
ETL
Data Lake
ETL
Warehouse
ETL
Marts
Consumers
Streaming-first
Sources
Stream Pipeline
Consumers
ETL
Lake
...

  • Soil alerts in seconds
  • History still stored

Enter Streaming Pipelines

  • Data flows as events
  • Process as it arrives
  • Push alerts & dashboards
  • Store only what’s needed

Streaming Pipeline Flow

  1. Source: sensors, tractors, APIs
  2. Stream: Kafka topics
  3. Process: filters, agg windows
  4. Store: Elasticsearch, DBs
  5. Consume: Kibana, alerts
Streaming Flow

Ag Example: Soil Moisture

  • Sensor sends event every 5 min
  • Stream joins with weather feed
  • Alert if below threshold
// Kafka consumer (Node.js)
consumer.on("data", msg => {
  if(msg.moisture < 15){
    sendAlert("Field A12 too dry!");
  }
});

Ag Example: Tractor Telemetry

  • Streams engine heat + GPS
  • Detect overheating early
  • Notify operator instantly

Streaming Pipeline Stack: Kafka

  • Capture events as messages
  • Process stream data
  • Scale for high throughput
# producers β†’ Kafka topics
# consumers β†’ Kafka topics

Kafka: Distributed Event Streaming

  • Distributed event streaming platform
  • High throughput (<2β€―ms latency)
  • Scalable to thousands of brokers
  • Durable, fault‑tolerant storage
# create topic
kafka-topics.sh --create -t events
# producer
kafka-console-producer.sh -t events
# consumer
kafka-console-consumer.sh -t events

End-to-End Streaming Example

  • Producers send events to Kafka
  • Consumers read and process events
# produce events via Kafka
# consume events from Kafka

Kafka Documentation
Lakehouse (eg MongoDB) vs Streaming (eg Kafka)

Lakehouse vs Streaming

  • Lakehouse
  • Batch load
  • SQL + ML unification
  • Daily/Hourly insights
  • Streaming
  • Event-by-event
  • Continuous joins
  • Alerts in seconds

Takeaways

  • Lakehouse: unified batch
  • Streaming: real-time alerts
  • Often used together
  • Agri farms need both
Questions?

A city wants to track traffic flow in real time to adjust signals, while also keeping historical data for planning future road expansions.

Would you choose a Data Lake, a Data Warehouse, a Lakehouse, or a Streaming Pipeline?

Traffic data illustration placeholder

Sample Answer: Traffic Use Case

Recommendation
  • Streaming pipeline for live traffic signals
  • Lakehouse (Delta/Iceberg) as shared store
  • SQL / ML for planning & forecasting
  • Consumers:
    • Real-time: traffic ops & systems
    • Reporting: planners, dashboards
Traffic Sensors / Apps
Streaming Ingest
Consumers (Real-Time)
Lakehouse Tables
SQL / ML Engines
Consumers (Reporting)

Realtime Data?

Realtime data processing enables immediate responses. Events are captured, streamed, and acted upon in seconds, supporting use cases like fraud detection, traffic signals, and IoT monitoring.

Realtime data flow placeholder
Questions?

Agenda

  1. DB Architecture - Data Lakehouse ✓
  2. DB Architecture - Streaming Pipelines ✓
  3. NoSQL Graph DB - Neo4j (Queries & Links)
  4. Lab 3 Sample Code, Homework 3
  5. Midterm & Project Part 1 – Guidelines
Data Lakehouse

Neo4j: Add Nodes

Requirement
  • Add Field A12
  • Add Sensor S-001
MERGE (:Field {id:'A12', name:'North'})
MERGE (:Sensor {id:'S-001', type:'soil'});
➜ Created/Found:
  Field A12, Sensor S-001

Neo4j: Add Link

Requirement
  • Link sensor to field
MATCH (f:Field {id:'A12'})
MATCH (s:Sensor {id:'S-001'})
MERGE (s)-[:LOCATED_IN]->(f)
RETURN f.id AS field, s.id AS sensor;
➜ field | sensor
  A12   | S-001

Neo4j: Add Reading

Requirement
  • Store moisture=14
  • Keep time of reading
MATCH (s:Sensor {id:'S-001'})
CREATE (r:Reading {
  at: datetime(), moisture:14
})-[:FROM]->(s)
RETURN s.id AS sensor, r.moisture AS m;
➜ sensor | m
  S-001  | 14

Neo4j: Read Latest

Requirement
  • List sensors in A12
  • Show latest moisture
MATCH (f:Field {id:'A12'})<-[:LOCATED_IN]-
      (s:Sensor)<-[:FROM]-(r:Reading)
WITH s, r
ORDER BY r.at DESC
WITH s, collect(r)[0] AS last
RETURN s.id AS sensor, last.moisture AS m;
➜ sensor | m
  S-001  | 14

Neo4j: Aggregate

Requirement
  • Avg moisture by field
  • Sort lowest first
MATCH (f:Field)<-[:LOCATED_IN]-(s:Sensor)
MATCH (r:Reading)-[:FROM]->(s)
WITH f.id AS field, avg(r.moisture) AS avg_m
RETURN field, round(avg_m,1) AS avg_m
ORDER BY avg_m ASC;
➜ field | avg_m
  A12   | 14.0
Questions?
ETL with Visual Studio

Lab 3 – Neo4j + Mongo Lake

  • Pull graph data from Neo4j
  • Push data into MongoDB Atlas
  • Query and screenshot results
  • Submit code + screenshots
# Repo
git clone https://github.com/SE4CPS/2025-COMP-263.git
cd 2025-COMP-263/labs/lab3

# Install
npm init -y
npm install dotenv neo4j-driver mongodb

# Env
cp .env.template .env   # do NOT commit

Step 2 – Test Neo4j

  • Use provided sample script
  • Expect β€œConnected to Neo4j Aura!”
  • Fix .env / network if it fails
# Run
node labs/lab3/sampleNeo4jConnection.js

# Expected
Connected to Neo4j Aura!

# Check if error
# - URI / USER / PASSWORD in .env
# - Internet/TLS reachable

Step 3 – Read / Select Data (Neo4j)

  • Open Neo4j Browser (Aura)
  • Get only nodes and edges
  • Run a generic read query
  • Visualize graph and screenshot
// Read & visualize nodes and edges
MATCH (n)-[r]->(m)
RETURN n,r,m
LIMIT 25;

Step 4 – Push to Mongo

  • Run integration script
  • Expect insert count log
  • Check Mongo Atlas project
# Run
node labs/lab3/pushNeo4jToMongo.js

# Expected log
Inserted X docs into Mongo lake

# Tip
# If auth fails, check .env connection

Step 5 – Query in Compass

  • Connect via Compass URI
  • Select Project1.lake
  • Run two finds
  • Take two screenshots
// All docs
db.lake.find({})

// Only Neo4j-sourced
db.lake.find({ sourceDB: "Neo4j" })

# Submit: code + screenshots

Agenda

  1. Pipeline Bottlenecks – Sources & Symptoms
  2. Throughput vs Latency – Metrics & Tradeoffs
  3. In-Memory Acceleration – Redis Overview
  4. Hands-On – Online Redis CLI (Key/Value, JSON)
  5. Pipeline Caching & Data Eviction Strategies
  6. Performance Tuning – Concurrency & Scaling
Data Pipeline Performance

Pipeline Bottlenecks

Sources & Symptoms in Modern Data Pipelines

Common Sources - Slow ETL

  • Slow I/O – network or disk
  • Serialization & parsing overhead
  • Unbalanced workloads across nodes
  • Blocking APIs or synchronous flows
Bottleneck Icon

Data Flow Example

  • Source β†’ Ingest β†’ Transform β†’ Store
  • Single slow stage throttles the rest
  • Monitoring latency per stage is key
Pipeline Stages Diagram

Symptoms

Symptom Likely Cause
High latency I/O wait or serialization
Low throughput Network or CPU bottleneck
Queue build-up Backpressure not handled
Performance Bottleneck Diagram

Prevention Strategies

  • Parallelize CPU-heavy transformations
  • Batch I/O and compress data blocks
  • Use async I/O and message queues
  • Profile and monitor each pipeline stage
Optimization Illustration

Throughput vs Latency

Metrics, Tradeoffs, and Tuning Principles

Core Definitions

  • Latency: Time per request (ms)
  • Throughput: Requests per second
  • Low latency β‰  high throughput
Performance Graph

Tradeoffs

Goal Possible Tradeoff
Increase throughput Higher latency from batching
Reduce latency Less batching β†’ CPU overhead
Parallelize tasks Coordination cost ↑
Tradeoff Balance Icon

Pipeline Example

  • Each stage adds fixed latency
  • Parallelism boosts throughput
  • Balance stages to avoid choke points
Latency vs Throughput Diagram

Measurement Tools

  • Prometheus & Grafana dashboards
  • Apache JMeter, k6, or Locust load tests
  • Redis Monitor (for in-memory ops)
Monitoring Dashboard

Monitoring Frequent & Real-Time Requests

  • Continuous soil & temperature readings from sensors
  • Buffer for ultra-fast ingestion
  • Sliding window to detect spikes or delays
  • Averages sent to dashboard every 10 s

Example: Smart irrigation with 500+ sensors


// Core loop
client.set(sensorId, JSON.stringify({
  moisture: 34.8,
  temp: 21.7,
  ts: Date.now()
}));

// Windowed latency check
avg = sum(latencies) / latencies.length;
console.log(`avg latency ${avg.toFixed(1)} ms`);

// sample Redis data
{
  "sensor:field12": {
    "moisture": 34.8,
    "temp": 21.7,
    "ts": 1738893421312
  }
}
How to Improve Data Pipeline Performance?
IndexedDB
Redis
MongoDB
CassandraDB
Faiss
ClickHouse
Neo4j

In-Memory NoSQL DB – Redis Overview

Fastest server-side database for real-time data

  • Data resides in RAM for microsecond access
  • Ideal for caching, queues, and streaming
  • Balances memory speed and persistence
Memory vs Storage

Chronology

  • 2009 – Salvatore Sanfilippo (antirez) releases Redis
  • Focus: speed, simplicity, in-memory design
  • Open-sourced under BSD, later managed by Redis Ltd
  • Now a backbone for caching & streaming
  • https://github.com/redis/redis
Redis logo

Server-Side Architecture

  • Runs as a server process
  • Clients connect via TCP on port 6379
  • Stores data in RAM for low latency
  • Optional persistence to disk (RDB/AOF)

$ redis-server
$ redis-cli ping
PONG

Data Eligible for Redis

  • Small to medium objects (KB–MB range)
  • Frequently accessed or time-sensitive data
  • Ephemeral or cacheable state (sessions, tokens)
  • Real-time metrics (sensor readings, API counts)
  • JSON, integers, or serialized strings
  • Session tokens β†’ session:usr123
  • API key cache β†’ api:key:weather-service

// Example: agriculture sensor data
{
  "sensor:field12": {
    "moisture": 34.8,
    "temperature": 21.7,
    "timestamp": 1738893421312
  }
}

// Session key
"session:usr123" : {
  "token": "eyJhbGciOiJIUzI1NiIs...",
  "expires": 1738893999000
}

// API request count
"api:weather:getCount" : 4821

// Cache entry
"cache:forecast:stockton" : {
  "temp": 21.7,
  "humidity": 65,
  "ts": 1738893480000
}

Why In-Memory?

  • RAM access β‰ˆ 10⁡ Γ— faster than disk
  • Ideal for caching & real-time analytics
  • Used between app ↔ database
cache diagram

Key–Value Model

  • Every record = unique key + value
  • Binary-safe strings β†’ numbers, JSON, etc.
  • O(1) lookup

SET crop:12 34.8
GET crop:12
# "34.8"

Core Data Types

  • String
  • List (linked list)
  • Set (unique items)
  • Hash (key–value map)
  • Sorted Set, Stream
Redis data types diagram

Strings

Single value per key, atomic increment/decrement


SET temp:field3 21.5
INCRBYFLOAT temp:field3 0.3
GET temp:field3
# 21.8

Lists

Push and pop like a queue or log


LPUSH sensor:log 34.1
LPUSH sensor:log 34.5
LRANGE sensor:log 0 -1
# ["34.5","34.1"]

Hashes

Mini key-value maps (like JSON objects)


HSET field:12 moisture 34.8 temp 21.7
HGETALL field:12
HGET field:12 temp
# moisture:34.8, temp:21.7

Sets

Unordered unique elements


SADD crops wheat barley maize
SMEMBERS crops
# wheat barley maize

Sorted Sets

Scores + values β†’ leaderboards or metrics


ZADD soil_moisture 34.8 field12 35.4 field13
ZRANGE soil_moisture 0 -1 WITHSCORES

Streams

Append-only logs for real-time pipelines


XADD sensor:stream * moisture 34.8 temp 21.7
XRANGE sensor:stream - +

Expiry & TTL

  • Sets a time limit (in seconds) for key validity
  • After expiry β†’ key is automatically deleted
  • Used for caches, sessions, temporary data

TTL = Time-To-Live (measured in seconds)


SET weather:now 21.7 EX 60
TTL weather:now
# 60  (seconds remaining)
# ...after a few seconds...
# 57
# -2  (means key expired)

Persistence Modes

  • RDB – snapshotting at intervals
  • AOF – append-only log of commands
  • Hybrid (default)
Persistence diagram

Publish / Subscribe

Real-time messaging between clients


SUBSCRIBE irrigation
PUBLISH irrigation "Start pump A"

ACID in Redis

Property Redis Support
Atomicity Yes (single command)
Consistency Eventual, not strict
Isolation Transactions via MULTI/EXEC
Durability No (by default)
ACID diagram

Quick Comparison

DB Model Speed Use Case
Redis Key-Value β˜…β˜…β˜…β˜…β˜… Caching / Realtime
IndexedDB Client KV β˜…β˜…β˜…β˜†β˜† Browser storage
MongoDB Document β˜…β˜…β˜…β˜†β˜† Flexible schema
SQL Relational β˜…β˜…β˜†β˜†β˜† Transactional data
Neo4j Graph β˜…β˜…β˜†β˜†β˜† Relationships
DB comparison diagram

Scaling

  • Master-replica replication
  • Redis Cluster for sharding
  • Horizontal scale for large datasets
Redis cluster

Common Use Cases

  • Caching API responses
  • Session management
  • Real-time analytics
  • Message queues
Streaming use case

Redis – Strengths

  • Server-side, in-memory key–value database
  • Ultra-low latency (<1 ms typical)
  • Supports rich data types (String, List, Hash, Set, Stream)
  • Ideal for caching, sessions, and real-time analytics
  • Complements MongoDB / SQL in pipelines

Redis – Limitations

  • Volatile by default: data lost if not persisted
  • Not ideal for heavy, continuous writes to disk
  • RAM-based β†’ limited by available memory
  • Lacks strong ACID durability (config-dependent)
  • No native query language (key-based access only)
Questions?

A city wants to track traffic flow in real time to adjust signals, while also keeping historical data for planning future road expansions.

Would you choose a Data Lake, a Data Warehouse, a Lakehouse, or a Streaming Pipeline?

Traffic data illustration placeholder

Agenda

  1. Redis API and Performance
  2. Caching Strategies in Web Apps
  3. Midterm Review
  4. Midterm Mock Exam
  5. Q&A and Discussion
Data Pipeline Performance

Node.js + MongoDB + Redis

  • Load agriculture docs from MongoDB
  • Cache results in Redis
  • Expose two GET endpoints with timing in JSON
  • Compare latency: Mongo only vs Redis cache
Agriculture data

Install & Env

npm i express mongodb ioredis dotenv
# .env (example)
PORT=3000
MONGO_HOST=mongodb+srv://cluster0.lixbqmp.mongodb.net
MONGO_USER=comp263_2025
MONGO_PASS=***yourpass***
REDIS_URL=redis://localhost:6379
CACHE_TTL_SECONDS=60
      

Keep secrets in .env. Don’t commit.

  • mongodb official driver
  • ioredis robust Redis client
  • Config validated at boot

Connect to MongoDB

Lab2.Agriculture and validates env.

  • SRV URI + separate auth
  • estimatedDocumentCount() on boot for sanity

const client = new MongoClient(`${HOST}/...`, {
  useNewUrlParser: true, useUnifiedTopology: true,
  auth: { username: USER, password: PASS }, authSource: "admin"
});

Add Redis Client

const Redis = require("ioredis");
const redis = new Redis(process.env.REDIS_URL);
// Optional: redis.on("connect", ...) for logging
      
  • Single instance is fine for class demos
  • Use SETEX/EX TTL for freshness
 

Timing & Response

  • Start timer at request
  • Return { source, timeMs, count, data }
  • Also set X-Response-Time header
function withTimer(handler) {
  return async (req, res) => {
    const t0 = process.hrtime.bigint();
    try { await handler(req, res, t0); }
    catch (e) { res.status(500).json({ error: String(e) }); }
  };
}
function elapsedMs(t0) {
  return Number((process.hrtime.bigint() - t0) / 1000000n);
}

GET /agriculture/mongo

app.get("/agriculture/mongo", withTimer(async (req, res, t0) => {
  const docs = await collection.find({}).limit(500).toArray();
  const body = { 
    source: "mongo", timeMs: elapsedMs(t0), 
    count: docs.length, data: docs 
  };
  res.set("X-Response-Time", body.timeMs + "ms").json(body);
}));
  • Baseline latency (no cache)
  • Limit for demo readability

GET /agriculture/redis

app.get("/agriculture/redis", withTimer(async (req, res, t0) => {
  const key = "agri:all:limit500";
  const ttl = Number(process.env.CACHE_TTL_SECONDS || 60);

  const cached = await redis.get(key);
  if (cached) {
    const data = JSON.parse(cached);
    const body = { 
        source: "redis", timeMs: elapsedMs(t0), 
        count: data.length, data 
    };
    return res.set("X-Response-Time", body.timeMs + "ms").json(body);
  }

  const docs = await collection.find({}).limit(500).toArray();
  await redis.set(key, JSON.stringify(docs), "EX", ttl);
  const body = { 
    source: "mongo→redis(set)", timeMs: elapsedMs(t0), 
    count: docs.length, data: docs 
  };
  res.set("X-Response-Time", body.timeMs + "ms").json(body);
}));
  • Hit: returns in micro-milliseconds
  • Miss: fetch + populate cache

See the Difference

# Quick manual check (shows X-Response-Time)
curl -i http://localhost:3000/agriculture/mongo | head
curl -i http://localhost:3000/agriculture/redis | head
curl -i http://localhost:3000/agriculture/redis | head  # 2nd
  • Discuss p50/p95 latencies
  • Expect Redis to dominate on repeat calls

Installing Redis on Ubuntu (WSL, macOS, Windows)

Windows (via WSL2 + Ubuntu)

# 1. Install Ubuntu on Windows
wsl --install -d Ubuntu
# Restart when prompted, then open Ubuntu

# 2. Inside Ubuntu
sudo apt update
sudo apt install -y redis-server
sudo service redis-server start
redis-cli ping   # β†’ PONG
      
  • Runs Redis inside Ubuntu under WSL2
  • Works seamlessly with Node.js on Windows
  • Use REDIS_URL=redis://127.0.0.1:6379 in .env

macOS (via Homebrew)

brew update
brew install redis
brew services start redis
redis-cli ping   # β†’ PONG
      
  • Uses Homebrew package manager
  • Start/stop with brew services
  • Redis listens on 127.0.0.1:6379

Your Turn

NoSQL Lab Repository

Performance Comparison: Redis vs MongoDB

Approximate time to retrieve one object at different data scales.

Dataset Size Redis (In-Memory) MongoDB (Disk-Based) Relative Speed
1 record ~1 ms ~100 ms β‰ˆ 100Γ— faster
1,000 records ~2 ms ~120 ms β‰ˆ 60Γ— faster
100,000 records ~5 ms ~250 ms β‰ˆ 50Γ— faster
1,000,000 records ~10 ms ~600 ms β‰ˆ 60Γ— faster

Redis Commands in Node.js: SET & GET

  • SET key value stores a string value.
  • GET key retrieves it.
  • Default behavior: value persists until deleted or replaced.
  • Use for caching simple data (e.g., counts, tokens, JSON).
const Redis = require("ioredis");
const redis = new Redis();

// Store data
await redis.set("crop:rice", JSON.stringify({ yield: 45 }));

// Retrieve data
const value = await redis.get("crop:rice");
console.log(JSON.parse(value));

Redis Commands: SETEX (Expire in Seconds)

  • SETEX key ttl value
  • Common for caching responses temporarily.
  • Expires automatically: no need to delete manually.
  • Node: redis.set(key, value, "EX", ttl)
await redis.set("weather:stockton", 
  JSON.stringify({ temp: 29 }), 
  "EX", 60); // expires in 60 seconds

const cached = await redis.get("weather:stockton");
console.log(cached ? "Cache hit" : "Expired");

Redis Commands: DEL & EXISTS

  • DEL key removes one or more keys.
  • EXISTS key checks if a key is present.
  • Useful for clearing outdated cache or validation.
await redis.del("weather:stockton");

const exists = await redis.exists("weather:stockton");
if (exists) console.log("Still cached");
else console.log("Cache cleared or expired");

Redis Commands: TTL & EXPIRE

  • TTL key shows remaining lifetime (in seconds).
  • EXPIRE key seconds sets a new TTL for an existing key.
  • Can refresh cache without rewriting data.
await redis.set("session:123", "active");
await redis.expire("session:123", 120);

const ttl = await redis.ttl("session:123");
console.log("Time left:", ttl, "seconds");

Redis Commands: MSET & MGET

  • MSET sets multiple keys at once.
  • MGET retrieves multiple keys in one call.
  • Efficient for bulk reads/writes in Node.js.
await redis.mset({
  "crop:wheat": "good",
  "crop:rice": "medium",
  "crop:corn": "low"
});

const values = await redis.mget(
    "crop:wheat", 
    "crop:rice", 
    "crop:corn"
);
console.log(values);

Redis Commands: FLUSHDB & KEYS

  • FLUSHDB removes all keys from the current database.
  • KEYS pattern lists keys matching a pattern.
  • ⚠️ Use with caution: affects all cached data.
const allKeys = await redis.keys("crop:*");
console.log("Found:", allKeys.length, "keys");

await redis.flushdb(); // clears all keys in this DB
console.log("Database cleared");

Agenda

  1. Redis API and Performance ✓
  2. Caching Strategies in Web Apps
  3. Midterm Review
  4. Midterm Mock Exam
  5. Q&A and Discussion
Data Pipeline Performance

Caching Strategy 1: Cache-Aside (Lazy Loading)

  • Application checks cache first.
  • If data is missing β†’ load from DB and store in cache.
  • Common in read-heavy APIs.
  • Simple and cost-efficient.
        graph LR
          A[Client Request] --> B[Check Cache]
          B -->|Hit| C[Return from Cache]
          B -->|Miss| D[Fetch from DB]
          D --> E[Update Cache]
          E --> F[Return to Client]
      

Caching Strategy 2: Write-Through

  • Every write goes to cache and DB together.
  • Keeps cache consistent with DB.
  • Higher write latency, but consistent reads.
        graph LR
          A[Client Write] --> B[Update Cache]
          B --> C[Write to DB]
          C --> D[Success Response]
      

Caching Strategy 3: Write-Behind (Write-Back)

  • Writes go to the cache first.
  • Cache flushes to the database asynchronously.
  • Fast writes; risk if cache fails before flush.
flowchart LR
A["Client Write"] --> B["Write to Cache"]
B --> C["Async Queue"]
C --> D["DB Update (Delayed)"]

Caching Strategy 4: Read-Through

  • App always queries the cache.
  • Cache fetches from DB automatically on a miss.
  • Cache layer abstracts DB access logic.
        graph LR
          A[Client Request] --> B[Cache Layer]
          B -->|Hit| C[Return from Cache]
          B -->|Miss| D[Cache Loads from DB]
          D --> E[Store in Cache & Return]
      

Caching Strategy 5: TTL / Expiration-Based

  • Each key has a time-to-live (TTL).
  • Expired data automatically removed.
  • Ensures freshness and reduces stale data.
  • Common for APIs with time-sensitive data (e.g., weather).
        graph TD
          A[Data Stored] --> B[Timer Starts]
          B -->|TTL Expires| C[Auto Delete]
          C --> D[Cache Miss Next Time]
          D --> E[Reload from DB]
      

Agenda, Oct/14

  1. Midterm Mock Exam
  2. Midterm Review
  3. Q&A and Discussion
 

Midterm Exam Structure

  • Duration: 10:05 - 11:15 AM
  • 20 Questions
  • Each question: 5 points
  • Scope 1/3: HW 1-3, Lab 1-3, Mod 1-7
  • Scope 2/3: IndexedDB, MongoDB Neo4j, Redis
  • Scope 3/3: Read code, not write
  • Closed book, no computer
  • On paper, please bring pens
  • Read each question carefully
  • If unsure, ask for clarification
  • Manage your time wisely
  • Show all work clearly
  • Check name/student id before submitting
  • Check answers before submitting

COMP 263: Midterm Review

Agenda, Oct/14

  1. Midterm Mock Exam ✓
  2. Midterm Review
  3. Q&A and Discussion
 

Module 1

Which β€œV” primarily addresses the challenge of many formats (logs, images, JSON) in one system?

  1. Veracity
  2. Variety
  3. Velocity
  4. Value

Answer: B: Variety

Module 1

Why must IndexedDB operations be awaited in JavaScript?

  1. They always run synchronously in the UI thread
  2. They require a service worker to complete
  3. They are asynchronous disk I/O behind a Promise
  4. They are limited to WebAssembly callbacks

Answer: C: Asynchronous I/O (Promise)

Module 1

What is automatically indexed in an IndexedDB object store?

  1. The primary key
  2. All string fields
  3. All numeric fields
  4. Every nested property

Answer: A: Primary key

Module 1

Which choice best explains why rigid SQL schemas struggle with fast-changing app fields?

  1. SQL cannot store text
  2. SQL disallows new tables
  3. SQL has no indexes
  4. Require coordinated migrations

Answer: D: Migrations required

Module 2

Which practice improves read performance for common UI queries in document models?

  1. Eliminating all redundancy
  2. Using only flat structures
  3. Denormalizing fields
  4. Storing dates as locale strings

Answer: C: Denormalize frequent fields

Module 2

Which identifier minimizes cross-system collisions without coordination?

  1. Auto-increment integer
  2. UUID (v4)
  3. Concatenated username+date
  4. Sequential timestamp

Answer: B: UUID v4

Module 2

What does the β€œZ” in 2025-09-04T10:15:30Z denote?

  1. UTC (Zulu) time
  2. Local browser timezone
  3. Zero-padded seconds
  4. Zigzag encoding

Answer: A: UTC

Module 2

When is embedding (not reference) generally preferred in JSON document design?

  1. When the related entity is reused by many parents
  2. When many-to-many links dominate
  3. When the child changes independently and often
  4. When data is always fetched together (tight 1:1)

Answer: D: Always fetched together

Module 3

Which metric ensures each real-world entity appears only once?

  1. Timeliness
  2. Uniqueness
  3. Accuracy
  4. Validity

Answer: B: Uniqueness

Module 3

What is the primary benefit of a UNIQUE index on a field such as email?

  1. Prevents duplicate values
  2. Forces numeric storage
  3. Replicates values across shards
  4. Compresses documents

Answer: A: Prevent duplicates

Module 3

A dataset missing required fields most directly violates which quality dimension?

  1. Consistency
  2. Veracity
  3. Integrity
  4. Completeness

Answer: D: Completeness

Module 3

Stale sensor values mainly reduce which property?

  1. Uniqueness
  2. Validity
  3. Timeliness
  4. Integrity

Answer: C: Timeliness

Module 4

In ELT, where is heavy transformation primarily performed?

  1. In the target store
  2. In a pre-processing appliance
  3. At the message broker
  4. On the client UI thread

Answer: A: In the target store

Module 4

Which workload fits streaming best?

  1. Quarterly revenue consolidation
  2. Nightly dimension table rebuild
  3. Monthly GDPR export
  4. Millisecond IoT data ingestion

Answer: D: IoT telemetry

Module 4

In MapReduce, which phase combines intermediate key groups into final values?

  1. Map
  2. Reduce
  3. Shuffle
  4. Partition

Answer: B: Reduce

Module 4

Which choice improves throughput for many small writes?

  1. Per-record transactions
  2. Deeply nested documents
  3. Batching operations
  4. Client-side polling

Answer: C: Batch operations

Module 5

Which property distinguishes a data lake from a warehouse?

  1. Schema-on-write
  2. Schema-on-read
  3. Requires star schema
  4. SQL-only ingestion

Answer: B: Schema-on-read

Module 5

Where are raw, unprocessed files usually stored before cleaning and structuring?

  1. Inside the main analytics dashboard
  2. In the initial lake for incoming data
  3. In the reporting database for end users
  4. In the API response cache

Answer: B: In the initial lake for incoming data data

Module 5

Which query pattern is a strength of a graph database?

  1. Multi-hop relationship traversal
  2. Massive table scans with joins
  3. Wide-column time bucketing
  4. Fast mulit-node joins

Answer: A: Multi-hop traversal

Module 5

Which component most directly decouples producers and consumers?

  1. Star schema
  2. Document index
  3. Message broker / queue
  4. Foreign key

Answer: C: Broker/Queue

Module 6

Which design is typical for an analytical warehouse model?

  1. Third normal form only
  2. Adjacency lists
  3. Highly nested JSON
  4. Star/snowflake schema

Answer: D: Star/Snowflake

Module 6

How do relational databases typically support aggregation such as SUM or AVG?

  1. By duplicating data into multiple temporary tables
  2. By relying on external scripting languages for computation
  3. By using SQL aggregate functions processed by the query engine
  4. By exporting data to a file before calculation

Answer: C: SQL aggregate functions in the query engine

Module 6

What is the main goal of moving data from a data lake into a data warehouse?

  1. To store unstructured data in raw format
  2. To organize and optimize data for analytical queries
  3. To remove historical records and reduce cost
  4. To bypass schema validation entirely

Answer: B: Organize data for analytics

Module 6

During transformation from data lake to data warehouse, which step typically occurs?

  1. Deleting metadata and lineage records
  2. Compressing data without schema enforcement
  3. Cleaning, joining, and aggregating curated datasets
  4. Moving JSON directly without structural checks

Answer: C: Clean, join, and aggregate curated data

Module 7

In cache-aside, what happens on a cache miss?

  1. Client returns 404 immediately
  2. App loads from DB, populates cache, returns result
  3. Cache fabricates a default value
  4. DB writes are skipped

Answer: B: Load DB β†’ populate cache β†’ return

Module 7

Which Redis option sets expiration when writing a value?

  1. SET ... NX
  2. SET ... PXAT only
  3. PERSIST
  4. SET key value EX <seconds>

Answer: D: SET ... EX <seconds>

Module 7

What best measures API latency difference between Mongo and Redis endpoints?

  1. Server-side timing (e.g., timeMs & X-Response-Time)
  2. Client clock only
  3. Counting HTTP headers
  4. Comparing payload sizes alone

Answer: A: Server-side timing

Module 7

Which practice balances speed and freshness for frequently changing data?

  1. Infinite TTL on all keys
  2. Delete cache on every read
  3. Short TTL with cache-aside reloads
  4. Write-through for read-only data

Answer: C: Short TTL + cache-aside

Questions?

Agriculture Dataset: Data Quality

  1. Are JSON keys consistent in naming?
  2. How to handle missing yield values?
  3. Is harvest_date in valid ISO format?
  4. How to standardize region names?
  5. How to record units accurately for yield?
  6. Which field relationships need validation?
  7. How to find duplicate farm_id?

{
  "farm_id": "F123",
  "crop": "rice",
  "region": "California",
  "yield": "72 tons",
  "soil": {
    "type": "clay",
    "ph": 9.8
  },
  "harvest_date": "10-09-25"
}
Questions?
Midterm (Oct/16)

Midterm Exam Structure

  • Duration: 10:05 - 11:15 AM
  • 20 Questions
  • Each question: 5 points
  • Scope 1/3: HW 1-3, Lab 1-3, Mod 1-7
  • Scope 2/3: IndexedDB, MongoDB Neo4j, Redis
  • Scope 3/3: Read code, not write
  • Closed book, no computer
  • On paper, please bring pens
  • Read each question carefully
  • If unsure, ask for clarification
  • Manage your time wisely
  • Show all work clearly
  • Check name/student id before submitting
  • Check answers before submitting
Q&A and Discussion

Today, October 21, 2025

  • Midterm Review
  • Database Pipeline Architecture and Data Dashboard
  • Column-based Database - ClickHouse
ClickHouse Columnar Database Diagram
Midterm Review

Database Pipeline Architecture and Data Dashboard

  • Metric
  • KPI
  • Target
  • Benchmark
Book Cover

Metric

  • What it is: A single measurement.
  • Agriculture example: Rainfall today = 8 mm.
  • Use: Simple observation collected daily.
{
  "metric": "rainfall",
  "value": 8,
  "unit": "mm",
  "timestamp_utc": "2025-10-21T00:00:00Z"
}
Raindrops on leaves

Key Performance Indicator (KPI)

  • A group of metrics that describe performance or trend.
  • Average weekly rainfall across all fields.
  • Use: Helps farmers understand overall water supply.
{
  "kpi": "weekly_rainfall",
  "average_value": 42,
  "unit": "mm/week",
  "week_start_utc": "2025-10-20T00:00:00Z",
  "data_points": 7
}
Weekly Rainfall Chart

Target

  • A goal or threshold value for a KPI.
  • Target weekly rainfall = 40 mm for optimal crop growth.
  • Guides irrigation planning when rainfall is below target.
{
  "target": "weekly_rainfall",
  "goal_value": 40,
  "unit": "mm/week",
  "valid_from_utc": "2025-10-01T00:00:00Z",
  "valid_to_utc": "2025-12-31T23:59:59Z"
}
Rainfall KPI with Target and Threshold

Benchmark

  • Reference point by crop type, season, or region.
  • Average weekly rainfall needs: Corn = 40 mm
  • Compare crop’s rainfall KPI to historical benchmark
{
  "benchmark": "weekly_rainfall_by_crop",
  "crops": {
    "corn": 40,
    "wheat": 30,
    "soybean": 35
  },
  "unit": "mm/week",
  "reference_period": "2024",
  "source": "historical_weather_data"
}
Crop-specific Rainfall Benchmark

How They Relate

Relationship Description Example
Metric β†’ KPI Daily rainfall readings combine to calculate weekly averages. {"rainfall_daily":[8,5,6,7,10,4,2]}
KPI ↔ Benchmark Compare weekly rainfall KPI to historical averages by crop type. Corn = 40 mm, Wheat = 30 mm, Soybean = 35 mm
KPI β†’ Target If the KPI is below the 40 mm target, schedule additional irrigation. Current = 32 mm β†’ Add irrigation
Loop Metrics feed into KPI β†’ compared to benchmark β†’ adjusted toward target. Continuous weekly update and correction cycle

From KPI to Aggregate Functions

KPIs are often derived from aggregate functions applied to raw metrics. These functions summarize multiple values to reveal overall trends or conditions.

  • MIN: Lowest rainfall recorded this week.
  • MAX: Highest rainfall recorded this week.
  • SUM: Total rainfall across all days.
  • AVG: Average rainfall: the weekly KPI value.
{
  "rainfall_daily": [8, 5, 6, 7, 10, 4, 2],
  "min": 2,
  "max": 10,
  "sum": 42,
  "avg": 6.0
}
ClickHouse Columnar Database Diagram

Today, October 23, 2025

  • Brief Quiz
  • Column-based Database - ClickHouse
  • DB Pipeline Architecture and Data Dashboard
  • Homework 4, Lab 4, Project Part 1, Final Exam Date
ClickHouse Columnar Database Diagram
Brief Quiz
IndexedDB
Redis
MongoDB
CassandraDB
Faiss
ClickHouse
Neo4j
Column-based Database - ClickHouse
History of ClickHouse
  • 2009: Built at Yandex for web analytics
  • 2014: Scaled for distributed queries
  • 2016: Open-sourced under Apache 2.0
  • 2018: Adopted by major tech companies
  • 2021: ClickHouse Inc. established
  • 2023: Cloud service officially launched
  • GitHub: ClickHouse Repo
ClickHouse Logo
SQL: The Query Language
  • 1970: Based on Codd’s relational model
  • 1974: IBM developed SEQUEL (Structured English Query)
  • 1986: Adopted as ANSI SQL standard
  • 1990s: Widely used in relational databases
  • 2000s: Extended for analytics and joins
  • Today: Common across modern engines (incl. ClickHouse)
SQL Database Illustration
ClickHouse: Supported Data Types
  • Numeric: Int8–Int256, UInt8–UInt256, Float32, Float64
  • Boolean: UInt8 used for true / false
  • String: String, FixedString(N)
  • Date & Time: Date, Date32, DateTime, DateTime64
  • Array: Array(T) for lists of any type
  • Tuple: Tuple(T1, T2, …)
  • Map: Map(KeyType, ValueType)
  • Enum: Enum8 / Enum16 for labeled constants
  • UUID: 128-bit unique identifiers
  • Nested: groups of arrays forming sub-tables
  • LowCardinality: optimized string storage
  • Nullable: supports NULL for any type
CREATE TABLE example_types
(
  id UInt32,
  name String,
  score Float64,
  tags Array(String),
  metrics Map(String, Float32),
  status Enum8('ok'=1, 'fail'=2),
  created_at DateTime64(3),
  uid UUID,
  extra Nullable(String)
)
ENGINE = MergeTree();

-- ClickHouse adds Array, Tuple, Map, Enum, LowCardinality, Nested, UUID
      
ClickHouse: Columnar SELECT
  • Why: scan only needed columns
  • Why: high compression, fast I/O
  • Why: vectorized execution
SELECT user_id, ts, duration
FROM logs
WHERE event = 'page_view'
LIMIT 3;

β”Œβ”€user_id─┬────────ts────────┬─duration─┐
β”‚ 101     β”‚ 2025-10-23 10:00 β”‚ 120      β”‚
β”‚ 102     β”‚ 2025-10-23 09:59 β”‚ 95       β”‚
β”‚ 103     β”‚ 2025-10-23 09:58 β”‚ 134      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      
ClickHouse: Aggregations
  • Why: sub-second on billions
  • Why: built-in aggregate functions
  • Why: no pre-rollups needed
SELECT url, count() AS views, avg(duration) AS avg_ms
FROM logs
GROUP BY url
ORDER BY views DESC
LIMIT 3;

β”Œβ”€url──────────────┬─views─┬─avg_ms─┐
β”‚ /home            β”‚ 12500 β”‚ 112.4  β”‚
β”‚ /products        β”‚  9432 β”‚ 98.3   β”‚
β”‚ /checkout        β”‚  2831 β”‚ 145.6  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      
ClickHouse: ARRAY JOIN
  • Why: flatten arrays directly
  • Why: simple query-based unnest
  • Why: efficient on nested JSON
SELECT user_id, action
FROM user_activity
ARRAY JOIN actions AS action
LIMIT 3;

β”Œβ”€user_id─┬─action───┐
β”‚ 101     β”‚ click    β”‚
β”‚ 101     β”‚ scroll   β”‚
β”‚ 102     β”‚ view     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      
ClickHouse: Window Functions
  • Why: moving averages and ranks
  • Why: session and cohort analysis
  • Why: SQL-standard syntax
SELECT
  user_id,
  ts,
  avg(duration) OVER (
    PARTITION BY user_id
    ORDER BY ts
    ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
  ) AS avg_last_5
FROM sessions
LIMIT 3;

β”Œβ”€user_id─┬────────ts────────┬─avg_last_5─┐
β”‚ 101     β”‚ 2025-10-23 10:00 β”‚ 98.4       β”‚
β”‚ 101     β”‚ 2025-10-23 10:05 β”‚ 102.7      β”‚
β”‚ 101     β”‚ 2025-10-23 10:10 β”‚ 95.8       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      
ClickHouse: Inserts
  • Why: handle millions of rows/sec
  • Why: batch or streaming supported
  • Why: flexible input formats
INSERT INTO logs (ts, user_id, event, duration)
VALUES (now(), 101, 'view', 120),
       (now(), 102, 'click', 85);

Query OK, 2 rows inserted.
Elapsed: 0.002 sec.
      
ClickHouse: System Tables
  • Why: internal observability
  • Why: track query performance
  • Why: inspect table storage
SELECT query_kind, read_rows, memory_usage
FROM system.query_log
LIMIT 2;

β”Œβ”€query_kind─┬─read_rows─┬─memory_usage─┐
β”‚ Select     β”‚ 21000     β”‚ 15.2 MiB     β”‚
β”‚ Insert     β”‚ 2000      β”‚ 2.8 MiB      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      
The Dashboard Effect
  • Real-Time Data Monitoring (incl. Clicks)
  • Improved Decision-Making
  • Improved Team Collaboration
  • Increased Efficiency and Productivity
  • Better Performance Tracking
The Dashboard Effect Book Cover
Row vs Column Storage
Row-Based vs Column-Based Storage
Row vs Column Storage
How columnar databases store and access data
  • Data is stored column by column instead of row by row
  • Each column is stored as a contiguous block on disk
  • Efficient for scanning specific columns during queries
  • Enables better compression due to uniform data types
  • Reduces I/O by skipping irrelevant columns
  • Ideal for analytics and aggregation use cases
Columnar Storage Diagram
Row-Based Tables

Overview & Disadvantages

  • Row-Based Storage
    • Stores entire rows together on disk
    • Optimized for transactional workloads (OLTP)
    • Used in systems like MySQL and PostgreSQL
  • Disadvantages
    • Inefficient for analytical queries (OLAP)
    • Loads full rows even when only one column is needed
    • Limited compression opportunities
    • Slower for large-scale filtering and aggregation
When not to use a columnar database
When Should I Use a Columnar Database?
  • Heavy read and analytics workloads
  • Large-scale aggregation and filtering
  • Dashboards and business intelligence tools
  • Event logging or time-series data
  • Need for high compression and fast scans
When to Use Row vs Column-Based Databases
Query Type Use Row-Based DB Use Column-Based DB
Inserts and Updates βœ”
Single-row lookups βœ”
Aggregations over large datasets βœ”
Reporting and analytics βœ”
Real-time dashboards βœ”
When to Use Columnar vs Row-Oriented Databases
Use Columnar Databases When... Use Row-Oriented Databases When...
Your primary use case is analytics Your primary use case is transactions
You need low query latency for large scans You don't need low latency for analytics
You have large amounts of analytical data You're working with smaller data sets
You don't need strict ACID compliance You need strict ACID compliance
You're using event sourcing principles You perform frequent, small updates or deletes
You store and analyze time series data You access records by unique IDs
Compression and performance advantages
Compression in Column-Based Databases
  • Columns store uniform data types
  • Enables use of efficient compression algorithms
  • Examples: run-length encoding, dictionary encoding
  • Significantly reduces storage footprint
  • Faster query performance due to fewer bytes read
  • Compression often applied automatically by the engine
Compression Diagram
Aggregation Query: Time Complexity
Storage Type Aggregation Time Complexity Reason
Row-Based O(n Γ— m) Scans all rows and all columns
Column-Based O(n) Scans only the relevant column

n = number of rows, m = number of columns

Indexing strategies in column stores
Indexing in Column-Based Databases
Concept Column-Based DB Row-Based DB
Data Layout Columns stored independently Rows stored together
Indexing Need Often reduced due to scan efficiency Critical for performance
Default Access Column filters without indexes are fast Needs indexes for efficient filtering
Index Type Skip indexes, min-max indexes, sparse indexes B-tree, hash, GIN, GiST
Compression & Filtering Efficient filtering via compressed column blocks Filtering depends on explicit indexes
Example ClickHouse: Primary index is skip-based by default PostgreSQL: Uses B-tree indexes by default
Comparing PostgreSQL vs. ClickHouse
ClickHouse vs PostgreSQL: Key Differences
Feature ClickHouse PostgreSQL
Storage Model Column-based Row-based
Primary Use Case Analytics (OLAP) Transactions (OLTP)
Indexing Skip index (min-max, sparse) B-tree, hash, GIN, GiST
Write Performance High-throughput batch inserts Optimized for single-row inserts
Read Performance Fast scans on large datasets Fast lookups and joins
Compression Built-in, column-aware compression Optional extensions
SQL Support Subset of ANSI SQL (e.g., no FULL OUTER JOIN, no foreign keys) Full ANSI SQL with rich extensions
ACID Compliance Eventual consistency, no full ACID Full ACID compliance
Open Source Columnar Databases
  • MonetDB – monetdb.org (2004, Mozilla Public License 2.0) – Supports SQL
  • Apache Druid – druid.apache.org (2012, Apache License 2.0) – Supports SQL
  • ClickHouse – clickhouse.com (2016, Apache License 2.0) – Supports SQL
  • Apache Kudu – kudu.apache.org (2016, Apache License 2.0) – Supports SQL via Impala
  • Apache Parquet – parquet.apache.org (2013, Apache License 2.0) – Columnar storage format; SQL support depends on engine
  • DuckDB – duckdb.org (2019, MIT License) – Supports SQL
Average Queries on trips Table
Description SQL Query
Average trip distance
SELECT AVG(trip_distance) FROM trips;
Average fare amount
SELECT AVG(fare_amount) FROM trips;
Average tip by payment type
SELECT payment_type_, AVG(tip_amount) FROM trips GROUP BY payment_type_;
Average total amount by vendor
SELECT vendor_id, AVG(total_amount) FROM trips GROUP BY vendor_id;
ClickHouse Playground
  • Web-based platform to experiment with ClickHouse queries
  • No setup required; start querying instantly
  • Access to various sample datasets for testing
  • Supports SQL queries with real-time results

Today, October 23, 2025

  • Brief Quiz ✓
  • Column-based Database - ClickHouse ✓
  • DB Pipeline Architecture and Data Dashboard
  • Homework 4, Lab 4, Project Part 1, Final Exam Date
ClickHouse Columnar Database Diagram
From Dashboard Questions to Database Pipeline
Dashboard First
  • What is the Benchmark?
  • What is the Target?
  • What is the KPI?
  • What is the Metric?
Dashboard Questions Illustration
Case Study: Agriculture Dashboard

A vineyard in California struggled with soil moisture variations, risking crop quality. The benchmark from state guidelines required maintaining 25–35% moisture.

The farm set a target of keeping soil moisture near 30% using IoT-based irrigation control.

The main KPI was average soil moisture per zone, updated hourly through a data pipeline.

Supporting metrics included pump runtime, water flow rate, and daily irrigation volume. Within weeks, dashboard insights cut water use by 15% while staying within target range.

Smart Agriculture Dashboard Illustration

Today, October 23, 2025

  • Brief Quiz ✓
  • Column-based Database - ClickHouse ✓
  • DB Pipeline Architecture and Data Dashboard ✓
  • Homework 4, Lab 4, Project Part 1, Final Exam Date
ClickHouse Columnar Database Diagram

Today, October 28, 2025

  1. Brief Review: 3 Questions on Paper
  2. Brief Review: Big Data and CAP Theorem
  3. Database Sharding and Replication
  4. MongoDB and CassandraDB
  5. Homework 4, Lab 4, Project Part 1, Final Exam Date
ClickHouse Columnar Database Diagram
Brief Review: 3 Questions on Paper
Brief Review: Big Data and CAP Theorem

The 5 Vs of Big Data

  • Veracity: Ensure data accuracy and reliability
  • Variety: Ingest and process data in multiple formats
  • Velocity: Real-time data processing capabilities
  • Value: Analytics tools to derive actionable insights
  • Volume: Data scales to petabytes (15 zeros) and beyond.
The 5 Vs of Big Data

Visualizing Data Volume

1GB
~200,000 emails
50GB
~5,000 photos
60GB
1 hour GPS/s
100GB
~20h video
200GB
~1h driving

Limitations of SQL Databases with High Row Volume

  • Performance Issues: Query response time degrades with increased data volume
  • Scalability Constraints: Hard limits on vertical scaling / complexities in horizontal scaling
  • Maintenance Overheads: Intensive resource needs for indexing and transaction locks
  • Cost Implications: Continuous increase in cost

How to Scale?

Distribute (not Replication) Data

Horizontal Scaling

Adding more machines or nodes to a system to increase capacity and distribute load.

Vertical Scaling

Increasing the capacity of an existing machine by adding more resources such as CPU or RAM.

Database Scalability Diagram

Horizontal Scaling Leads to Distributed Databases

  • Data spread across multiple nodes.
  • Each node manages part of the data.
  • Enables faster and fault-tolerant systems.

Distributed Databases lead to the CAP Theorem balancing Consistency, Availability, and Partition Tolerance.

Distributed Database Diagram

CAP Theorem

  • C: Consistency: All reads see the latest write
  • A: Availability: Every request gets a response
  • P: Partition Tolerance: Works despite network splits

Under a partition, you can favor C or A, not both.

CAP Theorem triangle
Database Sharding and Replication

Database Sharding and Replication

  • Sharding: Distributing data
  • Replication: Copying data

Together, they improve scalability, availability, and resilience of large systems.

ClickHouse Columnar Database Diagram

Benefits and Drawbacks of Sharding

Benefits
  • Enables horizontal scaling across multiple servers.
  • Improves query performance through data distribution.
  • Improve system reliability and fault isolation.
Drawbacks
  • Increases complexity of data management and queries.
  • Risk of uneven shard sizes over time (data imbalance).
  • Requires careful shard key selection and monitoring.

Sharding is Complex - SQL to NoSQL

For SQL Check Temporary SW/Data Options First

Illustration of Scaling
Database Consistency + Partition Tolerance (CP) Consistency + Availability (CA) Availability + Partition Tolerance (AP)
MongoDB/Redis βœ” ✘ ✘
Cassandra ✘ ✘ βœ”
RDBMS ✘ βœ” ✘

  • MongoDB/Redis: Focus on consistency and partition tolerance
  • Cassandra: Prioritizes availability and partition tolerance
  • RDBMS: Emphasizes consistency and availability for transactions

MongoDB Sharding Components

  • Shard Key: Field that defines data distribution.
  • Chunks: Data divided and assigned to shards.
  • Shards: Replica sets storing subsets of data.
  • Node: MongoDB instance (primary or secondary).
  • Query Router (mongos): Sends queries to correct shard.
  • Config Server: Stores shard and chunk metadata.
  • Balancer: Keeps data evenly distributed.
MongoDB Sharding Components Diagram

Replica Set vs Shard: Agriculture Data Example

Aspect Replica Set Shard
Purpose Provides redundancy and failover. Distributes data for scalability.
Data Stored Same crop data copied across nodes. Different crop regions stored on different shards.
Example All nodes store β€œWheat yield 2024” dataset. Shard A = Wheat (North farms), Shard B = Corn (South farms).
Goal Reliability and availability. Scalability and performance.

Role of the Primary in MongoDB

  • Writes: Only the primary accepts write operations.
  • Consistency: Holds latest data; secondaries replicate it.
  • Election: A new primary is chosen if one fails.
  • Clients: Connect mainly to the primary for reads/writes.
MongoDB Primary Role Diagram

MongoDB: Network Partition Logic

  • Partition: Nodes lose connectivity, creating isolated groups
  • Availability Priority: MongoDB favors availability (AP) to stay operational
  • Primary Election: Primary unreachable, new primary is elected in connected partition
  • Inconsistencies: Data may be temporarily inconsistent across partitions.
  • Re-Sync: Once resolved, partitions synchronize data to restore consistency.
Illustration of Scaling

Questions on MongoDB Sharding

  1. How does MongoDB handle writes if the primary node in a shard fails?
  2. What impact does a missing shard have on read operations?
  3. How would you choose an optimal shard key for a high-traffic app?
  4. How does MongoDB maintain data consistency within a shard?
  5. When would you prioritize availability over strict consistency?

Sample Answers

  1. MongoDB elects a new primary from the remaining nodes in the shard's replica set
  2. Queries for data on the missing shard will fail; unaffected shards continue serving data
  3. Choose a shard key with high cardinality and even distribution (e.g., user ID)
  4. Secondaries replicate from the primary node to keep data consistent within the shard
  5. Prioritize availability in applications where uptime is more critical than strict consistency

Today, October 30, 2025

  1. Brief Review: 2 Questions on Paper
  2. Brief Review: Big Data and CAP Theorem
  3. Concept of Eventually Consistency
  4. MongoDB and CassandraDB
  5. Homework 4, Lab 4, Project Part 1, Final Exam Date
ClickHouse Columnar Database Diagram
Brief Review: 2 Questions on Paper
Brief Review: Big Data and CAP Theorem
Concept Eventually Consistency
MongoDB and CassandraDB

What is Cassandra?

  • Open-source, distributed NoSQL wide-column store
  • Designed for massive scale, no single point of failure
  • Prioritises availability & partition-tolerance (AP in CAP)
  • Built for Systems Engineers, less for Developers
  • github.com/apache/cassandra
Cassandra logo

Cassandra Chronology

  • 2007 – Developed at Facebook for Inbox Search
  • 2008 – Open-sourced on Google Code
  • 2009 – Entered Apache Incubator
  • 2010 – Became Apache top-level project
  • 2015+ – Adopted by Netflix, Instagram, Uber, etc
Cassandra version history

Cluster Architecture

  • Peer-to-peer design, no master node
  • Gossip protocol for node discovery
  • Data partitioned via consistent hashing
Cassandra cluster ring

Gossip Protocol in Cassandra

  • Nodes share status info with random peers
  • No master, fully peer-to-peer
  • Spreads state quickly across the cluster
  • Shares node health, schema, and load data
  • Seed nodes help new nodes join

Keeps cluster consistent and fault-aware.

Cassandra Gossip Protocol Diagram

Data Model & CQL

  • Keyspace β†’ Table β†’ Rows β†’ Columns.
  • CQL syntax is SQL-like but denormalized.
  • CQL Documentation
CREATE KEYSPACE farm_data
 WITH REPLICATION = {'class':'SimpleStrategy', 'replication_factor':3};

CREATE TABLE crop_yields (
  farm_id text,
  crop_type text,
  year int,
  yield double,
  PRIMARY KEY ((farm_id), crop_type, year)
);
      
Cassandra table structure

Write & Read Path

  • Coordinator: Routes client requests to replicas
  • RF: Number of data copies in the cluster
  • Consistency: ONE, QUORUM, or ALL

Coordinator enforces consistency across replicas

Cassandra write path

Why Cassandra Chooses Availability

  • Remains available during node or network failure.
  • Writes accepted even if some replicas are unreachable.
  • Consistency achieved later via replication and repair.

β†’ Cassandra is AP (Availability + Partition Tolerance)

Cassandra write path

Agriculture Sensor Data Example

  • IoT sensors generate time-series data
  • Scales for thousands of farms globally
  • Optimized for high write throughput
CREATE TABLE sensor_data (
  farm_id text,
  sensor_id text,
  ts timestamp,
  temp float,
  moisture float,
  PRIMARY KEY ((farm_id, sensor_id), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
      
Agriculture IoT sensors

High Cardinality

  • Means many unique key values
  • Example: user_id, sensor_id
  • Spreads data evenly across nodes
  • Prevents hotspots and imbalance

Eventual Consistency

  • Replicas may differ right after a write
  • System stays available during updates
  • Data converges as replicas sync over time
  • Achieved via read repair and background sync

Partition Key

  • Determines where data is stored in the cluster
  • Same key β†’ same node and partition
  • Used for data distribution and lookup
  • Example: (farm_id)

Clustering Key

  • Defines how rows are ordered within a partition
  • Supports range queries and sorting
  • Example: (sensor_id, ts)

Example Table

CREATE TABLE sensor_data (
  farm_id text,
  sensor_id text,
  ts timestamp,
  value float,
  PRIMARY KEY ((farm_id), sensor_id, ts)
);
            

Partition Key: farm_id β†’ decides node placement
Clustering Keys: sensor_id, ts β†’ order within partition

Best Practices

  • Design tables for queries (no joins)
  • Choose partition keys carefully to avoid hotspots
  • Monitor compaction, tombstones, and repair frequency
  • Use QUORUM for balanced consistency vs availability
Compaction and tombstones

Questions?

Agriculture Data Pipeline: Choosing the Right Database

Pipeline Stage Data Type / Need Best Database Reason
Sensor Ingestion High-volume IoT readings (temperature, soil, humidity) ? Handles fast writes, distributed and fault-tolerant
Short-term Analytics Recent sensor trends (last 24h) ? Flexible schema, easy aggregation pipelines
Long-term Storage Historical yields and weather data ? Supports structured queries and analytics
Real-time Alerts Threshold-based triggers (low moisture) ? In-memory, fast pub/sub messaging
Visualization & Dashboards Aggregated farm metrics and KPIs ? Optimized for analytics and fast group-by queries

Data Storage Distribution Questions

  1. Should a data lake be distributed for large data volumes?
  2. Is distribution needed for a data warehouse's performance?
  3. Does a data mart require distribution for efficiency?

Today, November 4, 2025

  • Why migrate: scalability, flexibility, performance
  • Model shift: relational β†’ document
  • 1:1 β†’ merge into one document
  • 1:N β†’ embed child array
  • N:N β†’ use references or link collection
  • Case Study
  • Homework, Lab, Project
  • Review Distributed Databases
SQL to NoSQL Migration Diagram

Volume: Why migrate?

  • Massive farm data exceeds vertical limits
  • Horizontal sharding handles big datasets
  • Denormalized documents reduce joins
  • Better compression for repetitive data
-- SQL
SELECT f.id, f.name, h.crop, h.yield_kg
FROM farmers f
JOIN harvest_lots h ON h.farmer_id = f.id
WHERE f.id = 42;
-- MQL
db.farmers.find(
  { _id: 42 },
  { name: 1, harvestLots: 1 }
);
{
  "_id": 42,
  "name": "Alice Farms",
  "harvestLots": [
    { "crop": "Wheat", "yieldKg": 18000 },
    { "crop": "Corn",  "yieldKg": 24500 }
  ]
}

Velocity: Why migrate?

  • High-frequency sensor data overloads SQL
  • Append-only writes scale horizontally
  • Partitioned by time or field
  • Native time-series storage in NoSQL
-- SQL
INSERT INTO soil_events(sensor_id, ts, moisture, temp)
VALUES (17, NOW(), 23.4, 21.7);
-- MQL
db.soilSensors.updateOne(
  { _id: "sensor-17" },
  { $push: { readings: { ts: ISODate(), moisture: 23.4 } } },
  { upsert: true }
);
{
  "_id": "sensor-17",
  "readings": [
    { "ts": "2025-11-05T04:00:00Z", "moisture": 23.4 }
  ]
}

Velocity: Query Complexity Comparison

Relational (SQL)
  • Join operations across multiple tables
  • Query cost grows as O(n Γ— n)
  • Each join multiplies row combinations
  • High latency under concurrent writes
  • Difficult to parallelize across nodes
NoSQL (Document / Sharded)
  • Related data stored in the same document
  • Query cost reduces to O(shard count)
  • Shards handle independent partitions
  • Linear scalability for inserts and reads
  • Natural fit for distributed sensor data

Variety: Why migrate?

  • Flexible schema for diverse crop data
  • No ALTER TABLE for new fields
  • Store structured and unstructured data
  • Adapt quickly to new sensor formats
-- SQL
ALTER TABLE crops ADD COLUMN agronomy_json JSON;
-- MQL
db.crops.insertOne({
  _id: "crop-123",
  name: "Tomato",
  agronomy: { variety: "Roma", irrigation: "drip" },
  inputs: [{ name: "NPK-10-10-10", rateKgPerHa: 120 }]
});
{
  "_id": "crop-123",
  "name": "Tomato",
  "agronomy": { "variety": "Roma", "irrigation": "drip" },
  "inputs": [{ "name": "NPK-10-10-10", "rateKgPerHa": 120 }]
}

Veracity: Why migrate?

  • Retain both raw and cleaned data
  • Preserve source and lineage info
  • Track quality and validation issues
  • Allow flexible schema-on-read
-- SQL
INSERT INTO yield_clean
SELECT * FROM yield_stage WHERE valid = TRUE;
-- MQL
db.yieldReports.insertOne({
  _id: "YR-7",
  fieldId: "F-101",
  raw: { yieldKg: "??", moisture: "N/A", source: "c_csv" },
  clean: { yieldKg: 17450, moisture: 13.2 },
  quality: { valid: false, issues: ["missing_yield"] },
  lineage: { file: "coop_2025-11-05.csv" }
});
{
  "_id": "YR-7",
  "fieldId": "F-101",
  "raw": { "yieldKg": "??", "moisture": "N/A", "source": "c_csv" },
  "clean": { "yieldKg": 17450, "moisture": 13.2 },
  "quality": { "valid": false, "issues": ["missing_yield"] },
  "lineage": { "file": "coop_2025-11-05.csv" }
}

Value: Why migrate?

  • Faster reads for farm dashboards
  • Pre-aggregate crop yields
  • Reduce multiple joins per query
  • Enable quick insights and analytics
-- SQL
SELECT * FROM farmers WHERE id=42;
SELECT SUM(yield_kg)
FROM harvest_lots WHERE farmer_id=42;
-- MQL
db.farmers.find(
  { _id: 42 },
  { name: 1, harvestLots: 1, metrics: 1 }
);
{
  "_id": 42,
  "name": "Alice Farms",
  "harvestLots": [
    { "crop": "Wheat", "yieldKg": 18000 },
    { "crop": "Corn", "yieldKg": 24500 }
  ],
  "metrics": { "totalYieldKg": 42500, "lastHarvestId": 902 }
}
Questions

Model Shift: From Relational to Document

  • Relational model organizes data into fixed tables
  • Each entity stored separately, linked by foreign keys
  • Designed for transactional integrity (ACID)
  • Requires joins to reconstruct complete context
Relational Model Diagram

Model Shift: Document-Centric Thinking

  • Documents store all related data together
  • Supports nested and flexible structures
  • Optimized for read-heavy agricultural analytics
  • Fewer joins, faster access for field-level summaries
{
  "_id": "...",
  "farmerName": "...",
  "fields": [
    {
      "fieldId": "...",
      "crop": {
        "name": "...",
        "variety": "...",
        "agronomy": {
          "irrigation": "...",
          "spacingCm": ...
        }
      },
      "sensors": [
        {
          "sensorId": "...",
          "type": "...",
          "readings": [
            { "ts": "...", "moisture": ..., "temp": ... },
            ...
          ]

Model Shift: Example Mapping

  • 1:1 β†’ merge entities into a single document
  • 1:N β†’ embed related items as arrays (e.g., harvest lots)
  • N:N β†’ use reference arrays or link collections
  • Allows mixed, dynamic attributes per document
-- 1:1
{
  "_id": "farmer-101",
  "profile": { "name": "Alice", "address": "..." }
}
-- 1:N
{
  "_id": "field-22",
  "crop": "Wheat",
  "harvestLots": [
    { "lotId": "L-1", "yieldKg": 18000 },
    { "lotId": "L-2", "yieldKg": 24500 }
  ]
}
-- N:N (using references)
{
  "_id": "coop-7",
  "name": "Central Farm Cooperative",
  "farmers": [
    { "$ref": "farmers", "$id": "farmer-101" },
    { "$ref": "farmers", "$id": "farmer-102" }
  ],
  "equipment": [
    { "$ref": "equipment", "$id": "tractor-55" },
    { "$ref": "equipment", "$id": "harvester-21" }
  ]
}

Model Shift: Query Perspective

  • SQL: reconstruct context through joins
  • NoSQL: retrieve complete document in one read
  • Better for data lakes and farm dashboards
  • Natural alignment with JSON-based APIs
Query Perspective Comparison
Questions

1:1 Relationship: Concept

  • Two entities linked one-to-one in SQL
  • Each record in one table has exactly one in the other
  • In NoSQL, we merge them into a single document
  • Reduces joins and simplifies retrieval
 

1:1 Relationship: Relational Design

  • farmer table: stores identity info
  • farmer_profile table: stores extended details
  • Linked through a foreign key: farmer.id = farmer_profile.farmer_id
  • Retrieving data requires a join
-- SQL tables
farmers
 β”œβ”€ id
 β”œβ”€ name
 └─ region

farmer_profile
 β”œβ”€ farmer_id (FK)
 β”œβ”€ address
 └─ contact_number

1:1 Relationship: Document Design

  • Merge farmer and profile into one document
  • All details stored together under one key
  • No need for joins: a single read is enough
  • Best for frequently accessed combined data
{
  "_id": "farmer-101",
  "name": "Alice Farms",
  "region": "Central Valley",
  "profile": {
    "address": "...",
    "contactNumber": "...",
    "farmSizeHectares": ...,
    "certifications": ["Organic", "Sustainable"]
  }
}

1:1 Relationship: When to Merge

  • When both entities are always queried together
  • When updates to one imply updates to the other
  • When storage size per document remains moderate
  • When data lifecycle is shared (CRUD together)
 

1:1 Relationship: When Not to Merge

  • When one entity changes much more frequently
  • When one field grows large or optional (e.g.,notes)
  • When separation simplifies access control or privacy
  • When embedding violates document limit constraints
 

Discussion Questions

  • What are examples of 1:1 relationships in your datasets?
  • Would merging them simplify or complicate your queries?
  • How do you decide which side becomes the parent document?
  • What trade-offs exist between normalization and embedding?
  • Can a 1:1 merge evolve into a 1:N case over time?

1:1 Relationship: Transforming SQL to JSON

  • ETL processes related SQL rows into a single document
  • Use map to project SQL rows into objects
  • Use filter to select valid or matching pairs
  • Use reduce to merge each farmer–profile pair
  • Final output: one JSON document per farmer
// SQL source rows
const farmers = [
  { id: 1, name: "Alice Farms", region: "Central" },
  { id: 2, name: "Bob Fields", region: "South" }
];

const profiles = [
  { farmer_id: 1, address: "...", contact: "...", size_ha: 12 },
  { farmer_id: 2, address: "...", contact: "...", size_ha: 9 }
];

// Transformation
const merged = farmers
  .map(f => {
    const profile = profiles.find(p => p.farmer_id === f.id);
    return { ...f, profile };
  })
  .filter(d => d.profile)        // ensure valid links
  .reduce((acc, doc) => [...acc, doc], []);

// Result β†’ JSON documents
console.log(JSON.stringify(merged, null, 2));

1:1 Relationship: Example Output

  • Each SQL pair becomes one JSON document
  • No joins required at query time
  • Data is ready for direct NoSQL insertion
[
  {
    "_id": 1,
    "name": "Alice Farms",
    "region": "Central",
    "profile": {
      "address": "...",
      "contact": "...",
      "sizeHa": 12
    }
  },
  {
    "_id": 2,
    "name": "Bob Fields",
    "region": "South",
    "profile": {
      "address": "...",
      "contact": "...",
      "sizeHa": 9
    }
  }
]
Questions

1:N Relationship: Concept

  • One parent record has many child records in SQL
  • Examples: farmer β†’ harvest lots, field β†’ sensor readings
  • In NoSQL, embed children as an array inside the parent
  • Reduces joins; one read returns complete context
 

1:N Relationship: Relational Design

  • farmers table: parent entity
  • harvest_lots table: child rows per farmer
  • Linked by foreign key: harvest_lots.farmer_id β†’ farmers.id
  • Queries typically require joins and grouping
-- SQL tables
farmers
 β”œβ”€ id
 β”œβ”€ name
 └─ region

harvest_lots
 β”œβ”€ id
 β”œβ”€ farmer_id (FK)
 β”œβ”€ crop
 └─ yield_kg

1:N Relationship: Document Design

  • Embed child records under a descriptive key
  • Keep keys readable and domain-specific
  • Single document read serves the common access path
  • Great for dashboards and farmer-centric queries
{
  "_id": "farmer-101",
  "name": "Alice Farms",
  "region": "Central Valley",
  "harvestLots": [
    { "lotId": "L-1", "crop": "Wheat", "yieldKg": 18000 },
    { "lotId": "L-2", "crop": "Corn",  "yieldKg": 24500 }
  ]
}

1:N Relationship: When to Embed

  • Children are usually read with the parent
  • Child count per parent is bounded/moderate
  • Updates are mostly append or small edits
  • Lifecycle of parent and children is aligned
 

1:N Relationship: When Not to Embed

  • Very large arrays (e.g., millions of readings)
  • Children are updated independently at high write rates
  • Children are queried independently of the parent
  • Document size or hot-spotting becomes a bottleneck
 

Discussion Questions

  • Which parentβ†’child pairs in your data are read together most often?
  • What is a safe upper bound for children per parent in your case?
  • Do children need independent permissions or lifecycles?
  • Where would embedding vs. referencing change performance the most?
  • Could today’s 1:N become many-to-many in the future?

1:N: Transforming SQL to JSON

  • map(): project parent rows (farmers) into base docs
  • filter(): select child rows that belong to each parent
  • reduce(): aggregate children into an embedded array
  • Output: one parent document with a harvestLots array
// SQL source rows
const farmers = [
  { id: 1, name: "Alice Farms", region: "Central" },
];

const harvestLots = [
  { id: "L-1", farmer_id: 1, crop: "Wheat", yield_kg: 18000 },
  { id: "L-2", farmer_id: 1, crop: "Corn",  yield_kg: 24500 },
];

// Transformation (group lots under each farmer)
const docs = farmers.map(f => {
  const lots = harvestLots
    .filter(l => l.farmer_id === f.id)
    .reduce((acc, l) => [
      ...acc,
      { lotId: l.id, crop: l.crop, yieldKg: l.yield_kg }
    ], []);
  return { _id: f.id, name: f.name, region: f.region,... };
});

1:N: Example Output

  • Parent documents embed multiple child records
  • No join required at query time
  • Ideal for farmer-centric dashboards and APIs
[
  {
    "_id": 1,
    "name": "Alice Farms",
    "region": "Central",
    "harvestLots": [
      { "lotId": "L-1", "crop": "Wheat",  "yieldKg": 18000 },
      { "lotId": "L-2", "crop": "Corn",   "yieldKg": 24500 }
    ]
  },
  {
    "_id": 2,
    "name": "Bob Fields",
    "region": "South",
    "harvestLots": [
      { "lotId": "L-3", "crop": "Tomato", "yieldKg": 9000 }
    ]
  }
]
Questions

N:N Relationship: Concept

  • Many farmers share many equipment units (and vice versa)
  • Other examples: farmers ↔ co-ops, crops ↔ suppliers
  • In NoSQL, prefer references or a linking collection
  • Avoid duplicating the same child across many parents
 

N:N Relationship: Relational Design

  • farmers and equipment are peer tables
  • farmer_equipment is the join (pivot) table
  • Queries join across three tables to resolve links
  • Scales poorly for wide, deep associations
-- SQL tables
farmers
 β”œβ”€ id
 └─ name

equipment
 β”œβ”€ id
 └─ type

farmer_equipment
 β”œβ”€ farmer_id (FK β†’ farmers.id)
 └─ equipment_id (FK β†’ equipment.id)

N:N Relationship: Document Design (References)

  • Store arrays of IDs (manual references) on each side
  • Resolve lookups in the application or with aggregation
  • Good when association count is moderate
  • Keeps documents small and avoids duplication
{
  "_id": "farmer-101",
  "name": "Alice Farms",
  "equipmentIds": ["tractor-55", "harvester-21"]
}
{
  "_id": "tractor-55",
  "type": "Tractor",
  "farmerIds": ["farmer-101", "farmer-204"]
}

N:N Relationship: Document Design (Linking Collection)

  • Create a separate collection of link docs
  • Each link holds farmerId and equipmentId
  • Best for high-cardinality or frequently changing links
  • Enables independent indexing and TTL/policies on links
{
  "_id": "link-9001",
  "farmerId": "farmer-101",
  "equipmentId": "tractor-55",
  "since": "..."
}
{
  "_id": "link-9002",
  "farmerId": "farmer-204",
  "equipmentId": "tractor-55",
  "since": "..."
}

N:N: When to Use Which Pattern?

  • Reference arrays: moderate link counts, read with parent
  • Linking collection: high churn, volume, or analytics
  • Consider index needs and shard keys
  • Prefer small, stable parent documents
 

Discussion Questions

  • Are links mostly read with one side (farmer or equipment) or both?
  • What is the maximum expected degree per node (links per farmer/equipment)?
  • Do links change frequently or require their own lifecycle?
  • Could you start with reference arrays and later move to a linking collection?

N:N: Transforming SQL to JSON (Build Reference Arrays)

  • map(): initialize farmer docs
  • filter(): select links for each farmer
  • reduce(): aggregate equipment IDs per farmer
  • Output: farmers with equipmentIds arrays
// SQL source rows
const farmers = [
  { id: 101, name: "Alice Farms" }
];

const equipment = [
  { id: "tractor-55", type: "Tractor" }
];

const farmerEquipment = [
  { farmer_id: 101, equipment_id: "tractor-55" },
  { farmer_id: 101, equipment_id: "harvester-21" }
];

// Build reference arrays on farmer docs
const farmerDocs = farmers.map(f => {
  const eqIds = farmerEquipment
    .filter(link => link.farmer_id === f.id)
    .reduce((acc, link) => [...acc, link.equipment_id], []);
  return { _id: `farmer-${f.id}`, name: f.name, equipIds: eqIds };
});

N:N: Example Output (Reference Arrays)

  • Farmers reference many equipment by ID
  • Simple, compact parent documents
[
  {
    "_id": "farmer-101",
    "name": "Alice Farms",
    "equipmentIds": ["tractor-55", "harvester-21"]
  },
  {
    "_id": "farmer-204",
    "name": "Bob Fields",
    "equipmentIds": ["tractor-55"]
  }
]

N:N: Transforming SQL to JSON (Linking Collection)

  • map(): project each join row into a link doc
  • filter(): optional constraints (e.g., active links)
  • reduce(): accumulate into a link collection array
  • Output: independent farmerEquipmentLinks docs
// Build linking collection documents
const farmerEquipmentLinks = farmerEquipment
  .map((l, i) => ({
    _id: `link-${i + 1}`,
    farmerId: `farmer-${l.farmer_id}`,
    equipmentId: l.equipment_id,
    since: "..."
  }))
  .filter(link => link) // placeholder for conditions
  .reduce((acc, link) => [...acc, link], []);

N:N: Example Output (Linking Collection)

  • Links can be indexed and sharded independently
  • Supports high-cardinality and fast-changing relations
  • Clean separation of entity and association lifecycles
[
  {
    "_id": "link-1",
    "farmerId": "farmer-101",
    "equipmentId": "tractor-55",
    "since": "..."
  },
  {
    "_id": "link-2",
    "farmerId": "farmer-101",
    "equipmentId": "harvester-21",
    "since": "..."
  },
  {
    "_id": "link-3",
    "farmerId": "farmer-204",
    "equipmentId": "tractor-55",
    "since": "..."
  }
]
Questions

Best Practices - Migrating from SQL to NoSQL

  • Know your query patterns first
  • Start with one feature or domain
  • Choose embed or reference carefully
  • Keep documents small and consistent
  • Validate and index critical fields
  • Automate ETL for repeat runs
  • Monitor read and write latency
  • Track original table lineage
  • Test integrity before switching traffic
  • Iterate design as data grows

Today, November 4, 2025

  • Why migrate: scalability, flexibility, performance
  • Model shift: relational β†’ document
  • 1:1 β†’ merge into one document
  • 1:N β†’ embed child array
  • N:N β†’ use references or link collection
  • Case Study
  • Homework, Lab, Project
  • Review Distributed Databases
SQL to NoSQL Migration Diagram

Case Study: SQL to NoSQL (Agriculture)

Agriculture system tables:

farmers id name region
farmer_profiles id farmer_id phone email
harvest_lots id farmer_id crop yield_kg
buyers id name
contracts farmer_id buyer_id

Reports are slow due to joins.

  • Goal: one document per farmer
  • Embed harvests inside farmer
  • Reference buyers if needed
  • Compute total yield

Data Flow: From SQL to JSON

ETL (Extract β†’ Transform β†’ Load)

  • Extract from SQL tables
  • Transform to JSON documents
  • Load into NoSQL (e.g., MongoDB)
SQL β†’ map/filter/reduce β†’ JSON

Data Flow: From JSON to Enriched Data

ELT (Extract β†’ Load β†’ Transform)

  • Extract raw JSON
  • Load into data lake
  • Transform with metadata & KPIs
JSON β†’ load β†’ enrich(meta, totals)

Setup

  • All examples use agriculture
  • 1x1, 1xn, nxn
const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

Q1

  • How to model 1x1
  • farmer with profile
  • Target is one document
farmers ⟷ farmer_profiles

A1

  • Merge fields
  • map + find
const farmerDocs_1x1=farmers.map(f=>{
  const p=farmer_profiles.find(x=>x.farmer_id===f.id);
  return {...f,profile:p};
});

Q2

  • How to model 1xn
  • farmer with harvests
  • Embed array
farmers ⟷ harvests

Setup

  • All examples use agriculture
  • 1x1, 1xn, nxn
const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A2

  • map + filter
  • child array
const farmerDocs_1xn=farmers.map(f=>({
  ...f,
  harvests:harvests.filter(h=>h.farmer_id===f.id)
}));

Q3

  • Total yield per farmer
  • Which array tool
sum yield_kg

Setup

  • All examples use agriculture
  • 1x1, 1xn, nxn
const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A3

  • reduce inside map
const withTotals=farmerDocs_1xn.map(f=>({
  ...f,
  total_yield:f.harvests.reduce((s,h)=>s+h.yield_kg,0)
}));

Q4

  • Keep only high yield lots
  • Which array tool
yield_kg >= 300

Setup

  • All examples use agriculture
  • 1x1, 1xn, nxn
const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A4

  • filter nested
const highYield=withTotals.map(f=>({
  ...f,
  harvests:f.harvests.filter(h=>h.yield_kg>=300)
}));

Q5

  • How to model nxn
  • farmers with buyers
  • contracts link
farmers ⟷ contracts ⟷ buyers

Setup

  • All examples use agriculture
  • 1x1, 1xn, nxn
const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A5

  • map + filter + includes
  • embed buyer list
const farmerWithBuyers=farmers.map(f=>{
  const buyerIds=contracts
    .filter(c=>c.farmer_id===f.id)
    .map(c=>c.buyer_id);
  const myBuyers=buyers.filter(b=>buyerIds.includes(b.id));
  return {...f,buyers:myBuyers};
});

Q6

  • Alternative for nxn
  • keep link collection
contract documents

Setup

  • All examples use agriculture
  • 1x1, 1xn, nxn
const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A6

  • reference by id
{
  "farmer_id": 1,
  "buyer_id": 10
}

Q7

  • Create farmer document
  • Embed harvests
  • Add total_yield
map + filter + reduce

Setup

  • All examples use agriculture
  • 1x1, 1xn, nxn
const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A7

  • one pass build
const farmerDocs=farmers.map(f=>{
  const hs=harvests.filter(h=>h.farmer_id===f.id);
  const total=hs.reduce((s,h)=>s+h.yield_kg,0);
  return {...f,harvests:hs,total_yield:total};
});

Q8

  • Top farmers by total
  • Keep total greater than 400
filter by total_yield

Setup

  • All examples use agriculture
  • 1x1, 1xn, nxn
const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A8

  • filter outer array
const topFarmers=farmerDocs.filter(f=>f.total_yield>400);

Q9

  • Rules for mapping
  • 1x1, 1xn, nxn
state your rule

A9

  • 1x1 merge
  • 1xn embed array
  • nxn reference or link
use map filter reduce
Questions

Metadata Field: uuid

  • Globally unique identifier
  • Allows traceability across systems
  • Prevents duplicate records
  • Useful for merges and sync
meta.uuid = crypto.randomUUID();

Metadata Field: checksum

  • Represents data size or hash
  • Detects data changes or corruption
  • Used to verify sync integrity
  • Lightweight alternative to full hash
meta.checksum = JSON.stringify(obj).length;

Metadata Field: databaseType

  • Identifies storage model
  • Helps route queries correctly
  • Useful for ETL pipelines supporting multiple DBs
  • Examples: "SQL", "NoSQL", "Cache"
meta.databaseType = "NoSQL";

Metadata Field: databaseSource

  • Specifies data origin
  • Useful for audit trails
  • Supports hybrid architectures (e.g., SQL + MongoDB)
  • Enables source-based filtering
meta.databaseSource = "MongoDB";

Metadata Field: databaseDestination

  • Specifies where the enriched data will be stored
  • Includes database and collection (or table) name
  • Useful for routing during ETL or ELT
  • Enables multi-database deployment tracking
meta.databaseDestination = {
  db: "agriculture",
  collection: "farmers_enriched"
};

Metadata Field: lastSync

  • Records last update timestamp
  • Used to detect stale data
  • Supports incremental synchronization
  • Enables time-based analytics
meta.lastSync = new Date().toISOString();

Metadata Field: sourceTimestamp

  • Records when the data was first captured
  • Distinguishes original event time and sync time
  • Essential for timeline reconstruction
meta.sourceTimestamp = "2025-11-06T08:00Z";

Metadata Field: processedBy

  • Marks the script, pipeline that processed the record
  • Supports auditability and debugging
  • Helps track different ETL versions
meta.processedBy = "etl_pipeline_v2";

Metadata Field: recordStatus

  • Indicates lifecycle stage of the record
  • Examples: "new", "validated", "archived"
  • Useful for incremental updates
meta.recordStatus = "validated";

Metadata Field: dataVersion

  • Tracks schema or transformation version
  • Prevents mismatches: old and new data formats
  • Critical for reproducibility
meta.dataVersion = "v1.3";

Metadata Field: verifiedBy

  • Specifies who or what validated the data
  • Can be a QA process, script, or user
  • Useful for accountability and quality logs
meta.verifiedBy = "qualityCheckerBot";

Today, November 11, 2025

  1. Database Pipeline Security
  2. Similarity Search (Vector DB)
  3. Use Case: Agriculture Data
  4. Homework: Lab 5 & HW 5
FAISS Agriculture Pipeline

NoSQL Database Security: Overview

  • Protect confidentiality, integrity, and availability
  • Prevent breaches and legal exposure compliance
  • Model threats across pipeline: ETL
Interconnected Systems: Data paths at risk

NoSQL Database Security: Overview

Why Security Matters

  • Unauthorized access risks and data monetization
  • Integrity loss: tampering, poisoning, or schema drift
  • Ops. impact: downtime, ransom, regulatory penalties
Interconnected Systems: Data paths at risk

Core Security Pillars

Encryption
(in transit, at rest)
Access Control
(authn, RBAC/DAC)
Auditing & Monitoring
(logs, alerts, anomaly detection)
Data Fragmentation
(sharding, mart)

Encryption and Decryption

Encryption: Where It Applies

Layer Goal Example
In transit Prevent snooping TLS for drivers, HTTPS for APIs
At rest Protect disks/snapshots Database/volume encryption
Client side Minimize trust Field-level AES before insert
Data encryption across systems

IndexedDB: Minimal Field Encryption (AES-GCM)


// Generate a key
const key = await crypto.subtle.generateKey(
  { name: "AES-GCM", length: 256 },
  true,
  ["encrypt", "decrypt"]
);

// Encrypt a string field
const iv = crypto.getRandomValues(new Uint8Array(12));
const enc = await crypto.subtle.encrypt(
  { name: "AES-GCM", iv },
  key,
  new TextEncoder().encode("secret")
);

// Decrypt
const dec = await crypto.subtle.decrypt({ 
    name: "AES-GCM", iv 
}, key, enc);
new TextDecoder().decode(dec); // "secret"
            
Smartphone encryption visualization

Transport Encryption: Client ↔ Server

  • Protects data in motion between app and database
  • Prevents eavesdropping and MITM attacks
  • Use TLS 1.2+ with verified certificates
  • Apply to drivers, APIs, and cluster communication

// Example: MongoDB connection with TLS
mongodb+srv://user:pass@cluster.mongodb.net/
  ?ssl=true&retryWrites=true&w=majority
            
Encrypted connection between client and server

Data Lake Security: When to Encrypt or Decrypt

πŸ”’ Encrypt & Store in Lake
  • Data remains secure at rest in the lake
  • Minimizes exposure during ingestion
  • Ideal for sensitive or regulated datasets
  • Query requires authorized decryption layer

// Example: encrypt before upload
const enc = encryptAES(record);
lake.store(enc);
            
πŸ“Š Decrypt & Query Before Storing
  • Allows transformations and analytics pre-lake
  • Useful when encryption breaks query performance
  • Requires trusted compute and short-term exposure
  • Re-encrypt before final lake ingestion

// Example: decrypt β†’ query β†’ store
const data = decryptAES(blob);
runQuery(data);
lake.store(reEncrypt(data));
            

Encryption Models: Public vs Private Key

πŸ” Private (Symmetric) Key Encryption
  • Same key used for encryption and decryption
  • Fast and efficient for large data volumes
  • Common in storage, backups, and data lakes
  • Challenges: secure key distribution, rotation
  • Performance: high speed, minimal overhead
  • Security: compromised key = total data loss risk

// Example: AES-GCM symmetric encryption
const enc = await crypto.subtle.encrypt({
    name:"AES-GCM", iv
}, key, data);
            
πŸ”‘ Public (Asymmetric) Key Encryption
  • Key pair: public to encrypt, private to decrypt
  • Secure exchange and identity verification
  • Used in TLS, SSH, and digital signatures
  • Challenges: slower, complex key management
  • Performance: higher CPU cost per operation
  • Security: strong transmission, less bulk data

// Example: RSA public key encryption
const enc = await crypto.subtle.encrypt({
    name:"RSA-OAEP"
}, pubKey, data);
            

Discussion: Encryption in Practice

  • How is a symmetric key shared for sensor data?
  • How does encryption affect queries in NoSQL?
  • Encrypt early or in the lake: which is better?
  • How to balance strong encryption with speed?
Data security in interconnected hospital systems

NoSQL Databases: Encryption at Rest Support

Database Encryption at Rest Notes
Cassandra βœ… Supported Transparent Data Encryption (TDE) for SSTables and commit logs
MongoDB βœ… Supported Encrypted storage engine; integrates with external KMS
Redis ⚠️ Partial Enterprise and Redis Stack support disk-level encryption
IndexedDB ⚠️ Browser dependent Relies on OS or browser sandbox; no built-in encryption
Lakehouse (e.g., Delta, Iceberg) βœ… Supported Encryption handled by storage layer (S3, Azure, GCS)
Neo4j βœ… Supported Native file and log encryption (Enterprise edition)
FAISS ❌ Not native Relies on external filesystem or disk encryption

Access Control

Access Control: Models

  • RBAC: roles map to privileges (read, write, admin)
  • DAC: user-level control to access data
  • Implements identity and authorization rules only
  • Note: data not encrypted, access control β‰  encryption
  • Apply least privilege and separation of duties
Access control in interconnected hospital systems

RBAC Concept (~1996)

  1. Role Assignment: users get roles before access.
  2. Role Authorization: only approved roles are valid.
  3. Permission Authorization: roles define actions.

RBAC groups permissions by role, not by user.

Core RBAC Model

IndexedDB: Access Control Reality

IndexedDB Same-Origin

The same-origin policy restricts DB access to the same protocol, host, and port.

Access Control: Device, Browser, Tab

Device scope
  • OS user account is the root boundary
  • Disk and keychain hold secrets for apps
  • Use full disk encryption and per user key stores
Browser scope
  • Profile and origin define isolation
  • Cookies with SameSite and Secure limit exposure
  • LocalStorage and IndexedDB are per origin
  • Service Worker gates network for the origin
Tab scope
  • Per tab state in sessionStorage
  • Carry role in the key for checks
  • Avoid sharing across tabs unless needed
  • Use BroadcastChannel only for safe signals
Scope Control Example
Device OS user, key store OS login, keychain, disk crypto
Browser Origin, profile Cookie flags, IDB per origin
Tab Session state sessionStorage role for API checks

Soft RBAC with API Keys in IndexedDB (tab scoped)

  • Store API key in sessionStorage so it is per tab
  • Key encodes the role (patient, nurse, physician)
  • Middleware checks role before each read or write
// set per tab
sessionStorage.setItem("API_KEY", "idb_apikey:nurse:98D1B7");

// read role
const role = (sessionStorage.getItem("API_KEY") || "")
    .split(":")[1];

// tiny ACL
const ACL = {
  patient:   ["read:own"],
  nurse:     ["read:any","write:notes"],
  physician: ["read:any","write:any"]
};
const can = (r,a) => ACL[r]?.includes(a);
      
// guarded IndexedDB write
if (!can(role, "write:notes")) 
    throw new Error("forbidden");

const db = await new Promise((res, rej) => {
  const req = indexedDB.open("clinic", 1);
  req.onupgradeneeded = () => 
    req.result.createObjectStore("notes", { 
        keyPath: "_id" 
    });
  req.onsuccess = () => res(req.result);
  req.onerror = () => rej(req.error);
});

const tx = db.transaction("notes", "readwrite");
tx.objectStore("notes").put({ _id, patientId, text });
await tx.done; // or listen to tx.oncomplete
      
// example keys per role
"idb_apikey:patient:72F3A9"
"idb_apikey:nurse:98D1B7"
"idb_apikey:physician:65A4C2"
      

Access Control: Backend Layer

  • Applies before database queries are executed
  • Backend enforces user roles, tokens, and scopes
  • Prevents direct client-to-database connections
  • Central place to log, audit, and throttle requests
  • Decouples business logic from database access rules
// Example: Express middleware
app.use((req, res, next) => {
  const token = req.headers.authorization;
  // patient, nurse, physician
  const role = verifyJWT(token).role; 
  req.role = role;
  next();
});

app.get("/records/:id", (req, res) => {
  if (req.role !== "physician") 
    return res.status(403).send("forbidden");
  db.records.findOne({ _id: req.params.id })
    .then(r => res.json(r));
});
      
// API key example
API_KEY = "backend_apikey:physician:65A4C2"
role = API_KEY.split(":")[1]; // physician
      
Backend enforcing access before database

Access decisions occur in the backend: not the database.

IP-Based Access Control: Backend and MongoDB

  • Restricts database access by client IP address
  • Acts as a network-level access control layer
  • Common in managed clusters (MongoDB Atlas, Firewalls)
  • Only trusted subnets or devices can connect
  • Note: IP filtering β‰  authentication: still use user roles
# Example: MongoDB Atlas IP access list
Project β†’ Network Access β†’ IP Whitelist
Add IP: 203.0.113.15 (office)
Add CIDR: 203.0.113.0/24 (VPN subnet)
      
// Self-hosted: bind and firewall
net:
  bindIp: 127.0.0.1,192.168.10.5
  port: 27017

# OS firewall
sudo ufw allow from 192.168.10.0/24 to any port 27017
      
IP-based access control for MongoDB cluster

Only approved network sources can reach the database.

RBAC with API Keys in MongoDB

  • Each API key encodes a role (patient, nurse, physician)
  • API gateway checks key β†’ extracts role β†’ applies rule
  • MongoDB grants permissions by role
// Example API keys
API_KEY_PATIENT   = "mongo_apikey:patient:72F3A9";
API_KEY_NURSE     = "mongo_apikey:nurse:98D1B7";
API_KEY_PHYSICIAN = "mongo_apikey:physician:65A4C2";
      
// Check and enforce
const role = req.get("x-api-key").split(":")[1];
if (role === "nurse") db.notes.updateOne({ patientId }, update);
else if (role === "patient") db.notes.find({ self: true });
      
API key roles controlling MongoDB access

Encryption vs Access Control

Aspect Encryption Access Control
Pros
  • Protects data if media is stolen
  • Works in transit and at rest
  • Limits blast radius of breaches
  • Fine-grained permissions
  • Matches org roles and duties
  • Policy enforcement at query paths
Cons
  • Key rotation and storage complexity
  • Performance overhead
  • Does not decide who may access
  • Misconfig risk and privilege creep
  • Bypass risk if perimeter is weak
  • Ongoing policy maintenance

Data Fragmentation

Data Fragmentation: Concept

  • Divide large datasets into smaller, manageable fragments
  • Improve performance, scalability, and data isolation
  • Fragments can be distributed across nodes or regions
  • Supports privacy by storing sensitive data separately
// Example split
patients_core     β†’ demographics, ID
patients_medical  β†’ diagnoses, treatments
patients_billing  β†’ invoices, payments
      
Data fragmentation across healthcare systems

Horizontal Fragmentation: Sharding

  • Each shard holds a subset of records (horizontal split)
  • Common shard keys: region, department, or patient group
  • Queries route to the correct shard via the router
  • MongoDB, Cassandra, and Neo4j support native sharding
// Example shard keys
hospital_east   β†’ patients.region = "East"
hospital_west   β†’ patients.region = "West"
hospital_central β†’ patients.region = "Central"
      
Horizontal data sharding in hospital systems

Vertical Fragmentation: Data Marts

  • Each mart stores topic-specific slices of data
  • Feeds from the lake or ETL pipelines
  • Optimized for analytics or departmental access
  • Finance, clinical, and operations marts stay separate
// Example marts
mart_clinical  β†’ patient outcomes
mart_billing   β†’ payments, reimbursements
mart_staffing  β†’ scheduling, hours
      
Data marts built from a unified hospital data lake

In-Class Use Case: Protecting Farm GPS Data

  • Farms use GPS sensors on tractors and equipment.
  • Coordinates reveal private field locations.
  • Data is encrypted before storage in NoSQL.
  • Only verified users can view decrypted data.
  • Enables safe analytics without exposing locations.

Question:
Design a database pipeline from source to dashboard that protects GPS data.

Quick Check

  • Which layers need encryption in your pipeline?
  • What is the least-privilege role for a read-only analytics job?
  • How will you detect and respond to schema drift affecting security?

Security Audits: Purpose

  • Not always preventive: often risk mitigating.
  • Identify weaknesses before they lead to incidents.
  • Assess system controls, access, and data handling.
  • Recommend corrective and compensating actions.
  • Focus on detection, documentation, and improvement.

Audit Goal:
- Detect vulnerabilities
- Verify compliance
- Improve readiness
            

Security Audits: Common Scope

  • Access control and authentication policies
  • Encryption practices and key management
  • Data handling (ETL, logs, backups)
  • Incident response readiness
  • Regulatory and organizational compliance
{
  "audit_area": "Access Control",
  "finding": "Too many admin users",
  "risk": "Privilege misuse",
  "recommendation": "Role-based access"
}

Example: Database Audit for IoT Farm Data

  • Check encryption between sensors and servers.
  • Review API authentication (tokens, keys).
  • Inspect database access logs for anomalies.
  • Validate least-privilege for users.
  • Report on remediation progress.

Audit Report (Excerpt)
----------------------
βœ” Encryption in transit (TLS)
⚠ Weak password policy
⚠ Shared DB credentials
βœ” Regular patch updates
            

Security Audits: Risk Mitigation

  • Audits reduce impact, even if can’t prevent attacks.
  • Finds misconfigurations before exploitation.
  • Improves recovery and response plans.
  • Builds a security culture and accountability.
  • Essential part of ongoing data governance.

Preventive  β†’ stops risk
Mitigating β†’ reduces impact
Audit = Mitigating measure
            

In-Class Use Case: Auditing Farm IoT Data Security

  • Farm IoT sends weather, soil, GPS data to cloud.
  • Audit checks encryption, key storage.
  • Audit checks data access logs.
  • Identifies misuse or weak configurations.
  • Suggests mitigation, not just prevention.
  • Ensures secure analytics without disrupting service.

Question:
What security audit steps would you include to mitigate risk in a farm IoT data pipeline?

Questions?
IndexedDB
Redis
MongoDB
CassandraDB
Faiss
ClickHouse
Neo4j

Why Vector Databases?

  • SQL/NoSQL rely exact matching of values
  • Vector DBs enable *semantic* matching/meaning
  • AI systems: Find similar text, images, or signals
  • Recommendation, Personalization
  • Run efficiently on billions of high-dimensional vectors
Query Type Example
Keyword Search β€œcorn yield”
Vector Search β€œfind farms like this one”
Result Semantically similar records, not just identical text

Limitations of SQL and NoSQL Databases

  • Designed for *exact* equality or range conditions
  • No built-in understanding of semantic similarity
  • Indexes can’t handle high-dimensional vectors
  • JOIN/WHERE clauses cannot rank by similarity score
  • Struggle with unstructured or contextual data
Query Limitation
SELECT * FROM farms WHERE crop='corn' Exact only
SELECT * FROM soils WHERE pH BETWEEN 6 AND 7 Numeric range only
Find farms β€œlike” Farm X Not supported natively

Core Concept: Embeddings

  • Embeddings turn objects into numeric vectors
  • Each dimension represents a latent feature
  • Nearby vectors represent similar meaning or context
  • Enable search by *meaning* instead of keywords
  • Used across modern AI and ML applications
Data Embedding (3D)
β€œWheat yield in dry soil” [0.22, 0.45, 0.10]
β€œCorn yield in low rainfall” [0.21, 0.44, 0.11]
β€œRice yield in wet soil” [0.10, 0.30, 0.50]

How Similarity Search Works

  • Each record stored vector in multi-dimensional space
  • Query is also embedded as a vector
  • Compute distance between query and stored vectors
  • Return top-k nearest vectors (most similar meanings)
  • FAISS optimizes this with efficient indexing structures
Farm Vector Distance
F1 (Corn) [0.20, 0.40, 0.15] 0.02
F2 (Wheat) [0.25, 0.30, 0.18] 0.10
F3 (Rice) [0.15, 0.45, 0.20] 0.04

Why Vector Databases Are the Future

  • Bridge between AI models and data storage
  • Handle unstructured, semantic, and contextual data
  • Support hybrid filtering (metadata + similarity)
  • Enable real-time discovery of β€œsimilar” conditions
  • Transform data-driven decision making
Aspect SQL/NoSQL Vector DB
Query Type Exact Semantic
Data Type Structured Unstructured + Numeric
Example WHERE crop='corn' Farms similar to F1
Result Identical matches Similar contexts

Exercise: Similarity, Aggregation, or Document/Row?

  1. β€œFind farms similar to Farm A in rainfall pattern.”
  2. β€œCalculate average crop yield per region.”
  3. β€œRetrieve all records where crop = β€˜Corn’”
  4. β€œList farms soil closest to [0.3, 0.4, 0.2].”
  5. β€œGroup by fertilizer type and count farms.”
# Likely Type Typical System Key Value Fit Graph Fit
1 Similarity Vector DB No Sometimes
2 Aggregation SQL or OLAP No Rare
3 Document or Row SQL or NoSQL Yes for id lookup No
4 Similarity Vector DB No Sometimes
5 Aggregation SQL No Rare

Vector Databases

Common Vector Databases

  • FAISS: optimized local, fast similarity search
  • Milvus: scalable distributed vector database
  • Weaviate: semantic vector DB with GraphQL API
  • Qdrant: lightweight, production-ready
Name License GitHub / Website
FAISS Open Source github.com/facebookresearch/faiss
Milvus Open Source github.com/milvus-io/milvus
Weaviate Open Source github.com/weaviate/weaviate
Qdrant Open Source github.com/qdrant/qdrant

Chronology of FAISS

  • 2017: Released by Facebook AI Research
  • 2018: GPU acceleration and wide adoption
  • 2020: Integrated into ML and AI pipelines
  • 2022: Added hybrid and hierarchical indexes
  • 2024: Core engine for many vector databases
 

FAISS Example: Finding Similar Farms

  • Create sample data for farms (numeric features)
  • Build a FAISS index using L2 distance
  • Add all farm vectors to the index
  • Query with one farm to find its nearest neighbors
  • Full example: GitHub: /nosql/vector

import numpy as np
import faiss
# Sample farm data (rainfall, soil_pH, yield)
data = np.array([
  [0.75, 0.60, 0.82],
  [0.80, 0.62, 0.81],
  [0.25, 0.30, 0.28],
  [0.78, 0.65, 0.85]
], dtype='float32')

# Build index (L2 distance)
index = faiss.IndexFlatL2(3)
index.add(data)  # add all farm vectors

# Query: find 3 nearest farms to first one
query = np.array([[0.75, 0.60, 0.82]], dtype='float32')
D, I = index.search(query, k=3)

print(I)  # indices of similar farms β†’ [[0 3 1]]
print(D)  # similarity distances β†’ [[0.0000 0.0013 0.0025]]
      
Questions

Sources β†’ Lake (Raw)

  • Collect weather, soil, GPS data
  • Store as raw JSON in lake
  • Preserve source + timestamp
  • Question: How is GPS data protected at ingestion?

Before:
{"farm":"F1","lat":37.95,"lon":-121.29,"rain":3.4}

After:
{"farm":"F1","gps_enc":"b9a3f1...","rain":3.4}
      

Raw β†’ Curated (ETL)

  • Clean + validate JSON records
  • Convert units, drop invalid rows
  • Aggregate daily averages
  • Question: Who can modify curated data?

Before:
{"farm":"F1","rain":"3.4mm","temp":"22C"}

After:
{"farm":"F1","rain_mm":3.4,"temp_c":22.0}
      

Curated β†’ Analytics

  • Join soil, weather, yield by farm
  • Compute key features for analysis
  • Maintain metadata for traceability
  • Question: How does lineage tracking improve trust?

Before:
{"farm":"F1","rain":3.4}
{"farm":"F1","yield":4.8}

After:
{"farm":"F1","rain":3.4,"yield":4.8}
      

Analytics β†’ Vectors

  • Normalize numeric columns
  • Create farm vectors for similarity
  • Anonymize identifiers before indexing
  • Question: What risks arise if vectors leak?

Before:
{"farm":"F1","rain":3.4,"ph":6.3,"yield":4.8}

After:
{"id":"A1","vector":[0.62,0.58,0.73]}
      

Vector DB β†’ Query

  • Search for similar farms (top-k)
  • Restrict access by user region
  • Return safe, aggregated results
  • Question: How to prevent cross-farm data leaks?

Before:
{"query":"A1"}

After:
{"neighbors":[{"id":"A1","dist":0.0},
              {"id":"A2","dist":0.07}]}
      

API β†’ Secure Dashboard

  • Serve encrypted analytics via API
  • Apply role-based access (farmer/admin)
  • Display results without revealing GPS
  • Question: Which controls protect map layers?

Before:
{"farm":"F1","gps":[37.95,-121.29]}

After:
{"farm":"F1","region":"Valley","gps":"hidden"}
      

Summary: Secure Farm Pipeline

  • Source β†’ Lake β†’ Curated β†’ Vector β†’ Dashboard
  • Encrypt GPS and sensitive info
  • Apply access control at each layer
  • Monitor lineage and anomalies
  • Question: Which stage is most at risk?

{
  "pipeline":"secure_farm_data",
  "stages":["ingest","curate","vector","serve"],
  "security":["encrypt","access","audit"]
}
      

Similarity Search: Concept

  • Goal: find farms with similar conditions
  • Input: numeric features β†’ vector space
  • Metric: cosine or L2 distance
  • Index: FAISS (flat or IVF/PQ)
  • Security: restrict by tenant/region

Before:
{"farm":"F1","rain":3.4,"ph":6.3,"yield":4.8}

After:
{"id":"A1","vector":[0.62,0.58,0.73]}
      

Query Prep: Build Vector

  • Normalize features (0–1)
  • Handle missing with defaults
  • Mask identifiers before search
  • Security: log query lineage

Before:
{"farm":"F2","rain":2.1,"ph":6.7,"yield":4.2}

After:
{"id":"A2","vector":[0.39,0.71,0.61]}
      

Search: Top-k Neighbors

  • Input: query vector + k
  • Output: ids + distances
  • Lower distance = more similar
  • Security: strip PII, rate-limit

Before:
{"query":"A1","k":3}

After:
{"neighbors":[
  {"id":"A1","dist":0.00},
  {"id":"A2","dist":0.07},
  {"id":"A3","dist":0.12}
]}
      

Post-Process: Secure Results

  • Attach safe metadata (region/season)
  • Mask GPS; show region only
  • Filter by user’s role/scope
  • Audit: record query + response

Before:
{"neighbors":[{"id":"A2"},{"id":"A3"}]}

After:
{"neighbors":[
  {"id":"A2","region":"Valley","gps":"hidden","dist":0.07},
  {"id":"A3","region":"Foothill","gps":"hidden","dist":0.12}
]}
      

Quality & Safety Checks

  • Outlier guard (feature ranges)
  • Drift monitor (stats over time)
  • Access policy test (ABAC/RBAC)
  • Fail-closed on policy violations

Before:
{"vector":[1.9,-0.4,7.2]}

After:
{"error":"out_of_range","action":"reject"}
      

Abstract β€” Vector Database Management Systems

  • Describes data as vectors for text, images, and video.
  • Used in recommendation, search, and chatbots.
  • Handles high-dimensional and sparse data.
  • Focus on storage, retrieval, and processing.
  • Summarizes key concepts and challenges.

Today, November 13, 2025

  1. Case Study: Design a Database Pipeline
  2. Homework: Lab 5 & HW 5

Today, November 18, 2025

  1. Synthetic Data: Role and Importance
  2. Review: NoSQL Database
  3. Homework: Lab 4 & HW 4
  4. Homework: Lab 5 & HW 5
  5. Project: Part 2

Synthetic Data: Role and Importance

  • Used to preserve privacy
  • Helps generating test data
  • Source link
AI generated content and synthetic data

Test Data vs Synthetic Data

  • Test Data: taken from real systems
  • Test Data: often limited
  • Synthetic Data: mirror patternse
  • Synthetic Data: scalable
AI generated content and synthetic data

Benford's Law: Natural Number Patterns

  • Many real world numbers start with lower digits
  • Digit β€œ1” appears as the first digit about 30%
  • Auditing, fraud detection, and anomaly checks
  • When numbers span orders of magnitude
Benford's Law distribution
Questions
IndexedDB vs SQL Recap

IndexedDB Structure

  • Key–value storage.
  • Object store: weatherStore.
  • Each record is a full JSON object.
  • No schema; fields may vary.
  • Lookups use B-tree β†’ O(log n).

// IndexedDB stored record
{
  id: 101,
  city: "Stockton",
  temp: 92,
  humidity: 40,
  date: "2025-08-01"
}

SQL Table Structure

  • Tabular storage.
  • Fixed schema defined with CREATE TABLE.
  • Types enforced (INT, VARCHAR, DATE).
  • Indexes improve lookup β†’ O(log n).

-- SQL row stored in table
id | city      | temp | humidity | date
-----------------------------------------
101 | Stockton |  92  |    40    | 2025-08-01

SQL vs IndexedDB: Core Concepts

  • IndexedDB stores key–value weather records.
  • Uses object stores (e.g., "weatherStore").
  • Queries rely on JavaScript functions (map/filter/reduce).
  • Designed for browser local caching.
  • Typical full-scan operations cost O(n).
  • SQL stores weather rows in tables with fixed columns.
  • Queries use SELECT, INSERT, UPDATE, DELETE.
  • Indexes on city/date can give O(log n) lookups.
  • Runs on server-based weather databases.

CRUD in IndexedDB vs SQL

  • Create: store.add(weather).
  • Read: store.get(key) or store.getAll().
  • Update: store.put(weather).
  • Delete: store.delete(key).
  • Filtering usually costs O(n) with getAll + filter.
  • INSERT INTO weather VALUES (...).
  • SELECT * FROM weather WHERE city='Stockton'.
  • UPDATE weather SET temp = ... WHERE id = ...;
  • DELETE FROM weather WHERE id = ...;
  • With index on city/date, single-row lookup ~O(log n).

Query Style Difference


// IndexedDB Read (weather)
store.get(cityKey);   // primary key lookup (B-tree)
store.getAll();       // load all records
  • get(cityKey) uses a B-tree β†’ O(log n).
  • Scanning all weather entries with getAll is O(n).

SELECT * FROM weather
WHERE city = 'Stockton';

SELECT * FROM weather w
JOIN regions r
  ON w.region_id = r.id;
  • Without indexes, SQL filter is O(n).
  • With index on city, SQL lookup is O(log n).

Performance Question (MCQ)

Query fastest to find yesterday’s temperature record?

  • A: IndexedDB getAll() then filter by date.
  • B: IndexedDB cursor scan over all records.
  • C: SQL SELECT date='yesterday' without index.
  • D: SQL SELECT date='yesterday' with date index.

SELECT * FROM weather
WHERE date = 'yesterday';

-- or with index
SELECT * FROM weather
WHERE date = 'yesterday';

Write the Query (IndexedDB + JS)

Write an IndexedDB query:

  • Find all cities where temperature β‰₯ 90Β°F.
  • Use JavaScript map / filter / reduce.
  • Think about cost: reading all then filtering is O(n).

// Load all weather records, then filter and map
store.getAll().onsuccess = (event) => {
    const weather = event.target.result; // n records

    const hotCities =
        weather
            .filter(w => w.temp >= 90)  // O(n)
            .map(w => w.city);          // O(n)

    // Total time ~ O(n) for n records
};
MongoDB vs SQL Recap

MongoDB Document

  • JSON-like document.
  • Collection: weather.
  • Flexible fields.
  • Index lookup β†’ O(log n).

{
  _id: 101,
  city: "Stockton",
  temp: 92,
  hum: 40,
  date: "2025-08-01"
}

SQL Row

  • Fixed table schema.
  • Defined types (INT, VARCHAR).
  • Strict column structure.
  • Index lookup β†’ O(log n).

id | city     | temp | hum | date
-----------------------------------
101| Stockton | 92   | 40  | 2025-08-01

SQL vs MongoDB: Core Concepts

  • MongoDB uses documents.
  • No schema enforcement.
  • Nested fields allowed.
  • Aggregation pipeline for queries.
  • Full scans often O(n).
  • SQL uses tables.
  • Strict schemas with DDL.
  • Rows + typed columns.
  • Joins supported.
  • Index lookup O(log n).

CRUD in MongoDB vs SQL

  • Create: insertOne()
  • Read: find()
  • Update: updateOne()
  • Delete: deleteOne()
  • Filters rely on document fields.
  • Non-indexed queries β†’ O(n).
  • INSERT INTO weather ...
  • SELECT * FROM weather ...
  • UPDATE weather SET ...
  • DELETE FROM weather ...
  • Relational filtering.
  • Indexed lookup β†’ O(log n).

Find Hot Cities (MongoDB vs SQL)


db.weather.find({
  temp: { $gte: 90 }
})
  • Document filter.
  • Indexed filter β†’ O(log n).
  • No joins; embed or reference.

SELECT city FROM weather
WHERE temp >= 90;
  • Column filter.
  • Index improves to O(log n).

Performance Question (MCQ)

Fastest way to find all cities with temp β‰₯ 90?

  • A: MongoDB full scan (find() no index)
  • B: MongoDB indexed temp field
  • C: SQL full table scan
  • D: SQL indexed temp column
  • A and C β†’ full scans O(n)
  • B and D β†’ index lookup O(log n)
  • Best: B or D (both indexed)

Write the Query (MongoDB + SQL)

Write a MongoDB query:

  • Find weather records for β€œStockton”.
  • Short written answer expected.
  • Hint: index lookup β†’ O(log n).

// MongoDB MQL
db.weather.find({ city: "Stockton" })

-- SQL
SELECT * FROM weather
WHERE city = 'Stockton';
Redis vs SQL Recap

Redis Hash

  • In-memory key–value store.
  • Key pattern: weather:101.
  • Stores fields in a hash.
  • Key lookup ~O(1) average.

KEY: weather:101
FIELDS:
  city = "Stockton"
  temp = "92"
  hum  = "40"
  date = "2025-08-01"

SQL Row

  • Row in table weather.
  • Schema defined with DDL.
  • Types fixed (INT, VARCHAR, DATE).
  • Indexed lookup β†’ O(log n).

id | city     | temp | hum | date
-----------------------------------
101| Stockton | 92   | 40  | 2025-08-01

SQL vs Redis: Core Concepts

  • Redis is key–value, in memory.
  • Data types: strings, hashes, sets, lists.
  • No joins, no SQL parser.
  • Key lookup ~O(1) average.
  • SCAN over many keys β†’ O(n).
  • SQL is table and row based.
  • Relational queries with SELECT.
  • Supports joins and aggregates.
  • Index lookup β†’ O(log n).
  • Full table scan β†’ O(n).

CRUD in Redis vs SQL

  • Create: HSET weather hash.
  • Read: HGETALL / HGET.
  • Update: HSET same key.
  • Delete: DEL weather:101.
  • Single key ops ~O(1).
  • Global searches need SCAN β†’ O(n).
  • INSERT INTO weather (...).
  • SELECT * FROM weather ...
  • UPDATE weather SET ...
  • DELETE FROM weather ...
  • Indexed search β†’ O(log n).
  • Full scan β†’ O(n).

Find One Weather Record


HGETALL weather:101
  • Direct key lookup for id 101.
  • Average cost ~O(1).

SELECT * FROM weather
WHERE id = 101;
  • With index on id β†’ O(log n).
  • Without index β†’ O(n) scan.

Performance Question (MCQ)

Fastest way to get yesterday’s weather for id 101?

  • A: Redis SCAN all keys then filter by date field.
  • B: Redis HGETALL weather:101.
  • C: SQL full scan on weather table by date.
  • D: SQL indexed lookup on (id, date).
  • A and C read many rows β†’ O(n).
  • B uses Redis key β†’ ~O(1).
  • D uses SQL index β†’ O(log n).
  • Best for id 101: B or D.

Write the Query (Redis + SQL)

Task:

  • Get all fields for weather id 101.
  • Write Redis command and SQL query.
  • Key lookup is O(1) in Redis, O(log n) with index in SQL.

HGETALL weather:101

SELECT * FROM weather
WHERE id = 101;
ClickHouse vs SQL Recap

Weather Aggregates: ClickHouse vs Row-Store

ClickHouse (Daily Aggregates)

  • Pre-aggregated table: weather_daily_city.
  • One row per city per day.
  • Built via materialized view.
  • Query cost vs raw data ~O(1) (fixed small set).

-- ClickHouse aggregate row
city  | date       | avg_temp | max_temp | count
-------------------------------------------------
Stock | 2025-08-01 |   87.2   |   102    | 1440

Row-Store SQL (Raw Readings)

  • Raw table: weather_raw.
  • Many rows per city per day.
  • Aggregates computed at query time.
  • Full scan over readings ~O(n).

-- Row-store raw row
city  | date       | time  | temp | hum
----------------------------------------
Stock | 2025-08-01 | 12:00 |  92  | 40

Why ClickHouse Feels O(1) for Aggregates

  • Aggregates precomputed into small tables.
  • Query reads a few rows, not millions.
  • Fixed-size result per day/city.
  • Work vs original N readings ~O(1).
  • Row-store computes aggregates on raw data.
  • Must scan many readings each query.
  • Cost grows with history size.
  • Work vs N readings ~O(n).

Aggregate Pipeline: ClickHouse vs Row-Store

  • INSERT raw readings into weather_raw.
  • Materialized view updates weather_daily_city.
  • SELECT aggregates from small table.
  • Read queries on aggregates ~O(1).
  • INSERT into single raw table.
  • No pre-aggregate table.
  • SELECT uses GROUP BY on raw data.
  • Aggregate queries ~O(n) over readings.

Daily Average Temperature per City


-- ClickHouse (pre-aggregated)
SELECT city, avg_temp
FROM weather_daily_city
WHERE date = '2025-08-01';
  • Reads a few rows from aggregate table.
  • Work vs raw N readings ~O(1).

-- Row-store (raw)
SELECT city,
       AVG(temp) AS avg_temp
FROM weather_raw
WHERE date = '2025-08-01'
GROUP BY city;
  • Scans all readings for that date.
  • Work vs N daily readings ~O(n).

Performance Question (MCQ)

Average temperature per city for 3 years of data.

  • A: Row-store SQL on weather_raw (GROUP BY).
  • B: ClickHouse on weather_raw (GROUP BY).
  • C: Row-store SQL on daily aggregate table.
  • D: ClickHouse on daily aggregate table weather_daily_city.
  • A, B: scan many raw rows β†’ O(n).
  • C: smaller table, but still grows with days β†’ ~O(n) in days.
  • D: fixed-size aggregates per city, optimized columnar read.
  • With bounded #cities and pre-aggregates, query cost ~O(1) vs raw N.

Write the Aggregate Query

Task:

  • Return max temperature per city for August 2025.
  • Use ClickHouse aggregate table and row-store raw table.
  • ClickHouse aggregate query ~O(1) vs raw N readings.

-- ClickHouse (aggregate table)
SELECT city,
       max_temp
FROM weather_daily_city
WHERE date >= '2025-08-01'
  AND date <= '2025-08-31';

-- Row-store (raw table)
SELECT city,
       MAX(temp) AS max_temp
FROM weather_raw
WHERE date BETWEEN '2025-08-01'
              AND '2025-08-31'
GROUP BY city;
Recap Faiss and SQL

Vector Storage: FAISS vs SQL

FAISS Index

  • Stores dense vectors in an index.
  • Example: weather embedding per city.
  • Precomputed structure for fast search.
  • Query cost vs raw scan ~O(1) (fixed probes).

# vector for a city
[0.12, -0.30, 0.75, ...]

SQL Table with Vectors

  • Table: weather_vectors.
  • One row per city.
  • Vector stored as array/JSON/BLOB.
  • Brute-force distance scan ~O(n).

city  | vec
-----------------------
"Stock" | [0.12,-0.30,...]

Why FAISS Feels O(1) for Similarity Search

  • Index partitions space (IVF, HNSW, etc.).
  • Search probes only a few clusters.
  • For fixed index params, work per query is bounded.
  • Cost vs n vectors ~O(1) (approximate).
  • SQL has no vector index by default.
  • Must compute distance to many rows.
  • Work grows with number of vectors.
  • Cost vs n vectors ~O(n).

Vector Pipeline: FAISS vs SQL Only

  • Store metadata in SQL (city, region).
  • Store vectors in FAISS index.
  • Search in FAISS, then join IDs to SQL.
  • Nearest neighbor queries ~O(1) vs n.
  • Store everything in one SQL table.
  • No dedicated vector index.
  • Compute distance for many rows.
  • Nearest neighbor queries ~O(n).

Find Top-k Similar Weather Patterns


# FAISS: query q, get top-k
D, I = index.search(q, k)
# I = vector ids for nearest cities
  • Index search with fixed probes.
  • Cost vs n vectors ~O(1) (approximate).

-- SQL pseudo (no vector index)
SELECT city,
       distance(vec, :q) AS d
FROM weather_vectors
ORDER BY d
LIMIT 10;
  • Distance computed for many rows.
  • Cost vs n vectors ~O(n).

Performance Question (MCQ)

Task: Top 10 similar cities for a weather embedding.

  • A: SQL, full scan, ORDER BY distance.
  • B: SQL, prefilter by region, then full scan.
  • C: FAISS index with fixed probes.
  • D: FAISS + cached results for common queries.
  • A, B: distance on many rows β†’ O(n).
  • C: bounded work per query β†’ ~O(1) vs n.
  • D: cache makes repeated queries almost O(1) lookups.
  • Best practical answer: C (or D if cache fits use case).

Write the Vector Query

Task:

  • Return ids of 5 nearest cities for query vector q.
  • Write FAISS code and SQL-style pseudo query.
  • FAISS behaves ~O(1), SQL ~O(n).

# FAISS
k = 5
D, I = index.search(q, k)
# I = nearest vector ids

-- SQL pseudo
SELECT id
FROM weather_vectors
ORDER BY distance(vec, :q)
LIMIT 5;
Lessons Learned on Databases Pipelines
  1. Merging patient data in an MPI
  2. When tests become private health data
  3. Impact on mobile first for performance
  4. Sharding sites by region for speed
  5. The risks of delaying your sharding
  6. Using market data for backlog priorities
  7. Encryption’s impact on your ETL flow
  8. Why raw data belongs in your lake
  9. Pipeline bottlenecks increase latency
  10. Why sensor data is hard to verify
  11. Why devs change code over DB configs
  12. Why features often beat out quality
  13. Why pipelines need active monitoring

Questions
Lab 4 and Homework 4

Lab 4 – Question 1

Lab 4 Q1

Lab 4 – Question 2

Lab 4 Q2

Lab 4 – Question 3

Lab 4 Q3

Lab 4 – Question 4

Lab 4 Q4

PowerPoint Download

Use the link below to access the Lab 4 presentation:

GitHub – Lab 4 Presentation

Lab 5 and Homework 5

Project Part 2 – DB Pipeline Implementation (due Nov 25)

  • Implement Dashboard
  • Connect to Data Pipeline
  • Visualize Key Metrics
  • Team-based Collaboration
  • Course Presentation
Dashboard Overview
Final Exam Date
Questions

NoSQL, Key-Value Store, Document Store, Column Store, Graph Database, CAP Theorem, BASE Model, Eventual Consistency, ACID Properties, Strong Consistency, Data Warehouse, Sharding, Replication, Partitioning, Data Mart, Fact, Dimension, Volume, Velocity, Variety, Veracity, Value, Horizontal Scaling, Vertical Scaling, Consistency, Availability, Partition Tolerance, Vector Clock, Schema-less, Indexing, Secondary Index, Primary Key, Neo4j, Compound Key, MapReduce, Aggregation, Query Engine, Query Planner, Execution Plan, CRUD, Insert, Update, Delete, Read, Write Concern, Read Concern, Consistency Level, Consistency, onupgradeneeded, createObjectStore, keyPath, autoIncrement, transaction, objectStore.add, createIndex, Latency, Throughput, Fault Tolerance, Failover, Replica Set, Leader Election, Cluster, Data Center, Geo-Replication, Document Model, BSON, XML Store, Wide Column, Super Column, Column Family, CQL, MQL, Cypher, Graph Traversal, Property Graph, RDF, Triple Store, Relationship, Node, Edge, NoSQL, Index-Free Adjacency, Query Optimization, Materialized View, Denormalization, Data Redundancy, Write Amplification, Compaction, Commit Log, Snapshot, Backup, Restore, Hot Data, Cold Data, Data Lake, Data Warehouse, ETL, ELT, Streaming, Batch Processing,, Transformation, Lambda Architecture, Pub/Sub, Message Queue, Idempotency, Conflict Resolution, Event Sourcing, CQRS, Distributed Cache, In-Memory Database, Time-Series Database, Search Index, Map, Filter, Reduce, Inverted Index, Full-Text Search, Accuracy, Completeness, Consistency, Timeliness, Validity, Uniqueness, Integrity, Usability. Atlas, Compass, Collection, find, updateOne, deleteOne, deleteMany, aggregate, $match, $group, $sort, $limit, $set, $inc, MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->(friend), MATCH (bob:Person {name: 'Bob'})-[:FRIEND]->(friend), MATCH (alice:Person {name: 'Alice'}), MATCH (bob:Person {name: 'Bob'}), $jsonSchema, createCollection, createIndex, IndexedDB, UUID, ObjectId

Questions?

Thanks

Course Syllabus – Fall 2025

  • Course Code: COMP 263
  • Units: 3
  • Time: Tue/Thu 10:00–11:20 AM
  • Location: Baun Hall 214

Instructor & TA

  • Instructor: Dr. Berhe, CTC 117
    Email: sberhe@pacific.edu
  • TA: Mayuresh Sanjay More
    Email: m_more1@u.pacific.edu
    Office Hours: TBD

Course Resources

Course Description

This course explores managing large volumes of unstructured and evolving data. Emphasis on hands-on labs, projects, and real-world domains (IoT in agriculture, automotive, retail, healthcare). Students contrast modern challenges with traditional SQL concepts.

Course Goals by Role

  • Strategic: Database architecture decision-making
  • Operational: Query NoSQL databases
  • Data Architect: Evaluate technologies
  • Data Engineer: Build pipelines
  • Big Data Expert: Handle large datasets
  • Data Scientist: Analytics with unstructured data
  • Data Analyst: Define KPIs & dashboards

Grading Schema

  • Attendance: 5%
  • Homework: 15%
  • Labs: 15%
  • Project: 30%
  • Midterm Exam: 15%
  • Final Exam: 20% (cumulative)

Policies

  • Late Work: 20% penalty; extensions only if requested before deadline by email.
  • Attendance: Required and tracked; email in advance if absent.
  • Collaboration & AI: Collaboration allowed on labs; work must be your own. AI tools allowed if you understand the output.

University Policies

  • ADA Accommodations: Contact SSD (ssd@pacific.edu).
  • Honor Code: Academic honesty required; violations reported.
  • Assessment: Student work may be retained for program evaluation.
  • Support Services: Tutoring, Writing Center, Care Managers.