COMP 263

Welcome
Database Management Systems 2

Course COMP 263
10:00 AM - 11:20 AM, Tuesday and Thursday
Location: Baun Hall 214

Week	Meeting Dates (TR)	Topic	Notes	Assignments
Week 1	Aug 26, Aug 28	Unstructured and Evolving Data 🔗	IndexedDB	Lab 1, HW 1, Roster Verification
Week 2	Sept 2, Sept 4	Data Modeling and Engineering 🔗
Week 3	Sept 9, Sept 11	Data Quality and Standards 🔗	MongoDB	Lab 2, HW 2, Project Part 1: Data Lake Queries
Week 4	Sept 16, Sept 18	Data Transformation and Processing 🔗
Week 5	Sept 23, Sept 25	Database Pipeline Architecture and Data Lake 🔗	Neo4j
Week 6	Sept 30, Oct 2	Database Pipeline Architecture and Data Warehouse 🔗		Lab 3, HW 3
Week 7	Oct 7, Oct 9	Database Pipeline Architecture and Performance / Security🔗	Redis
Week 8	Oct 14, Oct 16	Midterm Preparation		Midterm: Oct 16th
Week 9	Oct 21, Oct 23	Database Pipeline Architecture and Data Dashboard 🔗	ClickHouse
Week 10	Oct 28, Oct 30	Database Sharding and Replication	Cassandra	Lab 4, HW 4, Project Part 2
Week 11	Nov 4, Nov 6	Database Migration from SQL to NoSQL 🔗
Week 12	Nov 11, Nov 13	Database Pipeline and Security 🔗	Faiss	Lab 5, HW 5
Week 13	Nov 18, Nov 20	Course Review
Week 14	Nov 25 ~~Nov 27~~ (Thanksgiving)	Finals Preparation
Week 15	Dec 2, Dec 4	Project Presentation		Final: Dec 4th from 10-11:20 AM in Baun Hall 214
Week 16	Dec 9, Dec 11	Finals Week		Final: Dec 11th from 10-11:20 AM in Baun Hall 214

Today

What changed?
Why DMS 2?
Why IndexedDB?

Data Management Before 2000

Limited Internet Connectivity
Isolated Systems
Manual Data Entry
From Paper Tables to SQL Tables
Static Data Structure
Predefined Schemas Before Data Entry

Early 2000s: Data Shift

Connectivity: Internet, NFC, BLE, Wi-Fi
Human Data: Real-time user data
Machine Data: Real-time IoT sensor data
Mobile Device: Human and Machine data

Connectivity and Automation 🙂

Healthcare (Scale, Glucose)
Agriculture
Automotive
City (Parking, Energy, Environment)
Retail (Stock, JITD)
Home (Meters, Laudry Machine)
Buildings & facilities
Industrial IoT (manufacturing)
Utilities (water/gas)
Logistics & supply chain
Wearables & medical devices

smart things

Connectivity and Automation 🤔

Connectivity and automation led to increased dependency on complex software systems (Sensors → Embedded IoT → Phone/App → Backend/Cloud) across industries

Connectivity and Automation 😠

Software continuously updated independently of other dependent components.

Updating Relational Schemas

Static pre-defined data schema not scalable in large software systems

Example in Retail...

Point of Sale System

A point of sale SQLite database should flexibly handle scanned data, adapting to changes in data structure without schema updates.

Sunday

Friday

Monday

flowchart LR

  %% Producers 
  subgraph PROD[Producers]
    PUMA[Puma SQL 🛢️]
    NIKE[Nike SQL 🛢️]
  end

  %% Backend 
  subgraph BE[Backend Service]
    API[Backend API]
  end

  %% Frontend
  subgraph FE[Frontend Phone App]
    APP[Mobile App]
    LDB[Local SQL DB 🛢️]
  end

  %% Data flows
  PUMA -->|JSON feed| API
  NIKE -->|JSON feed| API

  APP <--> |JSON requests/responses| API

  APP -->|SQL queries| LDB

  API -->|Sync JSON| APP
  APP -->|Send JSON| API

{
  "product": "t-shirt",
  "brand": "Puma",
  "size": "M",
  "price": 29.99,
  "stock": 120,
  "color": "orange"   ← added today
}

INSERT INTO products 
  (product, brand, size, price, stock, color)
VALUES 
  ('t-shirt', 'Puma', 'M', 29.99, 120, 'orange');

⚠️ Mismatch:
JSON has "color"
Table has no "color" column

ERROR: column "color"
does not exist

From SQL to NoSQL...

Carlo Strozzi “NoSQL” (1998)

No SQL: relational DB without SQL.
Unix style: text tables + pipes.
Portable: awk/sed/grep tooling.
Lightweight: small, script-driven RDBMS.
Still relational: algebra via scripts.
Focus: simplicity over complexity.
Against ORMs: avoided heavy abstraction.
Open format: plain text over binaries.

flowchart TB
    SQL[SQL Queries] --> RDBMS[(Relational DB)]
    S1[awk] --> RDBMS
    S2[sed] --> RDBMS
    S3[grep] --> RDBMS

Not Only SQL

Schema-less NoSQL – 2000s

Google Bigtable (2006): column-oriented
Amazon Dynamo (2007): key-value store
FB Cassandra (2008): Bigtable + Dynamo
10gen MongoDB (2009): document store
W3C IndexedDB (2010): key-value store
Shift in meaning: No SQL
Use cases: scalability, flexibility, web & IoT

flowchart TB
  MQL[MQL MongoDB] --> DB[NoSQL Data Stores]
  GraphQL[GraphQL] --> DB
  JSIDB[JavaScript IndexedDB] --> DB
  Redis[Redis Commands] --> DB
  CQL[CQL Cassandra] --> DB

Why is Data Unstructured and Evolving?

Software

Hardware

Data

Hardware

Software

Data

Hardware

Data

Software

Today, Thursday Aug 28th

Big Data Characteristics (5V)
Browser and IndexedDB Architecture
Lab Use Case Examples
Homework 1 and Lab 1

Volume

Massive data size
Petabytes to exabytes
Scalability critical

Example	RDBMS Limitation
FB: ~4 PB/day	No horizontal scaling
IoT sensors: TB/hour	Storage bottlenecks

Velocity

Real-time streams
Low-latency ingest
Continuous updates

Example	RDBMS Limitation
NYSE: ms updates	No real-time support
Twitter: 6K tweets/sec	Slow insert rate

Variety

Structured & unstructured
Multiformat input
No fixed schema

Example	RDBMS Limitation
Logs, images, JSON	Strict schemas only
Emails, videos	No native support

Veracity

Uncertain accuracy
Missing/incomplete values
Low signal-to-noise

Example	RDBMS Limitation
User-entered data	Rejects nulls/invalid
Web scraping	Requires clean data

Value

Actionable insight
Advanced analytics
Complex queries

Example	RDBMS Limitation
ML pipelines	No parallel compute
BigQuery/Spark jobs	Slow joins & scans

Weather forecasting
✓ Velocity: because data changes rapidly (e.g., satellite, radar)
Traffic navigation
✓ Velocity: live GPS and sensor updates
Product recommendations
✓ Value: insights from user behavior
Social media analysis
✓ Variety: text, images, video, audio
Smart farm sensors
✓ Volume: constant data from many devices

Today, Thursday Aug 28th

Big Data Characteristics (5V) ✓
Browser and IndexedDB Architecture
Lab Use Case Examples
Homework 1 and Lab 1

IndexedDB

Redis

MongoDB

CassandraDB

Faiss

ClickHouse

Neo4j

IndexedDB Timeline

2010: W3C introduces IndexedDB as a web standard for client-side storage
2011: First browser implementations enable offline apps
2014: IndexedDB 2.0 draft improves performance & API
2016: Broad adoption across Chrome, Firefox, Edge, Safari
2018: IndexedDB 2.0 becomes official W3C recommendation
2020s: Continuous improvements & optimizations

IndexedDB vs SQL: Structure

IndexedDB (Browser)

Object Store: items

keyPath: id
indexes: by_name, by_price

Record (JSON-like):

{
  id: 1,
  name: "Chair",
  price: 49,
  stock: 10
}

⇄

SQL Database

Table: items

CREATE TABLE items (
  id    INTEGER PRIMARY KEY,
  name  TEXT,
  price REAL,
  stock INTEGER
);

Row:

(1, 'Chair', 49, 10)

CRUD: IndexedDB ↔ SQL

IndexedDB

// CREATE
store.add({id: 1, name: "Chair", price: 49, stock: 10});

// READ
store.get(1);

// UPDATE
store.put({id: 1, name: "Chair", price: 59, stock: 10});

// DELETE
store.delete(1);

↔

SQL

-- CREATE
INSERT INTO items VALUES (1, 'Chair', 49, 10);

-- READ
SELECT * FROM items WHERE id = 1;

-- UPDATE
UPDATE items SET price = 59 WHERE id = 1;

-- DELETE
DELETE FROM items WHERE id = 1;

IndexedDB Event Flow

1) const req = indexedDB.open("furnitureDB", 1);
(CREATE DATABASE furnitureDB)

2) req.onupgradeneeded = function (event) { }
(CREATE TABLE items ...)

3) req.onsuccess = function (event) { }
(CONNECT)

4) req.onerror = function (event) { }
(CONNECTION ERROR)

5) const tx = db.transaction("items", "readwrite");
(BEGIN TRANSACTION)

6) const store = tx.objectStore("items");
(SELECT FROM items)

7) request.onsuccess = function (event) { }
(QUERY OK)

8) request.onerror = function (event) { }
(QUERY ERROR)

9) tx.oncomplete = function (event) { }
(COMMIT)

10) tx.onerror = function (event) { }
(ROLLBACK)

11) tx.onabort = function (event) { }
(ROLLBACK)

12) db.onversionchange = function (event) { }
(ALTER DATABASE / SCHEMA MIGRATION)

CRUD (Object Store API)

store.add(object);
(INSERT INTO items VALUES (...))

store.get(key);
(SELECT * FROM items WHERE id = ?)

store.getAll();
(SELECT * FROM items)

store.openCursor();
(SELECT * FROM items ORDER BY id)

store.put(object);
(UPDATE items SET ... WHERE id = ?)

store.delete(key);
(DELETE FROM items WHERE id = ?)

store.clear();
(DELETE FROM items)

store.index("by_name").getAll(query);
(SELECT * FROM items WHERE name = ?)

IndexedDB Resources

GitHub Repository: SE4CPS/2025-COMP-263

API Documentation: MDN IndexedDB API

Coding

▶ OneCompiler.com

IndexedDB

Redis

MongoDB

CassandraDB

Faiss

ClickHouse

Neo4j

Browser Support for IndexedDB

Feature set: getAll(), store/index rename
Support: Chrome, Edge, Firefox, Opera
Safari: support on current macOS/iOS
Mobile: Android and iOS support
Takeaway: Safe to use in modern browsers
Documentation: caniuse.com

Browser support matrix for IndexedDB improvements

Why IndexedDB is I/O

Disk: data stored on device, not memory
Async API: returns Promises/events
Non-blocking: avoids freezing UI
Transactions: safe grouped reads/writes
Sandboxed: per-tab storage → still disk I/O

flowchart LR
  subgraph TAB["Browser Tab (JavaScript)"]
    A["async call → Promise"]
    B["await (until resolve)"]
  end

  subgraph PROC["Browser Process"]
    IDB["IndexedDB backend"]
    STOR["Storage service"]
  end

  subgraph OS["Operating System"]
    FS["File System (disk I/O)"]
  end

  A --> IDB
  B -.-> A
  IDB --> STOR --> FS
  FS --> STOR --> IDB
  IDB -->|"completion event"| A

IndexedDB and I/O

Does the browser know if IndexedDB data will return from disk I/O?
➜ No. Until the request runs, the browser cannot know if a matching record exists.
Does the browser know when the IndexedDB read/write will complete?
➜ No. Disk latency, scheduling, and transactions make timing unpredictable.
What does a Promise from IndexedDB always return to JS?
➜ It always returns a handle (the Promise object) immediately, which will later settle (resolve/reject) with the result.

Access Control: Browser, Tab, Sandbox

Browser process: master controller, manages memory, networking, disk I/O
Tab = renderer process: each tab runs in its own thread/process
Sandboxed: tabs are isolated for security → no direct memory sharing
IndexedDB: each tab/origin gets its own database namespace
Async I/O: tab makes request → browser process handles disk → result returned via promise

flowchart TB
  BPROC["Browser Process"]
  TAB1["Tab 1: Renderer
(sandboxed)"]
  TAB2["Tab 2: Renderer
(sandboxed)"]
  DB["IndexedDB Storage
(per-origin, on disk)"]

  BPROC --> TAB1
  BPROC --> TAB2

  TAB1 -->|async request| DB
  TAB2 -->|async request| DB

IndexedDB Access Control

Can one browser tab read another tab’s IndexedDB data?
➜ No. Each tab is sandboxed; access is restricted by the same-origin policy
What defines the database namespace in IndexedDB?
➜ The origin (protocol + host + port). Databases are scoped per origin, not per tab
How can multiple tabs of the same site coordinate IndexedDB access?
➜ Through transactions, locks, and optional cross-tab messaging (e.g., BroadcastChannel)
Is IndexedDB a single-user or multi-user database?
➜ Single-user, per-browser. It is local to one user, not shared across browsers, accounts, or devices

%%{init: {'flowchart': { 'htmlLabels': true, 'useMaxWidth': true, 'wrap': true, 'curve': 'linear', 'nodeSpacing': 40, 'rankSpacing': 60 }}}%%
flowchart TB

classDef box fill:#eef,stroke:#333,stroke-width:1px,color:#000;
classDef os  fill:#fffdcc,stroke:#333,stroke-width:1px,color:#000;

APP["Layer 7: Application
(apps: browser (IndexedDB))"]:::box
OS["Operating System
(spans multiple layers)"]:::os
US["Layers 5–6
User-space libraries & services
(session, TLS, encoding)"]:::box
KRN["Layers 2–4
Kernel networking stack
(L2/L3/L4)"]:::box
HW["Layer 1: Physical
(NIC, cable, radio)"]:::box

APP --> OS
OS --> US
OS --> KRN
US --> HW
KRN --> HW

%% Explicit widths so labels never clip
style APP width:760px
style OS  width:760px
style US  width:360px
style KRN width:360px
style HW  width:760px

Why await is Required with IndexedDB

JavaScript is async by default → non-blocking, event loop driven
IndexedDB CRUD = I/O-bound → returns a Promise
Dependency: Read after Write → must wait until write commits
await order → prevents stale/undefined data


// Async default: does not wait
store.put({id: 1, name: "Chair"});
let item = store.get(1);   // ❌ may run before put finishes

// Correct with await
await store.put({id: 1, name: "Chair"});
let item = await store.get(1); // ✅ read after write

IndexedDB and Performance

How does IndexedDB impact performance of a web app?
➜ IndexedDB operations are asynchronous I/O. They do not block the UI, but frequent dependent reads/writes or large payloads require waiting (await) and can add latency and increase memory/disk usage.
How can performance be improved?
➜ Use indexes for efficient queries, batch writes instead of many single puts, cache in runtime memory (eg local array of objects) when possible.

Questions

IndexedDB and ACID Properties

Property	Meaning	IndexedDB
Atomicity	All operations in a transaction succeed or all fail.	Yes: transactions are atomic.
Consistency	Database moves from one valid state to another.	Yes: enforced within a transaction.
Isolation	Concurrent transactions do not interfere.	Partial: isolation within one tab, but multiple tabs can interfere.
Durability	Once committed, data persists even after crash/power loss.	Yes: backed by disk storage.

ACID Properties: Questions

Atomicity: What happens if one operation in a transaction fails?
➜ The whole transaction is rolled back; nothing is committed.
Consistency: If a store has keyPath: "id", autoIncrement: true, what happens if you try to insert two records with the same id?
➜ The transaction fails: IndexedDB enforces uniqueness and prevents an invalid state.
Isolation: If two tabs write to the same object store at the same time, do their transactions interfere?
➜ Within one tab, operations are isolated; across tabs, race conditions may occur.
Durability: After committing a transaction, will the data still exist after closing the browser and reopening it?
➜ Yes, data is persisted to disk (until explicitly cleared).

What happens every time the page loads?

// Open DB
const req = indexedDB.open("furnitureDB", 1);

req.onupgradeneeded = e => {
  let db = e.target.result;
  db.createObjectStore("items", { keyPath: "id", autoIncrement: true });
};

req.onsuccess = e => {
  let db = e.target.result;
  let tx = db.transaction("items", "readwrite");
  let store = tx.objectStore("items");
  store.add({ name: "Chair" });   // Insert one record each page load
};

Avoiding Duplicate Inserts

Problem: autoIncrement inserts a new record each page load
Fix: first check if record exists, then update (put) or add


let tx = db.transaction("items", "readwrite");
let store = tx.objectStore("items");

let req = store.get(1); // Check first

req.onsuccess = e => {
  if (e.target.result) {
    store.put({ id: 1, name: "Chair" }); // Record exists → update
  } else {
    store.add({ id: 1, name: "Chair" }); // Not found → insert
  }
};

Questions

Why is it called "IndexedDB"?

Why "IndexedDB"?

IndexedDB is a key–value store
Every object store requires a primary key
The primary key is automatically indexed
Lookups by key are O(log n) (via B-tree)
It’s an indexed database by default


// Create store with primary key
let store = db.createObjectStore("items", { 
  keyPath: "id", autoIncrement: true 
});

// Primary key "id" is automatically indexed
store.add({ name: "Chair", price: 49 });

// Fast lookup by id
store.get(1);  // uses the implicit index

Database Indexing

Without index → full scan: O(n)
With index (B-tree) → search: O(log n)
Index = sorted structure (keys, pointers)
Fast read, slow create, update, delete

IndexedDB and UNIQUE Index

UNIQUE index enforces no duplicates
Created when defining the object store
Lookup is fast: O(log n) using a B-tree
Eg. fields like email or username
Insert duplicate value → transaction fails

const req = indexedDB.open("furnitureDB", 1);
req.onupgradeneeded = e => {
  let db = e.target.result;
  let store = db.createObjectStore("items", { 
    keyPath: "id" 
  });
  store.createIndex("by_name", "name", { 
    unique: true 
  });
};
let tx = db.transaction("items", "readonly");
let store = tx.objectStore("items");
let idx = store.index("by_name");
let req = idx.get("Chair");

Questions

Today, Thursday Aug 28th

Big Data Characteristics (5V) ✓
Browser and IndexedDB Architecture ✓
Lab Use Case Examples
Homework 1 and Lab 1

Use Case: IndexedDB as Cache

Cache API responses locally
Reduce repeated network calls
Enable faster page reloads
Improve perceived performance

Use Case: Offline-First Apps

Store user actions when offline
Sync changes back to server later
Essential for mobile and flaky networks
Eg.: note-taking apps, messaging apps

Use Case: Large Local Datasets

MBs–GBs of structured data in-browser
Query efficiently using indexes (O(log n))
Keep UI responsive with async access
Examples: sensor reading, GIS maps

Use Case: Privacy & Access Control

Data is stored locally only
Access is restricted to same-origin policy
One site can't read other site’s IndexedDB
User can wipe data via browser settings
No server access unless app explicit syncs

Questions

Do you recommend IndexedDB?

Database must support strong aggregation (complex queries, joins)
Database must support unstructured data (JSON objects, nested fields)
Database must support multi-user access across devices
Database must work offline in the browser
Database must handle large local datasets efficiently

Today, Tuesday Sept 2nd

From Tables to JSON Documents
JSON Modeling Structures
Design Principles
Lab Use Case Examples
Roster Verification
Homework 1 and Lab 1

JSON Format

Frontend

JS Object

Backend

OOP Classes

Database

SQL Tables

Database

Data Modeling in Tables

1970s: E.F. Codd and Relational Model
ER Diagrams for conceptual design
Tables with rows & columns
Relationships via Foreign Keys
Normalization for consistency

erDiagram
  CUSTOMER ||--o{ ORDER : places
  PRODUCT ||--o{ ORDER : included_in

  CUSTOMER {
    int id
    string name
    string email
  }
  PRODUCT {
    int id
    string name
    float price
  }
  ORDER {
    int id
    date order_date
    int product_id
    int quantity
  }

Normalization (Step 1)

Denormalized: Multiple values in one field
Normal Form 1 (1NF):
– All attributes are atomic (no lists)
– Each row identified by a primary key
– No repeating groups

OrderID (PK)	Customer	Products
101	Alice	Chair, Table
102	Bob	Sofa

In 1NF, Products column would be split so each row has a single product.

SQL and Table Structures

SQL (Structured Query Language)
Designed to query data in tables
Supports filtering, joining, grouping
Declarative: state what, not how
⚠ Risk: Missing knowledge of SQL & table structure is critical


SELECT * FROM Customer;

SELECT c.name, o.order_date FROM Customer c
JOIN Order o ON c.id = o.customer_id;

SELECT p.name, SUM(o.quantity) AS total_sold FROM Order o
JOIN Product p ON o.product_id = p.id
GROUP BY p.name;

Backend

Data Modeling in Classes

Simula (1967, Ole-Johan Dahl)
Smalltalk (1972, Alan Kay)
C++ (1985, Bjarne Stroustrup)
Java (1995, James Gosling)
Classes define structure & behavior
Objects are instances with state
Relationships via composition & references

classDiagram
  class Customer {
    +int id
    +string name
    +string email
    +placeOrder(product, qty)
  }
  class Order {
    +int id
    +date orderDate
    +addItem(product, qty)
    +getTotal()
  }
  class Product {
    +int id
    +string name
    +float price
  }

  Customer "1" o-- "*" Order : places
  Order "*" *-- "*" Product : items

Encapsulation

From data bags → classes: data + methods
Encapsulation: hide internals behind APIs
Simplify communication: Conway's Law


// ❌ Before
const order = { items: [] };
order.items.push({ name:"Chair", price:50 });
console.log(order.items[0].price);

// ✅ After
class Order {
  constructor(){ this.items=[]; }
  add(p){ this.items.push(p); }
}

const o = new Order();
o.add({ name:"Chair", price:50 });
console.log(o.items[0].price);

OOP and Backend

Objects group data + methods
Query via methods (e.g., order.total())
Available in: Backend tier only
Mapping Objects: Frontend/Database
⚠ Risk: Missing/Wrong OOP knowledge limits backend design


class Order {
  constructor(){ this.items=[]; }
  add(p, q){ this.items.push({p,q}); }
  get total(){ return this.items
    .reduce((s,i)=>s+i.p.price*i.q,0); }
}

const o = new Order();
o.add({name:"Chair",price:50},2);
console.log(o.total); // query object

Frontend

Data Modeling in JavaScript

1995: JavaScript in Browser (Brendan Eich)
Objects as native structure
JSON introduced for serialization
Basis for frontend data exchange


const customer = {
  id: 1,
  name: "Alice",
  orders: [
    { id: 101, product: "Chair", qty: 2 },
    { id: 102, product: "Table", qty: 1 }
  ]
};
console.log(customer.name);

Frontend Data Normalization

Denormalized: nested, repetitive fields
Normalized JSON: break into arrays & IDs
Frontend: simple traversal, fast rendering


// Denormalized
{
  "id": 1,
  "name": "Alice",
  "orders": [
    { "product": "Chair", "qty": 2 },
    { "product": "Table", "qty": 1 }
  ]
}

// Normalized
{
  "customers": [{ "id": 1, "name": "Alice" }],
  "orders": [
    { "id": 101, "customerId": 1, "product": "Chair", ... },
    { "id": 102, "customerId": 1, "product": "Table", ... }
  ]
}

JSON / Document Modeling

2000s: NoSQL & flexible schema
JSON documents, nested objects/arrays
Tree-like structures replace joins
Denormalization for read efficiency
Query using any programming language

{
  "customer": {
    "id": 1,
    "name": "Alice",
    "email": "alice@example.com",
    "orders": [
      {
        "id": 101,
        "order_date": "2025-09-01",
        "items": [
          { "product": "Chair", "price": 49.99,... },
          { "product": "Table", "price": 199.99,... }
        ]
      }
    ]
  }
}

JSON in Frontend

Query with JS syntax: Map-Filter-Reduce
Native in frontend: JSON ⇄ JS Objects
Basis for APIs: exchange with backend
⚠ Risk: Weak JSON skills → hard to handle frontend data


// Query JSON directly
const data = {
  customer: { name: "Alice" },
  orders: [ { product:"Chair", qty:2 } ]
};

console.log(data.customer.name); // "Alice"
console.log(data.orders[0].product); // "Chair"

JSON Format

Frontend

JS Object

Backend

OOP Classes

Database

SQL Tables

Today, Tuesday Sept 2nd

From Tables to JSON Documents ✓
JSON Modeling Structures
Design Principles
Lab Use Case Examples
Roster Verification
Homework 1 and Lab 1

Created: 2001 by Douglas Crockford
Standardized: ECMA-404 (2013), RFC 8259 (IETF)
Where used:
– Frontend: JavaScript objects, APIs
– Backend: REST, GraphQL, config
– Databases: NoSQL (MongoDB, CouchDB), IndexedDB
Historical context: Lightweight alternative to XML for data exchange on the web

ECMA-404 Standard

JSON Object

Curly braces { }
Key–value pairs
Keys are always strings


const customer = {
  id: 1,
  name: "Alice",
  email: "alice@example.com"
};
console.log(customer.name); // Alice

JSON Array

Square brackets [ ]
Ordered collection
Can hold any JSON values


const products = [
  "Chair",
  "Table",
  "Sofa"
];
console.log(products[1]); // Table

JSON String

Double quotes required
Supports Unicode
Escape special chars with \


const message = {
  text: "Hello, World!",
  escaped: "Line1\nLine2"
};
console.log(message.text);

Numbers & Booleans

Numbers: integer or decimal
Boolean: true / false
No quotes required


const order = {
  qty: 2,
  price: 49.99,
  available: true
};
console.log(order.price * order.qty); // 99.98

Null

Represents empty / unknown
Written as null
No quotes around it


const profile = {
  name: "Alice",
  note: null
};
console.log(profile.note); // null

https://www.json.org/json-en.html

Is each snippet valid JSON?


1. { "name": "Alice" }
2. { name: "Alice" }
3. { "age": 25, }
4. [ "red", "green", "blue" ]
5. [ "a", "b",, "c" ]
6. { "valid": true, "value": null }
7. { "price": 19.99, "qty": 2 }
8. { "flag": True }
9. { "nested": { "x": 1, "y": 2 } }
10. "Just a string"

Today, Tuesday Sept 2nd

From Tables to JSON Documents ✓
JSON Modeling Structures ✓
Design Principles
Lab Use Case Examples
Roster Verification
Homework 1 and Lab 1

Flat


{
  "customer_id": 1,
  "customer_name": "Alice",
  "order_id": 101,
  "product": "Chair",
  "qty": 2
}

Nested


{
  "customer": {
    "id": 1,
    "name": "Alice"
  },
  "order": {
    "id": 101,
    "item": {
      "product": "Chair",
      "qty": 2
    }
  }
}

Design Principles for JSON Data

DIKW as Nested JSON


{
  "Wisdom": {
    "Knowledge": {
      "Information": {
        "Data": [
          "leaf1",
          "leaf2",
          "leaf3"
        ]
      }
    }
  }
}

Today, Tuesday Sept 2nd

From Tables to JSON Documents ✓
JSON Modeling Structures ✓
Design Principles ✓
Lab Use Case Examples
Roster Verification
Homework 1 and Lab 1

Model JSON requirements:

https://jsoncrack.com/editor

Lab Use Case Examples

IoT Sensor Data: Time-series data from temperature, humidity sensors
Social Media Posts: User posts with text, images, comments, likes
Healthcare Records: Patient records with demographics, visits, prescriptions
Geospatial Data: Locations with coordinates, metadata, timestamps

Questions

Today, Tuesday Sept 4th

JSON Modeling Structures
Summary IndexedDB
Homework 1 and Lab 1
Roster Verification
Lab Use Case Examples

JSON Data Distributed

🗄️ 📚 🕸️ 📊

Data is stored across multiple machines for scalability, fault tolerance, and availability.

Defining IDs

Unique – each ID is distinct, no duplicates
No Meaning – ID is just an identifier
Purpose: reference only


// ✅ Good ID
{ "id": "7f8c9d12", "name": "Alice" }

// ❌ Bad ID (meaning encoded)
{ "id": "2025-student-Alice", "name": "Alice" }

Defining JSON IDs

🔑 Every object shouldhave a unique id
This ensures:

Reliable referencing
Consistent updates
Safe merges in distributed systems


{
  "id": "a12f-34cd-56ef",
  "name": "...",
  "type": "...",
  "createdAt": "...",
  "otherField": "..."
}

Defining JSON IDs

Business ID – human-readable, domain-specific (e.g., order#1234)
Auto-increment – sequential ID (1, 2, 3...)
UUID – globally unique 128-bit identifier


// Built-in in modern browsers
const id = crypto.randomUUID();

console.log(id);
// e.g. "3b241101-e2bb-4255-8caf-4136c566a962"

Why UUID is a Universal ID

Universally Unique Identifier – 128-bit number
Guarantees uniqueness across time & space
Divided into structured parts:
- Timestamp / Random / Hash depending on version
- Variant bits specify format
- Version bits indicate generation method

UUIDs prevent collisions across distributed systems without coordination.

UUID Structure Breakdown


  123e4567-e89b-12d3-a456-426614174000
  └───────┘ └───┘ └──┘ └──┘ └────────┘
   Time     Ver   Seq  Var   Node/Random

Time – milliseconds since epoch (v1)
Version – 1 to 8 (defines how UUID is made)
Variant – layout type (e.g., RFC 4122)
Node/Random – MAC address or random bits

UUID Versions (History)

Version	Basis	Notes
v1	Timestamp + MAC	First defined, 1990s
v2	DCE Security	Rare, site-specific
v3	MD5 Hash (Name-based)	Stable for same input
v4	Random	Most common today
v5	SHA-1 Hash (Name-based)	More secure than v3
v6, v7, v8	Proposed (time-ordered, Unix time, custom)	2020s improvements

ISO 8601 Date & Time Format

Standardized format for date & time
Unambiguous across regions
Order: YYYY-MM-DD → hh:mm:ss → timezone
Preferred for APIs, JSON, databases
Machine & human-readable


// ISO 8601 Examples

"2025-09-04" 
// date only (YYYY-MM-DD)

"2025-09-04T10:15:30Z" 
// UTC time (Z = Zulu)

"2025-09-04T10:15:30-07:00" 
// local time with offset

✅ Always use ISO 8601 for storing & exchanging dates

What Does the Z Mean in ISO 8601?

Z = Zulu time (UTC+0)
UTC (Coordinated Universal Time)
No offset from zero meridian (Greenwich)
Same everywhere in the world
For distributed DB to avoid timezone errors


// ISO 8601 with Z

"2025-09-04T10:15:30Z"
// 10:15:30 UTC

"2025-09-04T03:15:30-07:00"
// Same moment in California (UTC-7)

"2025-09-04T12:15:30+02:00"
// Same moment in Central Europe (UTC+2)

✅ Z ensures a single global reference, independent of local timezones.


// JavaScript (Browser)
const z = new Date().toISOString();
// e.g. "2025-09-04T13:22:11.123Z"


// JavaScript (Node.js)
const z = new Date().toISOString();


# Python 3
from datetime import datetime, timezone
z = datetime.now(timezone.utc).isoformat()
# e.g. "2025-09-04T13:22:11.123456+00:00"


# Ruby
require "time"
z = Time.now.utc.iso8601


// Java (java.time)
import java.time.*;
String z = Instant.now().toString();
// or:
String z2 = OffsetDateTime.now(ZoneOffset.UTC).toString();


// Go
import "time"
z := time.Now().UTC().Format(time.RFC3339)


// C#
string z = DateTime.UtcNow.ToString("o");
// ISO 8601, e.g. 2025-09-04T13:22:11.1234567Z


// PHP
$z = (new DateTime('now', new DateTimeZone('UTC')))
       ->format(DateTime::ATOM);


# Shell (Unix)
date -u +"%Y-%m-%dT%H:%M:%SZ"

Tip: Prefer ISO 8601 strings (with Z) when storing or exchanging timestamps.

Question

Embed when:

1-to-1 relationship
Data is always accessed together
Example: user profile + address


{
  "userId": "u1",
  "name": "Alice",
  "address": {
    "city": "Stockton",
    "zip": "95211"
  }
}

Reference when:

1-to-n or n-to-n
Entities are reused across docs
Example: users and groups


// 1 → n (one user in multiple groups)
{ "userId": "u1", "groupIds": ["g1", "g2"] }

// n ↔ n (mutual references between users & groups)
{ "userId": "u2", "groupIds": ["g1"] }

{ "groupId": "g1", "memberIds": ["u1", "u2"] }

Optimize for Access Patterns

Model data around how it’s queried, not abstract theory
Include redundant fields if it avoids joins/lookups
Precompute aggregates when queries need speed
Design for read efficiency over write purity


// Store authorName directly with posts
{
  "postId": "p1",
  "title": "JSON Best Practices",
  "authorId": "u1",
  "authorName": "Alice"
}

JSON Attribute Naming Conventions

Booleans: start with is, has, or can isActive, hasLicense, canEdit
Arrays: use plural or end with List userList, tags, orderIds
Numbers: add units or context priceUSD, ageYears, timeoutMs
Strings: descriptive nouns, no type suffix userName, email, addressLine
Dates / Time: use ISO 8601, suffix with At createdAt, updatedAt, expiresAt
Objects: use clear entity names userProfile, billingInfo, geoLocation
IDs: always include id suffix userId, orderId, sessionId

{
  "userId": "u123",
  "userName": "Alice",
  "isActive": true,
  "tags": ["student", "premium"],
  "orderIds": ["o1001", "o1002"],
  "priceUSD": 199.99,
  "createdAt": "2025-09-04T10:15:00Z",
  "userProfile": {
    "ageYears": 29,
    "email": "alice@example.com"
  }
}

Common Metadata Attributes

✅ Common fields for metadata object:

createdAt, updatedAt, deletedAt
author – creator / last editor
schemaVersion – track schema changes
source – origin system or API
tags – labels for categorization
status – draft, active, archived, etc.
revision – document version number
checksum – hash for integrity check
permissions – access control rules

{
  "metadata": {
    "createdAt": "2025-09-04T12:00:00Z",
    "updatedAt": "2025-09-04T12:30:00Z",
    "author": "u1", // source device
    "tags": ["json", "nosql", "big-data"],
    "status": "active",
    "checksum": "a1b2c3d4",
    "permissions": {
      "read": ["admin", "editor"],
      "write": ["admin"]
    }
  }
}

Question

Today, Tuesday Sept 4th

JSON Modeling Structures ✓
Summary IndexedDB
Homework 1 and Lab 1
Roster Verification
Lab Use Case Examples

NoSQL Database Properties

Database	Type	Query Language	Data Format	ACID				CAP
				A	C	I	D	C	A	P
IndexedDB	Key-Value (browser)	JavaScript API	JSON	✓	✓	✓	✓
MongoDB	Document Store
Neo4j	Graph Database
Cassandra	Wide-Column Store
Reddis	Key-Value (in-memory)
ClickHouse	Columnar (OLAP)
FAISS	Vector Index / Library

Homework 1 (Canvas Module 1)

Refresh SQL by answering the questions on ACID, CAP, DDL, and DML
Philosophy of NoSQL
Course Syllabus
Optional: Read background on NoSQL database papers

Lab 1: Agricultural Data Collection with IndexedDB

In this lab, you will develop a web application that collects and stores unstructured agricultural data using IndexedDB. You will implement functionality to handle various data types, including sensor readings, images, farmer notes, GPS coordinates, and timestamps

Lab 1: Submission Instructions

Name file: lab1-lastname-XXXX.html (e.g. lab1-john-doe-1234.html)
Push code to GitHub repo
Add 5 unit tests + screenshot of console with retrieved data
Upload GitHub link + screenshot in course portal

Questions

Course Terms

NoSQL, Key-Value Store, Document Store, Column Store, Graph Database, CAP Theorem, BASE Model, ACID Properties, Eventual Consistency, Strong Consistency, Sharding, Replication, Partitioning, Horizontal Scaling, Vertical Scaling, Consistency, Availability, Partition Tolerance, Vector Clock, Schema-less, Indexing, Secondary Index, Primary Key, Unique Index, Compound Key, MapReduce, Aggregation, Query Engine, Query Planner, Execution Plan, CRUD, Insert, Update, Delete, Read, Transaction, Object Store, Async I/O, Promise, Await, Sandbox, Same-Origin Policy, JSON, BSON, Data Lake, Data Warehouse, ETL, ELT, Streaming, Batch Processing, Lambda Architecture, Kappa Architecture, Pub/Sub, Message Queue, Idempotency, Conflict Resolution, Event Sourcing, CQRS, Distributed Cache, Sharding, Replication, In-Memory Database, Time-Series Database, Search Index, Inverted Index, Full-Text Search, 5 Vs of Big Data: Volume, Velocity, Variety, Veracity, Value.

What is the difference between NoSQL and SQL?
What was the motivation to introduce NoSQL databases?
What is the difference between a SQL database and IndexedDB?

Why is IndexedDB considered asynchronous and I/O bound?
What role do indexes play in both SQL and IndexedDB?
How does access control work for IndexedDB in the browser?
What does big data mean?

How are Full Stack Engineering and JSON related in the way data flows across frontend, backend, and databases?
What are the trade-offs between using a flat JSON structure vs. a nested JSON structure for representing the same data?
Whats are disadvantages of modling data in JSON?

Should You Use IndexedDB?

Do you need to store data locally in the browser?
Do you want to work with structured objects, not just strings?
Do you need to store more than 5–10 MB of data?
Do you need offline access to your data?
Do you need querying by keys or indexes?
Do you want transactions (ACID support) in the browser?
Do you need to store files/blobs (e.g., images) locally?
Do you expect your app to sync later with a backend server?
Do you need better performance than localStorage can provide?
Is your app a progressive web app (PWA) or offline-first design?

Today, Tuesday Sept 4th

JSON Modeling Structures ✓
Summary IndexedDB ✓
Homework 1 and Lab 1 ✓
Roster Verification
Lab Use Case Examples

Lab Use Case

Write a local offline database using IndexedDB to keep track of books borrowed from a library.

Each book should have a unique id
Store the title and author
Record the borrower
Save borrowed date and due date
Support adding, updating, deleting, and listing books

Today Agenda

Data Quality and Standards
MongoDB History & Features
CRUD, Operators, Practice
Homework, Lab, and Project

What is data quality?

How does Data quality relate toBig Data

Volume → Errors scale with size
Velocity → Errors spreads fast
Variety → Less usability
Veracity → Noise & bias lower trust
Value → Errors reduces insigh

Use Case Automotive

Bad data → Wrong decisions
Incomplete → Missed chances
Outdated → Irrelevant actions
Inconsistent → Conflicts
Duplicate → Wasted resources

Use Case Agriculture

Accurate weather → Planning
Reliable yields → Supply
Clean data → Security
Timely updates → Less waste

The quality of the data
determines the quality
of the software.

Origins of Data Quality Metrics

Wang et al. (1995) — A Framework for Analysis of Data Quality Research
Introduced multi-dimensional metrics: accuracy, consistency, completeness, timeliness, uniqueness

NIST Research Data Framework (RDaF)

NIST SP 1500-18r2 (2024)
Focus: research data management
Core: quality, accessibility, reuse
Lifecycle: Envision → Plan → Acquire → Analyze → Share → Preserve
Read PDF

ISO 8000 vs NIST RDaF

Focus: Data quality in place
Core: accuracy, consistency, completeness

Focus: End-to-end research data
Core: quality, accessibility, reuse

Accuracy	Completeness	Consistency
Timeliness	Validity	Uniqueness
Integrity

Data Quality: Consistency

Uniform values across records
No contradictions in fields
Same units, codes, and formats


// ✅ Consistent
{ "id": 1, "unit": "kg" }
{ "id": 2, "unit": "kg" }

// ❌ Inconsistent
{ "id": 3, "unit": "kg" }
{ "id": 4, "unit": "lbs" }

Data Quality: Accuracy

Reflects real-world truth
Correct at the source
Free from measurement errors


// ✅ Accurate
{ "crop": "Wheat", "yieldKg": 7200 }

// ❌ Inaccurate (too high to be realistic)
{ "crop": "Wheat", "yieldKg": 7200000 }

Data Quality: Completeness

All necessary data is present
No missing critical fields
No empty required arrays


// ✅ Complete
{ "id": 1, "crop": "Maize", "soil": "Loam" }

// ❌ Incomplete (missing soil)
{ "id": 2, "crop": "Maize" }

Data Quality: Timeliness

Data is current when needed
No outdated values
Updated on schedule


// ✅ Timely
{ "crop": "Rice", "updatedAt": "2025-09-08T09:00Z" }

// ❌ Outdated
{ "crop": "Rice", "updatedAt": "2020-05-01T10:00Z" }

Data Quality: Uniqueness

No duplicate records
Each entity has a distinct ID
One fact represented only once


// ✅ Unique IDs
{ "id": 1, "crop": "Barley" }
{ "id": 2, "crop": "Millet" }

// ❌ Duplicate IDs
{ "id": 3, "crop": "Wheat" }
{ "id": 3, "crop": "Wheat" }

Data Quality: Validity

Data follows defined formats
Adheres to business rules
Within acceptable ranges


// ✅ Valid
{ "id": 10, "organic": true, "harvestDate": "2025-09-01" }

// ❌ Invalid (wrong type + bad date)
{ "id": "ten", "organic": "yes", "harvestDate": "jan-2025" }

Data Quality: Integrity

Data is accurate in its relationships
No broken links between entities
References are valid and consistent


// ✅ With integrity
{ "farmId": 1, "cropId": 101, "crop": "Maize" }
{ "cropId": 101, "name": "Maize" }

// ❌ Broken reference (farm points to missing crop)
{ "farmId": 2, "cropId": 999, "crop": "Unknown" }

How to validate data quality?

SW/Data Design

→

SW/Data Dev

→

SW/Data Test

→

SW/Data Release (Software Fixed)

→

Data Runtime Changing

→

Data Runtime Testing

Data Quality Seven Metrics

Integrity
Validity
Consistency
Accuracy
Completeness
Timeliness
Uniqueness

1 lb = 0.4536 kg

NASA Mars Climate Orbiter failed in 1999 due to data inconsistency.

How to validate data quality?

Integrity Definition

Data unchanged & correct
Reliable end-to-end

Integrity Examples

No negative yields
Immutable harvest dates
Signed sensor logs


// MongoDB
db.farm.find({ yieldKg: { $lt: 0 } })

// SQL
SELECT * FROM farm
WHERE yield_kg < 0;

Integrity Check

flowchart LR A[Source Data] --> B[Generate Checksum] B --> C[Transmit Data + Checksum] C --> D[Verify Checksum] D -->|Match ✅| E[Data Accepted] D -->|Mismatch ❌| F[Data Rejected] %% Green link for correct linkStyle 3 stroke:#0a0,stroke-width:2px; %% Red link for incorrect linkStyle 4 stroke:#f66,stroke-width:2px;

Generating Checksums

Type: SHA-256


# Python
import hashlib

with open('data.txt', 'rb') as f:
    bytes = f.read()
    hash = hashlib.sha256(bytes).hexdigest()
    # Example output: 3a7bd3e2360a...
    print(hash)

Type: SHA-256


// JavaScript (Node.js)
const crypto = require('crypto');
const fs = require('fs');

const fileBuffer = fs.readFileSync('data.txt');
const hash = crypto.createHash('sha256')
                   .update(fileBuffer)
                   .digest('hex');
// Example output: 3a7bd3e2360a...
console.log(hash);

Data Integrity with Checksums

Include checksum in metadata
Verify integrity before processing
SHA-256 recommended
Checksum is not encryption

{
  "meta": {
    "checksum": "3a7bd3e2360a9e3f2a1c4b..."
  },
  "data": {
    "sensorId": "S1001",
    "temperature": 22.5,
    "humidity": 45,
    "timestamp": "2025-09-08T15:00:00Z"
  }
}

Validity Definition

Values follow format
Within allowed ranges

Validity Examples

Valid season values
Temperature numeric
ISO timestamps
Use standards


// MongoDB
db.farm.find({
  season:{ $nin:["Spring","Fall"] }
})

// SQL
SELECT * FROM farm
WHERE season NOT IN
 ('Spring','Fall');

Validity Check

Which data is invalid?

{
  "id": 1,
  "crop": "Wheat",
  "yieldKg": 7500,
  "harvestDate": "32-13-2025",
  "price": 250,
  "irrigation": "Drip",
  "stock": 120,
  "locaton": "Field A",
  "owner": 12345,
  "organic": "maybe"
}

Validity MongoDB Check

Check if organic field is valid


// MongoDB: find docs where 'organic' is not boolean
db.farm.find({
  $expr: { $ne: [ { $type: "$organic" }, "bool" ] }
})

Consistency • Accuracy • Completeness • Timeliness • Uniqueness

➡️ Thursday

Questions?

Which Data Quality metrics are NOT met?

Are units consistent across records?
Is harvestDate in the correct format?
Is the record up to date?
Is crop present and non-empty?
Is owner the expected type?

{
  "id": 42,
  "crop": "Wheat",
  "yieldKg": 7200,
  "unit": "lbs",
  "owner": "Alice",
  "harvestDate": "13/31/2025",
  "updatedAt": "2022-01-10T10:00:00Z"
}

❌ Metrics not met:
- Consistency
- Validity
- Timeliness

Questions?

Today Agenda

Data Quality and Standards ✓
MongoDB History & Features
CRUD, Operators, Practice
Homework, Lab, and Project

IndexedDB

Redis

MongoDB

CassandraDB

Faiss

ClickHouse

Neo4j

MongoDB History

2007: Start at 10gen
2009: Open source (GitHub)
2014: WiredTiger
2016: Atlas cloud
2017: IPO
2021+: ACID, time series

Note: The name MongoDB comes from "humongous" (meaning huge).

The MEAN Stack

2009: Node.js introduced (JS on server)
2010: AngularJS launched (Google)
2013: Term MEAN stack popularized
MEAN =MongoDB, Express, Angular, Node
Full JavaScript stack → front-end to back-end

NoSQL Types & SQL

Key–Value: Fast lookups
Document: JSON records
Column-Family: Wide tables
Graph: Nodes + edges
SQL: Rows + columns

MongoDB and the 5 V's of Big Data

Volume: Horizontal scaling petabyte data
Velocity: High write throughput
Variety: Flexible schema-less docs
Veracity: Validation rules & consistency
Value: Aggregation & BI integration

RDBMS vs MongoDB

Scaling Volume in MongoDB vs RDBMS

MongoDB:Horizontal scaling (sharding)
RDBMS: Limited to vertical scaling
Sharding → distribute data multiple servers
Improves performance and fault-tolerance

How come horizontal scaling works now?

Hardware & Scaling

2000s: cheaper commodity computers
Made horizontal scaling practical
Clusters cheaper than one big server

What is a Document?

Document is self-contained unit of information
Stores related data together in one JSON-like structure
No need to join across tables to reconstruct meaning
Each document has a unique _id (UUID/ObjectId)
Each document can stand alone and be understood by itself

{
  "_id": ObjectId("64f9c1a2b1234c5678d90ef1"),
  "crop": "Wheat",
  "yieldKg": 7500,
  "irrigation": "Drip",
  "location": { // No joins needed
    "field": "North",
    "soil": "Loam"
  },
  "farmerName": "John Doe",         
  "farmerPhone": "555-1234",         
  "equipmentUsed": "Tractor-101",    
  "marketPriceUSD": 250       
}

What is a Document?

SQL (Normalized):

-- Crops table
Crops(id, crop, yieldKg, irrigation, locationId)

-- Location table
Locations(id, field, soil)

-- Farmers table
Farmers(id, name, phone)

-- Equipment table
Equipment(id, name)

✅ Avoids duplication, but needs joins to query.

MongoDB (Denormalized):

{
  "_id": ObjectId("64f9c1a2b1234c5678d90ef1"),
  "crop": "Wheat",
  "yieldKg": 7500,
  "irrigation": "Drip",
  "location": { "field": "North", "soil": "Loam" },
  "farmer": { "name": "John Doe", "phone": "555-1234" },
  "equipmentUsed": ["Tractor-101", "Seeder-202"]
}

⚡ All info in one document → fast reads, but duplicates farmer/equipment across docs.

MongoDB Local vs Online

MongoDB Compass (Local)
- GUI client for local databases
- Runs on developer’s machine
- Good for testing & development
- No internet required

MongoDB Atlas (Online)
- Fully managed cloud service
- Global availability & scaling
- Backups, monitoring, security
- Accessible anywhere via internet

MongoDB Features

JSON docs
Dynamic schema
Sharding & Replica
Indexes & Aggs
ACID Tx


// Document
{ crop:"Wheat",yieldKg:3000,
  soil:"Loam" }

MongoDB Query Language (MQL)

Sample JSON Document


{
  "farm": "GreenField",
  "crop": "Wheat",
  "year": 2024,
  "yield_tons": 120,
  "location": "California"
}

MQL Query Examples


// Find farms growing Wheat
db.crops.find({ 
    crop: "Wheat" 
});

I/O Options
Compass (GUI JSON view)
Shell queries
Drivers (Python, Node.js, Java)
Atlas API (REST/GraphQL)


// Principles of MQL
- JSON-like syntax
- Expressive operators ($gt, $set, etc.)
- CRUD: Create, Read, Update, Delete
- Works locally or in Atlas

MQL Syntax: Filter First

MQL methods filter as the first parameter
Filter defines which documents match
Second param = projection (return field)
Options: sort, limit, update operators

// Find all wheat crops
db.crops.find(
  { crop: "Wheat" },          // filter
  { crop: 1, yieldKg: 1 }     // projection
)
// Find with range filter
db.crops.find(
  { yieldKg: { $gt: 5000 } }, // filter
  { crop: 1, irrigation: 1 }  // projection
)

MQL Syntax Overview

db.collection.find(filter, projection)
db.collection.insertOne(document)
db.collection.updateOne(filter, update, opt)
db.collection.deleteOne(filter)
db.collection.aggregate(pipeline)

Reference: MongoDB Query Language (MQL) Documentation

// Example: Find crops with yield > 5000
db.crops.find(
  { yieldKg: { $gt: 5000 } },   // filter
  { crop: 1, yieldKg: 1 }       // projection
)

MongoDB ObjectId

UUID for documents auto-generated
Timestamp, MAC, process ID, counter

{
  "_id": ObjectId("650f5a2e1c4d3a6b7f9d1234"),
  "name": "Rose",
  "color": "Red",
  "weightInKg": 0.5,
  "timestamp": ISODate("2025-09-08T15:00:00Z")
}

MongoDB Query Results

All queries return object or cursor
Write operations return ack object
Check `ack` for confirmation

// Insert query
const insertResult = db.flowers.insertOne({ 
  name: "Tulip", 
  color: "Yellow" 
});

// Output:
// {
//   acknowledged: true,
//   insertedId: ObjectId("f9d5678...")
// }

CRUD Create


// SQL
CREATE TABLE farm (
    id INT PRIMARY KEY AUTO_INCREMENT,
    crop VARCHAR(50)
);

INSERT INTO farm(crop)
VALUES ('Rice');


// MongoDB
use farmDB
db.createCollection("farm")
db.farm.insertOne({ crop:"Rice" })

CRUD Read


// SQL
SELECT * 
FROM farm
WHERE crop = 'Rice';


// MongoDB
db.farm.find({ crop: "Rice" })

CRUD Update


// SQL
UPDATE farm 
SET yield_kg = 4000
WHERE crop = 'Rice';


// MongoDB
db.farm.updateOne(
    { crop: "Rice" },
    { $set: { yieldKg: 4000 } }
)

CRUD Delete


// SQL
DELETE FROM farm
WHERE crop = 'Rice';


// MongoDB
db.farm.deleteOne({ crop: "Rice" })

Comparison Ops

$gt / $lt
$in / $ne
$exists

// MongoDB
db.farm.find({yieldKg:{$gt:3000}})

// SQL
SELECT * FROM farm
WHERE yield_kg>3000;

Logical Ops

$and / $or / $not / $nor

// MongoDB
db.farm.find({$and:[
 {crop:"Wheat"},{yieldKg:{$gt:2000}}]})

// SQL
SELECT * FROM farm
WHERE crop='Wheat' AND yield_kg>2000;

Evaluation Ops

$regex
$type
$expr

// MongoDB
db.farm.find({crop:/^W/})

// SQL
SELECT * FROM farm
WHERE crop LIKE 'W%';

Practice Query 1


// SQL
SELECT * FROM farm
WHERE yield_kg > 6000;


// MongoDB
db.farm.find({ yieldKg: { $gt: 6000 } })

Practice Query 2


// SQL
SELECT * FROM farm
WHERE irrigation IS NULL;


// MongoDB
db.farm.find({ irrigation: null })

Practice Query 3


// SQL
SELECT * 
FROM farm
ORDER BY price ASC
LIMIT 1;


// MongoDB
db.farm.find().sort({ price: 1 }).limit(1)

Practice Query 4


// SQL
UPDATE farm
SET stock = stock + 20
WHERE crop = 'Maize';


// MongoDB
db.farm.updateOne(
    { crop: "Maize" },
    { $inc: { stock: 20 } }
)

Practice Query 5


// SQL
DELETE FROM farm
WHERE yield_kg < 1000;


// MongoDB
db.farm.deleteMany(
    { yieldKg: { $lt: 1000 } }
)

Today Agenda

How to Validate Data Quality
MongoDB History & Features
CRUD, Operators, Practice
Homework, Lab, and Project

How to validate data quality?

Accuracy	Completeness	Consistency
Timeliness	Validity	Uniqueness
Integrity

Minimum Schema & Types (Part 1)

Use $jsonSchema for types & required.
Disallow NULL for critical fields.

// Validity
// Enforce required fields
db.createCollection("readings", {
  validator: {
    $jsonSchema: {
      // Completeness
      required: ["deviceId", "ts"]
    }
  }
})

Minimum Schema & Types (Part 2)

Range checks for moisture (0–100).
Enum for unit (e.g., "percent").
Reject malformed docs on write.

// Validity, Accuracy
properties: {
  deviceId: { bsonType: "string" },
  ts: { bsonType: "date" },
  moisture: { minimum: 0, maximum: 100 },
  unit: { enum: ["percent"] }
}

bsonType Example (Part 1)

bsonType validates field data types.
Ensures string, date, int.

// Validity
properties: {
  deviceId: { bsonType: "string" },
  ts: { bsonType: "date" },
  moisture: { bsonType: "double" }
}

bsonType Example (Part 2)

Prevents accidental type drift.
Supports arrays, objects, bool, null.

// Consistency
properties: {
  battery: { bsonType: "int" },
  healthy: { bsonType: "bool" },
  tags: { bsonType: "array" },
  meta: { bsonType: "object" }
}

Defaults & Normalization (Part 1)

Normalize unit & numeric types.
Prefer upsert for idempotence.

// Normalize on write
db.readings.updateOne(
  { deviceId: "plant-007", ts: ts },
  {
    $set: {
      unit: "percent",
      moisture: { $toDecimal: "$moisture" }
    }
  },
  { upsert: true }
)

Defaults & Normalization (Part 2)

Apply defaults with $ifNull.
Backfill battery if null.

db.readings.updateOne(
  { deviceId: "plant-007", ts: ts },
  {
    $set: {
      battery: { $ifNull: ["$battery", 100] }
    }
  },
  { upsert: true }
)

Bulk Normalize (Part 1)

Use filter {} to target all docs.
Cast types consistently in multiple fields.

db.readings.updateMany(
  {},
  [{ $set: {
    deviceId: { $toString: "$deviceId" },
    ts: { $toDate: "$ts" },
    moisture: { $toDecimal: "$moisture" },
    battery: { $ifNull: ["$battery", 100] }
  }}]
)

Bulk Normalize (Part 2)

Cleaning historical IoT data.
Safe to run multiple times (idempotent).

db.readings.updateMany(
  {},
  [{ $set: {
    unit: "percent",
    healthy: { $toBool: "$healthy" },
    count: { $toInt: "$count" },
    tags: { $ifNull: ["$tags", []] },
    meta: { $ifNull: ["$meta", {}] }
  }}]
)

Before vs After Normalization

Before: Raw IoT readings can be messy.
Types are inconsistent (string vs number).
Null or missing fields (e.g., battery).
Arrays or objects not initialized.

{
  deviceId: 12345,
  ts: "2025-09-11",
  moisture: "42",
  battery: null,
  healthy: "true"
}

After: Normalized and validated.
All types aligned with schema.
Defaults filled (battery=100).
Ready for analytics & actuation.

{
  deviceId: "12345",
  ts: ISODate("2025-09-11T00:00:00Z"),
  moisture: NumberDecimal("42"),
  battery: 100,
  healthy: true,
  tags: [],
  meta: {}
}

Enforce Uniqueness

Prevent duplicates with unique index.
Use compound key (deviceId + ts).
Guarantee uniqueness of samples.

// Uniqueness
db.readings.createIndex(
  { deviceId: 1, ts: 1 },
  { unique: true }
)

Detect Outliers

Run QA queries for invalid ranges.
Quarantine suspicious documents.
Protect accuracy of analytics.

// Accuracy, Validity
db.readings.find({
  $or: [
    { moisture: { $lt: 0 } },   // Accuracy
    { moisture: { $gt: 100 } }  // Validity
  ]
})

Use Partial Indexes

Ensure rules only apply when fields exist.
Avoid blocking inserts with missing keys.
Supports validity and uniqueness.

// Validity, Uniqueness
db.readings.createIndex(
  { deviceId: 1, ts: 1 },
  {
    unique: true,
    partialFilterExpression: {
      deviceId: { $exists: true },
      ts: { $type: "date" }
    }
  }
)

Cross-Document Validation

Check latest reading before actuation.
Abort if invalid, out of range, or null.
Protect integrity of control actions.

// Integrity, Accuracy
const r = db.readings.findOne(
  { deviceId: "plant-007" },
  { sort: { ts: -1 } }
)

if (!r || r.moisture < 0 || r.moisture > 100) {
  throw "Invalid input"
}

Data Quality Dashboard

Visualize data quality in real time.
Good for operational monitoring.

Business intelligence & decision-making
Emphasizes data visibility and trust
(Quality) Metrics for organization success

MQL Aggregation Functions

MQL aggregation.
Functions for sum, avg, min, max.
Operators: $group, $match, $project.
Pipeline for step-by-step transformation.
Data quality checks & metrics calculation.

// Accuracy, Completeness
db.readings.aggregate([
  { $match: { unit: "percent" } },
  { $group: {
      _id: "$deviceId",
      avgMoisture: { $avg: "$moisture" },
      minMoisture: { $min: "$moisture" },
      maxMoisture: { $max: "$moisture" },
      totalReadings: { $sum: 1 },
      count: { $count: {} }
])

Count NULL and Inaccurate Data

Run scheduled aggregates for QA reports.
Surface % nulls, invalid ranges, duplicates.
Improves accuracy and completeness.

// Accuracy, Completeness
db.readings.aggregate([
  { $group: {
      _id: null,
      nullMoisture: { $sum: { 
        $cond: [{$eq:["$moisture",null]},1,0] }},
      outOfRange:   { $sum: { 
        $cond: [
            {
                $or:[{$lt:["$moisture",0]},
                {$gt:["$moisture",100]}]},1,0
        ]}}
  }}
])

Logging & Auditing

Maintain audit trail of changes.
Replay and verify historical decisions.
Supports integrity & traceability.

// Integrity
db.audit.insertOne({
  action: "UPDATE",
  target: "readings",
  ts: new Date(),
  user: "system",
  changes: { field: "moisture", old: "42", new: "42.0" }
})

Unstructured

→

Inconsistent

→

Normalized

→

Consistent

Eventually Consistent

Questions?

Accuracy ✓	Completeness ✓	Consistency ✓
Timeliness	Validity ✓	Uniqueness ✓
Integrity ✓

Questions?

Slowest component in realtime application?

Slowest Component in Realtime Application?

Realtime apps rely on multiple tiers.
Bottleneck decides end-to-end speed.
Commonly database writes or network.

CLIENT

→

SERVER

→

DATABASE

Term: Network Latency

Definition: Time delay for data to travel from source to destination and back (round trip). Factors can be bandwidth, distance, routing, congestion, and packet size.

Slowest Component in Realtime Application?

Big Data Transport Latency (100 MB)
-----------------------------------
1 Gbps : ~0.8 s one-way
100 Mbps : ~8 s one-way
10 Mbps : ~80 s one-way


Round-trip ≈ 2x + small ACK delay

CLIENT

→

SERVER

→

DATABASE

Timeliness via Sharded Architecture

Supports horizontal scale (sharding).
Each shard holds part of the dataset.
Query router (mongos) directs requests.
Config servers maintain cluster metadata.
Documentation

Horizontal Scaling for Timeliness

Distribute workload in multiple nodes.
Each shard holds a subset of the data.
Improves timeliness of queries.
Reduces latency under heavy IoT loads.

// Timeliness
// Shard readings by deviceId
db.adminCommand({
  shardCollection: "iot.readings",
  key: { deviceId: 1 }
})

Balanced Cluster Load

MongoDB balancer moves chunks automatically.
Keeps shards evenly loaded.
Maintains consistency and speed.

// Check balancer status
db.adminCommand({ balancerStatus: 1 })

// Move chunk manually if needed
db.adminCommand({
  moveChunk: "iot.readings",
  find: { deviceId: "plant-007" },
  to: "shard0001"
})

Scaling Reads & Writes

Reads routed to appropriate shard.
Writes distributed by shard key.
Supports high-throughput IoT Transformation.
Preserves timeliness under scale.

// Example query routed by key
db.readings.find({ deviceId: "plant-007" })

// Query router (mongos)
// ensures minimal latency

Other options?

Apply TTL for Ephemeral Data

Auto-expire stale telemetry buffers.
Preserve only fresh, relevant data.
Supports timeliness.

// Timeliness
db.staging.createIndex(
  { expiresAt: 1 },
  { expireAfterSeconds: 0 }
)

TTL Expiration Options

Expire After 24h
Deletes docs 24 hours after ts.
Good for temporary telemetry buffers.

// Timeliness
// Auto-delete 24h after ts
db.staging.createIndex(
  { ts: 1 },
  { expireAfterSeconds: 86400 }
)

Expire On Specific Date
Deletes docs at given expiresAt.
Useful for fixed retention windows.

// Timeliness
// Auto-delete at expiresAt
db.logs.createIndex(
  { expiresAt: 1 },
  { expireAfterSeconds: 0 }
)

db.logs.insertOne({
  msg: "debug",
  expiresAt: ISODate("2025-09-12T00:00:00Z")
})

Accuracy ✓	Completeness ✓	Consistency ✓
Timeliness ✓	Validity ✓	Uniqueness ✓
Integrity ✓

Practice Lab

https://onecompiler.com/mongodb

Conceptual Q1

In a farm databases, which data quality metricis the hardest to guarantee?

Accuracy
Consistency
Completeness
Timeliness
Uniqueness
Validity
Integrity

Conceptual Q2

Why is timeliness critical in IoT pipelines?

Today Agenda

Data Transformation and Processing
Map: Transform data items
Filter: Transform data rows
Reduce: Transform data columns
Lab 2, Homework 2, Project Part 1

Data Transformation and Processing (1970)

Ed Codd: relational db model
Data (table) transformation first
SQL Transforming limited RDMS


-- Filter high-yield crops
SELECT * 
FROM crops 
WHERE yield_per_hectare > 5000;

-- Sum total production
SELECT SUM(production_tonnes) AS total 
FROM crops;

-- Update harvest_date to DATE format
UPDATE crops
SET harvest_date = DATE('2025-09-14')
WHERE id = 1;

Data Transformation and Processing (1966)

Structured Program Theorem (Böhm–Jacopini)
Algorithm first = Sequence, Selection, Iteration
Minimal building blocks for computation

#include <stdio.h>

int main() {
    int yields[5] = {4000, 5200, 3100};
    int total = 0;

    for (int i = 0; i < 5; i++) { // Iteration
        if (yields[i] > 5000) {   // Selection
            total += yields[i];    // Sequence
        }
    }
    printf("Total high yield = %d\n", total);
    return 0;
}

Data Transformation and Processing (1972)

Dennis Ritchie: C programming language
Transformations limited to if & loops

// Example: transform yields with C
#include <stdio.h>

int main() {
    int yields[5] = {4000, 5200, 3100};
    for (int i = 0; i < 5; i++) {
        if (yields[i] > 5000) {
          printf("High yield: %d\n", yields[i]);
        }
    }
    return 0;
}

Data Transformation and Processing (~2000)

-- Simple SQL query
SELECT name, price
FROM Flowers
WHERE price > 10;

-- Loop + if-then-else
FOR each flower IN Flowers:
    IF flower.price > 10 THEN
        PRINT "Expensive"
    ELSE
        PRINT "Affordable"
    END IF
END FOR

...and today?

How to transform JSON data?

How not to transform JSON data?

Risk bugs (manual index, array size mismatch)
Not maintainable (hard to change logic)
Not readable (nested loops & conditions)
Doesn't scale for real JSON with nested objects
Lacks composability (can't easily reuse)

#include <stdio.h>

int main() {
    int yields[5] = {4000, 5200, 3100};
    int total = 0;

    for (int i = 0; i < 5; i++) { // Iteration
        if (yields[i] > 5000) {   // Selection
            total += yields[i];    // Sequence
        }
    }
    printf("Total high yield = %d\n", total);
    return 0;
}

JSON Data Transformation with Map, Filter, Reduce

Map – Declarative: what to transform, not how
Filter – Declarative: what rows to keep, not how to loop
Reduce – Declarative: what summary you want (sum, avg, min, max, count)

Contrast: Non-declarative requires specifying both what + how (loops, indices, control flow).

MAP

MAP Syntax

Transform each JSON object item in an array
Apply a function to each element
Create a new array from existing data

const numbers = [1, 2, 3, 4];

const doubled = numbers.map(x => x * 2);
// Result: [2, 4, 6, 8]

const objects = [{id:1}, {id:2}];
const transformed = objects.map(obj => ({
    ...obj, 
    value: obj.id * 10
}));
// Result: [{id:1, value:10}, {id:2, value:20}]

Map Example

Map applies to item granularity
Used for item requirements
IndexedDB, MongoDB, and SQL

//Before:
[
  { "crop": "Wheat", "yield": "5000" },
  { "crop": "Corn", "yield": "4000" }
]

//After:
[
  { "crop": "Wheat", "yield": 5000 },
  { "crop": "Corn", "yield": 4000 }
]

	Map	Filter	Reduce
IndexedDB (JS)
MongoDB (MQL)

Map in JavaScript

Translation: English → Spanish
Data Format: Date → ISO string
Type Conversion: String → Integer
Default Value: Fill missing fields

// Translation: English to Spanish
["Wheat","Corn"].map(c=>c==="Wheat"?"Trigo":"Maíz");

// Data Format: Local to UTC
["01/12/25"].map(d => new Date(d).toISOString());

// Type Conversion: String to Int
["5000","4000"].map(y => parseInt(y));

// Default Value: Region
[{r:null}].map(x=>({r:x.r||"Unknown"}))

Data Mapping Requirements: Req 1 & 2

Req 1: Fill missing data with default
Req 2: Convert yield to integer

Before:
[
  { "crop": "Wheat", "yield": "5000" },
  { "crop": "Corn" }
]

After Map:
[
  { "crop": "Wheat", "yield": 5000 },
  { "crop": "Corn", "yield": "missing" }
]

Map IndexedDB Data with JavaScript

Data Mapping Requirements: Req 1 & 2

const data = [
  { crop: "Wheat", yield: "5000" },
  { crop: "Corn" }
];

const mapped = data.map(d => ({
  crop: d.crop,
  yield: d.yield ? parseInt(d.yield) : "missing"
}));

const data = [
  { crop: "Wheat", yield: "5000" },
  { crop: "Corn" }
];

const transformed = [];
for (let i = 0; i < data.length; i++) {
  const d = data[i];
  let yieldVal;
  if (d.yield) {
    yieldVal = parseInt(d.yield);
  } else {
    yieldVal = "missing";
  }
  transformed.push({crop:d.crop,yield:yieldVal});
}

Data Mapping Requirements: Req 3 & 4

Req 3: Add timestampLastUpdate meta data
Req 4: Add uuid meta data

// Before:
const data = [{ "crop": "Wheat" }];

// After:
[
  { 
    "crop": "Wheat", 
    "tsLastUpdate": "2025-09-15T15:00:00Z", 
    "uuid": "a1b2c3-"
  }
]

Map IndexedDB Data with JavaScript

Data Mapping Requirements: Req 3 & 4


const data = [
  { crop: "Wheat" },
  { crop: "Corn" }
];

const mapped = data.map(d => ({
  ...d,
  meta: {
    tsLastUpdate: new Date(),
    uuid: crypto.randomUUID()
  }
}));


const data = [
  { crop: "Wheat" },
  { crop: "Corn" }
];

const transformed = [];
for (let i = 0; i < data.length; i++) {
  const d = data[i];
  transformed.push({
    ...d,
    meta: {
      tsLastUpdate: new Date(),
      uuid: crypto.randomUUID()
    }
  });
}

Map Use Case

Poor quality data ⟶ Better quality data (e.g., remove nulls, fix typos)
Unnormalized data ⟶ Normalized data (e.g., split address: street/city/state)
Unstandardized data ⟶ Standardized data (e.g., timestamps → UTC)
Format 1 ⟶ Format 2 (e.g., English → Spanish)
Incomplete data ⟶ Complete data (e.g., fill missing yield with average)
Duplicate records ⟶ Unique dataset (e.g., remove repeated crop entries)
Inconsistent units ⟶ Standardized units (e.g., lbs → kg, gallons → liters)
Unstructured text ⟶ Structured fields (e.g., parse “Wheat, 5000kg” into JSON)
Less secure ⟶ More secure (e.g., anonymize SSN “123-45-6789” ⟶ “XXX-XX-6789”)
Less integrity ⟶ More integrity (e.g., add metadata like checksum or UUID to verify integrity)
No Data Pipeline ⟶ Data Pipelin (e.g., extract-transform-load (ETL) process for IoT data)

	Map	Filter	Reduce
IndexedDB (JS)	✓
MongoDB (MQL)

MAP Syntax in MongoDB

input: Array to iterate over
as: Variable name for each element
in: Expression to apply to each element
Creates a new array with transformed values


{
  $map: {
    input: [/* array */],
    as: "element",
    in: { /* transformation */ }
  }
}

MAP Example: Double Numbers

Double each number in an array

// Create a temporary collection
db.tempNumbers.insertMany([ {
     _id: 1, 
     numbers: [1,2,3,4] 
    } 
]);

db.tempNumbers.aggregate([
  {
    $project: {
      doubled: {
        $map: {
          input: "$numbers",
          as: "n",
          in: { $multiply: ["$$n", 2] }
        }
      }
    }
  }
]);
// Result: [{_id:1, doubled: [2,4,6,8]}]

MAP Example: Transform Objects

Transform each object in an array
Add computed values for each element

db.tempObjects.insertOne({ 
    _id: 1, 
    objects: [{id:1},{id:2}] 
});

db.tempObjects.aggregate([
  {
    $project: {
      transformed: {
        $map: {
          input: "$objects",
          as: "obj",
          in: { 
            id: "$$obj.id", 
            value: { $multiply: ["$$obj.id", 10] } 
          }
        }
      }
    }
  }
]);

Map MongoDB Data with JavaScript

	Map	Filter	Reduce
IndexedDB (JS)	✓
MongoDB (MQL)	✓

Filter

What is a Filter?

Think of a filter like a coffee dripper ☕
What does a filter filter?
“to pass or move through or as if through a filter” - Merriam-Webster

Filter Syntax

Filter elements in an array based on a condition
Returns new array with elements that meet the condition
array.filter(element => condition === true)

// Coffee menu as objects
const drinks = [
  { name: "Espresso", price: 3 },
  { name: "Latte", price: 4 },
  { name: "Cappuccino", price: 5 }
];

// Filter: keep only Latte
const lattes = drinks.filter(drink => drink.name === "Latte");

console.log(lattes);
// [ { name: "Latte", price: 4 } ]

Filter Requirements in JavaScript: Req 1 & 2

// Data set
const drinks = [
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
];

// Req 1: Filter only Cappuccino

// Data set
const drinks = [
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
];

// Req 2a: Map price to number
// Req 2b: Filter drinks under $5

Filter IndexedDB Data with JavaScript

Filter Requirements in JavaScript: Req 1 & 2

// Data set
const drinks = [
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
];
const cappuccinos = drinks.filter(d => d.name === "Cappuccino");
const cheapNames = drinks
  .map(d => ({ drink: d.name, price: Number(d.price) }))
  .filter(d => d.price < 5);

// Data set
const drinks = [
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
];
const cappuccinos = [];
const cheapNames = [];
for (let i = 0; i < drinks.length; i++) {
  const d = drinks[i];
    if (d.name === "Cappuccino") {
    cappuccinos.push(d);
  }
  const price = Number(d.price);
  if (price < 5) {
    cheapNames.push({ drink: d.name, price: price });
  }
}

	Map	Filter	Reduce
IndexedDB (JS)	✓	✓
MongoDB (MQL)	✓

Filter Requirements in MongoDB: Req 1 & 2

// Collection: drinks
// Documents
[
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
]

// Req 1: Filter only Cappuccino

// Collection: drinks
// Documents
[
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
]

// Req 2a: Convert price to number
// Req 2b: Filter Cappuccinos under $5

Filter Data with MongoDB (MQL)

Filter Requirements in MongoDB: Req 1 & 2

// Insert sample data
db.drinks.insertMany([
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
])

// Req 1: Filter only Cappuccino
db.drinks.find({ name: "Cappuccino" })

// Result:
[ { name: "Cappuccino", price: "5" } ]

// Insert sample data
db.drinks.insertMany([
  { name: "Espresso", price: "3" },
  { name: "Latte", price: "4" },
  { name: "Cappuccino", price: "5" }
])

// Req 2a: Convert price to number
// Req 2b: Filter Cappuccinos under $5
db.drinks.aggregate([
  { $match: { name: "Cappuccino" } },
  { $addFields: { price: { $toInt: "$price" } } },
  { $match: { price: { $lt: 5 } } }
])

// Result:
[ { name: "Cappuccino", price: 5 } ]

Filter Use Case

Noise ⟶ Signal (e.g., remove irrelevant logs, keep error events)
All crops ⟶ Selected crops (e.g., only “Wheat”)
All dates ⟶ Relevant period (e.g., last 7 days only)
All customers ⟶ Target customers (e.g., local farmers)
All products ⟶ In-stock products (e.g., quantity > 0)
All diagnoses ⟶ Selected ICD-9 codes (e.g., only 250.xx)
All devices ⟶ Active devices (e.g., last ping < 5 min ago)
All fields ⟶ Sensitive fields removed (e.g., drop SSN, keep name + city)
All records ⟶ Valid records (e.g., only rows passing schema validation)
Unfiltered API ⟶ Specific view (e.g., Query selecting only posts with tag “IoT”)

	Map	Filter	Reduce
IndexedDB (JS)	✓	✓
MongoDB (MQL)	✓	✓

Reduce

What is a Reduce?

Reduce like combining coffee grounds into one cup ☕
What does reduce do? → Big data to One Value

Reduce Syntax

Reduce combines array elements into a single result
Requires an accumulator and a function
array.reduce((acc, element) =>..., initValue)

// Example: sum numbers
const numbers = [1, 2, 3];
const total = numbers.reduce(
  (acc, n) => acc + n, 0
);

console.log(total);
// 6

Reduce in JavaScript

// Data set
const drinks = [
  { name: "Espresso", price: 3 },
  { name: "Latte", price: 4 },
  { name: "Cappuccino", price: 5 }
];

// Total price of all drinks
const total = drinks.reduce(
  (sum, d) => sum + d.price, 0
);

console.log(total);
// 12

// Same with loop
const drinks = [
  { name: "Espresso", price: 3 },
  { name: "Latte", price: 4 },
  { name: "Cappuccino", price: 5 }
];

let total = 0;
for (let i = 0; i < drinks.length; i++) {
  total += drinks[i].price;
}

console.log(total);
// 12

Reduce in MongoDB (Aggregation)

// Insert data
db.drinks.insertMany([
  { name: "Espresso", price: 3 },
  { name: "Latte", price: 4 },
  { name: "Cappuccino", price: 5 }
])

// Total price of all drinks
db.drinks.aggregate([
  { $group: { total: { $sum: "$price" } } }
])

// Result:
[ { total: 12 } ]

// Total price of Cappuccino only
db.drinks.aggregate([
  { $match: { name: "Cappuccino" } },
  { $group: { _id: "$name", total: { $sum: "$price" } } }
])

// Result:
[ { _id: "Cappuccino", total: 5 } ]

	Map	Filter	Reduce
IndexedDB (JS)	✓	✓	✓
MongoDB (MQL)	✓	✓	✓

Reduce Use Case

Many sales ⟶ Total sales (e.g., SUM revenue)
Multiple prices ⟶ Average price (e.g., AVG cost)
All values ⟶ Minimum / Maximum (e.g., lowest or highest temperature)
All customers ⟶ Count of customers (e.g., COUNT records)
All errors ⟶ Error rate summary (e.g., % failed requests)
Sensor data stream ⟶ Aggregated metric (e.g., rolling average)
Daily transactions ⟶ Weekly or monthly totals
All text entries ⟶ One concatenated string (e.g., join comments)
Distributed logs ⟶ Unified summary (e.g., number of events per category)
Detailed dataset ⟶ One KPI dashboard metric (e.g., ROI = (Σrev−Σcost)/Σcost)

Questions

Today Agenda

MQL CRUD Query
MQL Aggregation Pipeline
Lab 2, Homework 2, Project Part 1

Today Agenda

DB Architecture - End-to-End Aspects
DB Architecture - Design Time Aspects
DB Architecture - Runtime Aspects
NoSQL Graph DB - Neo4j
Lab 2, Homework 2, Project Part 1

End-to-End Database Pipeline Aspects

Pipeline connects data sources to sinks
Ingest, process, store, and deliver data
Supports both real-time and batch workloads

Software Required to Transport Data from Database to Database

ETL Transport Software

E – Extract: Pull data from source systems
T – Transform: Clean, validate, and reshape data
L – Load: Insert data into target system

E: databases, APIs, IoT devices, logs
T: quality checks, joins, aggregations
L: lake, warehouse, mart, dashboard
—

History of ETL vs ELT

1970s–80s: ETL emerges relational databases
1990s: ETL dominates for data warehouses
2000s: Big data growth → lakes introduced
2010s: ELT rises with scalable cloud storage
Today: Hybrid pipelines mixing ETL + ELT

Source DB Systems

Transactional databases (SQL/NoSQL)
APIs, IoT devices, logs
Streaming sources (Kafka, MQTT)

Data Extraction

Captures raw data from sources
Batch (files, APIs) and streaming (events)
Ensures durability and availability

# extract.py - Example of extracting data from an API
import requests
import json

url = "https://api.weather.com/data"
response = requests.get(url)

if response.status_code == 200:
    data = response.json()
    print("Data extracted:", data)
else:
    print("Failed:", response.status_code)

Where to Extract Data To?

Concept of Data Lake

Central repository for raw data
Add metadata for context and discovery
Schema-on-read: flexibility for data veriaty

Concept of Data Lake

Central repository for raw data
Add metadata for context and discovery
Schema-on-read: flexibility for data variety
One query language for multiple source db
Raw data is read-only
Data available even when source DB offline

# Example structure of a data lake
lake = {
    // READ ONLY
    "uuid": "123e4567-e89b-12d3-a456-426614174000", 
    "item": { 
        "raw_data": {
            "sensor_readings": [...],
            "logs": [...],
            "api_data": [...]
        },
        // READ AND WRITE
        "meta_data": { 
            "source": "IoT sensors",
            "timestamp": "2025-09-23T08:00:00Z",
            "tags": ["temperature", "humidity"]
        }
    }
}

Concept of Data Lake

Holds source-of-truth data in raw form
Supports delayed or retrospective processing
Minimizes risk of premature data loss

Concept of Data Lake

Transform: Enrich raw data with metadata
Load: Store enriched data in the data lake
Maintain read-only raw data

Concept of Data Lake

In many data pipeline architectures, the data lake is often the first and central component.

Concept of Data Lake

It is highly recommended to start with a data lake to ensure flexibility and scalability.

Case Study: Agriculture Database Pipeline

IoT soil moisture sensors collect raw data
DB sources: SQL, IndexedDB, and MongoDB
Stored in a data lake with metadata
Lake holds truth for delayed processing
Task: Draw architecture up to lake
Include: Metadata on quality Aspects

When is software just a tool to transport data from one DB to another?

Example with MongoDB Compass

Questions?

1. What is the main purpose of the Extract step in ETL?

Transform data for analysis
Load data into the target system
Capture raw data from source systems
Generate reports for end users

2. Which of the following is a key characteristic of a Data Lake?

Only stores structured data
Data is read-only and enriched with metadata
Requires source databases to be always online
Data is stored without any metadata or context

3. In ETL processes, what does the Transform step usually involve?

Pulling data from APIs and files
Cleaning, validating, and reshaping data
Loading data into dashboards
Backing up the source database

4. When would you recommend using ETL over ELT?

When the target system can handle large-scale raw data processing
When you need to clean, validate, and transform data before loading
When you want to load raw data first and transform later
When working exclusively with streaming data

Questions?

Today Agenda

DB Architecture - End-to-End Aspects ✓
DB Architecture - Design Time Aspects
DB Architecture - Runtime Aspects
NoSQL Graph DB - Neo4j
Lab 2, Homework 2, Project Part 1

Design-Time Aspects

Choices made before pipeline runs
Impact scalability, cost, reliability
Guide long-term architecture quality

Which Database for Lake?

Object storage (e.g., S3, GCS, Azure Blob)
NoSQL (e.g., MongoDB, Cassandra)
Columnar formats (Parquet, ORC)
Relational Database (MySQL, OracleDB)
Decision: scalability vs. query speed

Which Sync Interval?

How often to sync from sources
Real-time (seconds) vs batch (hours)
Impacts freshness and cost

Interval	Use Case
Real-time	IoT, fraud detection
Hourly	ETL jobs, logs
Daily	Reporting dashboards

Which Sync Trigger?

Event-based: trigger on data arrival
Schedule-based: run at fixed times
Hybrid: schedule + event validation

# etl.py - all-in-one ETL script
def etl():
    extract()
    transform()
    load()

# Separate modular files
# extract.py
def extract():
    print("Extracting data from source...")

# transform.py
def transform():
    print("Transforming and enriching data...")

# load.py
def load():
    print("Loading data into target system...")

Partial or Full Sync?

Partial sync: only changed data
Full sync: replace entire dataset
Tradeoff: performance vs reliability

Type	Pros	Cons
Partial	Faster, less storage	Risk missing updates
Full	Consistent snapshot	High cost & latency

Schema Design

Schema-on-read: flexibility
Schema-on-write: strict validation

db = MongoClient().Agriculture # connect to MongoDB
print(db.flowers.count_documents({
    "color": 
        {"$exists": True}
    }))  # verify data

What is DataRetention Policy?

How long to keep raw data in lake
Short retention reduces cost
Long retention supports audits
May be driven by legal requirement

# Retention strategies for raw data
def retention_policy(item):
    # 1. Auto delete
    if item.age_days > 30:
        del item

    # 2. Deletion flag
    elif item.flag_for_deletion:
        item.deleted = True

    # 3. Do not delete
    else:
        pass  # keep indefinitely

Security, Access, and Integrity

Data Integrity Check
Encryption at rest & in transit
Access control (RBAC, ABAC)
Audit logs for compliance

# Verify checksum of metadata
doc = db.Farm.find_one({"uuid": "123e4567-e89b-..."})
meta_json = json.dumps(doc['meta_data'])
checksum = hashlib.md5(meta_json).hexdigest()

print("Stored checksum:", doc.get("checksum"))
print("Calculated checksum:", checksum)
print("Valid:", doc.get("checksum") == checksum)

How muchtraceability?

Track data origin and transformations
Enable reproducibility & trust
Metadata catalog for discovery
PubMed Article: 23684593

Design-Time Summary

Database type for lake
Sync interval & trigger
Partial vs full sync
Schema, retention, format
Integrity, Traceability

Case Study: Agriculture Data Pipeline (Part 2)

IoT soil moisture sensors collect raw data
Data ingested through batch and streaming
Stored in a data lake with metadata
Lake holds truth for delayed processing
Task: Draw end-to-end architecture up to lake
Include: Design Time Aspects

Questions?

1. What is a key design-time consideration when choosing ETL vs ELT?

Whether to use Python or Java
Data volume, source complexity, and processing capabilities
Network bandwidth between users
Time of day to run the pipeline

2. Which design-time aspect involves defining cleaning, validation, and metadata?

Data Sources
Transformation
Retention & Security
Sync Trigger

3. Which design-time aspect determines how long raw data is kept in the pipeline?

Data Sources
ETL vs ELT choice
Retention & Security
Transformation & Metadata

4. When planning design-time aspects, what should you consider about data sources?

Connectivity requirements
The color of the UI dashboard
The font used in reports
Only the database name

Questions?

Today Agenda

DB Architecture - End-to-End Aspects ✓
DB Architecture - Design Time Aspects ✓
NoSQL Graph DB - Neo4j
DB Architecture - Runtime Aspects
Lab 2, Homework 2, Project Part 1

IndexedDB

Redis

MongoDB

CassandraDB

Faiss

ClickHouse

Neo4j

When to Use a Graph Database?

To Query Relationships

Query connections, paths, and patterns between entities.

Subject Predict Object

Alice: FRIEND_OF: Bob
Paris: CAPITAL_OF: France
Rose: GROWS_IN: Garden
Car: MANUFACTURED_BY: Toyota
Student: ENROLLED_IN: Course

History of Graph Databases

1960s–1970s: Early graph theory research
1980s: OOP databases with graph-like features
1990s: Semantic web and RDF for graph data
2000s: Emergence of native graph databases
2010s: GraphDB in social networks
2010s: GraphDB in recommendation systems
2010s: GraphDB in fraud detection

History of Neo4j

2000: Neo4j project started in Sweden
2007: First public release of Neo4j
2010: Neo4j for social network
2010: Neo4j for recommendation system
2014–2020: Expanded support for clustering
Today: Widely used native graph database
Query language: Cypher
Node/edge properties can use JSON structures
Schema-less: nodes/edges flexible properties

Neo4j Aura Console

https://console-preview.neo4j.io/

Create Nodes

Create a node with label Person

// Create a Person node
CREATE (p:Person {name: "Alice", age: 30})
RETURN p;

Create Relationships

Connect two nodes with a FRIEND relationship

// Create FRIEND relationship
MATCH (a:Person {name:"Alice"}),(b:Person {name:"Bob"})
CREATE (a)-[:FRIEND]->(b)
RETURN a, b;

Find Nodes

Retrieve all Person nodes

// Find all Person nodes
MATCH (p:Person)
RETURN p;

// Sample return object
{
  "p": {
    "name": "Alice",
    "age": 30,
    "email": "alice@example.com"
  }
}

Filter Nodes

Find Person nodes older than 25

// Filter by age
MATCH (p:Person)
WHERE p.age > 25
RETURN p;

// Sample return objects
[
  {"name": "Alice", "age": 30},
  {"name": "Charlie", "age": 28}
]

Update Nodes

Update the age of a Person

// Update age
MATCH (p:Person {name: "Alice"})
SET p.age = 31
RETURN p;

// Sample return object
{"name": "Alice", "age": 31}

Delete Nodes

Remove a Person node

// Delete node
MATCH (p:Person {name: "Bob"})
DETACH DELETE p;

// No return object; node is deleted

Count Nodes

Count how many Person nodes exist

// Count nodes
MATCH (p:Person)
RETURN count(p) AS totalPersons;

// Sample return object
{"totalPersons": 3}

Find Relationships

List all FRIEND relationships

// Find all FRIEND relationships
MATCH (a)-[r:FRIEND]->(b)
RETURN a.name, b.name;

// Sample return objects
[
  {"a.name": "Alice", "b.name": "Bob"},
  {"a.name": "Charlie", "b.name": "Alice"}
]

Aggregate Data

Calculate average age of Persons

// Average age
MATCH (p:Person)
RETURN avg(p.age) AS averageAge;

// Sample return object
{"averageAge": 29.7}

Optional Match

Find Persons and optional FRIENDs

// Optional FRIEND match
MATCH (p:Person)
OPTIONAL MATCH (p)-[:FRIEND]->(f)
RETURN p.name, collect(f.name) AS friends;

// Sample return objects
[
  {"p.name": "Alice", "friends": ["Bob", "Charlie"]},
  {"p.name": "Charlie", "friends": []}
]

SQL vs Cypher

Find all friends of Alice

// SQL query
SELECT f.name
FROM Person p
JOIN FRIENDS fr ON p.id = fr.person_id
JOIN Person f ON fr.friend_id = f.id
WHERE p.name = 'Alice';

// Cypher query
MATCH (a:Person {name: "Alice"})-[:FRIEND]->(f)
RETURN f.name;

Questions

ACID Compatibility

Database	Atomicity	Consistency	Isolation	Durability
IndexedDB	Yes	Yes	Yes (per transaction)	Yes
MongoDB	Yes (single document)	Yes (single document)	Limited	Yes (with journaling)
Neo4j	Yes	Yes	Yes	Yes
SQL (e.g., MySQL, PostgreSQL)	Yes	Yes	Yes	Yes

Today Agenda

DB Architecture - End-to-End Aspects ✓
DB Architecture - Design Time Aspects ✓
NoSQL Graph DB - Neo4j
DB Architecture - Runtime Aspects
Lab 2, Homework 2, Project Part 1

ETL Runtime Challenges

Data volume spikes
Overwhelms downstream systems
Causes pipeline bottlenecks
Increases job failures
Leads to data loss or delays

Runtime Aspects

Log endpoints usage
Log requests & responses
Log server uptime
Log DB health & queries
Use logs to prevent broken pipelines

Runtime Circuit Breaker

Protects services from overload
Stops repeated failing requests
Allows system to recover
Throttles down request rate
Write on lake only

Runtime Resilience4j

Java library for fault tolerance
Circuit Breaker support
Rate Limiter & Bulkhead
Retry & Time Limiter

resilience4j

Monitoring Pipelines

Dashboards for job health
Alerts on data drift & anomalies
Performance metrics (I/O, memory)
Audit logs for compliance


import psutil, time

while True:
    cpu = psutil.cpu_percent()
    mem = psutil.virtual_memory().percent
    print(f"CPU: {cpu}%, MEM: {mem}%")
    time.sleep(5)

Questions

Which NoSQL Database Would You Recommend?

You need an offline browser-based app to store user preferences → ?
You want to manage a large product catalog with flexible schema → ?
You need to model social network relationships like friends-of-friends → ?
You are building a data lake for weather and IoT sensors → ?
You need fast local caching of session data in a web app → ?

Questions

Lab 2 - Upstream Data

History of Node.js

Created by Ryan Dahl in 2009
Built on Chrome’s V8 JavaScript engine
Introduced event-driven, non-blocking I/O
npm (Node Package Manager) launched in 2010
Widely adopted for web apps & APIs

Getting Started with Node.js & npm

Node.js = JavaScript runtime
npm = package manager
Install packages with npm install
Track dependencies in package.json
Run apps with node app.js
Lab2-Sample


# initialize a project
npm init -y

# install a package
npm install express

# run the app
node app.js

Review package.json

name / version: project identity
description: project purpose
main: entry file (app.js)
scripts: run commands (start, dev)
dependencies: required libraries
devDependencies: tools for development
license: usage terms

{
  "name": "lab2-agriculture-app",
  "version": "1.0.0",
  "description": "Simple Node.js app with Express and MongoDB",
  "main": "app.js",
  "scripts": {
    "start": "node app.js",
    "dev": "nodemon app.js"
  },
  "dependencies": {
    "dotenv": "^16.4.5",
    "express": "^4.19.2",
    "mongodb": "^4.17.2"
  },
  "devDependencies": {
    "nodemon": "^3.1.0"
  },
  "license": "MIT"
}

Review .gitignore

node_modules/: installed packages, not tracked
.env: sensitive credentials (DB, API keys)
Logs: debug/error logs ignored
OS / Editor files: system-generated clutter
Build artifacts: compiled output, not source

# Node dependencies
node_modules/

# Environment variables
.env

# Logs
*.log

# OS / Editor files
.DS_Store
Thumbs.db

# Build artifacts
dist/
build/

Setup Env & App

Load .env early
Import Express & MongoDB
Create the app instance
Read runtime ports/creds

// app.js
require("dotenv").config();

const express = require("express");
const { MongoClient } = require("mongodb");

const app = express();

const PORT = process.env.PORT || 3000;
const HOST = process.env.MONGO_HOST;
const USER = process.env.MONGO_USER;
const PASS = process.env.MONGO_PASS;

Early Env Validation

Fail fast if secrets missing
Clear error message
Exit to avoid half-start

(function validateEnv() {
  const missing = [];
  if (!HOST) missing.push("MONGO_HOST");
  if (!USER) missing.push("MONGO_USER");
  if (!PASS) missing.push("MONGO_PASS");
  if (missing.length) {
    console.error("❌ Missing env var(s):", missing.join(", "));
    console.error("   Ensure .env ext to app.js with these keys.");
    process.exit(1);
  }
})();

Create Mongo Client

SRV URL host-only
Separate auth credentials
Retry & unified topology

const client = new MongoClient(
  `${HOST}/?retryWrites=true&w=majority`,
  {
    useNewUrlParser: true,
    useUnifiedTopology: true,
    auth: { username: USER, password: PASS },
    authSource: "admin"
  }
);

let collection; // set after connect

Connect & Start Server

Connect & ping admin
Select DB/collection
Log document count
Start Express on success

(async function start() {
  try {
    console.log("⏳ Connecting to MongoDB...");
    await client.connect();
    await client.db("admin").command({ ping: 1 });

    const db = client.db("Lab2");
    collection = db.collection("Agriculture");

    const host = HOST.replace(/^mongodb\+srv:\/\//, "");
    const count = await collection.estimatedDocumentCount();
    console.log(`✅ Connected to ${host}`);
    console.log(`📚 Lab2.Agriculture docs: ${count}`);

    app.listen(PORT, () =>
      console.log(`🚀 http://localhost:${PORT}`)
    );
  } catch (err) {
    console.error("❌ DB connection error:", err.message || err);
    process.exit(1);
  }
})();

Data Endpoint

Guard: DB initialized
Query and return JSON
Handle 5xx on errors

app.get("/agriculture", async (_req, res) => {
  try {
    if (!collection) {
        return res.status(503).send("DB not initialized");
    }
    const docs = await collection.find({}).toArray();
    res.json(docs);
  } catch (e) {
    res.status(500).send(String(e));
  }
});

Health & Debug

/health: basic readiness
/debug/agriculture: count + sample
Use in runtime checks

app.get("/health", (_req, res) => {
  res.json({
    ok: Boolean(collection),
    cluster: HOST.replace(/^mongodb\+srv:\/\//, "")
  });
});

app.get("/debug/agriculture", async (_req, res) => {
  try {
    if (!collection) {
        return res.status(503).send("Database not initialized");
    }
    const count = await collection.estimatedDocumentCount();
    const sample = await collection.find({}).limit(5).toArray();
    res.json({db:"Lab2",collection:"Agriculture",count,sample });
  } catch (e) {
    res.status(500).json({ error: String(e) });
  }
});

POST Endpoint

Accept JSON in request body
Insert new document
Return confirmation with ID
Handle errors gracefully

// Enable JSON parsing
app.use(express.json());

// POST: add agriculture doc
app.post("/agriculture", async (req, res) => {
  try {
    const doc = req.body;
    const result = await collection.insertOne(doc);
    res.json({ insertedId: result.insertedId });
  } catch (e) {
    res.status(500).json({ error: String(e) });
  }
});

Today Agenda

DB Architecture - Data for Secondary Usage
DB Architecture - Data Warehouse
DB Architecture - Data Mart
NoSQL Graph DB - Neo4j
Lab 2, Homework 2
Lab 3, Homework 3
Midterm, Project Part 1

DB Architecture - Data for Primary Usage

Data for Primary Usage

Support real-time operations
Record transactions or events
Maintain accurate current data
Enable immediate decision-making
Monitor systems and devices

Data for Primary Usage – Hospital

Track patient vital signs in real-time
Maintain accurate medical records
Support immediate clinical decisions
Monitor ICU devices and alerts
Enable timely diagnosis and treatment

{
    "patient_id":"P2025",
    "heart_rate":82,
    "blood_pressure":"120/80",
    "oxygen_level":97,
    "timestamp":"2025-09-30T09:30:00Z"
}

Data for Primary Usage – Agriculture

Monitor soil moisture and temperature
Track crop growth and yield data
Control irrigation systems
Support real-time pest detection
Ensure sustainable resource use

{
    "field_id":"F010",
    "soil_moisture":22.5,
    "temperature":29.1,
    "irrigation_status":"ON",
    "timestamp":"2025-09-30T09:30:00Z"
}

Data for Primary Usage – Retail

Record customer purchases
Maintain inventory levels
Support checkout and billing
Enable personalized promotions
Provide real-time sales insights

{
    "transaction_id":"R7845",
    "store_id":"S123",
    "items":["Laptop","Mouse"],
    "total_amount":1150.50,
    "purchase_time":"2025-09-30T09:45:00Z"
}

DB Architecture - Data for Secondary Usage

Data for Secondary Usage

Aggregate historical records
Analyze trends over time
Compute averages and statistics
Support predictive models
Identify patterns across datasets
Train Large Language Models (LLMs)

Healthcare (Aggregated)

Patients by condition
Avg. medication usage
Treatment success rates
Hospital occupancy trends
Age-group analysis

{
    "total_patients":1200,
    "avg_medications":2.3,
    "treatment_success_pct":87,
    "occupancy_rate":0.78,
    "age_groups":{"0-18":120,"19-65":890,"65+":190}
}

Automotive (Aggregated)

Avg. maintenance cost
Failure rates by model
Service frequency trends
Common repair types
Fuel efficiency analysis

{
    "avg_cost":320,
    "failures":{"modelA":12,"modelB":7},
    "service_count":420,
    "common_repairs":["brake","oil","tire"],
    "avg_mpg":28
}

Agriculture (Aggregated)

Avg. yield per crop
Planting vs harvest trends
Soil quality averages
Climate impact analysis
Water usage statistics

{
    "avg_yield_tons":52.5,
    "planting_dates":["2024-03-15","2024-04-01"],
    "harvest_dates":["2024-09-20","2024-09-30"],
    "soil_ph_avg":6.5,
    "water_used_l":5000
}

Retail (Aggregated)

Total sales by product
Avg. customer spend
Top-selling categories
Inventory turnover rates
Monthly sales trends

{
    "total_sales":10500,
    "avg_spend":45.7,
    "top_categories":["electronics","books"],
    "turnover_rate":1.8,
    "monthly_sales":[1200,1100,1350]
}

What is a Data Lake?

Central repo for all data types
Stores raw data, highly flexible
Handles large-scale data sources

DB Architecture - How from Primary Usage to Secondary Usage?

Data Lake: From Raw to Structured

Raw Zone: Original data stored
Processed Zone: Cleaned and curated
Analytics Zone: Structured for queries

ETL: From Lake to Warehouse

Extract: Pull data from lake
Transform: Clean and standardize
Validate: Ensure data quality
Load: Store into warehouse
Monitor: Track data pipeline health

{
    "extracted_records":1200000,
    "transformed_records":1185000,
    "errors_found":15000,
    "quality_score":0.987,
    "load_status":"success"
}

Data Warehouse

Structured & Normalized: Optimized for queries
Subject-Oriented: Focused on domains
Integrated: Data from multiple sources
Time-Variant: Historical trends stored
Non-Volatile: Read-only, stable data

Data Flow in Warehouses

Extract: Pull data from sources
Transform: Normalize & standardize
Load: Store in warehouse tables

                                db.crops.aggregate([
  { $match: { crop_type: "Wheat" } },
  { $project: { yield_tons: 1, farm_id: 1 } },
  { $group: {
      _id: "$farm_id",                
      total_yield: { $sum: "$yield_tons" },
      avg_yield: { $avg: "$yield_tons" }   
  }},
  { $sort: { total_yield: -1 } }
])

Challenges of Data Warehouse

Single, centralized structure can be overwhelming
Difficult to serve diverse departmental needs
Performance bottlenecks with large queries
Limited flexibility for specialized analysis
High cost of maintaining one large system
Slow adoption by end-users without tailored views
Harder to scale for business-specific use cases

Data Marts

Subject-Specific: Focused on one domain
Smaller Scope: Limited dataset vs warehouse
Faster Access: Optimized for queries
Decentralized: Independent warehouse

Types of Data Marts

Dependent: From central warehouse
Independent: From external sources
Hybrid: Combines both approaches
By Security: Restricted vs open access
By Domain: Sales, marketing, finance
By Application: Analytics or reporting

                                {
    "farm_id": "F12345",
    "crop_type": "Wheat",
    "region": "North Valley",
    "harvest_year": 2024,
    "yield_tons": 50.5,
    "water_used_m3": 1200,
    "fertilizer_kg": 300,
    "pesticides_kg": 20,
    "profit_usd": 7500,
    "irrigation_method": "Drip"
}

Which Database Type for Structured Data and Aggregated Queries?

Why Relational Databases for Warehouses & Marts?

Support structured schemas (facts + dimensions)
Efficient for joins, aggregations, group-bys
ACID properties ensure data integrity
Indexes optimize analytical queries
Widely supported by BI and reporting tools
Data Volume: Lake >>> Warehouse

// SQL Query Example
SELECT d.specialty, SUM(f.cost) AS total_cost
FROM Fact_Treatment f
JOIN Dim_Doctor d ON f.doctor_id = d.doctor_id
GROUP BY d.specialty;

Why Facts & Dimensions Matter

Facts: Numeric metrics (sales, yield, treatment cost)
Dimensions: Context (time, location, product, patient)
Facts alone = numbers without meaning
Dimensions alone = context without metrics
Together: enable analysis (per crop, per doctor, per region)

// Example: Hospital
Fact_Treatment {
  patient_id, doctor_id, date_id, cost
}

Dim_Doctor { doctor_id, specialty }
Dim_Date   { date_id, month, year }

Data Warehouse – Facts & Dimensions

Fact: Measure hospital events (treatment, admission)
Dimensions: Describe context (patient, doctor, time)
Facts = numeric & additive (e.g., cost, duration)
Dimensions = descriptive attributes
Together form star or snowflake schema

// Hospital Example
Fact_Treatment {
  patient_id, doctor_id, date_id, cost
}

Dim_Patient { patient_id, age, gender }
Dim_Doctor  { doctor_id, specialty }
Dim_Date    { date_id, year, month, day }

Data Warehouse/Mart – Star Schema

Central fact table with measures (e.g., sales, revenue)
Surrounded by denormalized dimension tables
Simpler design, faster queries
Easy for reporting and BI tools
Consumes more storage due to redundancy

Data Warehouse/Mart – Snowflake Schema

Fact table with normalized dimension tables
Dimensions split into sub-dimensions (hierarchies)
Reduces data redundancy and storage cost
More complex design and joins
Better for large, scalable warehouses

Challenges of Secondary Data Usage

Data may be outdated or irrelevant
Original purpose may be missing
Data quality can be inconsistent
Limited control over data collection
Privacy concerns with sensitive data
May require cleaning or transformation
Primary usage has higher priority

Warehouse Query Requirement – Average Rainfall

Requirement: Find average rainfall in a region
Fact Table: Rainfall measurements
Dimensions: Region, Date
Aggregate: Average per region

// Fact Table: Fact_Rainfall
// region_id, date_id, mm_rain

SELECT r.region_name, 
       AVG(f.mm_rain) AS avg_rainfall
FROM Fact_Rainfall f
JOIN Dim_Region r ON f.region_id = r.region_id
GROUP BY r.region_name;

Please Draw a Database Pipeline on Paper

Questions?

Lab 2

Lab 2 – What Went Well

Client (.html)

Almost no for loops and if conditions
Used chaining with map, filter, reduce
Focused on enriching metadata
Mostly button clicks (some triggers, not automatic)

Server (.js)

Explicitly allowed Access-Control-Allow-Methods
Asked many good questions

Enrichment of Metadata

Using Array(10) instead of for loop
Fill and map pattern generates objects concisely
Metadata enrichment applied with spread operator
Cleaner and more functional programming style
Scalable for generating synthetic sensor data

const objects = Array(10).fill(null)
  .map((_, i) => ({
    id: i + 1,
    sensorId: `sensor-${i + 1}`,
    reading: Math.floor(Math.random() * 100),
    timestamp: new Date().toISOString(),
    notes: `Reading ${i + 1}`,
    ...metadata
  }));

Lab 2 – What Did Not Go Well

Client (.html)

Extraction means querying data automatically
Extraction means posting to the lake automatically
Manual copy of objects to collection not scalable

Server (.js)

Password in plain code (mongodb+srv://...)
Should use .env for secrets
Need more specific endpoints ('/sync')
Some queries deleted entire lake data
GitHub structure not followed (one folder per lab)

Metadata as a Separate Object

Metadata should not mix with raw sensor data
Keeps the structure clean and organized
Prevents accidental changes to measurement values
Allows flexible updates without touching raw data
Improves clarity for downstream processing

const reading = {
  data: {
    id: 1,
    sensorId: "sensor-1",
    value: 72,
    timestamp: "2025-09-30T10:00:00Z"
  },
  metadata: {
    author: "Alice",
    last_sync: new Date().toISOString(),
    uuid_source: crypto.randomUUID()
  }
};

Use sourceId Instead of `id`

Avoids confusion with database _id
Clarifies data origin vs system identifier
Makes schema easier to interpret
Supports better integration with external sources
Improves long-term maintainability

{
  "_id": {
    "$oid": "68d7557c7cbbeeea2631a6cb"
  },
  "id": 1,
  "sensorId": "S1",
  "reading": 53,
  "timestamp": "2025-09-27T03:08:47.143Z",
  "notes": "Record for sensor 1",
  "metadata": {
    "author": "Alice",
    "last_sync": "2025-09-27T03:09:47.140Z",
    "uuid_source": "8a72254d-b18c-4706-9e57-a98bb8ae523d"
  }
}

Extraction: Query Database + Upload Data

HTTP POST Request with Fetch API

Send data to the server
Specify method as POST
Include headers (e.g., Content-Type)
Body must be a string (e.g., JSON)
Commonly used to create new records

fetch("http://localhost:3000/agriculture/sync/indexeddb", {
  method: "POST",
  headers: {
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    sensorId: "S1",
    reading: 75
  })
})
  .then(response => response.json())
  .then(result => {
    console.log("Data posted:", result);
  })
  .catch(error => {
    console.error("Error:", error);
  });

HTTP GET Request with Fetch API

Retrieve data from a server
Default method is GET
Returns a Promise
Use response.json() to parse JSON
Commonly used for reading APIs

fetch("http://localhost:3000/agriculture/sync/indexeddb")
  .then(response => response.json())
  .then(data => {
    console.log("Data received:", data);
  })
  .catch(error => {
    console.error("Error:", error);
  });

How to Prevent Deleting Data in a Lake?

MongoDB Lake – Avoid Deletion

Do not expose DELETE routes in API
Restrict operations to insert and update
Protect data at the application layer

// API example: No DELETE endpoint
app.post("/add", async (req, res) => {
  await db.collection("lake").insertOne(req.body);
  res.send("Added safely");
});

// No app.delete(...) defined

Soft Delete Instead of Hard Delete

Mark documents as deleted, don’t remove them
Preserves history for auditing
Supports recovery if needed

// Soft delete with a flag
db.lake.updateOne(
  { _id: ObjectId("...") },
  { $set: { isDeleted: true, deletedAt: new Date() } }
);

// Query only active docs
db.lake.find({ isDeleted: { $ne: true } });

Restrict User Roles

Use MongoDB RBAC (Role-Based Access Control)
Remove remove privilege from users
Allow only find, insert, update

db.createRole({
  role: "noDeleteRole",
  privileges: [
    {
      resource: { db: "lakeDB", collection: "" },
      actions: ["find", "insert", "update"]
    }
  ],
  roles: []
});

db.createUser({
  user: "lakeUser",
  pwd: "securePass",
  roles: [ { role: "noDeleteRole", db: "lakeDB" } ]
});

Monitor & Restore

Use Change Streams to detect deletes
Log and re-insert deleted documents
Backups & snapshots as safety net

// Watch for deletes
db.lake.watch([
  { $match: { operationType: "delete" } }
]).on("change", change => {
  console.log("Delete detected:", change);
  // Optionally restore from backup
});

Agenda

DB Architecture - Data Lakehouse
DB Architecture - Streaming Pipelines
NoSQL Graph DB - Neo4j (Queries & Links)
Lab 3 Sample Code, Homework 3
Midterm & Project Part 1 – Guidelines

Recap: Lake → Warehouse → Mart

Lake: raw, flexible, cheap
Warehouse: curated, reliable
Mart: focused, fast to use

Agriculture: Warehouse path

Soil tests → ETL → tables
Yields by field, by season
Reports for decisions

RDBMS for marts; clean, structured facts

Limits: Warehouse

Rigid schemas; slow change
Costly scale for history
Weak for unstructured data
Weak for realtime data
Raw detail lost on ETL

Agriculture: Lake path

Sensor logs in object store
Realtime data, weather CSV
Ad hoc SQL and scripts

MongoDB for raw docs; flexible ingest

Limits: Data Lake

Schema-on-read is fragile
No ACID across files
Catalog and lineage weak

The Gap

Keep raw + curated together
Get ACID and time travel
Serve SQL, ML, streaming
Less ETL: no lake → warehouse

Lake + Warehouse

Sources

ETL

Data Lake

ETL

Data Warehouse

ETL

Marts / Dashboards

Consumers

Lakehouse

Sources

ETL

Lakehouse (open formats + ACID)

SQL / ML / Streaming Engines

Consumers

Single ingestion step
Less ETL (no lake → warehouse shuffle)
One store, many engines

Architectural Concept Lakehouse

SQL query engine on raw files
Tables on open files format
ACID: safe merge and update
Schema evolve, time travel
One store, many engines
Strong catalog and lineage

Agriculture: Lakehouse flow

Ingest sensors, drones, soil
Store raw + curated tables
Merge late yield updates
SQL for Reports; ML on same data

Where tools fit

RDBMS: marts and reports
MongoDB: raw field logs
Neo4j: farm links, supply
IndexedDB: offline tablets
Lakehouse: Delta Lake / Apache Iceberg

Agriculture with Lakehouse

Soil + yield → curated tables
Drone images → raw files
BI reports on zones

-- Daily report
SELECT zone_id, AVG(yield_kg)
FROM fact_yield
GROUP BY zone_id;

Graph view: Inputs → Fields

Fertilizer → Field → Yield
Supplier → Batch → Field

Neo4j to explore causal chains

Takeaways

Warehouse: trust, but rigid
Lake: flexible, but messy
Lakehouse: both in one

Recap: Lakehouse

One platform for raw + curated
ACID transactions on open files
SQL, ML, and BI share same data
Good for batch, not always for now

Limits of Batch

Updates only daily/hourly
Late data is hard to merge fast
Farm alerts need real time

Batch-Oriented

Insurance risk modeling
Census & demographic statistics
Climate & weather simulations
Pharmaceutical clinical trial data
Corporate reporting

Real-Time (Streaming)

Air traffic control & flight tracking
Emergency response & 911 dispatch systems
Healthcare monitoring (ICU sensors, wearables)
Power grid & smart meters
Autonomous vehicles & traffic systems

New Need: Real Time Pipelines

Soil dries in minutes
Machine breakdown alerts
Weather warnings on time

Batch-first

Sources

ETL

Data Lake

ETL

Warehouse

ETL

Marts

Consumers

Streaming-first

Sources

Stream Pipeline

Consumers

ETL

Lake

...

Soil alerts in seconds
History still stored

Enter Streaming Pipelines

Data flows as events
Process as it arrives
Push alerts & dashboards
Store only what’s needed

Streaming Pipeline Flow

Source: sensors, tractors, APIs
Stream: Kafka topics
Process: filters, agg windows
Store: Elasticsearch, DBs
Consume: Kibana, alerts

Ag Example: Soil Moisture

Sensor sends event every 5 min
Stream joins with weather feed
Alert if below threshold

// Kafka consumer (Node.js)
consumer.on("data", msg => {
  if(msg.moisture < 15){
    sendAlert("Field A12 too dry!");
  }
});

Ag Example: Tractor Telemetry

Streams engine heat + GPS
Detect overheating early
Notify operator instantly

Streaming Pipeline Stack: Kafka

Capture events as messages
Process stream data
Scale for high throughput

# producers → Kafka topics
# consumers → Kafka topics

Kafka: Distributed Event Streaming

Distributed event streaming platform
High throughput (<2 ms latency)
Scalable to thousands of brokers
Durable, fault‑tolerant storage

# create topic
kafka-topics.sh --create -t events
# producer
kafka-console-producer.sh -t events
# consumer
kafka-console-consumer.sh -t events

End-to-End Streaming Example

Producers send events to Kafka
Consumers read and process events

# produce events via Kafka
# consume events from Kafka

Kafka Documentation

Lakehouse (eg MongoDB) vs Streaming (eg Kafka)

Lakehouse vs Streaming

Lakehouse
Batch load
SQL + ML unification
Daily/Hourly insights

Streaming
Event-by-event
Continuous joins
Alerts in seconds

Takeaways

Lakehouse: unified batch
Streaming: real-time alerts
Often used together
Agri farms need both

Questions?

A city wants to track traffic flow in real time to adjust signals, while also keeping historical data for planning future road expansions.

Would you choose a Data Lake, a Data Warehouse, a Lakehouse, or a Streaming Pipeline?

Sample Answer: Traffic Use Case

Recommendation

Streaming pipeline for live traffic signals
Lakehouse (Delta/Iceberg) as shared store
SQL / ML for planning & forecasting
Consumers:
- Real-time: traffic ops & systems
- Reporting: planners, dashboards

Traffic Sensors / Apps

Streaming Ingest

Consumers (Real-Time)

Lakehouse Tables

SQL / ML Engines

Consumers (Reporting)

Realtime Data?

Realtime data processing enables immediate responses. Events are captured, streamed, and acted upon in seconds, supporting use cases like fraud detection, traffic signals, and IoT monitoring.

Questions?

Agenda

DB Architecture - Data Lakehouse ✓
DB Architecture - Streaming Pipelines ✓
NoSQL Graph DB - Neo4j (Queries & Links)
Lab 3 Sample Code, Homework 3
Midterm & Project Part 1 – Guidelines

Neo4j: Add Nodes

Requirement

Add Field A12
Add Sensor S-001

MERGE (:Field {id:'A12', name:'North'})
MERGE (:Sensor {id:'S-001', type:'soil'});

➜ Created/Found:
  Field A12, Sensor S-001

Neo4j: Add Link

Requirement

Link sensor to field

MATCH (f:Field {id:'A12'})
MATCH (s:Sensor {id:'S-001'})
MERGE (s)-[:LOCATED_IN]->(f)
RETURN f.id AS field, s.id AS sensor;

➜ field | sensor
  A12   | S-001

Neo4j: Add Reading

Requirement

Store moisture=14
Keep time of reading

MATCH (s:Sensor {id:'S-001'})
CREATE (r:Reading {
  at: datetime(), moisture:14
})-[:FROM]->(s)
RETURN s.id AS sensor, r.moisture AS m;

➜ sensor | m
  S-001  | 14

Neo4j: Read Latest

Requirement

List sensors in A12
Show latest moisture

MATCH (f:Field {id:'A12'})<-[:LOCATED_IN]-
      (s:Sensor)<-[:FROM]-(r:Reading)
WITH s, r
ORDER BY r.at DESC
WITH s, collect(r)[0] AS last
RETURN s.id AS sensor, last.moisture AS m;

➜ sensor | m
  S-001  | 14

Neo4j: Aggregate

Requirement

Avg moisture by field
Sort lowest first

MATCH (f:Field)<-[:LOCATED_IN]-(s:Sensor)
MATCH (r:Reading)-[:FROM]->(s)
WITH f.id AS field, avg(r.moisture) AS avg_m
RETURN field, round(avg_m,1) AS avg_m
ORDER BY avg_m ASC;

➜ field | avg_m
  A12   | 14.0

Questions?

ETL with Visual Studio

Lab 3 – Neo4j + Mongo Lake

Pull graph data from Neo4j
Push data into MongoDB Atlas
Query and screenshot results
Submit code + screenshots

# Repo
git clone https://github.com/SE4CPS/2025-COMP-263.git
cd 2025-COMP-263/labs/lab3

# Install
npm init -y
npm install dotenv neo4j-driver mongodb

# Env
cp .env.template .env   # do NOT commit

Step 2 – Test Neo4j

Use provided sample script
Expect “Connected to Neo4j Aura!”
Fix .env / network if it fails

# Run
node labs/lab3/sampleNeo4jConnection.js

# Expected
Connected to Neo4j Aura!

# Check if error
# - URI / USER / PASSWORD in .env
# - Internet/TLS reachable

Step 3 – Read / Select Data (Neo4j)

Open Neo4j Browser (Aura)
Get only nodes and edges
Run a generic read query
Visualize graph and screenshot

// Read & visualize nodes and edges
MATCH (n)-[r]->(m)
RETURN n,r,m
LIMIT 25;

Step 4 – Push to Mongo

Run integration script
Expect insert count log
Check Mongo Atlas project

# Run
node labs/lab3/pushNeo4jToMongo.js

# Expected log
Inserted X docs into Mongo lake

# Tip
# If auth fails, check .env connection

Step 5 – Query in Compass

Connect via Compass URI
Select Project1.lake
Run two finds
Take two screenshots

// All docs
db.lake.find({})

// Only Neo4j-sourced
db.lake.find({ sourceDB: "Neo4j" })

# Submit: code + screenshots

Agenda

Pipeline Bottlenecks – Sources & Symptoms
Throughput vs Latency – Metrics & Tradeoffs
In-Memory Acceleration – Redis Overview
Hands-On – Online Redis CLI (Key/Value, JSON)
Pipeline Caching & Data Eviction Strategies
Performance Tuning – Concurrency & Scaling

Pipeline Bottlenecks

Sources & Symptoms in Modern Data Pipelines

Common Sources - Slow ETL

Slow I/O – network or disk
Serialization & parsing overhead
Unbalanced workloads across nodes
Blocking APIs or synchronous flows

Data Flow Example

Source → Ingest → Transform → Store
Single slow stage throttles the rest
Monitoring latency per stage is key

Symptoms

Symptom	Likely Cause
High latency	I/O wait or serialization
Low throughput	Network or CPU bottleneck
Queue build-up	Backpressure not handled

Prevention Strategies

Parallelize CPU-heavy transformations
Batch I/O and compress data blocks
Use async I/O and message queues
Profile and monitor each pipeline stage

Throughput vs Latency

Metrics, Tradeoffs, and Tuning Principles

Core Definitions

Latency: Time per request (ms)
Throughput: Requests per second
Low latency ≠ high throughput

Tradeoffs

Goal	Possible Tradeoff
Increase throughput	Higher latency from batching
Reduce latency	Less batching → CPU overhead
Parallelize tasks	Coordination cost ↑

Pipeline Example

Each stage adds fixed latency
Parallelism boosts throughput
Balance stages to avoid choke points

Measurement Tools

Prometheus & Grafana dashboards
Apache JMeter, k6, or Locust load tests
Redis Monitor (for in-memory ops)

Monitoring Frequent & Real-Time Requests

Continuous soil & temperature readings from sensors
Buffer for ultra-fast ingestion
Sliding window to detect spikes or delays
Averages sent to dashboard every 10 s

Example: Smart irrigation with 500+ sensors


// Core loop
client.set(sensorId, JSON.stringify({
  moisture: 34.8,
  temp: 21.7,
  ts: Date.now()
}));

// Windowed latency check
avg = sum(latencies) / latencies.length;
console.log(`avg latency ${avg.toFixed(1)} ms`);


// sample Redis data
{
  "sensor:field12": {
    "moisture": 34.8,
    "temp": 21.7,
    "ts": 1738893421312
  }
}

How to Improve Data Pipeline Performance?

IndexedDB

Redis

MongoDB

CassandraDB

Faiss

ClickHouse

Neo4j

In-Memory NoSQL DB – Redis Overview

Fastest server-side database for real-time data

Data resides in RAM for microsecond access
Ideal for caching, queues, and streaming
Balances memory speed and persistence

Chronology

2009 – Salvatore Sanfilippo (antirez) releases Redis
Focus: speed, simplicity, in-memory design
Open-sourced under BSD, later managed by Redis Ltd
Now a backbone for caching & streaming
https://github.com/redis/redis

Server-Side Architecture

Runs as a server process
Clients connect via TCP on port 6379
Stores data in RAM for low latency
Optional persistence to disk (RDB/AOF)


$ redis-server
$ redis-cli ping
PONG

Data Eligible for Redis

Small to medium objects (KB–MB range)
Frequently accessed or time-sensitive data
Ephemeral or cacheable state (sessions, tokens)
Real-time metrics (sensor readings, API counts)
JSON, integers, or serialized strings
Session tokens → session:usr123
API key cache → api:key:weather-service


// Example: agriculture sensor data
{
  "sensor:field12": {
    "moisture": 34.8,
    "temperature": 21.7,
    "timestamp": 1738893421312
  }
}


// Session key
"session:usr123" : {
  "token": "eyJhbGciOiJIUzI1NiIs...",
  "expires": 1738893999000
}

// API request count
"api:weather:getCount" : 4821

// Cache entry
"cache:forecast:stockton" : {
  "temp": 21.7,
  "humidity": 65,
  "ts": 1738893480000
}

Why In-Memory?

RAM access ≈ 10⁵ × faster than disk
Ideal for caching & real-time analytics
Used between app ↔ database

Key–Value Model

Every record = unique key + value
Binary-safe strings → numbers, JSON, etc.
O(1) lookup


SET crop:12 34.8
GET crop:12
# "34.8"

Core Data Types

String
List (linked list)
Set (unique items)
Hash (key–value map)
Sorted Set, Stream

Strings

Single value per key, atomic increment/decrement


SET temp:field3 21.5
INCRBYFLOAT temp:field3 0.3
GET temp:field3
# 21.8

Lists

Push and pop like a queue or log


LPUSH sensor:log 34.1
LPUSH sensor:log 34.5
LRANGE sensor:log 0 -1
# ["34.5","34.1"]

Hashes

Mini key-value maps (like JSON objects)


HSET field:12 moisture 34.8 temp 21.7
HGETALL field:12
HGET field:12 temp
# moisture:34.8, temp:21.7

Sets

Unordered unique elements


SADD crops wheat barley maize
SMEMBERS crops
# wheat barley maize

Sorted Sets

Scores + values → leaderboards or metrics


ZADD soil_moisture 34.8 field12 35.4 field13
ZRANGE soil_moisture 0 -1 WITHSCORES

Streams

Append-only logs for real-time pipelines


XADD sensor:stream * moisture 34.8 temp 21.7
XRANGE sensor:stream - +

Expiry & TTL

Sets a time limit (in seconds) for key validity
After expiry → key is automatically deleted
Used for caches, sessions, temporary data

TTL = Time-To-Live (measured in seconds)


SET weather:now 21.7 EX 60
TTL weather:now
# 60  (seconds remaining)
# ...after a few seconds...
# 57
# -2  (means key expired)

Persistence Modes

RDB – snapshotting at intervals
AOF – append-only log of commands
Hybrid (default)

Publish / Subscribe

Real-time messaging between clients


SUBSCRIBE irrigation
PUBLISH irrigation "Start pump A"

ACID in Redis

Property	Redis Support
Atomicity	Yes (single command)
Consistency	Eventual, not strict
Isolation	Transactions via MULTI/EXEC
Durability	No (by default)

Quick Comparison

DB	Model	Speed	Use Case
Redis	Key-Value	★★★★★	Caching / Realtime
IndexedDB	Client KV	★★★☆☆	Browser storage
MongoDB	Document	★★★☆☆	Flexible schema
SQL	Relational	★★☆☆☆	Transactional data
Neo4j	Graph	★★☆☆☆	Relationships

Scaling

Master-replica replication
Redis Cluster for sharding
Horizontal scale for large datasets

Common Use Cases

Caching API responses
Session management
Real-time analytics
Message queues

Redis – Strengths

Server-side, in-memory key–value database
Ultra-low latency (<1 ms typical)
Supports rich data types (String, List, Hash, Set, Stream)
Ideal for caching, sessions, and real-time analytics
Complements MongoDB / SQL in pipelines

Redis – Limitations

Volatile by default: data lost if not persisted
Not ideal for heavy, continuous writes to disk
RAM-based → limited by available memory
Lacks strong ACID durability (config-dependent)
No native query language (key-based access only)

Questions?

A city wants to track traffic flow in real time to adjust signals, while also keeping historical data for planning future road expansions.

Would you choose a Data Lake, a Data Warehouse, a Lakehouse, or a Streaming Pipeline?

Agenda

Redis API and Performance
Caching Strategies in Web Apps
Midterm Review
Midterm Mock Exam
Q&A and Discussion

Node.js + MongoDB + Redis

Load agriculture docs from MongoDB
Cache results in Redis
Expose two GET endpoints with timing in JSON
Compare latency: Mongo only vs Redis cache

Install & Env

npm i express mongodb ioredis dotenv
# .env (example)
PORT=3000
MONGO_HOST=mongodb+srv://cluster0.lixbqmp.mongodb.net
MONGO_USER=comp263_2025
MONGO_PASS=***yourpass***
REDIS_URL=redis://localhost:6379
CACHE_TTL_SECONDS=60

Keep secrets in .env. Don’t commit.

mongodb official driver
ioredis robust Redis client
Config validated at boot

Connect to MongoDB

Lab2.Agriculture and validates env.

SRV URI + separate auth
estimatedDocumentCount() on boot for sanity


const client = new MongoClient(`${HOST}/...`, {
  useNewUrlParser: true, useUnifiedTopology: true,
  auth: { username: USER, password: PASS }, authSource: "admin"
});

Add Redis Client

const Redis = require("ioredis");
const redis = new Redis(process.env.REDIS_URL);
// Optional: redis.on("connect", ...) for logging

Single instance is fine for class demos
Use SETEX/EX TTL for freshness

Timing & Response

Start timer at request
Return { source, timeMs, count, data }
Also set X-Response-Time header

function withTimer(handler) {
  return async (req, res) => {
    const t0 = process.hrtime.bigint();
    try { await handler(req, res, t0); }
    catch (e) { res.status(500).json({ error: String(e) }); }
  };
}
function elapsedMs(t0) {
  return Number((process.hrtime.bigint() - t0) / 1000000n);
}

GET /agriculture/mongo

app.get("/agriculture/mongo", withTimer(async (req, res, t0) => {
  const docs = await collection.find({}).limit(500).toArray();
  const body = { 
    source: "mongo", timeMs: elapsedMs(t0), 
    count: docs.length, data: docs 
  };
  res.set("X-Response-Time", body.timeMs + "ms").json(body);
}));

Baseline latency (no cache)
Limit for demo readability

GET /agriculture/redis

app.get("/agriculture/redis", withTimer(async (req, res, t0) => {
  const key = "agri:all:limit500";
  const ttl = Number(process.env.CACHE_TTL_SECONDS || 60);

  const cached = await redis.get(key);
  if (cached) {
    const data = JSON.parse(cached);
    const body = { 
        source: "redis", timeMs: elapsedMs(t0), 
        count: data.length, data 
    };
    return res.set("X-Response-Time", body.timeMs + "ms").json(body);
  }

  const docs = await collection.find({}).limit(500).toArray();
  await redis.set(key, JSON.stringify(docs), "EX", ttl);
  const body = { 
    source: "mongo→redis(set)", timeMs: elapsedMs(t0), 
    count: docs.length, data: docs 
  };
  res.set("X-Response-Time", body.timeMs + "ms").json(body);
}));

Hit: returns in micro-milliseconds
Miss: fetch + populate cache

See the Difference

# Quick manual check (shows X-Response-Time)
curl -i http://localhost:3000/agriculture/mongo | head
curl -i http://localhost:3000/agriculture/redis | head
curl -i http://localhost:3000/agriculture/redis | head  # 2nd

Discuss p50/p95 latencies
Expect Redis to dominate on repeat calls

Installing Redis on Ubuntu (WSL, macOS, Windows)

Windows (via WSL2 + Ubuntu)

# 1. Install Ubuntu on Windows
wsl --install -d Ubuntu
# Restart when prompted, then open Ubuntu

# 2. Inside Ubuntu
sudo apt update
sudo apt install -y redis-server
sudo service redis-server start
redis-cli ping   # → PONG

Runs Redis inside Ubuntu under WSL2
Works seamlessly with Node.js on Windows
Use REDIS_URL=redis://127.0.0.1:6379 in .env

macOS (via Homebrew)

brew update
brew install redis
brew services start redis
redis-cli ping   # → PONG

Uses Homebrew package manager
Start/stop with brew services
Redis listens on 127.0.0.1:6379

Your Turn

NoSQL Lab Repository

Performance Comparison: Redis vs MongoDB

Approximate time to retrieve one object at different data scales.

Dataset Size	Redis (In-Memory)	MongoDB (Disk-Based)	Relative Speed
1 record	~1 ms	~100 ms	≈ 100× faster
1,000 records	~2 ms	~120 ms	≈ 60× faster
100,000 records	~5 ms	~250 ms	≈ 50× faster
1,000,000 records	~10 ms	~600 ms	≈ 60× faster

Redis Commands in Node.js: SET & GET

SET key value stores a string value.
GET key retrieves it.
Default behavior: value persists until deleted or replaced.
Use for caching simple data (e.g., counts, tokens, JSON).

const Redis = require("ioredis");
const redis = new Redis();

// Store data
await redis.set("crop:rice", JSON.stringify({ yield: 45 }));

// Retrieve data
const value = await redis.get("crop:rice");
console.log(JSON.parse(value));

Redis Commands: SETEX (Expire in Seconds)

SETEX key ttl value
Common for caching responses temporarily.
Expires automatically: no need to delete manually.
Node: redis.set(key, value, "EX", ttl)

await redis.set("weather:stockton", 
  JSON.stringify({ temp: 29 }), 
  "EX", 60); // expires in 60 seconds

const cached = await redis.get("weather:stockton");
console.log(cached ? "Cache hit" : "Expired");

Redis Commands: DEL & EXISTS

DEL key removes one or more keys.
EXISTS key checks if a key is present.
Useful for clearing outdated cache or validation.

await redis.del("weather:stockton");

const exists = await redis.exists("weather:stockton");
if (exists) console.log("Still cached");
else console.log("Cache cleared or expired");

Redis Commands: TTL & EXPIRE

TTL key shows remaining lifetime (in seconds).
EXPIRE key seconds sets a new TTL for an existing key.
Can refresh cache without rewriting data.

await redis.set("session:123", "active");
await redis.expire("session:123", 120);

const ttl = await redis.ttl("session:123");
console.log("Time left:", ttl, "seconds");

Redis Commands: MSET & MGET

MSET sets multiple keys at once.
MGET retrieves multiple keys in one call.
Efficient for bulk reads/writes in Node.js.

await redis.mset({
  "crop:wheat": "good",
  "crop:rice": "medium",
  "crop:corn": "low"
});

const values = await redis.mget(
    "crop:wheat", 
    "crop:rice", 
    "crop:corn"
);
console.log(values);

Redis Commands: FLUSHDB & KEYS

FLUSHDB removes all keys from the current database.
KEYS pattern lists keys matching a pattern.
⚠️ Use with caution: affects all cached data.

const allKeys = await redis.keys("crop:*");
console.log("Found:", allKeys.length, "keys");

await redis.flushdb(); // clears all keys in this DB
console.log("Database cleared");

Agenda

Redis API and Performance ✓
Caching Strategies in Web Apps
Midterm Review
Midterm Mock Exam
Q&A and Discussion

Caching Strategy 1: Cache-Aside (Lazy Loading)

Application checks cache first.
If data is missing → load from DB and store in cache.
Common in read-heavy APIs.
Simple and cost-efficient.

        graph LR
          A[Client Request] --> B[Check Cache]
          B -->|Hit| C[Return from Cache]
          B -->|Miss| D[Fetch from DB]
          D --> E[Update Cache]
          E --> F[Return to Client]

Caching Strategy 2: Write-Through

Every write goes to cache and DB together.
Keeps cache consistent with DB.
Higher write latency, but consistent reads.

        graph LR
          A[Client Write] --> B[Update Cache]
          B --> C[Write to DB]
          C --> D[Success Response]

Caching Strategy 3: Write-Behind (Write-Back)

Writes go to the cache first.
Cache flushes to the database asynchronously.
Fast writes; risk if cache fails before flush.

flowchart LR
A["Client Write"] --> B["Write to Cache"]
B --> C["Async Queue"]
C --> D["DB Update (Delayed)"]

Caching Strategy 4: Read-Through

App always queries the cache.
Cache fetches from DB automatically on a miss.
Cache layer abstracts DB access logic.

        graph LR
          A[Client Request] --> B[Cache Layer]
          B -->|Hit| C[Return from Cache]
          B -->|Miss| D[Cache Loads from DB]
          D --> E[Store in Cache & Return]

Caching Strategy 5: TTL / Expiration-Based

Each key has a time-to-live (TTL).
Expired data automatically removed.
Ensures freshness and reduces stale data.
Common for APIs with time-sensitive data (e.g., weather).

        graph TD
          A[Data Stored] --> B[Timer Starts]
          B -->|TTL Expires| C[Auto Delete]
          C --> D[Cache Miss Next Time]
          D --> E[Reload from DB]

Agenda, Oct/14

Midterm Mock Exam
Midterm Review
Q&A and Discussion

Midterm Exam Structure

Duration: 10:05 - 11:15 AM
20 Questions
Each question: 5 points
Scope 1/3: HW 1-3, Lab 1-3, Mod 1-7
Scope 2/3: IndexedDB, MongoDB Neo4j, Redis
Scope 3/3: Read code, not write
Closed book, no computer
On paper, please bring pens

Read each question carefully
If unsure, ask for clarification
Manage your time wisely
Show all work clearly
Check name/student id before submitting
Check answers before submitting

COMP 263: Midterm Review

Sample Midterm

COMP 263 Mid Term.docx

Answers

COMP 263 Mid Term Solution.docx

Agenda, Oct/14

Midterm Mock Exam ✓
Midterm Review
Q&A and Discussion

Module 1

Which “V” primarily addresses the challenge of many formats (logs, images, JSON) in one system?

Veracity
Variety
Velocity
Value

Answer: B: Variety

Module 1

Why must IndexedDB operations be awaited in JavaScript?

They always run synchronously in the UI thread
They require a service worker to complete
They are asynchronous disk I/O behind a Promise
They are limited to WebAssembly callbacks

Answer: C: Asynchronous I/O (Promise)

Module 1

What is automatically indexed in an IndexedDB object store?

The primary key
All string fields
All numeric fields
Every nested property

Answer: A: Primary key

Module 1

Which choice best explains why rigid SQL schemas struggle with fast-changing app fields?

SQL cannot store text
SQL disallows new tables
SQL has no indexes
Require coordinated migrations

Answer: D: Migrations required

Module 2

Which practice improves read performance for common UI queries in document models?

Eliminating all redundancy
Using only flat structures
Denormalizing fields
Storing dates as locale strings

Answer: C: Denormalize frequent fields

Module 2

Which identifier minimizes cross-system collisions without coordination?

Auto-increment integer
UUID (v4)
Concatenated username+date
Sequential timestamp

Answer: B: UUID v4

Module 2

What does the “Z” in 2025-09-04T10:15:30Z denote?

UTC (Zulu) time
Local browser timezone
Zero-padded seconds
Zigzag encoding

Answer: A: UTC

Module 2

When is embedding (not reference) generally preferred in JSON document design?

When the related entity is reused by many parents
When many-to-many links dominate
When the child changes independently and often
When data is always fetched together (tight 1:1)

Answer: D: Always fetched together

Module 3

Which metric ensures each real-world entity appears only once?

Timeliness
Uniqueness
Accuracy
Validity

Answer: B: Uniqueness

Module 3

What is the primary benefit of a UNIQUE index on a field such as email?

Prevents duplicate values
Forces numeric storage
Replicates values across shards
Compresses documents

Answer: A: Prevent duplicates

Module 3

A dataset missing required fields most directly violates which quality dimension?

Consistency
Veracity
Integrity
Completeness

Answer: D: Completeness

Module 3

Stale sensor values mainly reduce which property?

Uniqueness
Validity
Timeliness
Integrity

Answer: C: Timeliness

Module 4

In ELT, where is heavy transformation primarily performed?

In the target store
In a pre-processing appliance
At the message broker
On the client UI thread

Answer: A: In the target store

Module 4

Which workload fits streaming best?

Quarterly revenue consolidation
Nightly dimension table rebuild
Monthly GDPR export
Millisecond IoT data ingestion

Answer: D: IoT telemetry

Module 4

In MapReduce, which phase combines intermediate key groups into final values?

Map
Reduce
Shuffle
Partition

Answer: B: Reduce

Module 4

Which choice improves throughput for many small writes?

Per-record transactions
Deeply nested documents
Batching operations
Client-side polling

Answer: C: Batch operations

Module 5

Which property distinguishes a data lake from a warehouse?

Schema-on-write
Schema-on-read
Requires star schema
SQL-only ingestion

Answer: B: Schema-on-read

Module 5

Where are raw, unprocessed files usually stored before cleaning and structuring?

Inside the main analytics dashboard
In the initial lake for incoming data
In the reporting database for end users
In the API response cache

Answer: B: In the initial lake for incoming data data

Module 5

Which query pattern is a strength of a graph database?

Multi-hop relationship traversal
Massive table scans with joins
Wide-column time bucketing
Fast mulit-node joins

Answer: A: Multi-hop traversal

Module 5

Which component most directly decouples producers and consumers?

Star schema
Document index
Message broker / queue
Foreign key

Answer: C: Broker/Queue

Module 6

Which design is typical for an analytical warehouse model?

Third normal form only
Adjacency lists
Highly nested JSON
Star/snowflake schema

Answer: D: Star/Snowflake

Module 6

How do relational databases typically support aggregation such as SUM or AVG?

By duplicating data into multiple temporary tables
By relying on external scripting languages for computation
By using SQL aggregate functions processed by the query engine
By exporting data to a file before calculation

Answer: C: SQL aggregate functions in the query engine

Module 6

What is the main goal of moving data from a data lake into a data warehouse?

To store unstructured data in raw format
To organize and optimize data for analytical queries
To remove historical records and reduce cost
To bypass schema validation entirely

Answer: B: Organize data for analytics

Module 6

During transformation from data lake to data warehouse, which step typically occurs?

Deleting metadata and lineage records
Compressing data without schema enforcement
Cleaning, joining, and aggregating curated datasets
Moving JSON directly without structural checks

Answer: C: Clean, join, and aggregate curated data

Module 7

In cache-aside, what happens on a cache miss?

Client returns 404 immediately
App loads from DB, populates cache, returns result
Cache fabricates a default value
DB writes are skipped

Answer: B: Load DB → populate cache → return

Module 7

Which Redis option sets expiration when writing a value?

SET ... NX
SET ... PXAT only
PERSIST
SET key value EX <seconds>

Answer: D: SET ... EX <seconds>

Module 7

What best measures API latency difference between Mongo and Redis endpoints?

Server-side timing (e.g., timeMs & X-Response-Time)
Client clock only
Counting HTTP headers
Comparing payload sizes alone

Answer: A: Server-side timing

Module 7

Which practice balances speed and freshness for frequently changing data?

Infinite TTL on all keys
Delete cache on every read
Short TTL with cache-aside reloads
Write-through for read-only data

Answer: C: Short TTL + cache-aside

Questions?

Agriculture Dataset: Data Quality

Are JSON keys consistent in naming?
How to handle missing yield values?
Is harvest_date in valid ISO format?
How to standardize region names?
How to record units accurately for yield?
Which field relationships need validation?
How to find duplicate farm_id?


{
  "farm_id": "F123",
  "crop": "rice",
  "region": "California",
  "yield": "72 tons",
  "soil": {
    "type": "clay",
    "ph": 9.8
  },
  "harvest_date": "10-09-25"
}

Questions?

Midterm (Oct/16)

Midterm Exam Structure

Duration: 10:05 - 11:15 AM
20 Questions
Each question: 5 points
Scope 1/3: HW 1-3, Lab 1-3, Mod 1-7
Scope 2/3: IndexedDB, MongoDB Neo4j, Redis
Scope 3/3: Read code, not write
Closed book, no computer
On paper, please bring pens

Read each question carefully
If unsure, ask for clarification
Manage your time wisely
Show all work clearly
Check name/student id before submitting
Check answers before submitting

Q&A and Discussion

Today, October 21, 2025

Midterm Review
Database Pipeline Architecture and Data Dashboard
Column-based Database - ClickHouse

Midterm Review

Database Pipeline Architecture and Data Dashboard

Metric
KPI
Target
Benchmark

Metric

What it is: A single measurement.
Agriculture example: Rainfall today = 8 mm.
Use: Simple observation collected daily.

{
  "metric": "rainfall",
  "value": 8,
  "unit": "mm",
  "timestamp_utc": "2025-10-21T00:00:00Z"
}

Key Performance Indicator (KPI)

A group of metrics that describe performance or trend.
Average weekly rainfall across all fields.
Use: Helps farmers understand overall water supply.

{
  "kpi": "weekly_rainfall",
  "average_value": 42,
  "unit": "mm/week",
  "week_start_utc": "2025-10-20T00:00:00Z",
  "data_points": 7
}

Target

A goal or threshold value for a KPI.
Target weekly rainfall = 40 mm for optimal crop growth.
Guides irrigation planning when rainfall is below target.

{
  "target": "weekly_rainfall",
  "goal_value": 40,
  "unit": "mm/week",
  "valid_from_utc": "2025-10-01T00:00:00Z",
  "valid_to_utc": "2025-12-31T23:59:59Z"
}

Benchmark

Reference point by crop type, season, or region.
Average weekly rainfall needs: Corn = 40 mm
Compare crop’s rainfall KPI to historical benchmark

{
  "benchmark": "weekly_rainfall_by_crop",
  "crops": {
    "corn": 40,
    "wheat": 30,
    "soybean": 35
  },
  "unit": "mm/week",
  "reference_period": "2024",
  "source": "historical_weather_data"
}

How They Relate

Relationship	Description	Example
Metric → KPI	Daily rainfall readings combine to calculate weekly averages.	`{"rainfall_daily":[8,5,6,7,10,4,2]}`
KPI ↔ Benchmark	Compare weekly rainfall KPI to historical averages by crop type.	Corn = 40 mm, Wheat = 30 mm, Soybean = 35 mm
KPI → Target	If the KPI is below the 40 mm target, schedule additional irrigation.	Current = 32 mm → Add irrigation
Loop	Metrics feed into KPI → compared to benchmark → adjusted toward target.	Continuous weekly update and correction cycle

From KPI to Aggregate Functions

KPIs are often derived from aggregate functions applied to raw metrics. These functions summarize multiple values to reveal overall trends or conditions.

MIN: Lowest rainfall recorded this week.
MAX: Highest rainfall recorded this week.
SUM: Total rainfall across all days.
AVG: Average rainfall: the weekly KPI value.

{
  "rainfall_daily": [8, 5, 6, 7, 10, 4, 2],
  "min": 2,
  "max": 10,
  "sum": 42,
  "avg": 6.0
}

Today, October 23, 2025

Brief Quiz
Column-based Database - ClickHouse
DB Pipeline Architecture and Data Dashboard
Homework 4, Lab 4, Project Part 1, Final Exam Date

Brief Quiz

IndexedDB

Redis

MongoDB

CassandraDB

Faiss

ClickHouse

Neo4j

Column-based Database - ClickHouse

History of ClickHouse

2009: Built at Yandex for web analytics
2014: Scaled for distributed queries
2016: Open-sourced under Apache 2.0
2018: Adopted by major tech companies
2021: ClickHouse Inc. established
2023: Cloud service officially launched
GitHub: ClickHouse Repo

SQL: The Query Language

1970: Based on Codd’s relational model
1974: IBM developed SEQUEL (Structured English Query)
1986: Adopted as ANSI SQL standard
1990s: Widely used in relational databases
2000s: Extended for analytics and joins
Today: Common across modern engines (incl. ClickHouse)

ClickHouse: Supported Data Types

Numeric: Int8–Int256, UInt8–UInt256, Float32, Float64
Boolean: UInt8 used for true / false
String: String, FixedString(N)
Date & Time: Date, Date32, DateTime, DateTime64
Array: Array(T) for lists of any type
Tuple: Tuple(T1, T2, …)
Map: Map(KeyType, ValueType)
Enum: Enum8 / Enum16 for labeled constants
UUID: 128-bit unique identifiers
Nested: groups of arrays forming sub-tables
LowCardinality: optimized string storage
Nullable: supports NULL for any type

CREATE TABLE example_types
(
  id UInt32,
  name String,
  score Float64,
  tags Array(String),
  metrics Map(String, Float32),
  status Enum8('ok'=1, 'fail'=2),
  created_at DateTime64(3),
  uid UUID,
  extra Nullable(String)
)
ENGINE = MergeTree();

-- ClickHouse adds Array, Tuple, Map, Enum, LowCardinality, Nested, UUID

ClickHouse: Columnar SELECT

Why: scan only needed columns
Why: high compression, fast I/O
Why: vectorized execution

SELECT user_id, ts, duration
FROM logs
WHERE event = 'page_view'
LIMIT 3;

┌─user_id─┬────────ts────────┬─duration─┐
│ 101     │ 2025-10-23 10:00 │ 120      │
│ 102     │ 2025-10-23 09:59 │ 95       │
│ 103     │ 2025-10-23 09:58 │ 134      │
└─────────┴──────────────────┴──────────┘

ClickHouse: Aggregations

Why: sub-second on billions
Why: built-in aggregate functions
Why: no pre-rollups needed

SELECT url, count() AS views, avg(duration) AS avg_ms
FROM logs
GROUP BY url
ORDER BY views DESC
LIMIT 3;

┌─url──────────────┬─views─┬─avg_ms─┐
│ /home            │ 12500 │ 112.4  │
│ /products        │  9432 │ 98.3   │
│ /checkout        │  2831 │ 145.6  │
└──────────────────┴───────┴────────┘

ClickHouse: ARRAY JOIN

Why: flatten arrays directly
Why: simple query-based unnest
Why: efficient on nested JSON

SELECT user_id, action
FROM user_activity
ARRAY JOIN actions AS action
LIMIT 3;

┌─user_id─┬─action───┐
│ 101     │ click    │
│ 101     │ scroll   │
│ 102     │ view     │
└─────────┴──────────┘

ClickHouse: Window Functions

Why: moving averages and ranks
Why: session and cohort analysis
Why: SQL-standard syntax

SELECT
  user_id,
  ts,
  avg(duration) OVER (
    PARTITION BY user_id
    ORDER BY ts
    ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
  ) AS avg_last_5
FROM sessions
LIMIT 3;

┌─user_id─┬────────ts────────┬─avg_last_5─┐
│ 101     │ 2025-10-23 10:00 │ 98.4       │
│ 101     │ 2025-10-23 10:05 │ 102.7      │
│ 101     │ 2025-10-23 10:10 │ 95.8       │
└─────────┴──────────────────┴────────────┘

ClickHouse: Inserts

Why: handle millions of rows/sec
Why: batch or streaming supported
Why: flexible input formats

INSERT INTO logs (ts, user_id, event, duration)
VALUES (now(), 101, 'view', 120),
       (now(), 102, 'click', 85);

Query OK, 2 rows inserted.
Elapsed: 0.002 sec.

ClickHouse: System Tables

Why: internal observability
Why: track query performance
Why: inspect table storage

SELECT query_kind, read_rows, memory_usage
FROM system.query_log
LIMIT 2;

┌─query_kind─┬─read_rows─┬─memory_usage─┐
│ Select     │ 21000     │ 15.2 MiB     │
│ Insert     │ 2000      │ 2.8 MiB      │
└────────────┴────────────┴──────────────┘

The Dashboard Effect

Real-Time Data Monitoring (incl. Clicks)
Improved Decision-Making
Improved Team Collaboration
Increased Efficiency and Productivity
Better Performance Tracking

Row-Based vs Column-Based Storage

How columnar databases store and access data

Data is stored column by column instead of row by row
Each column is stored as a contiguous block on disk
Efficient for scanning specific columns during queries
Enables better compression due to uniform data types
Reduces I/O by skipping irrelevant columns
Ideal for analytics and aggregation use cases

Row-Based Tables

Overview & Disadvantages

Row-Based Storage

Stores entire rows together on disk
Optimized for transactional workloads (OLTP)
Used in systems like MySQL and PostgreSQL

Disadvantages

Inefficient for analytical queries (OLAP)
Loads full rows even when only one column is needed
Limited compression opportunities
Slower for large-scale filtering and aggregation

When not to use a columnar database

When Should I Use a Columnar Database?

Heavy read and analytics workloads
Large-scale aggregation and filtering
Dashboards and business intelligence tools
Event logging or time-series data
Need for high compression and fast scans

When to Use Row vs Column-Based Databases

Query Type	Use Row-Based DB	Use Column-Based DB
Inserts and Updates	✔
Single-row lookups	✔
Aggregations over large datasets		✔
Reporting and analytics		✔
Real-time dashboards		✔

When to Use Columnar vs Row-Oriented Databases

Use Columnar Databases When...	Use Row-Oriented Databases When...
Your primary use case is analytics	Your primary use case is transactions
You need low query latency for large scans	You don't need low latency for analytics
You have large amounts of analytical data	You're working with smaller data sets
You don't need strict ACID compliance	You need strict ACID compliance
You're using event sourcing principles	You perform frequent, small updates or deletes
You store and analyze time series data	You access records by unique IDs

Compression and performance advantages

Compression in Column-Based Databases

Columns store uniform data types
Enables use of efficient compression algorithms
Examples: run-length encoding, dictionary encoding
Significantly reduces storage footprint
Faster query performance due to fewer bytes read
Compression often applied automatically by the engine

Aggregation Query: Time Complexity

Storage Type	Aggregation Time Complexity	Reason
Row-Based	O(n × m)	Scans all rows and all columns
Column-Based	O(n)	Scans only the relevant column

n = number of rows, m = number of columns

Indexing strategies in column stores

Indexing in Column-Based Databases

Concept	Column-Based DB	Row-Based DB
Data Layout	Columns stored independently	Rows stored together
Indexing Need	Often reduced due to scan efficiency	Critical for performance
Default Access	Column filters without indexes are fast	Needs indexes for efficient filtering
Index Type	Skip indexes, min-max indexes, sparse indexes	B-tree, hash, GIN, GiST
Compression & Filtering	Efficient filtering via compressed column blocks	Filtering depends on explicit indexes
Example	ClickHouse: Primary index is skip-based by default	PostgreSQL: Uses B-tree indexes by default

Comparing PostgreSQL vs. ClickHouse

ClickHouse vs PostgreSQL: Key Differences

Feature	ClickHouse	PostgreSQL
Storage Model	Column-based	Row-based
Primary Use Case	Analytics (OLAP)	Transactions (OLTP)
Indexing	Skip index (min-max, sparse)	B-tree, hash, GIN, GiST
Write Performance	High-throughput batch inserts	Optimized for single-row inserts
Read Performance	Fast scans on large datasets	Fast lookups and joins
Compression	Built-in, column-aware compression	Optional extensions
SQL Support	Subset of ANSI SQL (e.g., no FULL OUTER JOIN, no foreign keys)	Full ANSI SQL with rich extensions
ACID Compliance	Eventual consistency, no full ACID	Full ACID compliance

Open Source Columnar Databases

MonetDB – monetdb.org (2004, Mozilla Public License 2.0) – Supports SQL
Apache Druid – druid.apache.org (2012, Apache License 2.0) – Supports SQL
ClickHouse – clickhouse.com (2016, Apache License 2.0) – Supports SQL
Apache Kudu – kudu.apache.org (2016, Apache License 2.0) – Supports SQL via Impala
Apache Parquet – parquet.apache.org (2013, Apache License 2.0) – Columnar storage format; SQL support depends on engine
DuckDB – duckdb.org (2019, MIT License) – Supports SQL

Average Queries on trips Table

Description	SQL Query
Average trip distance	`SELECT AVG(trip_distance) FROM trips;`
Average fare amount	`SELECT AVG(fare_amount) FROM trips;`
Average tip by payment type	`SELECT payment_type_, AVG(tip_amount) FROM trips GROUP BY payment_type_;`
Average total amount by vendor	`SELECT vendor_id, AVG(total_amount) FROM trips GROUP BY vendor_id;`

ClickHouse Playground

Web-based platform to experiment with ClickHouse queries
No setup required; start querying instantly
Access to various sample datasets for testing
Supports SQL queries with real-time results

ClickHouse Playground

Today, October 23, 2025

Brief Quiz ✓
Column-based Database - ClickHouse ✓
DB Pipeline Architecture and Data Dashboard
Homework 4, Lab 4, Project Part 1, Final Exam Date

From Dashboard Questions to Database Pipeline

Dashboard First

What is the Benchmark?
What is the Target?
What is the KPI?
What is the Metric?

Case Study: Agriculture Dashboard

A vineyard in California struggled with soil moisture variations, risking crop quality. The benchmark from state guidelines required maintaining 25–35% moisture.

The farm set a target of keeping soil moisture near 30% using IoT-based irrigation control.

The main KPI was average soil moisture per zone, updated hourly through a data pipeline.

Supporting metrics included pump runtime, water flow rate, and daily irrigation volume. Within weeks, dashboard insights cut water use by 15% while staying within target range.

Smart Agriculture Dashboard Illustration

Today, October 23, 2025

Brief Quiz ✓
Column-based Database - ClickHouse ✓
DB Pipeline Architecture and Data Dashboard ✓
Homework 4, Lab 4, Project Part 1, Final Exam Date

Today, October 28, 2025

Brief Review: 3 Questions on Paper
Brief Review: Big Data and CAP Theorem
Database Sharding and Replication
MongoDB and CassandraDB
Homework 4, Lab 4, Project Part 1, Final Exam Date

Brief Review: 3 Questions on Paper

Brief Review: Big Data and CAP Theorem

The 5 Vs of Big Data

Veracity: Ensure data accuracy and reliability
Variety: Ingest and process data in multiple formats
Velocity: Real-time data processing capabilities
Value: Analytics tools to derive actionable insights
Volume: Data scales to petabytes (15 zeros) and beyond.

Visualizing Data Volume

1GB
~200,000 emails

50GB
~5,000 photos

60GB
1 hour GPS/s

100GB
~20h video

200GB
~1h driving

Limitations of SQL Databases with High Row Volume

Performance Issues: Query response time degrades with increased data volume
Scalability Constraints: Hard limits on vertical scaling / complexities in horizontal scaling
Maintenance Overheads: Intensive resource needs for indexing and transaction locks
Cost Implications: Continuous increase in cost

How to Scale?

Distribute (not Replication) Data

Horizontal Scaling

Adding more machines or nodes to a system to increase capacity and distribute load.

Vertical Scaling

Increasing the capacity of an existing machine by adding more resources such as CPU or RAM.

Horizontal Scaling Leads to Distributed Databases

Data spread across multiple nodes.
Each node manages part of the data.
Enables faster and fault-tolerant systems.

Distributed Databases lead to the CAP Theorem balancing Consistency, Availability, and Partition Tolerance.

CAP Theorem

C: Consistency: All reads see the latest write
A: Availability: Every request gets a response
P: Partition Tolerance: Works despite network splits

Under a partition, you can favor C or A, not both.

Database Sharding and Replication

Sharding: Distributing data
Replication: Copying data

Together, they improve scalability, availability, and resilience of large systems.

Benefits and Drawbacks of Sharding

Benefits

Enables horizontal scaling across multiple servers.
Improves query performance through data distribution.
Improve system reliability and fault isolation.

Drawbacks

Increases complexity of data management and queries.
Risk of uneven shard sizes over time (data imbalance).
Requires careful shard key selection and monitoring.

Sharding is Complex - SQL to NoSQL

For SQL Check Temporary SW/Data Options First

Database	Consistency + Partition Tolerance (CP)	Consistency + Availability (CA)	Availability + Partition Tolerance (AP)
MongoDB/Redis	✔	✘	✘
Cassandra	✘	✘	✔
RDBMS	✘	✔	✘

MongoDB/Redis: Focus on consistency and partition tolerance
Cassandra: Prioritizes availability and partition tolerance
RDBMS: Emphasizes consistency and availability for transactions

MongoDB Sharding Components

Shard Key: Field that defines data distribution.
Chunks: Data divided and assigned to shards.
Shards: Replica sets storing subsets of data.
Node: MongoDB instance (primary or secondary).
Query Router (mongos): Sends queries to correct shard.
Config Server: Stores shard and chunk metadata.
Balancer: Keeps data evenly distributed.

Replica Set vs Shard: Agriculture Data Example

Aspect	Replica Set	Shard
Purpose	Provides redundancy and failover.	Distributes data for scalability.
Data Stored	Same crop data copied across nodes.	Different crop regions stored on different shards.
Example	All nodes store “Wheat yield 2024” dataset.	Shard A = Wheat (North farms), Shard B = Corn (South farms).
Goal	Reliability and availability.	Scalability and performance.

Role of the Primary in MongoDB

Writes: Only the primary accepts write operations.
Consistency: Holds latest data; secondaries replicate it.
Election: A new primary is chosen if one fails.
Clients: Connect mainly to the primary for reads/writes.

MongoDB: Network Partition Logic

Partition: Nodes lose connectivity, creating isolated groups
Availability Priority: MongoDB favors availability (AP) to stay operational
Primary Election: Primary unreachable, new primary is elected in connected partition
Inconsistencies: Data may be temporarily inconsistent across partitions.
Re-Sync: Once resolved, partitions synchronize data to restore consistency.

Questions on MongoDB Sharding

How does MongoDB handle writes if the primary node in a shard fails?
What impact does a missing shard have on read operations?
How would you choose an optimal shard key for a high-traffic app?
How does MongoDB maintain data consistency within a shard?
When would you prioritize availability over strict consistency?

Sample Answers

MongoDB elects a new primary from the remaining nodes in the shard's replica set
Queries for data on the missing shard will fail; unaffected shards continue serving data
Choose a shard key with high cardinality and even distribution (e.g., user ID)
Secondaries replicate from the primary node to keep data consistent within the shard
Prioritize availability in applications where uptime is more critical than strict consistency

Today, October 30, 2025

Brief Review: 2 Questions on Paper
Brief Review: Big Data and CAP Theorem
Concept of Eventually Consistency
MongoDB and CassandraDB
Homework 4, Lab 4, Project Part 1, Final Exam Date

Brief Review: 2 Questions on Paper

Brief Review: Big Data and CAP Theorem

Concept Eventually Consistency

MongoDB and CassandraDB

What is Cassandra?

Open-source, distributed NoSQL wide-column store
Designed for massive scale, no single point of failure
Prioritises availability & partition-tolerance (AP in CAP)
Built for Systems Engineers, less for Developers
github.com/apache/cassandra

Cassandra Chronology

2007 – Developed at Facebook for Inbox Search
2008 – Open-sourced on Google Code
2009 – Entered Apache Incubator
2010 – Became Apache top-level project
2015+ – Adopted by Netflix, Instagram, Uber, etc

Cluster Architecture

Peer-to-peer design, no master node
Gossip protocol for node discovery
Data partitioned via consistent hashing

Gossip Protocol in Cassandra

Nodes share status info with random peers
No master, fully peer-to-peer
Spreads state quickly across the cluster
Shares node health, schema, and load data
Seed nodes help new nodes join

Keeps cluster consistent and fault-aware.

Data Model & CQL

Keyspace → Table → Rows → Columns.
CQL syntax is SQL-like but denormalized.
CQL Documentation

CREATE KEYSPACE farm_data
 WITH REPLICATION = {'class':'SimpleStrategy', 'replication_factor':3};

CREATE TABLE crop_yields (
  farm_id text,
  crop_type text,
  year int,
  yield double,
  PRIMARY KEY ((farm_id), crop_type, year)
);

Write & Read Path

Coordinator: Routes client requests to replicas
RF: Number of data copies in the cluster
Consistency: ONE, QUORUM, or ALL

Coordinator enforces consistency across replicas

Why Cassandra Chooses Availability

Remains available during node or network failure.
Writes accepted even if some replicas are unreachable.
Consistency achieved later via replication and repair.

→ Cassandra is AP (Availability + Partition Tolerance)

Agriculture Sensor Data Example

IoT sensors generate time-series data
Scales for thousands of farms globally
Optimized for high write throughput

CREATE TABLE sensor_data (
  farm_id text,
  sensor_id text,
  ts timestamp,
  temp float,
  moisture float,
  PRIMARY KEY ((farm_id, sensor_id), ts)
) WITH CLUSTERING ORDER BY (ts DESC);

High Cardinality

Means many unique key values
Example: user_id, sensor_id
Spreads data evenly across nodes
Prevents hotspots and imbalance

Eventual Consistency

Replicas may differ right after a write
System stays available during updates
Data converges as replicas sync over time
Achieved via read repair and background sync

Partition Key

Determines where data is stored in the cluster
Same key → same node and partition
Used for data distribution and lookup
Example: (farm_id)

Clustering Key

Defines how rows are ordered within a partition
Supports range queries and sorting
Example: (sensor_id, ts)

Example Table

CREATE TABLE sensor_data (
  farm_id text,
  sensor_id text,
  ts timestamp,
  value float,
  PRIMARY KEY ((farm_id), sensor_id, ts)
);

Partition Key: farm_id → decides node placement
Clustering Keys: sensor_id, ts → order within partition

Best Practices

Design tables for queries (no joins)
Choose partition keys carefully to avoid hotspots
Monitor compaction, tombstones, and repair frequency
Use QUORUM for balanced consistency vs availability

Questions?

Agriculture Data Pipeline: Choosing the Right Database

Pipeline Stage	Data Type / Need	Best Database	Reason
Sensor Ingestion	High-volume IoT readings (temperature, soil, humidity)	?	Handles fast writes, distributed and fault-tolerant
Short-term Analytics	Recent sensor trends (last 24h)	?	Flexible schema, easy aggregation pipelines
Long-term Storage	Historical yields and weather data	?	Supports structured queries and analytics
Real-time Alerts	Threshold-based triggers (low moisture)	?	In-memory, fast pub/sub messaging
Visualization & Dashboards	Aggregated farm metrics and KPIs	?	Optimized for analytics and fast group-by queries

Data Storage Distribution Questions

Should a data lake be distributed for large data volumes?
Is distribution needed for a data warehouse's performance?
Does a data mart require distribution for efficiency?

Today, November 4, 2025

Why migrate: scalability, flexibility, performance
Model shift: relational → document
1:1 → merge into one document
1:N → embed child array
N:N → use references or link collection
Case Study
Homework, Lab, Project
Review Distributed Databases

Volume: Why migrate?

Massive farm data exceeds vertical limits
Horizontal sharding handles big datasets
Denormalized documents reduce joins
Better compression for repetitive data

-- SQL
SELECT f.id, f.name, h.crop, h.yield_kg
FROM farmers f
JOIN harvest_lots h ON h.farmer_id = f.id
WHERE f.id = 42;

-- MQL
db.farmers.find(
  { _id: 42 },
  { name: 1, harvestLots: 1 }
);

{
  "_id": 42,
  "name": "Alice Farms",
  "harvestLots": [
    { "crop": "Wheat", "yieldKg": 18000 },
    { "crop": "Corn",  "yieldKg": 24500 }
  ]
}

Velocity: Why migrate?

High-frequency sensor data overloads SQL
Append-only writes scale horizontally
Partitioned by time or field
Native time-series storage in NoSQL

-- SQL
INSERT INTO soil_events(sensor_id, ts, moisture, temp)
VALUES (17, NOW(), 23.4, 21.7);

-- MQL
db.soilSensors.updateOne(
  { _id: "sensor-17" },
  { $push: { readings: { ts: ISODate(), moisture: 23.4 } } },
  { upsert: true }
);

{
  "_id": "sensor-17",
  "readings": [
    { "ts": "2025-11-05T04:00:00Z", "moisture": 23.4 }
  ]
}

Velocity: Query Complexity Comparison

Relational (SQL)

Join operations across multiple tables
Query cost grows as O(n × n)
Each join multiplies row combinations
High latency under concurrent writes
Difficult to parallelize across nodes

NoSQL (Document / Sharded)

Related data stored in the same document
Query cost reduces to O(shard count)
Shards handle independent partitions
Linear scalability for inserts and reads
Natural fit for distributed sensor data

Variety: Why migrate?

Flexible schema for diverse crop data
No ALTER TABLE for new fields
Store structured and unstructured data
Adapt quickly to new sensor formats

-- SQL
ALTER TABLE crops ADD COLUMN agronomy_json JSON;

-- MQL
db.crops.insertOne({
  _id: "crop-123",
  name: "Tomato",
  agronomy: { variety: "Roma", irrigation: "drip" },
  inputs: [{ name: "NPK-10-10-10", rateKgPerHa: 120 }]
});

{
  "_id": "crop-123",
  "name": "Tomato",
  "agronomy": { "variety": "Roma", "irrigation": "drip" },
  "inputs": [{ "name": "NPK-10-10-10", "rateKgPerHa": 120 }]
}

Veracity: Why migrate?

Retain both raw and cleaned data
Preserve source and lineage info
Track quality and validation issues
Allow flexible schema-on-read

-- SQL
INSERT INTO yield_clean
SELECT * FROM yield_stage WHERE valid = TRUE;

-- MQL
db.yieldReports.insertOne({
  _id: "YR-7",
  fieldId: "F-101",
  raw: { yieldKg: "??", moisture: "N/A", source: "c_csv" },
  clean: { yieldKg: 17450, moisture: 13.2 },
  quality: { valid: false, issues: ["missing_yield"] },
  lineage: { file: "coop_2025-11-05.csv" }
});

{
  "_id": "YR-7",
  "fieldId": "F-101",
  "raw": { "yieldKg": "??", "moisture": "N/A", "source": "c_csv" },
  "clean": { "yieldKg": 17450, "moisture": 13.2 },
  "quality": { "valid": false, "issues": ["missing_yield"] },
  "lineage": { "file": "coop_2025-11-05.csv" }
}

Value: Why migrate?

Faster reads for farm dashboards
Pre-aggregate crop yields
Reduce multiple joins per query
Enable quick insights and analytics

-- SQL
SELECT * FROM farmers WHERE id=42;
SELECT SUM(yield_kg)
FROM harvest_lots WHERE farmer_id=42;

-- MQL
db.farmers.find(
  { _id: 42 },
  { name: 1, harvestLots: 1, metrics: 1 }
);

{
  "_id": 42,
  "name": "Alice Farms",
  "harvestLots": [
    { "crop": "Wheat", "yieldKg": 18000 },
    { "crop": "Corn", "yieldKg": 24500 }
  ],
  "metrics": { "totalYieldKg": 42500, "lastHarvestId": 902 }
}

Questions

Model Shift: From Relational to Document

Relational model organizes data into fixed tables
Each entity stored separately, linked by foreign keys
Designed for transactional integrity (ACID)
Requires joins to reconstruct complete context

Model Shift: Document-Centric Thinking

Documents store all related data together
Supports nested and flexible structures
Optimized for read-heavy agricultural analytics
Fewer joins, faster access for field-level summaries

{
  "_id": "...",
  "farmerName": "...",
  "fields": [
    {
      "fieldId": "...",
      "crop": {
        "name": "...",
        "variety": "...",
        "agronomy": {
          "irrigation": "...",
          "spacingCm": ...
        }
      },
      "sensors": [
        {
          "sensorId": "...",
          "type": "...",
          "readings": [
            { "ts": "...", "moisture": ..., "temp": ... },
            ...
          ]

Model Shift: Example Mapping

1:1 → merge entities into a single document
1:N → embed related items as arrays (e.g., harvest lots)
N:N → use reference arrays or link collections
Allows mixed, dynamic attributes per document

-- 1:1
{
  "_id": "farmer-101",
  "profile": { "name": "Alice", "address": "..." }
}

-- 1:N
{
  "_id": "field-22",
  "crop": "Wheat",
  "harvestLots": [
    { "lotId": "L-1", "yieldKg": 18000 },
    { "lotId": "L-2", "yieldKg": 24500 }
  ]
}

-- N:N (using references)
{
  "_id": "coop-7",
  "name": "Central Farm Cooperative",
  "farmers": [
    { "$ref": "farmers", "$id": "farmer-101" },
    { "$ref": "farmers", "$id": "farmer-102" }
  ],
  "equipment": [
    { "$ref": "equipment", "$id": "tractor-55" },
    { "$ref": "equipment", "$id": "harvester-21" }
  ]
}

Model Shift: Query Perspective

SQL: reconstruct context through joins
NoSQL: retrieve complete document in one read
Better for data lakes and farm dashboards
Natural alignment with JSON-based APIs

Questions

1:1 Relationship: Concept

Two entities linked one-to-one in SQL
Each record in one table has exactly one in the other
In NoSQL, we merge them into a single document
Reduces joins and simplifies retrieval

1:1 Relationship: Relational Design

farmer table: stores identity info
farmer_profile table: stores extended details
Linked through a foreign key: farmer.id = farmer_profile.farmer_id
Retrieving data requires a join

-- SQL tables
farmers
 ├─ id
 ├─ name
 └─ region

farmer_profile
 ├─ farmer_id (FK)
 ├─ address
 └─ contact_number

1:1 Relationship: Document Design

Merge farmer and profile into one document
All details stored together under one key
No need for joins: a single read is enough
Best for frequently accessed combined data

{
  "_id": "farmer-101",
  "name": "Alice Farms",
  "region": "Central Valley",
  "profile": {
    "address": "...",
    "contactNumber": "...",
    "farmSizeHectares": ...,
    "certifications": ["Organic", "Sustainable"]
  }
}

1:1 Relationship: When to Merge

When both entities are always queried together
When updates to one imply updates to the other
When storage size per document remains moderate
When data lifecycle is shared (CRUD together)

1:1 Relationship: When Not to Merge

When one entity changes much more frequently
When one field grows large or optional (e.g.,notes)
When separation simplifies access control or privacy
When embedding violates document limit constraints

Discussion Questions

What are examples of 1:1 relationships in your datasets?
Would merging them simplify or complicate your queries?
How do you decide which side becomes the parent document?
What trade-offs exist between normalization and embedding?
Can a 1:1 merge evolve into a 1:N case over time?

1:1 Relationship: Transforming SQL to JSON

ETL processes related SQL rows into a single document
Use map to project SQL rows into objects
Use filter to select valid or matching pairs
Use reduce to merge each farmer–profile pair
Final output: one JSON document per farmer

// SQL source rows
const farmers = [
  { id: 1, name: "Alice Farms", region: "Central" },
  { id: 2, name: "Bob Fields", region: "South" }
];

const profiles = [
  { farmer_id: 1, address: "...", contact: "...", size_ha: 12 },
  { farmer_id: 2, address: "...", contact: "...", size_ha: 9 }
];

// Transformation
const merged = farmers
  .map(f => {
    const profile = profiles.find(p => p.farmer_id === f.id);
    return { ...f, profile };
  })
  .filter(d => d.profile)        // ensure valid links
  .reduce((acc, doc) => [...acc, doc], []);

// Result → JSON documents
console.log(JSON.stringify(merged, null, 2));

1:1 Relationship: Example Output

Each SQL pair becomes one JSON document
No joins required at query time
Data is ready for direct NoSQL insertion

[
  {
    "_id": 1,
    "name": "Alice Farms",
    "region": "Central",
    "profile": {
      "address": "...",
      "contact": "...",
      "sizeHa": 12
    }
  },
  {
    "_id": 2,
    "name": "Bob Fields",
    "region": "South",
    "profile": {
      "address": "...",
      "contact": "...",
      "sizeHa": 9
    }
  }
]

Questions

1:N Relationship: Concept

One parent record has many child records in SQL
Examples: farmer → harvest lots, field → sensor readings
In NoSQL, embed children as an array inside the parent
Reduces joins; one read returns complete context

1:N Relationship: Relational Design

farmers table: parent entity
harvest_lots table: child rows per farmer
Linked by foreign key: harvest_lots.farmer_id → farmers.id
Queries typically require joins and grouping

-- SQL tables
farmers
 ├─ id
 ├─ name
 └─ region

harvest_lots
 ├─ id
 ├─ farmer_id (FK)
 ├─ crop
 └─ yield_kg

1:N Relationship: Document Design

Embed child records under a descriptive key
Keep keys readable and domain-specific
Single document read serves the common access path
Great for dashboards and farmer-centric queries

{
  "_id": "farmer-101",
  "name": "Alice Farms",
  "region": "Central Valley",
  "harvestLots": [
    { "lotId": "L-1", "crop": "Wheat", "yieldKg": 18000 },
    { "lotId": "L-2", "crop": "Corn",  "yieldKg": 24500 }
  ]
}

1:N Relationship: When to Embed

Children are usually read with the parent
Child count per parent is bounded/moderate
Updates are mostly append or small edits
Lifecycle of parent and children is aligned

1:N Relationship: When Not to Embed

Very large arrays (e.g., millions of readings)
Children are updated independently at high write rates
Children are queried independently of the parent
Document size or hot-spotting becomes a bottleneck

Discussion Questions

Which parent→child pairs in your data are read together most often?
What is a safe upper bound for children per parent in your case?
Do children need independent permissions or lifecycles?
Where would embedding vs. referencing change performance the most?
Could today’s 1:N become many-to-many in the future?

1:N: Transforming SQL to JSON

map(): project parent rows (farmers) into base docs
filter(): select child rows that belong to each parent
reduce(): aggregate children into an embedded array
Output: one parent document with a harvestLots array

// SQL source rows
const farmers = [
  { id: 1, name: "Alice Farms", region: "Central" },
];

const harvestLots = [
  { id: "L-1", farmer_id: 1, crop: "Wheat", yield_kg: 18000 },
  { id: "L-2", farmer_id: 1, crop: "Corn",  yield_kg: 24500 },
];

// Transformation (group lots under each farmer)
const docs = farmers.map(f => {
  const lots = harvestLots
    .filter(l => l.farmer_id === f.id)
    .reduce((acc, l) => [
      ...acc,
      { lotId: l.id, crop: l.crop, yieldKg: l.yield_kg }
    ], []);
  return { _id: f.id, name: f.name, region: f.region,... };
});

1:N: Example Output

Parent documents embed multiple child records
No join required at query time
Ideal for farmer-centric dashboards and APIs

[
  {
    "_id": 1,
    "name": "Alice Farms",
    "region": "Central",
    "harvestLots": [
      { "lotId": "L-1", "crop": "Wheat",  "yieldKg": 18000 },
      { "lotId": "L-2", "crop": "Corn",   "yieldKg": 24500 }
    ]
  },
  {
    "_id": 2,
    "name": "Bob Fields",
    "region": "South",
    "harvestLots": [
      { "lotId": "L-3", "crop": "Tomato", "yieldKg": 9000 }
    ]
  }
]

Questions

N:N Relationship: Concept

Many farmers share many equipment units (and vice versa)
Other examples: farmers ↔ co-ops, crops ↔ suppliers
In NoSQL, prefer references or a linking collection
Avoid duplicating the same child across many parents

N:N Relationship: Relational Design

farmers and equipment are peer tables
farmer_equipment is the join (pivot) table
Queries join across three tables to resolve links
Scales poorly for wide, deep associations

-- SQL tables
farmers
 ├─ id
 └─ name

equipment
 ├─ id
 └─ type

farmer_equipment
 ├─ farmer_id (FK → farmers.id)
 └─ equipment_id (FK → equipment.id)

N:N Relationship: Document Design (References)

Store arrays of IDs (manual references) on each side
Resolve lookups in the application or with aggregation
Good when association count is moderate
Keeps documents small and avoids duplication

{
  "_id": "farmer-101",
  "name": "Alice Farms",
  "equipmentIds": ["tractor-55", "harvester-21"]
}

{
  "_id": "tractor-55",
  "type": "Tractor",
  "farmerIds": ["farmer-101", "farmer-204"]
}

N:N Relationship: Document Design (Linking Collection)

Create a separate collection of link docs
Each link holds farmerId and equipmentId
Best for high-cardinality or frequently changing links
Enables independent indexing and TTL/policies on links

{
  "_id": "link-9001",
  "farmerId": "farmer-101",
  "equipmentId": "tractor-55",
  "since": "..."
}

{
  "_id": "link-9002",
  "farmerId": "farmer-204",
  "equipmentId": "tractor-55",
  "since": "..."
}

N:N: When to Use Which Pattern?

Reference arrays: moderate link counts, read with parent
Linking collection: high churn, volume, or analytics
Consider index needs and shard keys
Prefer small, stable parent documents

Discussion Questions

Are links mostly read with one side (farmer or equipment) or both?
What is the maximum expected degree per node (links per farmer/equipment)?
Do links change frequently or require their own lifecycle?
Could you start with reference arrays and later move to a linking collection?

N:N: Transforming SQL to JSON (Build Reference Arrays)

map(): initialize farmer docs
filter(): select links for each farmer
reduce(): aggregate equipment IDs per farmer
Output: farmers with equipmentIds arrays

// SQL source rows
const farmers = [
  { id: 101, name: "Alice Farms" }
];

const equipment = [
  { id: "tractor-55", type: "Tractor" }
];

const farmerEquipment = [
  { farmer_id: 101, equipment_id: "tractor-55" },
  { farmer_id: 101, equipment_id: "harvester-21" }
];

// Build reference arrays on farmer docs
const farmerDocs = farmers.map(f => {
  const eqIds = farmerEquipment
    .filter(link => link.farmer_id === f.id)
    .reduce((acc, link) => [...acc, link.equipment_id], []);
  return { _id: `farmer-${f.id}`, name: f.name, equipIds: eqIds };
});

N:N: Example Output (Reference Arrays)

Farmers reference many equipment by ID
Simple, compact parent documents

[
  {
    "_id": "farmer-101",
    "name": "Alice Farms",
    "equipmentIds": ["tractor-55", "harvester-21"]
  },
  {
    "_id": "farmer-204",
    "name": "Bob Fields",
    "equipmentIds": ["tractor-55"]
  }
]

N:N: Transforming SQL to JSON (Linking Collection)

map(): project each join row into a link doc
filter(): optional constraints (e.g., active links)
reduce(): accumulate into a link collection array
Output: independent farmerEquipmentLinks docs

// Build linking collection documents
const farmerEquipmentLinks = farmerEquipment
  .map((l, i) => ({
    _id: `link-${i + 1}`,
    farmerId: `farmer-${l.farmer_id}`,
    equipmentId: l.equipment_id,
    since: "..."
  }))
  .filter(link => link) // placeholder for conditions
  .reduce((acc, link) => [...acc, link], []);

N:N: Example Output (Linking Collection)

Links can be indexed and sharded independently
Supports high-cardinality and fast-changing relations
Clean separation of entity and association lifecycles

[
  {
    "_id": "link-1",
    "farmerId": "farmer-101",
    "equipmentId": "tractor-55",
    "since": "..."
  },
  {
    "_id": "link-2",
    "farmerId": "farmer-101",
    "equipmentId": "harvester-21",
    "since": "..."
  },
  {
    "_id": "link-3",
    "farmerId": "farmer-204",
    "equipmentId": "tractor-55",
    "since": "..."
  }
]

Questions

Best Practices - Migrating from SQL to NoSQL

Know your query patterns first
Start with one feature or domain
Choose embed or reference carefully
Keep documents small and consistent
Validate and index critical fields

Automate ETL for repeat runs
Monitor read and write latency
Track original table lineage
Test integrity before switching traffic
Iterate design as data grows

Today, November 4, 2025

Why migrate: scalability, flexibility, performance
Model shift: relational → document
1:1 → merge into one document
1:N → embed child array
N:N → use references or link collection
Case Study
Homework, Lab, Project
Review Distributed Databases

Case Study: SQL to NoSQL (Agriculture)

Agriculture system tables:

farmers	id	name	region
farmer_profiles	id	farmer_id	phone	email
harvest_lots	id	farmer_id	crop	yield_kg
buyers	id	name
contracts	farmer_id	buyer_id

Reports are slow due to joins.

Goal: one document per farmer
Embed harvests inside farmer
Reference buyers if needed
Compute total yield

Data Flow: From SQL to JSON

ETL (Extract → Transform → Load)

Extract from SQL tables
Transform to JSON documents
Load into NoSQL (e.g., MongoDB)

SQL → map/filter/reduce → JSON

Data Flow: From JSON to Enriched Data

ELT (Extract → Load → Transform)

Extract raw JSON
Load into data lake
Transform with metadata & KPIs

JSON → load → enrich(meta, totals)

Setup

All examples use agriculture
1x1, 1xn, nxn

const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

Q1

How to model 1x1
farmer with profile
Target is one document

farmers ⟷ farmer_profiles

A1

Merge fields
map + find

const farmerDocs_1x1=farmers.map(f=>{
  const p=farmer_profiles.find(x=>x.farmer_id===f.id);
  return {...f,profile:p};
});

Q2

How to model 1xn
farmer with harvests
Embed array

farmers ⟷ harvests

Setup

All examples use agriculture
1x1, 1xn, nxn

const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A2

map + filter
child array

const farmerDocs_1xn=farmers.map(f=>({
  ...f,
  harvests:harvests.filter(h=>h.farmer_id===f.id)
}));

Q3

Total yield per farmer
Which array tool

sum yield_kg

Setup

All examples use agriculture
1x1, 1xn, nxn

const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A3

reduce inside map

const withTotals=farmerDocs_1xn.map(f=>({
  ...f,
  total_yield:f.harvests.reduce((s,h)=>s+h.yield_kg,0)
}));

Q4

Keep only high yield lots
Which array tool

yield_kg >= 300

Setup

All examples use agriculture
1x1, 1xn, nxn

const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A4

filter nested

const highYield=withTotals.map(f=>({
  ...f,
  harvests:f.harvests.filter(h=>h.yield_kg>=300)
}));

Q5

How to model nxn
farmers with buyers
contracts link

farmers ⟷ contracts ⟷ buyers

Setup

All examples use agriculture
1x1, 1xn, nxn

const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A5

map + filter + includes
embed buyer list

const farmerWithBuyers=farmers.map(f=>{
  const buyerIds=contracts
    .filter(c=>c.farmer_id===f.id)
    .map(c=>c.buyer_id);
  const myBuyers=buyers.filter(b=>buyerIds.includes(b.id));
  return {...f,buyers:myBuyers};
});

Q6

Alternative for nxn
keep link collection

contract documents

Setup

All examples use agriculture
1x1, 1xn, nxn

const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A6

reference by id

{
  "farmer_id": 1,
  "buyer_id": 10
}

Q7

Create farmer document
Embed harvests
Add total_yield

map + filter + reduce

Setup

All examples use agriculture
1x1, 1xn, nxn

const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A7

one pass build

const farmerDocs=farmers.map(f=>{
  const hs=harvests.filter(h=>h.farmer_id===f.id);
  const total=hs.reduce((s,h)=>s+h.yield_kg,0);
  return {...f,harvests:hs,total_yield:total};
});

Q8

Top farmers by total
Keep total greater than 400

filter by total_yield

Setup

All examples use agriculture
1x1, 1xn, nxn

const farmers=[{id:1,name:"Alem",region:"Delta"},
{ id:2,name:"Maya",region:"Central" }];

const farmer_profiles=[{farmer_id:1,phone:"555-1111"},
{ farmer_id:2,phone:"555-2222" }];

const harvests=[{id:1,farmer_id:1,crop:"Tomato",yield_kg:300},
{ id:2,farmer_id:1,crop:"Onion",yield_kg:200 },
{ id:3,farmer_id:2,crop:"Wheat",yield_kg:500 }];

const buyers=[{id:10,name:"FreshMart"},
{ id:20,name:"AgriCo" }];

const contracts=[{farmer_id:1,buyer_id:10},
{ farmer_id:1,buyer_id:20 },
{ farmer_id:2,buyer_id:10 }];

A8

filter outer array

const topFarmers=farmerDocs.filter(f=>f.total_yield>400);

Q9

Rules for mapping
1x1, 1xn, nxn

state your rule

A9

1x1 merge
1xn embed array
nxn reference or link

use map filter reduce

Questions

Metadata Field: uuid

Globally unique identifier
Allows traceability across systems
Prevents duplicate records
Useful for merges and sync

meta.uuid = crypto.randomUUID();

Metadata Field: checksum

Represents data size or hash
Detects data changes or corruption
Used to verify sync integrity
Lightweight alternative to full hash

meta.checksum = JSON.stringify(obj).length;

Metadata Field: databaseType

Identifies storage model
Helps route queries correctly
Useful for ETL pipelines supporting multiple DBs
Examples: "SQL", "NoSQL", "Cache"

meta.databaseType = "NoSQL";

Metadata Field: databaseSource

Specifies data origin
Useful for audit trails
Supports hybrid architectures (e.g., SQL + MongoDB)
Enables source-based filtering

meta.databaseSource = "MongoDB";

Metadata Field: databaseDestination

Specifies where the enriched data will be stored
Includes database and collection (or table) name
Useful for routing during ETL or ELT
Enables multi-database deployment tracking

meta.databaseDestination = {
  db: "agriculture",
  collection: "farmers_enriched"
};

Metadata Field: lastSync

Records last update timestamp
Used to detect stale data
Supports incremental synchronization
Enables time-based analytics

meta.lastSync = new Date().toISOString();

Metadata Field: sourceTimestamp

Records when the data was first captured
Distinguishes original event time and sync time
Essential for timeline reconstruction

meta.sourceTimestamp = "2025-11-06T08:00Z";

Metadata Field: processedBy

Marks the script, pipeline that processed the record
Supports auditability and debugging
Helps track different ETL versions

meta.processedBy = "etl_pipeline_v2";

Metadata Field: recordStatus

Indicates lifecycle stage of the record
Examples: "new", "validated", "archived"
Useful for incremental updates

meta.recordStatus = "validated";

Metadata Field: dataVersion

Tracks schema or transformation version
Prevents mismatches: old and new data formats
Critical for reproducibility

meta.dataVersion = "v1.3";

Metadata Field: verifiedBy

Specifies who or what validated the data
Can be a QA process, script, or user
Useful for accountability and quality logs

meta.verifiedBy = "qualityCheckerBot";

Today, November 11, 2025

Database Pipeline Security
Similarity Search (Vector DB)
Use Case: Agriculture Data
Homework: Lab 5 & HW 5

NoSQL Database Security: Overview

Protect confidentiality, integrity, and availability
Prevent breaches and legal exposure compliance
Model threats across pipeline: ETL

Interconnected Systems: Data paths at risk

NoSQL Database Security: Overview

Why Security Matters

Unauthorized access risks and data monetization
Integrity loss: tampering, poisoning, or schema drift
Ops. impact: downtime, ransom, regulatory penalties

Core Security Pillars

Encryption
(in transit, at rest)

Access Control
(authn, RBAC/DAC)

Auditing & Monitoring
(logs, alerts, anomaly detection)

Data Fragmentation
(sharding, mart)

Encryption and Decryption

Encryption: Where It Applies

Layer	Goal	Example
In transit	Prevent snooping	TLS for drivers, HTTPS for APIs
At rest	Protect disks/snapshots	Database/volume encryption
Client side	Minimize trust	Field-level AES before insert

IndexedDB: Minimal Field Encryption (AES-GCM)


// Generate a key
const key = await crypto.subtle.generateKey(
  { name: "AES-GCM", length: 256 },
  true,
  ["encrypt", "decrypt"]
);

// Encrypt a string field
const iv = crypto.getRandomValues(new Uint8Array(12));
const enc = await crypto.subtle.encrypt(
  { name: "AES-GCM", iv },
  key,
  new TextEncoder().encode("secret")
);

// Decrypt
const dec = await crypto.subtle.decrypt({ 
    name: "AES-GCM", iv 
}, key, enc);
new TextDecoder().decode(dec); // "secret"

Transport Encryption: Client ↔ Server

Protects data in motion between app and database
Prevents eavesdropping and MITM attacks
Use TLS 1.2+ with verified certificates
Apply to drivers, APIs, and cluster communication


// Example: MongoDB connection with TLS
mongodb+srv://user:pass@cluster.mongodb.net/
  ?ssl=true&retryWrites=true&w=majority

Encrypted connection between client and server

Data Lake Security: When to Encrypt or Decrypt

🔒 Encrypt & Store in Lake

Data remains secure at rest in the lake
Minimizes exposure during ingestion
Ideal for sensitive or regulated datasets
Query requires authorized decryption layer


// Example: encrypt before upload
const enc = encryptAES(record);
lake.store(enc);

📊 Decrypt & Query Before Storing

Allows transformations and analytics pre-lake
Useful when encryption breaks query performance
Requires trusted compute and short-term exposure
Re-encrypt before final lake ingestion


// Example: decrypt → query → store
const data = decryptAES(blob);
runQuery(data);
lake.store(reEncrypt(data));

Encryption Models: Public vs Private Key

🔐 Private (Symmetric) Key Encryption

Same key used for encryption and decryption
Fast and efficient for large data volumes
Common in storage, backups, and data lakes
Challenges: secure key distribution, rotation
Performance: high speed, minimal overhead
Security: compromised key = total data loss risk


// Example: AES-GCM symmetric encryption
const enc = await crypto.subtle.encrypt({
    name:"AES-GCM", iv
}, key, data);

🔑 Public (Asymmetric) Key Encryption

Key pair: public to encrypt, private to decrypt
Secure exchange and identity verification
Used in TLS, SSH, and digital signatures
Challenges: slower, complex key management
Performance: higher CPU cost per operation
Security: strong transmission, less bulk data


// Example: RSA public key encryption
const enc = await crypto.subtle.encrypt({
    name:"RSA-OAEP"
}, pubKey, data);

Discussion: Encryption in Practice

How is a symmetric key shared for sensor data?
How does encryption affect queries in NoSQL?
Encrypt early or in the lake: which is better?
How to balance strong encryption with speed?

Data security in interconnected hospital systems

NoSQL Databases: Encryption at Rest Support

Database	Encryption at Rest	Notes
Cassandra	✅ Supported	Transparent Data Encryption (TDE) for SSTables and commit logs
MongoDB	✅ Supported	Encrypted storage engine; integrates with external KMS
Redis	⚠️ Partial	Enterprise and Redis Stack support disk-level encryption
IndexedDB	⚠️ Browser dependent	Relies on OS or browser sandbox; no built-in encryption
Lakehouse (e.g., Delta, Iceberg)	✅ Supported	Encryption handled by storage layer (S3, Azure, GCS)
Neo4j	✅ Supported	Native file and log encryption (Enterprise edition)
FAISS	❌ Not native	Relies on external filesystem or disk encryption

Access Control

Access Control: Models

RBAC: roles map to privileges (read, write, admin)
DAC: user-level control to access data
Implements identity and authorization rules only
Note: data not encrypted, access control ≠ encryption
Apply least privilege and separation of duties

Access control in interconnected hospital systems

RBAC Concept (~1996)

Role Assignment: users get roles before access.
Role Authorization: only approved roles are valid.
Permission Authorization: roles define actions.

RBAC groups permissions by role, not by user.

IndexedDB: Access Control Reality

The same-origin policy restricts DB access to the same protocol, host, and port.

Access Control: Device, Browser, Tab

Device scope

OS user account is the root boundary
Disk and keychain hold secrets for apps
Use full disk encryption and per user key stores

Browser scope

Profile and origin define isolation
Cookies with SameSite and Secure limit exposure
LocalStorage and IndexedDB are per origin
Service Worker gates network for the origin

Tab scope

Per tab state in sessionStorage
Carry role in the key for checks
Avoid sharing across tabs unless needed
Use BroadcastChannel only for safe signals

Scope	Control	Example
Device	OS user, key store	OS login, keychain, disk crypto
Browser	Origin, profile	Cookie flags, IDB per origin
Tab	Session state	sessionStorage role for API checks

Soft RBAC with API Keys in IndexedDB (tab scoped)

Store API key in sessionStorage so it is per tab
Key encodes the role (patient, nurse, physician)
Middleware checks role before each read or write

// set per tab
sessionStorage.setItem("API_KEY", "idb_apikey:nurse:98D1B7");

// read role
const role = (sessionStorage.getItem("API_KEY") || "")
    .split(":")[1];

// tiny ACL
const ACL = {
  patient:   ["read:own"],
  nurse:     ["read:any","write:notes"],
  physician: ["read:any","write:any"]
};
const can = (r,a) => ACL[r]?.includes(a);

// guarded IndexedDB write
if (!can(role, "write:notes")) 
    throw new Error("forbidden");

const db = await new Promise((res, rej) => {
  const req = indexedDB.open("clinic", 1);
  req.onupgradeneeded = () => 
    req.result.createObjectStore("notes", { 
        keyPath: "_id" 
    });
  req.onsuccess = () => res(req.result);
  req.onerror = () => rej(req.error);
});

const tx = db.transaction("notes", "readwrite");
tx.objectStore("notes").put({ _id, patientId, text });
await tx.done; // or listen to tx.oncomplete

// example keys per role
"idb_apikey:patient:72F3A9"
"idb_apikey:nurse:98D1B7"
"idb_apikey:physician:65A4C2"

Access Control: Backend Layer

Applies before database queries are executed
Backend enforces user roles, tokens, and scopes
Prevents direct client-to-database connections
Central place to log, audit, and throttle requests
Decouples business logic from database access rules

// Example: Express middleware
app.use((req, res, next) => {
  const token = req.headers.authorization;
  // patient, nurse, physician
  const role = verifyJWT(token).role; 
  req.role = role;
  next();
});

app.get("/records/:id", (req, res) => {
  if (req.role !== "physician") 
    return res.status(403).send("forbidden");
  db.records.findOne({ _id: req.params.id })
    .then(r => res.json(r));
});

// API key example
API_KEY = "backend_apikey:physician:65A4C2"
role = API_KEY.split(":")[1]; // physician

Backend enforcing access before database

Access decisions occur in the backend: not the database.

IP-Based Access Control: Backend and MongoDB

Restricts database access by client IP address
Acts as a network-level access control layer
Common in managed clusters (MongoDB Atlas, Firewalls)
Only trusted subnets or devices can connect
Note: IP filtering ≠ authentication: still use user roles

# Example: MongoDB Atlas IP access list
Project → Network Access → IP Whitelist
Add IP: 203.0.113.15 (office)
Add CIDR: 203.0.113.0/24 (VPN subnet)

// Self-hosted: bind and firewall
net:
  bindIp: 127.0.0.1,192.168.10.5
  port: 27017

# OS firewall
sudo ufw allow from 192.168.10.0/24 to any port 27017

IP-based access control for MongoDB cluster

Only approved network sources can reach the database.

RBAC with API Keys in MongoDB

Each API key encodes a role (patient, nurse, physician)
API gateway checks key → extracts role → applies rule
MongoDB grants permissions by role

// Example API keys
API_KEY_PATIENT   = "mongo_apikey:patient:72F3A9";
API_KEY_NURSE     = "mongo_apikey:nurse:98D1B7";
API_KEY_PHYSICIAN = "mongo_apikey:physician:65A4C2";

// Check and enforce
const role = req.get("x-api-key").split(":")[1];
if (role === "nurse") db.notes.updateOne({ patientId }, update);
else if (role === "patient") db.notes.find({ self: true });

API key roles controlling MongoDB access

Encryption vs Access Control

Aspect	Encryption	Access Control
Pros	Protects data if media is stolen Works in transit and at rest Limits blast radius of breaches	Fine-grained permissions Matches org roles and duties Policy enforcement at query paths
Cons	Key rotation and storage complexity Performance overhead Does not decide who may access	Misconfig risk and privilege creep Bypass risk if perimeter is weak Ongoing policy maintenance

Data Fragmentation

Data Fragmentation: Concept

Divide large datasets into smaller, manageable fragments
Improve performance, scalability, and data isolation
Fragments can be distributed across nodes or regions
Supports privacy by storing sensitive data separately

// Example split
patients_core     → demographics, ID
patients_medical  → diagnoses, treatments
patients_billing  → invoices, payments

Data fragmentation across healthcare systems

Horizontal Fragmentation: Sharding

Each shard holds a subset of records (horizontal split)
Common shard keys: region, department, or patient group
Queries route to the correct shard via the router
MongoDB, Cassandra, and Neo4j support native sharding

// Example shard keys
hospital_east   → patients.region = "East"
hospital_west   → patients.region = "West"
hospital_central → patients.region = "Central"

Horizontal data sharding in hospital systems

Vertical Fragmentation: Data Marts

Each mart stores topic-specific slices of data
Feeds from the lake or ETL pipelines
Optimized for analytics or departmental access
Finance, clinical, and operations marts stay separate

// Example marts
mart_clinical  → patient outcomes
mart_billing   → payments, reimbursements
mart_staffing  → scheduling, hours

Data marts built from a unified hospital data lake

In-Class Use Case: Protecting Farm GPS Data

Farms use GPS sensors on tractors and equipment.
Coordinates reveal private field locations.
Data is encrypted before storage in NoSQL.
Only verified users can view decrypted data.
Enables safe analytics without exposing locations.

Question:
Design a database pipeline from source to dashboard that protects GPS data.

Quick Check

Which layers need encryption in your pipeline?
What is the least-privilege role for a read-only analytics job?
How will you detect and respond to schema drift affecting security?

Security Audits: Purpose

Not always preventive: often risk mitigating.
Identify weaknesses before they lead to incidents.
Assess system controls, access, and data handling.
Recommend corrective and compensating actions.
Focus on detection, documentation, and improvement.


Audit Goal:
- Detect vulnerabilities
- Verify compliance
- Improve readiness

Security Audits: Common Scope

Access control and authentication policies
Encryption practices and key management
Data handling (ETL, logs, backups)
Incident response readiness
Regulatory and organizational compliance

{
  "audit_area": "Access Control",
  "finding": "Too many admin users",
  "risk": "Privilege misuse",
  "recommendation": "Role-based access"
}

Example: Database Audit for IoT Farm Data

Check encryption between sensors and servers.
Review API authentication (tokens, keys).
Inspect database access logs for anomalies.
Validate least-privilege for users.
Report on remediation progress.


Audit Report (Excerpt)
----------------------
✔ Encryption in transit (TLS)
⚠ Weak password policy
⚠ Shared DB credentials
✔ Regular patch updates

Security Audits: Risk Mitigation

Audits reduce impact, even if can’t prevent attacks.
Finds misconfigurations before exploitation.
Improves recovery and response plans.
Builds a security culture and accountability.
Essential part of ongoing data governance.


Preventive  → stops risk
Mitigating → reduces impact
Audit = Mitigating measure

In-Class Use Case: Auditing Farm IoT Data Security

Farm IoT sends weather, soil, GPS data to cloud.
Audit checks encryption, key storage.
Audit checks data access logs.
Identifies misuse or weak configurations.
Suggests mitigation, not just prevention.
Ensures secure analytics without disrupting service.

Question:
What security audit steps would you include to mitigate risk in a farm IoT data pipeline?

Questions?

IndexedDB

Redis

MongoDB

CassandraDB

Faiss

ClickHouse

Neo4j

Why Vector Databases?

SQL/NoSQL rely exact matching of values
Vector DBs enable *semantic* matching/meaning
AI systems: Find similar text, images, or signals
Recommendation, Personalization
Run efficiently on billions of high-dimensional vectors

Query Type	Example
Keyword Search	“corn yield”
Vector Search	“find farms like this one”
Result	Semantically similar records, not just identical text

Limitations of SQL and NoSQL Databases

Designed for *exact* equality or range conditions
No built-in understanding of semantic similarity
Indexes can’t handle high-dimensional vectors
JOIN/WHERE clauses cannot rank by similarity score
Struggle with unstructured or contextual data

Query	Limitation
SELECT * FROM farms WHERE crop='corn'	Exact only
SELECT * FROM soils WHERE pH BETWEEN 6 AND 7	Numeric range only
Find farms “like” Farm X	Not supported natively

Core Concept: Embeddings

Embeddings turn objects into numeric vectors
Each dimension represents a latent feature
Nearby vectors represent similar meaning or context
Enable search by *meaning* instead of keywords
Used across modern AI and ML applications

Data	Embedding (3D)
“Wheat yield in dry soil”	[0.22, 0.45, 0.10]
“Corn yield in low rainfall”	[0.21, 0.44, 0.11]
“Rice yield in wet soil”	[0.10, 0.30, 0.50]

How Similarity Search Works

Each record stored vector in multi-dimensional space
Query is also embedded as a vector
Compute distance between query and stored vectors
Return top-k nearest vectors (most similar meanings)
FAISS optimizes this with efficient indexing structures

Farm	Vector	Distance
F1 (Corn)	[0.20, 0.40, 0.15]	0.02
F2 (Wheat)	[0.25, 0.30, 0.18]	0.10
F3 (Rice)	[0.15, 0.45, 0.20]	0.04

Why Vector Databases Are the Future

Bridge between AI models and data storage
Handle unstructured, semantic, and contextual data
Support hybrid filtering (metadata + similarity)
Enable real-time discovery of “similar” conditions
Transform data-driven decision making

Aspect	SQL/NoSQL	Vector DB
Query Type	Exact	Semantic
Data Type	Structured	Unstructured + Numeric
Example	WHERE crop='corn'	Farms similar to F1
Result	Identical matches	Similar contexts

Exercise: Similarity, Aggregation, or Document/Row?

“Find farms similar to Farm A in rainfall pattern.”
“Calculate average crop yield per region.”
“Retrieve all records where crop = ‘Corn’”
“List farms soil closest to [0.3, 0.4, 0.2].”
“Group by fertilizer type and count farms.”

#	Likely Type	Typical System	Key Value Fit	Graph Fit
1	Similarity	Vector DB	No	Sometimes
2	Aggregation	SQL or OLAP	No	Rare
3	Document or Row	SQL or NoSQL	Yes for id lookup	No
4	Similarity	Vector DB	No	Sometimes
5	Aggregation	SQL	No	Rare

Vector Databases

Common Vector Databases

FAISS: optimized local, fast similarity search
Milvus: scalable distributed vector database
Weaviate: semantic vector DB with GraphQL API
Qdrant: lightweight, production-ready

Name	License	GitHub / Website
FAISS	Open Source	github.com/facebookresearch/faiss
Milvus	Open Source	github.com/milvus-io/milvus
Weaviate	Open Source	github.com/weaviate/weaviate
Qdrant	Open Source	github.com/qdrant/qdrant

Chronology of FAISS

2017: Released by Facebook AI Research
2018: GPU acceleration and wide adoption
2020: Integrated into ML and AI pipelines
2022: Added hybrid and hierarchical indexes
2024: Core engine for many vector databases

FAISS Example: Finding Similar Farms

Create sample data for farms (numeric features)
Build a FAISS index using L2 distance
Add all farm vectors to the index
Query with one farm to find its nearest neighbors
Full example: GitHub: /nosql/vector


import numpy as np
import faiss
# Sample farm data (rainfall, soil_pH, yield)
data = np.array([
  [0.75, 0.60, 0.82],
  [0.80, 0.62, 0.81],
  [0.25, 0.30, 0.28],
  [0.78, 0.65, 0.85]
], dtype='float32')

# Build index (L2 distance)
index = faiss.IndexFlatL2(3)
index.add(data)  # add all farm vectors

# Query: find 3 nearest farms to first one
query = np.array([[0.75, 0.60, 0.82]], dtype='float32')
D, I = index.search(query, k=3)

print(I)  # indices of similar farms → [[0 3 1]]
print(D)  # similarity distances → [[0.0000 0.0013 0.0025]]

Questions

Sources → Lake (Raw)

Collect weather, soil, GPS data
Store as raw JSON in lake
Preserve source + timestamp
Question: How is GPS data protected at ingestion?


Before:
{"farm":"F1","lat":37.95,"lon":-121.29,"rain":3.4}

After:
{"farm":"F1","gps_enc":"b9a3f1...","rain":3.4}

Raw → Curated (ETL)

Clean + validate JSON records
Convert units, drop invalid rows
Aggregate daily averages
Question: Who can modify curated data?


Before:
{"farm":"F1","rain":"3.4mm","temp":"22C"}

After:
{"farm":"F1","rain_mm":3.4,"temp_c":22.0}

Curated → Analytics

Join soil, weather, yield by farm
Compute key features for analysis
Maintain metadata for traceability
Question: How does lineage tracking improve trust?


Before:
{"farm":"F1","rain":3.4}
{"farm":"F1","yield":4.8}

After:
{"farm":"F1","rain":3.4,"yield":4.8}

Analytics → Vectors

Normalize numeric columns
Create farm vectors for similarity
Anonymize identifiers before indexing
Question: What risks arise if vectors leak?


Before:
{"farm":"F1","rain":3.4,"ph":6.3,"yield":4.8}

After:
{"id":"A1","vector":[0.62,0.58,0.73]}

Vector DB → Query

Search for similar farms (top-k)
Restrict access by user region
Return safe, aggregated results
Question: How to prevent cross-farm data leaks?


Before:
{"query":"A1"}

After:
{"neighbors":[{"id":"A1","dist":0.0},
              {"id":"A2","dist":0.07}]}

API → Secure Dashboard

Serve encrypted analytics via API
Apply role-based access (farmer/admin)
Display results without revealing GPS
Question: Which controls protect map layers?


Before:
{"farm":"F1","gps":[37.95,-121.29]}

After:
{"farm":"F1","region":"Valley","gps":"hidden"}

Summary: Secure Farm Pipeline

Source → Lake → Curated → Vector → Dashboard
Encrypt GPS and sensitive info
Apply access control at each layer
Monitor lineage and anomalies
Question: Which stage is most at risk?


{
  "pipeline":"secure_farm_data",
  "stages":["ingest","curate","vector","serve"],
  "security":["encrypt","access","audit"]
}

Similarity Search: Concept

Goal: find farms with similar conditions
Input: numeric features → vector space
Metric: cosine or L2 distance
Index: FAISS (flat or IVF/PQ)
Security: restrict by tenant/region


Before:
{"farm":"F1","rain":3.4,"ph":6.3,"yield":4.8}

After:
{"id":"A1","vector":[0.62,0.58,0.73]}

Query Prep: Build Vector

Normalize features (0–1)
Handle missing with defaults
Mask identifiers before search
Security: log query lineage


Before:
{"farm":"F2","rain":2.1,"ph":6.7,"yield":4.2}

After:
{"id":"A2","vector":[0.39,0.71,0.61]}

Search: Top-k Neighbors

Input: query vector + k
Output: ids + distances
Lower distance = more similar
Security: strip PII, rate-limit


Before:
{"query":"A1","k":3}

After:
{"neighbors":[
  {"id":"A1","dist":0.00},
  {"id":"A2","dist":0.07},
  {"id":"A3","dist":0.12}
]}

Post-Process: Secure Results

Attach safe metadata (region/season)
Mask GPS; show region only
Filter by user’s role/scope
Audit: record query + response


Before:
{"neighbors":[{"id":"A2"},{"id":"A3"}]}

After:
{"neighbors":[
  {"id":"A2","region":"Valley","gps":"hidden","dist":0.07},
  {"id":"A3","region":"Foothill","gps":"hidden","dist":0.12}
]}

Quality & Safety Checks

Outlier guard (feature ranges)
Drift monitor (stats over time)
Access policy test (ABAC/RBAC)
Fail-closed on policy violations


Before:
{"vector":[1.9,-0.4,7.2]}

After:
{"error":"out_of_range","action":"reject"}

Abstract — Vector Database Management Systems

Describes data as vectors for text, images, and video.
Used in recommendation, search, and chatbots.
Handles high-dimensional and sparse data.
Focus on storage, retrieval, and processing.
Summarizes key concepts and challenges.

https://arxiv.org/pdf/2309.11322

Today, November 13, 2025

Case Study: Design a Database Pipeline
Homework: Lab 5 & HW 5

Today, November 18, 2025

Synthetic Data: Role and Importance
Review: NoSQL Database
Homework: Lab 4 & HW 4
Homework: Lab 5 & HW 5
Project: Part 2

Synthetic Data: Role and Importance

Used to preserve privacy
Helps generating test data
Source link

Test Data vs Synthetic Data

Test Data: taken from real systems
Test Data: often limited
Synthetic Data: mirror patternse
Synthetic Data: scalable

Benford's Law: Natural Number Patterns

Many real world numbers start with lower digits
Digit “1” appears as the first digit about 30%
Auditing, fraud detection, and anomaly checks
When numbers span orders of magnitude

Questions

IndexedDB vs SQL Recap

IndexedDB Structure

Key–value storage.
Object store: weatherStore.
Each record is a full JSON object.
No schema; fields may vary.
Lookups use B-tree → O(log n).


// IndexedDB stored record
{
  id: 101,
  city: "Stockton",
  temp: 92,
  humidity: 40,
  date: "2025-08-01"
}

SQL Table Structure

Tabular storage.
Fixed schema defined with CREATE TABLE.
Types enforced (INT, VARCHAR, DATE).
Indexes improve lookup → O(log n).


-- SQL row stored in table
id | city      | temp | humidity | date
-----------------------------------------
101 | Stockton |  92  |    40    | 2025-08-01

SQL vs IndexedDB: Core Concepts

IndexedDB stores key–value weather records.
Uses object stores (e.g., "weatherStore").
Queries rely on JavaScript functions (map/filter/reduce).
Designed for browser local caching.
Typical full-scan operations cost O(n).

SQL stores weather rows in tables with fixed columns.
Queries use SELECT, INSERT, UPDATE, DELETE.
Indexes on city/date can give O(log n) lookups.
Runs on server-based weather databases.

CRUD in IndexedDB vs SQL

Create: store.add(weather).
Read: store.get(key) or store.getAll().
Update: store.put(weather).
Delete: store.delete(key).
Filtering usually costs O(n) with getAll + filter.

INSERT INTO weather VALUES (...).
SELECT * FROM weather WHERE city='Stockton'.
UPDATE weather SET temp = ... WHERE id = ...;
DELETE FROM weather WHERE id = ...;
With index on city/date, single-row lookup ~O(log n).

Query Style Difference


// IndexedDB Read (weather)
store.get(cityKey);   // primary key lookup (B-tree)
store.getAll();       // load all records

get(cityKey) uses a B-tree → O(log n).
Scanning all weather entries with getAll is O(n).


SELECT * FROM weather
WHERE city = 'Stockton';

SELECT * FROM weather w
JOIN regions r
  ON w.region_id = r.id;

Without indexes, SQL filter is O(n).
With index on city, SQL lookup is O(log n).

Performance Question (MCQ)

Query fastest to find yesterday’s temperature record?

A: IndexedDB getAll() then filter by date.
B: IndexedDB cursor scan over all records.
C: SQL SELECT date='yesterday' without index.
D: SQL SELECT date='yesterday' with date index.


SELECT * FROM weather
WHERE date = 'yesterday';

-- or with index
SELECT * FROM weather
WHERE date = 'yesterday';

Write the Query (IndexedDB + JS)

Write an IndexedDB query:

Find all cities where temperature ≥ 90°F.
Use JavaScript map / filter / reduce.
Think about cost: reading all then filtering is O(n).


// Load all weather records, then filter and map
store.getAll().onsuccess = (event) => {
    const weather = event.target.result; // n records

    const hotCities =
        weather
            .filter(w => w.temp >= 90)  // O(n)
            .map(w => w.city);          // O(n)

    // Total time ~ O(n) for n records
};

MongoDB vs SQL Recap

MongoDB Document

JSON-like document.
Collection: weather.
Flexible fields.
Index lookup → O(log n).


{
  _id: 101,
  city: "Stockton",
  temp: 92,
  hum: 40,
  date: "2025-08-01"
}

SQL Row

Fixed table schema.
Defined types (INT, VARCHAR).
Strict column structure.
Index lookup → O(log n).


id | city     | temp | hum | date
-----------------------------------
101| Stockton | 92   | 40  | 2025-08-01

SQL vs MongoDB: Core Concepts

MongoDB uses documents.
No schema enforcement.
Nested fields allowed.
Aggregation pipeline for queries.
Full scans often O(n).

SQL uses tables.
Strict schemas with DDL.
Rows + typed columns.
Joins supported.
Index lookup O(log n).

CRUD in MongoDB vs SQL

Create: insertOne()
Read: find()
Update: updateOne()
Delete: deleteOne()
Filters rely on document fields.
Non-indexed queries → O(n).

INSERT INTO weather ...
SELECT * FROM weather ...
UPDATE weather SET ...
DELETE FROM weather ...
Relational filtering.
Indexed lookup → O(log n).

Find Hot Cities (MongoDB vs SQL)


db.weather.find({
  temp: { $gte: 90 }
})

Document filter.
Indexed filter → O(log n).
No joins; embed or reference.


SELECT city FROM weather
WHERE temp >= 90;

Column filter.
Index improves to O(log n).

Performance Question (MCQ)

Fastest way to find all cities with temp ≥ 90?

A: MongoDB full scan (find() no index)
B: MongoDB indexed temp field
C: SQL full table scan
D: SQL indexed temp column

A and C → full scans O(n)
B and D → index lookup O(log n)
Best: B or D (both indexed)

Write the Query (MongoDB + SQL)

Write a MongoDB query:

Find weather records for “Stockton”.
Short written answer expected.
Hint: index lookup → O(log n).


// MongoDB MQL
db.weather.find({ city: "Stockton" })


-- SQL
SELECT * FROM weather
WHERE city = 'Stockton';

Redis vs SQL Recap

Redis Hash

In-memory key–value store.
Key pattern: weather:101.
Stores fields in a hash.
Key lookup ~O(1) average.


KEY: weather:101
FIELDS:
  city = "Stockton"
  temp = "92"
  hum  = "40"
  date = "2025-08-01"

SQL Row

Row in table weather.
Schema defined with DDL.
Types fixed (INT, VARCHAR, DATE).
Indexed lookup → O(log n).


id | city     | temp | hum | date
-----------------------------------
101| Stockton | 92   | 40  | 2025-08-01

SQL vs Redis: Core Concepts

Redis is key–value, in memory.
Data types: strings, hashes, sets, lists.
No joins, no SQL parser.
Key lookup ~O(1) average.
SCAN over many keys → O(n).

SQL is table and row based.
Relational queries with SELECT.
Supports joins and aggregates.
Index lookup → O(log n).
Full table scan → O(n).

CRUD in Redis vs SQL

Create: HSET weather hash.
Read: HGETALL / HGET.
Update: HSET same key.
Delete: DEL weather:101.
Single key ops ~O(1).
Global searches need SCAN → O(n).

INSERT INTO weather (...).
SELECT * FROM weather ...
UPDATE weather SET ...
DELETE FROM weather ...
Indexed search → O(log n).
Full scan → O(n).

Find One Weather Record


HGETALL weather:101

Direct key lookup for id 101.
Average cost ~O(1).


SELECT * FROM weather
WHERE id = 101;

With index on id → O(log n).
Without index → O(n) scan.

Performance Question (MCQ)

Fastest way to get yesterday’s weather for id 101?

A: Redis SCAN all keys then filter by date field.
B: Redis HGETALL weather:101.
C: SQL full scan on weather table by date.
D: SQL indexed lookup on (id, date).

A and C read many rows → O(n).
B uses Redis key → ~O(1).
D uses SQL index → O(log n).
Best for id 101: B or D.

Write the Query (Redis + SQL)

Task:

Get all fields for weather id 101.
Write Redis command and SQL query.
Key lookup is O(1) in Redis, O(log n) with index in SQL.


HGETALL weather:101


SELECT * FROM weather
WHERE id = 101;

ClickHouse vs SQL Recap

Weather Aggregates: ClickHouse vs Row-Store

ClickHouse (Daily Aggregates)

Pre-aggregated table: weather_daily_city.
One row per city per day.
Built via materialized view.
Query cost vs raw data ~O(1) (fixed small set).


-- ClickHouse aggregate row
city  | date       | avg_temp | max_temp | count
-------------------------------------------------
Stock | 2025-08-01 |   87.2   |   102    | 1440

Row-Store SQL (Raw Readings)

Raw table: weather_raw.
Many rows per city per day.
Aggregates computed at query time.
Full scan over readings ~O(n).


-- Row-store raw row
city  | date       | time  | temp | hum
----------------------------------------
Stock | 2025-08-01 | 12:00 |  92  | 40

Why ClickHouse Feels O(1) for Aggregates

Aggregates precomputed into small tables.
Query reads a few rows, not millions.
Fixed-size result per day/city.
Work vs original N readings ~O(1).

Row-store computes aggregates on raw data.
Must scan many readings each query.
Cost grows with history size.
Work vs N readings ~O(n).

Aggregate Pipeline: ClickHouse vs Row-Store

INSERT raw readings into weather_raw.
Materialized view updates weather_daily_city.
SELECT aggregates from small table.
Read queries on aggregates ~O(1).

INSERT into single raw table.
No pre-aggregate table.
SELECT uses GROUP BY on raw data.
Aggregate queries ~O(n) over readings.

Daily Average Temperature per City


-- ClickHouse (pre-aggregated)
SELECT city, avg_temp
FROM weather_daily_city
WHERE date = '2025-08-01';

Reads a few rows from aggregate table.
Work vs raw N readings ~O(1).


-- Row-store (raw)
SELECT city,
       AVG(temp) AS avg_temp
FROM weather_raw
WHERE date = '2025-08-01'
GROUP BY city;

Scans all readings for that date.
Work vs N daily readings ~O(n).

Performance Question (MCQ)

Average temperature per city for 3 years of data.

A: Row-store SQL on weather_raw (GROUP BY).
B: ClickHouse on weather_raw (GROUP BY).
C: Row-store SQL on daily aggregate table.
D: ClickHouse on daily aggregate table weather_daily_city.

A, B: scan many raw rows → O(n).
C: smaller table, but still grows with days → ~O(n) in days.
D: fixed-size aggregates per city, optimized columnar read.
With bounded #cities and pre-aggregates, query cost ~O(1) vs raw N.

Write the Aggregate Query

Task:

Return max temperature per city for August 2025.
Use ClickHouse aggregate table and row-store raw table.
ClickHouse aggregate query ~O(1) vs raw N readings.


-- ClickHouse (aggregate table)
SELECT city,
       max_temp
FROM weather_daily_city
WHERE date >= '2025-08-01'
  AND date <= '2025-08-31';


-- Row-store (raw table)
SELECT city,
       MAX(temp) AS max_temp
FROM weather_raw
WHERE date BETWEEN '2025-08-01'
              AND '2025-08-31'
GROUP BY city;

Recap Faiss and SQL

Vector Storage: FAISS vs SQL

FAISS Index

Stores dense vectors in an index.
Example: weather embedding per city.
Precomputed structure for fast search.
Query cost vs raw scan ~O(1) (fixed probes).


# vector for a city
[0.12, -0.30, 0.75, ...]

SQL Table with Vectors

Table: weather_vectors.
One row per city.
Vector stored as array/JSON/BLOB.
Brute-force distance scan ~O(n).


city  | vec
-----------------------
"Stock" | [0.12,-0.30,...]

Why FAISS Feels O(1) for Similarity Search

Index partitions space (IVF, HNSW, etc.).
Search probes only a few clusters.
For fixed index params, work per query is bounded.
Cost vs n vectors ~O(1) (approximate).

SQL has no vector index by default.
Must compute distance to many rows.
Work grows with number of vectors.
Cost vs n vectors ~O(n).

Vector Pipeline: FAISS vs SQL Only

Store metadata in SQL (city, region).
Store vectors in FAISS index.
Search in FAISS, then join IDs to SQL.
Nearest neighbor queries ~O(1) vs n.

Store everything in one SQL table.
No dedicated vector index.
Compute distance for many rows.
Nearest neighbor queries ~O(n).

Find Top-k Similar Weather Patterns


# FAISS: query q, get top-k
D, I = index.search(q, k)
# I = vector ids for nearest cities

Index search with fixed probes.
Cost vs n vectors ~O(1) (approximate).


-- SQL pseudo (no vector index)
SELECT city,
       distance(vec, :q) AS d
FROM weather_vectors
ORDER BY d
LIMIT 10;

Distance computed for many rows.
Cost vs n vectors ~O(n).

Performance Question (MCQ)

Task: Top 10 similar cities for a weather embedding.

A: SQL, full scan, ORDER BY distance.
B: SQL, prefilter by region, then full scan.
C: FAISS index with fixed probes.
D: FAISS + cached results for common queries.

A, B: distance on many rows → O(n).
C: bounded work per query → ~O(1) vs n.
D: cache makes repeated queries almost O(1) lookups.
Best practical answer: C (or D if cache fits use case).

Write the Vector Query

Task:

Return ids of 5 nearest cities for query vector q.
Write FAISS code and SQL-style pseudo query.
FAISS behaves ~O(1), SQL ~O(n).


# FAISS
k = 5
D, I = index.search(q, k)
# I = nearest vector ids


-- SQL pseudo
SELECT id
FROM weather_vectors
ORDER BY distance(vec, :q)
LIMIT 5;

Lessons Learned on Databases Pipelines

Merging patient data in an MPI
When tests become private health data
Impact on mobile first for performance
Sharding sites by region for speed
The risks of delaying your sharding
Using market data for backlog priorities
Encryption’s impact on your ETL flow
Why raw data belongs in your lake
Pipeline bottlenecks increase latency
Why sensor data is hard to verify
Why devs change code over DB configs
Why features often beat out quality
Why pipelines need active monitoring

Questions

Lab 4 and Homework 4

Lab 4 – Question 1

Lab 4 – Question 2

Lab 4 – Question 3

Lab 4 – Question 4

PowerPoint Download

Use the link below to access the Lab 4 presentation:

GitHub – Lab 4 Presentation

Lab 5 and Homework 5

Project Part 2 – DB Pipeline Implementation (due Nov 25)

Implement Dashboard
Connect to Data Pipeline
Visualize Key Metrics
Team-based Collaboration
Course Presentation

Final Exam Date

Questions

NoSQL, Key-Value Store, Document Store, Column Store, Graph Database, CAP Theorem, BASE Model, Eventual Consistency, ACID Properties, Strong Consistency, Data Warehouse, Sharding, Replication, Partitioning, Data Mart, Fact, Dimension, Volume, Velocity, Variety, Veracity, Value, Horizontal Scaling, Vertical Scaling, Consistency, Availability, Partition Tolerance, Vector Clock, Schema-less, Indexing, Secondary Index, Primary Key, Neo4j, Compound Key, MapReduce, Aggregation, Query Engine, Query Planner, Execution Plan, CRUD, Insert, Update, Delete, Read, Write Concern, Read Concern, Consistency Level, Consistency, onupgradeneeded, createObjectStore, keyPath, autoIncrement, transaction, objectStore.add, createIndex, Latency, Throughput, Fault Tolerance, Failover, Replica Set, Leader Election, Cluster, Data Center, Geo-Replication, Document Model, BSON, XML Store, Wide Column, Super Column, Column Family, CQL, MQL, Cypher, Graph Traversal, Property Graph, RDF, Triple Store, Relationship, Node, Edge, NoSQL, Index-Free Adjacency, Query Optimization, Materialized View, Denormalization, Data Redundancy, Write Amplification, Compaction, Commit Log, Snapshot, Backup, Restore, Hot Data, Cold Data, Data Lake, Data Warehouse, ETL, ELT, Streaming, Batch Processing,, Transformation, Lambda Architecture, Pub/Sub, Message Queue, Idempotency, Conflict Resolution, Event Sourcing, CQRS, Distributed Cache, In-Memory Database, Time-Series Database, Search Index, Map, Filter, Reduce, Inverted Index, Full-Text Search, Accuracy, Completeness, Consistency, Timeliness, Validity, Uniqueness, Integrity, Usability. Atlas, Compass, Collection, find, updateOne, deleteOne, deleteMany, aggregate, $match, $group, $sort, $limit, $set, $inc, MATCH (alice:Person {name: 'Alice'})-[:FRIEND]->(friend), MATCH (bob:Person {name: 'Bob'})-[:FRIEND]->(friend), MATCH (alice:Person {name: 'Alice'}), MATCH (bob:Person {name: 'Bob'}), $jsonSchema, createCollection, createIndex, IndexedDB, UUID, ObjectId

Questions?

Thanks

Course Syllabus – Fall 2025

Course Code: COMP 263
Units: 3
Time: Tue/Thu 10:00–11:20 AM
Location: Baun Hall 214

Instructor & TA

Instructor: Dr. Berhe, CTC 117
Email: sberhe@pacific.edu
TA: Mayuresh Sanjay More
Email: m_more1@u.pacific.edu
Office Hours: TBD

Course Resources

Course Description

This course explores managing large volumes of unstructured and evolving data. Emphasis on hands-on labs, projects, and real-world domains (IoT in agriculture, automotive, retail, healthcare). Students contrast modern challenges with traditional SQL concepts.

Course Goals by Role

Strategic: Database architecture decision-making
Operational: Query NoSQL databases
Data Architect: Evaluate technologies
Data Engineer: Build pipelines
Big Data Expert: Handle large datasets
Data Scientist: Analytics with unstructured data
Data Analyst: Define KPIs & dashboards

Grading Schema

Attendance: 5%
Homework: 15%
Labs: 15%
Project: 30%
Midterm Exam: 15%
Final Exam: 20% (cumulative)

Policies

Late Work: 20% penalty; extensions only if requested before deadline by email.
Attendance: Required and tracked; email in advance if absent.
Collaboration & AI: Collaboration allowed on labs; work must be your own. AI tools allowed if you understand the output.

University Policies

ADA Accommodations: Contact SSD (ssd@pacific.edu).
Honor Code: Academic honesty required; violations reported.
Assessment: Student work may be retained for program evaluation.
Support Services: Tutoring, Writing Center, Care Managers.

Welcome Database Management Systems 2

Today

Data Management Before 2000

Early 2000s: Data Shift

Connectivity and Automation 🙂

Connectivity and Automation 🤔

Connectivity and Automation 😠

Updating Relational Schemas

Example in Retail...

Point of Sale System

From SQL to NoSQL...

Carlo Strozzi “NoSQL” (1998)

Not Only SQL

Schema-less NoSQL – 2000s

Why is Data Unstructured and Evolving?

Today, Thursday Aug 28th

Volume

Velocity

Variety

Veracity

Value

Today, Thursday Aug 28th

IndexedDB Timeline

IndexedDB vs SQL: Structure

CRUD: IndexedDB ↔ SQL

IndexedDB Event Flow

CRUD (Object Store API)

IndexedDB Resources

Coding

Browser Support for IndexedDB

Why IndexedDB is I/O

IndexedDB and I/O

Access Control: Browser, Tab, Sandbox

IndexedDB Access Control

Why await is Required with IndexedDB

IndexedDB and Performance

Questions

IndexedDB and ACID Properties

ACID Properties: Questions

What happens every time the page loads?

Avoiding Duplicate Inserts

Questions

Why is it called "IndexedDB"?

Why "IndexedDB"?

Database Indexing

IndexedDB and UNIQUE Index

Questions

Today, Thursday Aug 28th

Use Case: IndexedDB as Cache

Use Case: Offline-First Apps

Use Case: Large Local Datasets

Use Case: Privacy & Access Control

Questions

Do you recommend IndexedDB?

Today, Tuesday Sept 2nd

Database

Data Modeling in Tables

Normalization (Step 1)

SQL and Table Structures

Backend

Data Modeling in Classes

Encapsulation

OOP and Backend

Frontend

Data Modeling in JavaScript

Frontend Data Normalization

JSON / Document Modeling

JSON in Frontend

Today, Tuesday Sept 2nd

JSON Object

JSON Array

JSON String

Numbers & Booleans

Null

Today, Tuesday Sept 2nd

Flat

Nested

Design Principles for JSON Data

DIKW as Nested JSON

Today, Tuesday Sept 2nd

Welcome
Database Management Systems 2