Dec 10, 8:00 am - 11:00 am, Baun Hall 214
Dec 12, 7:00 pm - 10:00 pm, John T Chambers Technology Center
114
Syllabus
Textbooks
"Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL
Movement, 2nd Edition"
by Eric Redmond, and Jim Wilson
"Fundamentals of Data Engineering"
by Joe Reis and Matt Housley
"The Dashboard Effect: Transform Your Company"
by Jon Thompson and Brick Thompson
Office Hours
Monday - Friday from 1:00 PM to 3:00 PM, located at CTC 117
Grading Schema
Homework: 15%
Labs: 15%
Project: 40%
Midterm Exam: 15%
Final Exam: 15%
Assignments and Exams
The course includes regular homework assignments, hands-on labs, a comprehensive project,
a midterm, and a final exam
Course Assumptions
It is assumed that students have a basic understanding of databases and programming
concepts. Familiarity with JavaScript and Python development is recommended
GitHub Repository
The course code and materials will be available on GitHub. Please clone the repository at
https://github.com/SE4CPS/dms2.
Collaboration and AI Rules
Collaboration is encouraged on assignments, but all submitted work must be your own.
The use of AI tools for generating code or assignments is allowed
Research Opportunities
Students interested in research can expand their course project into a research
opportunity, particularly in the areas of NoSQL databases, big data, and data analytics.
Course Goals 1/3
Data Architect: Learned to examine database technologies for a specific application
use case
Data Engineer: Gained expertise in building data pipelines using NoSQL databases.
Big Data Specialist: Gained experience in handling and processing large volumes of
data using NoSQL technologies
Database Administrator: Developed skills to manage, monitor, and optimize NoSQL
databases
Course Goals 2/3
Data Quality Specialist: Learned to define and implement data quality requirements.
Data Scientist: Learned to work with unstructured data in NoSQL databases for
advanced analytics
Data Analyst: Learned to define KPIs and metrics relevant for dashboards
BI Developer: Learned to integrate NoSQL databases into business intelligence tools
for better decision-making
Course Goals 3/3
Requirement: Effectively interact with domain stakeholders to examine data
requirements.
Strategic: Making architectural database and data decisions
Operational: Ability to query NoSQL databases for specific requirements
Today
What changed? Why DMS 2?
Why is data unstructured and evolving?
Why IndexedDB the Standard NoSQL Database for the Web?
Data Management Before 2000
Limited Internet Connectivity: Isolated Systems
Manual Data Entry: Transition from Paper Tables to SQL Tables
Static Data Models: Predefined Schemas Before Data Entry
Early 2000s: Data Shift
Connectivity: Rapid adoption of internet, NFC, BLE, and Wi-Fi;
Human Data: Surge in real-time, user-generated data
Machine Data: Cheap transistors fueled machine-generated IoT data
Mobile: Mobile devices; ubiquitous data access
Connectivity and Automation 🙂
Healthcare: Medical record systems, telemedicine, and patient
monitoring
Agriculture: Precision farming, automated irrigation, and crop monitoring.
Automotive: Autonomous driving systems, connected vehicles, and predictive
maintenance
Connectivity and Automation 🤔
Connectivity and automation
led to increased dependency on complex software
systems across industries
Connectivity and Automation 😠
Software continuously updated
independently of other dependent components.
Updating Relational Schemas
A consistent data model (SQL) is not realistic in complex software systems
Point of Sale System
A point of sale (POS) database should flexibly handle scanned data,
adapting to changes in data structure without data schema update
Review the architectural differences between relational (SQL) and non-relational (NoSQL)
databases
Analyze the trade-offs between SQL's ACID properties and NoSQL's CAP theorem implications.
Lab 1: Agricultural Data Collection
Canvas Due Date: Tuesday, September 10
Lab 1: Agricultural Data Collection with IndexedDB
In this lab, you will develop a web application that collects and stores unstructured
agricultural data using IndexedDB. You will implement functionality to handle various data
types, including sensor readings, images, farmer notes, GPS coordinates, and timestamps
Lab 1: Submission Instructions
After completing the lab, follow these steps to submit your work:
Create a new Git branch named lab1-firstname-lastname-XXXX (e.g.,
lab1-john-doe-1234).
Text Data Sample:
--------------------
"The quick brown fox jumps over the lazy dog.
This sentence contains every letter of the alphabet."
"Data science is an interdisciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge from data."
Email Data
Store
Entries
Email
{ id: 1, sender: "user@example.com", text: "Dear Team, Please find attached
the report. Best Regards, John", timestamp: "2024-08-28T09:00:00Z" }
{ id: 2, sender: "admin@example.com", text: "System Maintenance scheduled for
12:00 AM tonight. Expect downtime.", timestamp: "2024-08-28T12:00:00Z" }
Log Data
Log Data Sample:
--------------------
2024-08-28 12:45:23 INFO: User logged in - UserID: 12345
2024-08-28 13:02:10 WARNING: High memory usage detected
2024-08-28 14:15:45 ERROR: Unable to connect to the database
...
Data stored as free-form text, such as emails or social media posts
Data organized in a predefined format, such as tables and columns
Data generated by sensors, such as temperature readings
Which is unstructured data?
A relational database storing customer information
A JSON document representing a user profile
A collection of log files generated by a web server
Which data is stored in NoSQL DBs?
Tabular data in SQL databases
Key-value pairs or document-based data like JSON
Structured data with strict schema enforcement
Why unstructured data complex to manage?
It requires a predefined schema before storage
It lacks a predefined structure, making it harder to organize and query
It is always stored in text files
Unstructured vs structured?
50% unstructured, 50% structured
80% unstructured, 20% structured
30% unstructured, 70% structured
Data structure updates before table structure updates ✓
Data values that cannot be structured in RDBMS tables ✓
Today
Unstructured data ✓
IndexedDB Part 2
Code Lab
Why IndexedDB?
Performance
Comprehensive API
ACID Compliance
Performance via Asynchronous API
Performance via Asynchronous API
Non-blocking operations keep the UI responsive
Allows multiple requests to run simultaneously
Prevents the browser from freezing during large data operations
let request = indexedDB.open("BookStoreDB", 1);
request.onsuccess = function(event) {
let db = event.target.result;
let transaction = db.transaction("Books", "readwrite");
let store = transaction.objectStore("Books");
store.add({ title: "JavaScript Basics", author: "John Doe" });
};
console.log("next Instruction...");
Performance via Asynchronous API
Improves performance for data-intensive applications
Enables smooth user experience without lag
Handles large datasets effectively
Fetching all books in the background while the user continues browsing
let transaction = db.transaction("Books", "readonly");
let store = transaction.objectStore("Books");
let request = store.getAll();
request.onsuccess = function(event) {
console.log(event.target.result);
};
console.log("next Instruction...");
Performance via Indexing
Fast data retrieval by creating indexes on frequently queried fields
Reduces the time complexity of search operations
Improves performance when dealing with large datasets
let request = indexedDB.open("BookStoreDB", 1);
request.onupgradeneeded = function(event) {
let db = event.target.result;
let store = db.createObjectStore("Books", { keyPath: "id", autoIncrement: true });
store.createIndex("authorIndex", "author", { unique: false });
};
Performance via Indexing
Fast retrieve all books by a specific author
Efficiently search and filter data
Minimizes the overhead of scanning the entire database
let transaction = db.transaction("Books", "readonly");
let store = transaction.objectStore("Books");
let index = store.index("authorIndex");
let request = index.getAll("John Doe");
request.onsuccess = function(event) {
console.log(event.target.result);
};
Performance via Parallel Reads
let transaction = db.transaction("Books", "readonly");
let store = transaction.objectStore("Books");
let request1 = store.get(1);
let request2 = store.get(2);
request1.onsuccess = function(event) {
console.log("Book 1:", event.target.result);
};
request2.onsuccess = function(event) {
console.log("Book 2:", event.target.result);
};
Performance via No Schema Verification
Faster data storage and retrieval as no schema validation is required
Less overhead, especially for large datasets or frequently updated data
Why IndexedDB?
Performance ✓
Comprehensive API
ACID Compliance
Comprehensive API (1/4)
SQL Query
IndexedDB API
CREATE TABLE Books (id INT, title TEXT, author TEXT);
Supported. Transactions are atomic, meaning all operations within a transaction are
completed successfully, or none are applied. If an error occurs, the transaction can
be aborted
Consistency
Partially. While IndexedDB does not enforce a schema, it maintains consistency
within the data by ensuring that transactions either fully succeed or fail,
preventing partial updates
Isolation
Partial. Transactions are isolated to an extent, meaning data changes in a
transaction are not visible to other transactions until the transaction is complete.
However, there is no strict locking mechanism as in traditional RDBMS
Durability
Supported. Once a transaction is committed, the data is guaranteed to be saved, even
if the system crashes immediately afterward. Data is stored persistently in the
browser
Please review the syllabus, homework, lab, and reading list
Data Modeling and Engineering
Today
Data Modeling
Data Engineering
Questions Assignment 1
Glossary and Terminology
Which of the following statements is true about IndexedDB?
A) IndexedDB is a NoSQL database built into the browser, allowing for storage of significant
amounts of structured data
B) IndexedDB does not support transactions, making it unsuitable for complex operations.
C) IndexedDB only supports string data types and cannot store objects
D) IndexedDB can only be used in server-side applications and not in the browser
Which code snippets creates an object store in IndexedDB?
A)
var request = indexedDB.open('MyDatabase', 1);
request.onupgradeneeded = function(event) { // when created or updated
var db = event.target.result;
db.createObjectStore('MyObjectStore', { keyPath: 'id' });
};
B)
var db = new IndexedDB('MyDatabase');
db.createStore('MyObjectStore', 'id');
What happens if a database is created twice in IndexedDB with the same
name but different version numbers?
A) The database is re-created and any existing data is
overwritten
B) The database is ignored and no changes are made
C) The database is upgraded, and the
onupgradeneeded event is triggered.
D) The database is deleted and recreated from scratch.
What happens if a database is created twice in IndexedDB with the same
name and the same version number?
A) The database is re-created and any existing data is
overwritten
B) The database is ignored and no changes are made
C) The database is upgraded, and the
onupgradeneeded event is triggered.
D) The database is deleted and recreated from scratch.
Data Modeling: SQL vs. NoSQL
SQL: Structured schemas with tables, rows, and
columns
NoSQL: Flexible schema with key-value pairs for
quick lookups
Data Modeling: SQL vs. NoSQL
SQL: Models relationships between
entities
NoSQL: Models relationships within a
single entity
SQL Example: Crops Data
-- Create table for crops and farms
CREATE TABLE Crops (
CropID INT PRIMARY KEY,
CropName VARCHAR(50),
PlantingDate DATE,
HarvestDate DATE,
FarmID INT
);
-- Insert crop data
INSERT INTO Crops VALUES
(1, 'Wheat', '2024-03-15', '2024-08-10', 1),
(2, 'Corn', '2024-04-01', '2024-09-15', 2);
SQL databases use a structured query language (SQL) for complex queries, enabling
powerful filtering, joining, and aggregation capabilities
Process:
SQL Query → Returns Structured Data
NoSQL Querying
NoSQL databases use flexible key-based lookups and simple filtering. Queries
typically
return JSON values and may require additional processing with JavaScript
Process:
NoSQL Query → Returns JSON Data
JavaScript Query → Additional Processing
Result → Processed Data Output
⚠️ Warning: The schema of the JSON data must be known for accurate
processing and aggregation. Since NoSQL schemas are flexible and not enforced,
ensure
proper schema knowledge for consistent data handling.
SQL vs. NoSQL Data Structures
SQL Query & Aggregation
-- Sample SQL Query
SELECT CropName, PlantingDate, HarvestDate
FROM Crops
WHERE FarmID = 1;
-- Average Harvest Duration Calculation
SELECT AVG(DATEDIFF(DAY, PlantingDate, HarvestDate)) AS AvgHarvestDays
FROM Crops
WHERE FarmID = 1;
Structured query language (SQL) allows direct querying and aggregation, such as
calculating averages
NoSQL Query & Aggregation
// JavaScript to query IndexedDB and calculate average harvest days
let db; // Assume `db` is an open IndexedDB instance
// Query to get data from IndexedDB
let transaction = db.transaction(['Crops'], 'readonly');
let objectStore = transaction.objectStore('Crops');
let request = objectStore.getAll();
request.onsuccess = function(event) {
let crops = event.target.result;
// Example of processing data to calculate average harvest days
let totalHarvestDays = crops.reduce((sum, crop) => {
let plantingDate = new Date(crop.PlantingDate);
let harvestDate = new Date(crop.HarvestDate);
let days = (harvestDate - plantingDate) / (1000 * 60 * 60 * 24);
return sum + days;
}, 0);
let avgHarvestDays = totalHarvestDays / crops.length;
console.log('Average Harvest Days:', avgHarvestDays);
};
NoSQL querying returns JSON data, which can be processed with JavaScript for
aggregations
such as average calculations
Flexible data structure using nested key-value pairs. The schema can vary between
documents
and might not be uniform
⚠️ Important: To ensure accurate processing and aggregation, it's
crucial
to know the schema of the JSON data. In NoSQL, since the schema is not enforced, you
must
understand and manage it at the application level to maintain consistency and
correctness in
data handling.
Evaluating NoSQL Data Modeling
Pros
Flexible Schema: Allows for varying structures and rapid schema
changes
Nested Data: Can model complex hierarchical relationships
within a
single document
Efficient Reads: Reduces the need for joins by storing related
data
together
Scalability: Easily scales horizontally to handle large volumes
of
data
Cons
Complex Queries: Nested structures can make querying and data
manipulation more complex
Data Duplication: May lead to data duplication if not managed
properly
Performance Issues: Large and deeply nested documents can
impact
performance
Inconsistent Structure: Variability in document structure can
lead
to data inconsistency
Case Study I - Precision Farming Database
Requirements Overview
Sensor Data: Real-time, high volume. Data from IoT sensors monitoring soil
moisture, temperature, and humidity, requiring efficient processing and storage
Weather Data: Data from APIs. External weather data including rainfall,
temperature, and wind speed, crucial for predictive analysis and decision-making
Drone Imagery: High-resolution images & videos. Large files from drones
used for crop health monitoring, demanding robust storage solutions
Farm Management Data: Information on crop types,
planting schedules, and irrigation plans, essential for managing daily operations
Which type of database (SQL or NoSQL/IndexedDB) would you recommend for each requirement, and
why?
Today
Data Modeling ✓
Data Engineering
Questions Assignment 1
Glossary and Terminology
What is Data Engineering?
Data Collection and Ingestion
Data Transformation and Cleaning
Data Storage and Management
Data Pipeline Development
Data Security and Governance
Data Versioning
Data Maintenance
Data Archiving
Integration of Software Engineering with Data Engineering
How can principles of software engineering be integrated with data engineering practices to
improve the reliability, scalability, and maintainability of data-driven systems in agriculture?
UUID (Universally Unique Identifier) is a 128-bit number generated. UUIDs are designed to be
globally unique across space and time, with a very low probability of
duplication
Book Inventory: Cataloging information including
ISBN, title, author, genre, and availability status, crucial for managing book lending and
returns
Member Records: Detailed information on library members,
such as name, contact details, membership type, and borrowing history, requiring secure and
organized storage
Transaction Logs: High volume. Records of book checkouts, returns, and
fines, which need to be tracked efficiently over time for auditing purposes
Digital Media: Storage of e-books, audiobooks, and
digital magazines, requiring flexible storage solutions to handle various formats and sizes.
Which type of database (SQL or NoSQL/IndexedDB) would you recommend for each requirement, and
why?
Today
Data Modeling Part 2 ✓
Data Engineering Part 2
Assignment Review
Roster Verification
When verifying data model?
Data Model Verification
Always verify the data model before reading or writing to:
Ensure consistency
Maintain data integrity
Prevent errors
What is tranformation?
Map - Filter - Reduce
Library: Map Example
Map: Convert the list of books to show only titles
// Assert
console.assert(result.id === sensorId,
'Test failed: Sensor ID does not match.');
console.assert(result.temperature >= 0 && result.temperature <= 50,
'Test failed: Temperature is out of the range (0-50°C).');
console.assert(result.humidity >= 0 && result.humidity <= 100,
'Test failed: Humidity is out of the range (0-100%).');
console.assert(result.soilMoisture >= 0 && result.soilMoisture <= 100,
'Test failed: Soil moisture is out of the range (0-100%).');
Today
Data Modeling Part 2 ✓
Data Engineering Part 2 ✓
Assignment Review ✓
Roster Verification
Reminder
Please review the syllabus, homework, lab, and reading list
Data Quality and Standards
Today
Why Data Quality?
Definition of Data Quality
Implementation of Data Quality
Introduction MongoDB
Project Part 1, Homework 2, Lab 2
Definition of Data Quality
Integrity: Data maintains accuracy and consistency
Validity: Data adheres to established formats and rules
Consistency: Data is consistent across datasets
Accuracy: Data correctly reflects real-world conditions
Completeness: All necessary data is included; essential for analysis
Timeliness: Data is available when needed; current conditions
Uniqueness: No unnecessary duplicates; each entry is distinct
Definition of Data Quality
Healthcare: Compliance with HIPAA for data privacy, accuracy in
patient records
Automotive: Compliance with ISO/TS 16949 for quality and safety
standards
Manufacturing: Conformity with ISO 9001 for quality management
systems
Retail: Compliance with PCI DSS for payment data security,
accuracy in inventory data
Implementing Data Integrity
Data Tampering
Data tampering involves unauthorized alterations or manipulations of data, potentially compromising
its integrity during transmission or storage.
Simulate working with large data sets to explore performance optimization in NoSQL databases.
Techniques include read-only flags, indexing, and dedicated object stores
Using a "Todo List" domain, this project improves data query performance and teamwork skills.
Alternate domains may be used, provided they maintain the same 'todo' object structure
Project Part 1: Implementation
Create "TodoList" IndexedDB store with 100,000 objects; display on the browser
Assign 'completed' status to 1,000 objects, others to 'in progress'
Time and display the reading of all 'completed' status objects
Apply a read-only flag, remeasure read time for 'completed' tasks
Index the 'status' field, time read operations for 'completed' tasks
Establish "TodoListCompleted" store, move all completed tasks, and time reads
db.flowerCollection.find({
createdAt: { $gte: new Date("2023-01-01"), $lt: new Date("2023-12-31") }
});
What else do you recommend?
This week
Practice NoSQL Queries ✓
Why Data? ✓
Ingestion (Extract) ✓
Processing (Transform) → Thursday
Loading (Load) → Thursday
Project Part 1: Motivation
Simulate working with large data sets to explore performance optimization in NoSQL databases.
Techniques include read-only flags, indexing, and dedicated object stores
Using a "Todo List" domain, this project improves data query performance and teamwork skills.
Alternate domains may be used, provided they maintain the same 'todo' object structure
Project Part 1: Implementation
Create "TodoList" IndexedDB store with 100,000 objects; display on the browser
Assign 'completed' status to 1,000 objects, others to 'in progress'
Time and display the reading of all 'completed' status objects
Apply a read-only flag, remeasure read time for 'completed' tasks
Index the 'status' field, time read operations for 'completed' tasks
Establish "TodoListCompleted" store, move all completed tasks, and time reads
What is a key feature of unstructured data in NoSQL databases?
Data can be stored without a fixed schema
Which of the following is an example of unstructured data?
A JSON document containing varied fields for different products
Which NoSQL database type is most commonly used for storing unstructured data?
Document database
Why are NoSQL databases often preferred for storing unstructured data?
They allow flexible, schema-less data storage
What does the 'C' in the CAP theorem stand for?
Consistency
Question
Explanation
Please explain in a brief paragraph, according to the CAP theorem, which two
properties can be guaranteed simultaneously in a distributed database?
In the context of the CAP theorem, only two out of the three properties —
Consistency, Availability, and Partition Tolerance — can be fully achieved at the
same time. Typically, systems need to choose between Consistency and Availability
when Partition Tolerance is a necessity
Please explain in a brief paragraph, in the context of the CAP theorem, what does
Partition Tolerance refer to?
Partition Tolerance in the CAP theorem means the system continues to operate despite
an arbitrary number of messages being dropped or delayed by the network between
nodes. Essentially, the system can sustain any amount of network failure that
doesn't result in a failure of the entire network
Please explain in a brief paragraph, which property is often sacrificed in a
distributed system to achieve Availability and Partition Tolerance?
Consistency is often sacrificed in a distributed system to achieve Availability and
Partition Tolerance, leading to eventual consistency rather than guaranteed
consistency at all times
Please explain in a brief paragraph, what does the 'A' in ACID stand for?
The 'A' in ACID stands for Atomicity. This property ensures that all parts of a
transaction are treated as a single, indivisible unit, and either all of its
operations are completed without error, or none of them are applied
Please explain in a brief paragraph, which ACID property ensures that once a
transaction is committed, it cannot be undone?
The ACID property that ensures once a transaction is committed it cannot be undone
is Durability. This means that once a transaction has been committed, it will remain
so, even in the event of power loss, crashes, or errors
// Unit tests
console.assert(typeof gpsCoordinates === "number" && !isNaN(gpsCoordinates), "GPS coordinates must be a valid number");
console.assert(Array.isArray(sensorReadings) && sensorReadings.length > 0, "Sensor readings must be a non-empty array");
console.assert(typeof cropPhoto === "string" && cropPhoto.trim() !== "", "Crop photo path must be a non-empty string");
console.assert(typeof farmerNote === "string" && farmerNote.trim() !== "", "Farmer note must be a non-empty string");
console.assert(timestamp instanceof Date && !isNaN(timestamp.valueOf()), "Timestamp must be a valid Date object");
// Unit tests https://jestjs.io/
test('GPS coordinates should be a valid number', () => {
const gpsCoordinates = getGPS(); // Function to retrieve GPS coordinates
expect(typeof gpsCoordinates).toBe('number');
expect(gpsCoordinates).not.toBeNaN();
});
Lessons for Lab 1
Auto-increment adds objects upon each page reload
Automatically add UUIDs to new entries
Automatically add created, updated timestamps to new entries
Testing is effective only with scripts embedded in the HTML page
GPS data includes latitude, longitude, and altitude
Add good default values for empty fields (e.g., "not available")
Methods for displaying multiple objects vary across browsers
2015: Neo4j launched its cloud offering, Neo4j Aura
2018: Neo4j 3.5: “Graph Data Science.”
2020: Neo4j 4.0: multi-database, scalability
2021: Neo4j 4.1: data visualization and performance
Table < Tree < Graph
Why Graph Data Model?
Social Networks (e.g., Facebook, LinkedIn)
Supply Chain Management
Recommendation Engines (e.g., Netflix, Amazon)
Biological Networks (e.g., Protein Interaction)
Road and Transport Networks
Telecommunication Networks
Knowledge Graphs
Energy Grids (Power Distribution Networks)
What is the shortest path between two nodes?
Who are the neighbors of a specific node?
Does the graph contain any cycles?
What are the strongly connected components?
Which nodes are within N-hops from a given node?
Which nodes have the highest centrality?
Are there any nodes that match a certain pattern?
What are the communities or clusters in the network?
What are all the possible paths between two nodes?
Roads Within a 3-Road Radius
-- Find all roads within three roads (edges) distance from a start node
SELECT n1.id AS start_node, n2.id AS second_node, n3.id AS third_node
FROM nodes n1
JOIN edges e1 ON n1.id = e1.start_node_id
JOIN nodes n2 ON e1.end_node_id = n2.id
JOIN edges e2 ON n2.id = e2.start_node_id
JOIN nodes n3 ON e2.end_node_id = n3.id
JOIN edges e3 ON n3.id = e3.start_node_id
WHERE n1.name = 'StartIntersection';
Joins: The Multiplication of Tables
SQL Joins multiply tables
😔
...resulting in an explosion of rows before filtering!
Example:
SELECT *
FROM table1
JOIN table2 ON table1.id = table2.id;
Document-Based DBs: Avoid Modeling Relationships
Document databases like MongoDB prioritize flexibility and performance for storing unstructured
data
❌ Avoid trying to replicate relational DB modeling
Feature
Graph
Tree
Table
Structure
?
Hierarchical structure with no cycles
Flat structure with rows and columns
Node Relationships
?
Each node (except the root) has exactly one parent
Rows can relate through foreign keys
Root Node
?
Has a single root node from which all nodes descend
No root; all rows are equal
Edges
?
Typically unweighted
No direct edges; relationships managed via keys
Use Cases
?
Hierarchical data representation (e.g., file systems)
Transaction management, data retrieval, reporting
Traversal
?
Typically traversed in a top-down manner
Data retrieved through SQL queries
Memory Usage
?
Generally more memory-efficient due to structure
Efficient storage using normalization
Feature
Graph
Tree
Table
Structure
Can have cycles and multiple paths between nodes
Hierarchical structure with no cycles
Flat structure with rows and columns
Node Relationships
Nodes can connect in various ways (multiple parents)
Each node (except the root) has exactly one parent
Rows can relate through foreign keys
Root Node
No designated root node
Has a single root node from which all nodes descend
No root; all rows are equal
Edges
Can be weighted or unweighted
Typically unweighted
No direct edges; relationships managed via keys
Use Cases
Social networks, recommendation systems, etc
Hierarchical data representation (e.g., file systems)
Transaction management, data retrieval, reporting
Traversal
Various traversal methods (DFS, BFS)
Typically traversed in a top-down manner
Data retrieved through SQL queries
Memory Usage
Can require more memory due to flexibility
Generally more memory-efficient due to structure
Efficient storage using normalization
Operation
Graph
Tree
SQL Table
Create
O(1) (adding a node and edge)
O(1) (adding a node)
O(1) (inserting a row)
Read
O(V + E) (traversing edges and vertices)
O(h) (h is the height of the tree)
O(log n) (for indexed read, O(n) for non-indexed)
Update
O(V + E) (find the node and updating it)
O(h) (find the node and update it)
O(log n) (for indexed updates, O(n) for non-indexed)
Delete
O(V + E) (find and remove a node and its edges)
O(h) (find and remove a node)
O(log n) (for indexed deletes, O(n) for non-indexed)
Read Cases
Binary Search Tree (Balanced): O(log n)
Binary Search Tree (Unbalanced): O(n)
N-ary Tree (Balanced): O(log n) depending on structure
N-ary Tree (Unbalanced): O(n)
Neo4j Query Language: Cypher
CREATE (f:Flower {name: 'Rose', color: 'Red', type: 'Perennial'})
RETURN f
MATCH (f:Flower {name: 'Rose'})
RETURN f
MATCH (f:Flower {name: 'Rose'})
SET f.color = 'Pink'
RETURN f
MATCH (f:Flower {name: 'Rose'})
DELETE f
MATCH (f:Flower {name: 'Rose'})
RETURN f.color
MATCH (f:Flower)
WHERE f.color = 'Red'
RETURN f.name, f.type
Supplier Relationships: Complex network of suppliers, distributors, and
manufacturers, requiring real-time tracking of goods flow
Shipment Tracking: Monitoring shipments across different regions, requiring
visibility of delays and bottlenecks
Product Lifecycle: Tracking the product journey from raw materials to
finished goods, with a need to identify critical path disruptions
Inventory Optimization: Real-time data on stock levels across warehouses,
aiming to minimize excess inventory while avoiding stockouts
Which database type do you recommend?
Why recommend a Graph Database for managing this supply chain network? It is good
in tracking complex, interdependent relationships between suppliers, products, and shipments,
providing real-time insights and uncovering critical issues such as shipment delays or
bottlenecks.
Operation
SQL
IndexedDB
MongoDB
Neo4j
Create
INSERT INTO table_name (columns) VALUES (values)
objectStore.add(data)
db.collection.insertOne(data)
CREATE (n:Label {properties})
Read
SELECT * FROM table_name WHERE condition
objectStore.get(key) or objectStore.openCursor()
db.collection.find(query)
MATCH (n:Label) WHERE condition RETURN n
Update
UPDATE table_name SET column = value WHERE condition
objectStore.put(data)
db.collection.updateOne(query, update)
MATCH (n:Label) WHERE condition SET n.property = value
Delete
DELETE FROM table_name WHERE condition
objectStore.delete(key)
db.collection.deleteOne(query)
MATCH (n:Label) WHERE condition DELETE n
Operation
SQL
IndexedDB
MongoDB
Neo4j
Create Relationship
INSERT INTO relation_table (fk1, fk2) VALUES (id1, id2)
IndexedDB doesn’t support relationships natively
Embed related document or use references (manual handling)
CREATE (a)-[:RELATION]->(b)
Read Relationship
SELECT * FROM relation_table WHERE fk1 = id1
Manually retrieve related objects
db.collection.aggregate([lookup])
MATCH (a)-[:RELATION]->(b) RETURN b
Update Relationship
UPDATE relation_table SET fk2 = new_id WHERE fk1 = id1
Manually update linked objects
Manually update references or embedded docs
MATCH (a)-[r:RELATION]->(b) SET r.property = value
// Create relationships between Flower Nodes
MATCH (r:Flower {name: 'Rose'}), (t:Flower {name: 'Tulip'})
CREATE (r)-[:NEIGHBOR]->(t)
MATCH (s:Flower {name: 'Sunflower'}), (d:Flower {name: 'Daisy'})
CREATE (s)-[:NEIGHBOR]->(d)
MATCH (l:Flower {name: 'Lily'}), (r:Flower {name: 'Rose'})
CREATE (l)-[:NEIGHBOR]->(r)
// Find all neighbors of a specific flower
MATCH (f:Flower {name: 'Rose'})-[:NEIGHBOR]->(neighbor)
RETURN neighbor.name
// Find all flowers connected by "NEIGHBOR" relationships
MATCH (f:Flower)-[:NEIGHBOR]->(neighbor)
RETURN f.name, neighbor.name
// Count how many neighbors each flower has
MATCH (f:Flower)-[:NEIGHBOR]->(neighbor)
RETURN f.name, COUNT(neighbor) AS neighbor_count
Database Architecture MongoDB (document)
MongoDB Architecture
Efficient Query Optimization: Structures like indexing, partitioning, and
sharding enhance query performance
Scalable Data Distribution: Supports horizontal and vertical scaling to
manage data growth and high transaction volumes
Data Consistency and Transactions: Implements ACID compliance or BASE
principles based on system requirements
Fault Tolerance and Recovery: Utilizes replication and backup strategies to
ensure high availability and data protection
Agriculture: How will sensor data (e.g., soil moisture, weather) be
collected and integrated?
Retail: How will customer purchase data be tracked and analyzed to optimize
inventory?
Healthcare: What kind of patient data (e.g., medical history, vital signs)
needs to be securely stored and accessed?
Logistics: How will shipment and fleet tracking data be collected and used
for real-time decision-making?
Education: How will student performance data be collected and analyzed to
improve learning outcomes?
Database Architecture IndexedDB (key-value)
Question on Database Architecture?
Database Dashboard
Today
NoSQL Query Practice
Dashboard Effect
Delta vs. Full Data Load
Dashboard Prototype
How to Create a Database in IndexedDB?
A. indexedDB.open('FlowerDB', 1);
B. createIndexedDB('FlowerDB', version: 1);
C. openDatabase('FlowerDB');
D. new IndexedDB('FlowerDB', 1);
Answer
A. indexedDB.open('FlowerDB', 1);
How to Create an Object Store in IndexedDB?
A. db.createObjectStore('flowers', { keyPath: 'id' });
B. db.createStore('flowers');
C. createObjectStore('flowers', keyPath: 'id');
D. indexedDB.createObjectStore('flowers', { keyPath: 'id' });
Answer
A. db.createObjectStore('flowers', { keyPath: 'id' });
How to Add (Create) a Flower in the 'flowers' Object Store in IndexedDB?
A. store.add({ id: 1, name: 'Rose', color: 'Red' })
B. store.put({ id: 1, name: 'Rose', color: 'Red' })
C. store.insert({ id: 1, name: 'Rose', color: 'Red' })
D. store.save({ id: 1, name: 'Rose', color: 'Red' })
Answer
A. store.add({ id: 1, name: 'Rose', color: 'Red' })
How to Read a Flower from the 'flowers' Object Store in IndexedDB?
A. store.find(1)
B. store.get(1)
C. store.read(1)
D. store.fetch(1)
Answer
B. store.get(1)
How to Update a Flower in the 'flowers' Object Store in IndexedDB?
A. store.update({ id: 1, name: 'Rose', color: 'Pink' })
B. store.put({ id: 1, name: 'Rose', color: 'Pink' })
C. store.modify({ id: 1, name: 'Rose', color: 'Pink' })
D. store.change({ id: 1, name: 'Rose', color: 'Pink' })
Answer
B. store.put({ id: 1, name: 'Rose', color: 'Pink' })
How to Delete a Flower from the 'flowers' Object Store in IndexedDB?
A. store.remove(1)
B. store.delete(1)
C. store.destroy(1)
D. store.erase(1)
Answer
B. store.delete(1)
How to Add (Create) a Document in the 'flowers' Collection in MongoDB?
A. db.flowers.insertOne({ _id: 1, name: 'Rose', color: 'Red' })
B. db.flowers.add({ _id: 1, name: 'Rose', color: 'Red' })
C. db.flowers.put({ _id: 1, name: 'Rose', color: 'Red' })
D. db.flowers.create({ _id: 1, name: 'Rose', color: 'Red' })
Answer
A. db.flowers.insertOne({ _id: 1, name: 'Rose', color: 'Red' })
How to Read a Document from the 'flowers' Collection in MongoDB?
A. db.flowers.find({ _id: 1 })
B. db.flowers.get({ _id: 1 })
C. db.flowers.read({ _id: 1 })
D. db.flowers.fetch({ _id: 1 })
Answer
A. db.flowers.find({ _id: 1 })
How to Update a Document in the 'flowers' Collection in MongoDB?
Efficiency: Streamlines processes via centralized access to
crucial data
Continuous Improvement: Identifies trends and areas for improvement through
data tracking
What makes a good Dashboard?
Defining KPIs, Metrics, and Terms
KPI: A measurable value that shows progress toward a business goal
Metric: A specific measurement that supports KPIs
Benchmark: A reference point for comparing performance
Target: A set goal for a KPI or metric
Flower Store Database
KPI: Monthly revenue growth from flower sales.
Metric: Number of roses sold per day
Benchmark: Average monthly sales compared to other local flower stores.
Target: Increase online orders by 15% this quarter
Case Study - Coffee Store Database
Requirements Overview
Sales Data: Real-time sales data including customer orders, payments, and
discounts,
requiring constant updates and monitoring to track revenue and popular products
Inventory Data: Track stock levels of coffee beans, cups, milk, and other
supplies, ensuring that critical supplies are always available and no overstock occurs
Customer Feedback: Collect ratings and reviews from customers, including
any complaints, through online surveys and feedback forms
Employee Performance: Monitor barista performance, including order handling
speed and customer service quality, to optimize staff efficiency
What data (kpi, metric, benchmark, target) would you load into the dashboard for real-time
monitoring? What type of database (SQL
or NoSQL) would you recommend to store each type of data, and why?
Today
NoSQL Query Practice ✓
Dashboard Effect ✓
Delta vs. Full Data Load
Dashboard Prototype
Delta Load
Definition: Loads only changed (new or updated) data
Efficiency: Faster, using less system and network resources
Use Case: Large datasets with frequent updates
Full Data Load
Definition: Reloads the entire dataset, regardless of changes
Efficiency: Slower, higher resource usage
Use Case: Small datasets or full refreshes needed
Examples of Delta and Full Data Loads
GitHub: Uses delta load to track and update only changed files in
repositories
Google Drive: Syncs only changed or new files in delta load mode
Amazon RDS: Full load for data backups, delta load for incremental backups
Case Study - Coffee Store Database
Requirements Overview
Sales Data: Real-time sales data including customer orders, payments, and
discounts,
requiring constant updates and monitoring to track revenue and popular products
Inventory Data: Track stock levels of coffee beans, cups, milk, and other
supplies, ensuring that critical supplies are always available and no overstock occurs
Customer Feedback: Collect ratings and reviews from customers, including
any complaints, through online surveys and feedback forms
Employee Performance: Monitor barista performance, including order handling
speed and customer service quality, to optimize staff efficiency
For which requirement do you recommend a full or delta load and why?
Delta Load: Deletions & Timestamps
Deletions: Track deleted records to ensure they are removed in the target
system
Timestamps: Use "last updated" timestamps to identify changes after the
last load
Soft Deletes: Consider marking records as inactive rather than removing
them entirely
Requirements: Ensure each record has a unique ID and an updated timestamp
for tracking
Today
Flower Graph Inc
Data Engineer
git clone --no-checkout https://github.com/SE4CPS/dms2.git
cd dms2
git sparse-checkout init --cone
git sparse-checkout set neo4j
git checkout main
cd neo4j
npm install
node app.js
How to Create a Node in Neo4j?
A. CREATE (:Flower {id: 1, name: 'Rose', color: 'Red'})
B. INSERT (:Flower {id: 1, name: 'Rose', color: 'Red'})
C. ADD (:Flower {id: 1, name: 'Rose', color: 'Red'})
D. PUT (:Flower {id: 1, name: 'Rose', color: 'Red'})
Answer
A. CREATE (:Flower {id: 1, name: 'Rose', color: 'Red'})
How to Read a Node from Neo4j?
A. MATCH (f:Flower {id: 1}) RETURN f
B. GET (f:Flower {id: 1}) RETURN f
C. FETCH (f:Flower {id: 1}) RETURN f
D. FIND (f:Flower {id: 1}) RETURN f
Answer
A. MATCH (f:Flower {id: 1}) RETURN f
How to Update a Node in Neo4j?
A. MATCH (f:Flower {id: 1}) SET f.color = 'Pink'
B. MODIFY (f:Flower {id: 1}) SET f.color = 'Pink'
C. UPDATE (f:Flower {id: 1}) SET f.color = 'Pink'
D. CHANGE (f:Flower {id: 1}) SET f.color = 'Pink'
Answer
A. MATCH (f:Flower {id: 1}) SET f.color = 'Pink'
How to Delete a Node in Neo4j?
A. MATCH (f:Flower {id: 1}) DELETE f
B. MATCH (f:Flower {id: 1}) REMOVE f
C. MATCH (f:Flower {id: 1}) ERASE f
D. MATCH (f:Flower {id: 1}) DESTROY f
Answer
A. MATCH (f:Flower {id: 1}) DELETE f
How to Create a Relationship Between Nodes in Neo4j?
A. MATCH (f:Flower {id: 1}), (g:Garden {id: 1}) CREATE (f)-[:PLANTED_IN]->(g)
B. MATCH (f:Flower {id: 1}), (g:Garden {id: 1}) ADD (f)-[:PLANTED_IN]->(g)
C. MATCH (f:Flower {id: 1}), (g:Garden {id: 1}) CONNECT (f)-[:PLANTED_IN]->(g)
D. MATCH (f:Flower {id: 1}), (g:Garden {id: 1}) LINK (f)-[:PLANTED_IN]->(g)
Answer
A. MATCH (f:Flower {id: 1}), (g:Garden {id: 1}) CREATE
(f)-[:PLANTED_IN]->(g)
How to Read a Relationship in Neo4j?
A. MATCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) RETURN r
B. GET (f:Flower)-[r:PLANTED_IN]->(g:Garden) RETURN r
C. FIND (f:Flower)-[r:PLANTED_IN]->(g:Garden) RETURN r
D. FETCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) RETURN r
Answer
A. MATCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) RETURN r
How to Update a Relationship in Neo4j?
A. MATCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) SET r.date = '2024-01-01'
B. MODIFY (f:Flower)-[r:PLANTED_IN]->(g:Garden) SET r.date = '2024-01-01'
C. UPDATE (f:Flower)-[r:PLANTED_IN]->(g:Garden) SET r.date = '2024-01-01'
D. CHANGE (f:Flower)-[r:PLANTED_IN]->(g:Garden) SET r.date = '2024-01-01'
Answer
A. MATCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) SET r.date = '2024-01-01'
How to Delete a Relationship in Neo4j?
A. MATCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) DELETE r
B. MATCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) REMOVE r
C. MATCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) DESTROY r
D. MATCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) ERASE r
Answer
A. MATCH (f:Flower)-[r:PLANTED_IN]->(g:Garden) DELETE r
Data Product Owner
What is our minimum viable product (MVP)?
Data Product Customer
What is your request?
Please review project 1, homework 2/3 lab 2/3 and reading list
Reading List - Module 6 Database Dashboard in Canvas
Database and Data Security 🔒
Today
Introduction Security
Security by Encryption
Security by Access Control
Introduction Security
Introduction Security
Protect sensitive information from unauthorized access
Prevent data breaches and legal consequences
Ensure compliance with data protection laws
Maintain data accuracy and integrity
Introduction Security
Security by Encryption
Security by Access Control
Introduction Security
Scenario
Access Control
Encryption
Both
Protecting sensitive files
X
Securing API endpoints from unauthorized access
X
Encrypting data during transmission (e.g., over HTTPS)
X
Database access based on user roles
X
Securing cloud storage (e.g., S3 or Redis)
X
Encrypting personal data (e.g., credit card info)
X
Ensuring secure access to an encrypted database
X
Security by Encryption
Security by Encryption
Encrypting Data for IndexedDB
// Step 1: Generate a key (for both encryption and decryption)
async function generateKey() {
return await crypto.subtle.generateKey(
{ name: "AES-GCM", length: 256 }, // AES-GCM encryption algorithm
true, // Key can be exported if needed (true = exportable)
["encrypt", "decrypt"] // The key will be used for both encrypt and decrypt
);
}
// Step 2: Encrypt the data
async function encryptData(key, data) {
const encoder = new TextEncoder();
const encodedData = encoder.encode(data); // Convert data to bytes
const iv = crypto.getRandomValues(new Uint8Array(12)); // Create a random initialization vector (IV)
// Encrypt the data using the key and IV
const encrypted = await crypto.subtle.encrypt(
{ name: "AES-GCM", iv: iv },
key,
encodedData
);
return { encrypted, iv }; // Return the encrypted data and IV
}
// Step 3: Decrypt the data
async function decryptData(key, encryptedData, iv) {
const decrypted = await crypto.subtle.decrypt(
{ name: "AES-GCM", iv: iv }, // Use the same IV as used in encryption
key, // Use the same key
encryptedData
);
const decoder = new TextDecoder();
return decoder.decode(decrypted); // Convert the decrypted bytes back to string
}
// Example usage:
(async () => {
const key = await generateKey(); // Generate the key
const data = "Sensitive Information"; // Data to encrypt
// Encrypt the data
const { encrypted, iv } = await encryptData(key, data);
console.log("Encrypted Data:", new Uint8Array(encrypted));
// Decrypt the data
const decryptedData = await decryptData(key, encrypted, iv);
console.log("Decrypted Data:", decryptedData); // Should print the original data
})();
IndexedDB follows the same-origin policy, which means that only scripts from
the same origin (protocol, host, and port) can access the database. This is a core part of
access control in browsers.
Which database architecture supports horizontal scaling the best?
A. Relational databases
B. NoSQL databases
C. In-memory databases
D. File-based databases
Answer
B. NoSQL databases
Which architecture stores data as documents rather than tables?
A. Key-Value stores
B. Document databases
C. Columnar databases
D. Relational databases
Answer
B. Document databases
Module 6 - Database Dashboard
What is a key advantage of a partial data load in dashboards?
A. It reduces the time needed to load the dashboard
B. It ensures all data is always updated
C. It loads the entire dataset regardless of size
D. It simplifies the data processing pipeline
Answer
A. It reduces the time needed to load the dashboard
When would you use a full data load instead of a partial data load?
A. When only small parts of the data change frequently
B. When data consistency is critical and the entire dataset needs to be refreshed
C. When you want to load only recent data
D. When the data load time is irrelevant
Answer
B. When data consistency is critical and the entire dataset needs to be refreshed
Which load method is more efficient for large, frequently changing datasets?
A. Full data load
B. Partial data load
C. No data load is necessary
D. Both methods are equally efficient
Answer
B. Partial data load
What is the primary purpose of a KPI (Key Performance Indicator) in a dashboard?
A. To display the total number of records in a dataset
B. To measure progress toward a specific business goal
C. To summarize past performance
D. To calculate the average value of a metric
Answer
B. To measure progress toward a specific business goal
What is a metric used for in a dashboard?
A. To track and measure specific data points
B. To set the business objectives
C. To compare against industry standards
D. To predict future outcomes
Answer
A. To track and measure specific data points
What is the purpose of a benchmark in a business dashboard?
A. To set a new goal for the next quarter
B. To compare actual performance against a standard or reference point
C. To highlight the highest-performing metrics
D. To calculate the average of all KPIs
Answer
B. To compare actual performance against a standard or reference point
Module 7 - Database and Data Security
What is the main benefit of using encryption for securing NoSQL big data?
A. It ensures that only authorized users can modify the database schema
B. It protects data from unauthorized access, even if the database is compromised
C. It simplifies data replication across nodes
D. It removes the need for access control policies
Answer
B. It protects data from unauthorized access, even if the database is compromised
How does access control improve security in NoSQL databases?
A. It ensures all data is encrypted
B. It restricts access to sensitive data based on user roles and permissions
C. It removes the need for data backups
D. It enables faster querying by limiting access to large datasets
Answer
B. It restricts access to sensitive data based on user roles and permissions
Which of the following is a challenge when using encryption in NoSQL databases?
A. Encryption increases data redundancy
B. Encrypted data can be more difficult to index and query efficiently
C. Encryption requires no additional processing power
D. Encryption allows unauthorized users to bypass access control
Answer
B. Encrypted data can be more difficult to index and query efficiently
Lab Questions
Which method is used to open a database in IndexedDB?
A. indexedDB.open()
B. openDatabase()
C. openIndexedDB()
D. database.open()
Answer
A. indexedDB.open()
How do you create an object store in IndexedDB?
A. db.createObjectStore()
B. createStore()
C. db.createStore()
D. objectStore.create()
Answer
A. db.createObjectStore()
Which method is used to add data to an object store in IndexedDB?
A. store.insert()
B. store.add()
C. db.addData()
D. objectStore.insert()
Answer
B. store.add()
Which method retrieves data from an object store in IndexedDB?
A. store.find()
B. db.query()
C. store.get()
D. db.getData()
Answer
C. store.get()
Which query is used to find all documents in a MongoDB collection?
A. db.collection.findAll()
B. db.collection.get()
C. db.collection.find()
D. db.collection.selectAll()
Answer
C. db.collection.find()
How do you update a document in MongoDB?
A. db.collection.modify()
B. db.collection.update()
C. db.collection.change()
D. db.collection.set()
Answer
B. db.collection.update()
Which operator is used to delete a document in MongoDB?
A. db.collection.remove()
B. db.collection.delete()
C. db.collection.erase()
D. db.collection.drop()
Answer
A. db.collection.remove()
How do you query for documents where a field equals a specific value in MongoDB?
A. db.collection.find({ "field": value })
B. db.collection.findByField()
C. db.collection.selectBy()
D. db.collection.queryField()
Answer
A. db.collection.find({ "field": value })
Which Cypher query retrieves all nodes in Neo4j?
A. MATCH (n) RETURN n
B. FIND n RETURN ALL
C. SELECT * FROM nodes
D. GET nodes
Answer
A. MATCH (n) RETURN n
How do you create a relationship between two nodes in Neo4j?
A. CONNECT (a)-[:RELATES_TO]->(b)
B. MATCH (a), (b) CREATE (a)-[:RELATES_TO]->(b)
C. INSERT RELATION a, b
D. ADD RELATIONSHIP a -> b
Answer
B. MATCH (a), (b) CREATE (a)-[:RELATES_TO]->(b)
Question 1: Data Modeling for NoSQL vs SQL
How does data modeling in NoSQL databases differ from SQL, particularly when dealing with
unstructured or semi-structured data? Provide examples of use cases where NoSQL's flexible
schema is more beneficial than SQL's rigid schema
Question 2: Partial vs Full Data Load in Dashboards
In what scenarios would you choose a partial data load over a full data load for a real-time
dashboard? Discuss the trade-offs in terms of performance, consistency, and data freshness
Question 3: NoSQL Data Security
When securing a NoSQL database, how do encryption and access control complement each other?
Discuss the challenges of applying encryption and access control in distributed NoSQL
environments like MongoDB and Neo4j
Questions?
Homework and Lab 4
Glossary
Term
Definition
Map
A higher-order function that applies a given function to each item in a
collection,
producing a new collection with the results.
In data processing, it transforms each element of a dataset according to a
specified
operation.
Reduce
A higher-order function that aggregates elements of a collection into a single
result by iteratively applying a combining function.
In data processing, it consolidates data from multiple sources into a summary or
single value.
Filter
A higher-order function that selects elements of a collection that satisfy a
given
condition, producing a new collection of filtered items.
In data processing, it isolates data that meets specific criteria.
Assets
In software development, assets include files such as MP3s, MP4s, images, and
other
media types that are used to improve the visual and auditory experience of the
application. Assets are managed separately from the main application code,
allowing
for easier updates and maintenance without altering the core software.
Data Lakes, Data Warehouses & Secondary Data Usage
How can this data be used for research beyond its original purpose?
Analyze crop yield trends over time
Study the impact of planting and harvesting times on yield
Optimize farming strategies based on weather and yield data
Challenges of Secondary Data Usage
Data may be outdated or irrelevant to current needs
Original data context or purpose may be missing, leading to misinterpretation
Data quality can be inconsistent, impacting the accuracy of new analyses
Limited control over how the data was collected or processed
Privacy concerns when repurposing sensitive or personal data
May require extensive cleaning or transformation before use
Primary data usage has higher priority then secondary data usage
Today
Midterm Review ✓
Secondary Data Usage ✓
Data Lakes
Data Warehouses
Data Marts
Assignments
What is a Data Lake?
A centralized repository that stores vast amounts of structured, semi-structured, and
unstructured data
Stores data in its raw format, making it highly flexible
Can handle large-scale data from diverse sources (IoT, logs, databases, etc.)
Data Lake: Supporting Secondary Data Usage
A Data Lake enables secondary usage by providing flexible, scalable storage for a wide range of
data types
Advantages of Data Lakes for Secondary Usage
Schema-on-Read: Data is ingested in its raw form, and schemas are applied
when querying. This flexibility allows for a wide variety of secondary analyses
Support for Diverse Data Types: Data Lakes handle structured,
semi-structured, and unstructured data, enabling the use of different formats (e.g., JSON,
CSV, images)
Scalability: Data Lakes scale easily to handle large volumes of data,
making them suitable for secondary research over long periods and with massive datasets.
Data Lake Architecture for Secondary Usage
Raw Zone: Stores data in its original format, making it available for
various secondary analyses like trend analysis, predictive modeling, and historical
research
Processed Zone: Data is cleaned and transformed as needed, which supports
secondary research that requires high-quality, curated datasets
Analytics Zone: Provides a structured and indexed version of the data,
optimized for querying, enabling fast analysis for research purposes
Secondary Research Use Cases in Data Lakes
Historical Analysis: Data lakes store historical data across multiple
years, allowing researchers to study long-term trends and patterns
Machine Learning and AI: The large datasets in data lakes are ideal for
training models that support predictive analytics and advanced research applications
Ad-Hoc Queries: Schema-on-read flexibility allows researchers to perform
different kinds of ad-hoc queries without needing to reformat the data
Data Science & Machine Learning: Why is a data lake suitable for storing
large, unstructured datasets?
Historical Data Analysis: How can a data lake support long-term trend
analysis across diverse data types?
Ad-hoc Queries: How does the schema-on-read capability of a data lake make
it easier exploring insights?
Why are Data Lakes an optimal solution for these cases?
Today
Midterm Review ✓
Secondary Data Usage ✓
Data Lakes ✓
Data Warehouses (Thu)
Data Marts (Thu)
Redis DB (Thu)
Assignments
Data Warehouse
A centralized system for storing integrated, structured data from multiple sources, optimized for
analytics and reporting.
Key Properties
Subject-Oriented: Focused on specific business domains (e.g., sales)
Integrated: Data from heterogeneous sources unified
Time-Variant: Stores historical data for trends
Non-Volatile: Data is read-only, no frequent updates
Optimized for Analytics: Built for complex queries, not transactions
Data Flow
Extract data from sources
Transform to standardized formats
Load into the warehouse for querying
Questions?
Data Marts
A subset of a data warehouse focused on specific business areas, providing
faster, targeted access to data
Key Properties
Subject-Specific: Focused on a single domain (e.g., customer service,
marketing).
Smaller Scope: Contains a limited dataset compared to a data warehouse.
Faster Access: Optimized for specific queries in a particular department.
Decentralized: Can exist independently or within a data warehouse
Types of Data Marts
Dependent: Sourced from a central data warehouse
Independent: Built directly from external data sources
Hybrid: Combines both dependent and independent approaches
Questions?
Redis - the fastest database
Cache Layers Overview
Cache Layer
Persistence
Latency
Size
Examples
CPU Registers
Non-Persistent
Nanoseconds
Bytes
N/A
L1 Cache
Non-Persistent
Nanoseconds
Kilobytes
N/A
L2 Cache
Non-Persistent
Nanoseconds
KB to MBs
N/A
L3 Cache
Non-Persistent
Nanoseconds
Megabytes
N/A
RAM (Main Memory)
Non-Persistent
Microseconds
Gigabytes
Redis, Memcached
Disk Cache
Non-Persistent
Milliseconds
GBs
OS Page Cache
Persistent Storage
Persistent
Milliseconds to Seconds
TB to PB
HDD, S3, HDFS
Redis History
2009: Redis created for real-time log analysis
2010: Redis 1.0 released with basic key-value store and Pub/Sub
2011-2012: Introduced advanced data types (lists, sets, hashes)
2013: Added Lua scripting and persistence options (RDB, AOF)
2015: Redis Cluster introduced for horizontal scalability
2020: Redis 6.0 launched with ACLs and multi-threaded I/O
2022: Redis 7.0 introduced functions and advanced replication
Database Selection
1. Select Database 1:
SELECT 1
(Start by using database 1 for isolation of the Flower Store's data)
Inventory Management (CRUD)
Create (Add New Flower to Inventory - Hash):
HSET flower:roses name "Roses" price 10 stock 100
Read (Get Flower Details - Hash):
HGETALL flower:roses
Update (Update Stock Count - Hash):
HINCRBY flower:roses stock -5
Delete (Remove Flower from Inventory - DEL):
DEL flower:roses
Inventory Management
1. Add New Flower to Inventory (Hash):
HSET flower:roses name "Roses" price 10 stock 100
2. Update Stock Count (Hash):
HINCRBY flower:roses stock -5
3. Add Flower to Popularity Ranking (Sorted Set):
ZADD flower_rankings 100 "Roses"
4. Get Flower with Highest Popularity (Sorted Set):
ZREVRANGE flower_rankings 0 0
Order Management
1. Add a Customer Order (List):
LPUSH orders "order:1001"
2. View Pending Orders (List):
LRANGE orders 0 -1
3. Process Oldest Order (List):
RPOP orders
4. Store Order Details (Hash):
HSET order:1001 customer "John" total 30 status "pending"
EXPLAIN ANALYZE
SELECT
AVG(lowest_price) AS avg_lowest_price,
MAX(highest_price) AS max_highest_price,
MIN(menus_appeared) AS min_menus_appeared,
SUM(times_appeared) AS total_times_appeared,
COUNT(*) AS total_dishes,
STDDEV(lowest_price) AS stddev_lowest_price,
STDDEV(highest_price) AS stddev_highest_price
FROM dish;
SELECT
avg(lowest_price) AS avg_lowest_price,
max(highest_price) AS max_highest_price,
min(menus_appeared) AS min_menus_appeared,
sum(times_appeared) AS total_times_appeared,
count(*) AS total_dishes,
stddevPop(lowest_price) AS stddev_lowest_price,
stddevPop(highest_price) AS stddev_highest_price
FROM dish;
ClickHouse caches query results to reduce response times and
computational load on subsequent runs.
SELECT COUNT(*) AS total_records FROM dish;
SELECT COUNT(DISTINCT name) AS unique_names FROM dish;
SELECT SUM(menus_appeared) AS total_menus_appeared FROM dish;
SELECT AVG(times_appeared) AS avg_times_appeared FROM dish;
SELECT MIN(lowest_price) AS min_price, MAX(highest_price) AS max_price FROM dish;
SELECT MIN(first_appeared) AS first_year, MAX(last_appeared) AS last_year FROM dish;
SELECT first_appeared, COUNT(*) AS total_by_first_year
FROM dish
GROUP BY first_appeared
ORDER BY first_appeared;
SELECT last_appeared, AVG(highest_price) AS avg_highest_price
FROM dish
GROUP BY last_appeared
ORDER BY last_appeared;
SELECT name, SUM(times_appeared) AS total_times
FROM dish
GROUP BY name
ORDER BY total_times DESC
LIMIT 5;
SELECT stddevPop(lowest_price) AS stddev_lowest_price FROM dish;