Lab 3: Data Normalization & Schema Design
Overview
In this lab you will take on the role of a database consultant hired by Lab 3 Music Shop, a small business that has been tracking all of its sales data in a single, sprawling spreadsheet. Your job is to analyze the flat data, identify the problems it causes, and design a proper normalized relational database schema.
You will work through the normalization process step by step — from an unnormalized flat file through 1NF, 2NF, and 3NF — discovering anomalies at each stage and showing how each normal form resolves them. Finally, you will implement your normalized design in MySQL and prove that no data was lost by reconstructing the original flat view with a query.
Background: Why Normalize?
When data is stored in a single flat table, three types of anomalies can occur:
Normalization systematically eliminates these anomalies by decomposing tables so that each table represents a single, coherent fact about the world.
The Dataset: Lab 3 Music Shop
Below is a sample of the spreadsheet the shop owner has been maintaining. Each row represents a line item on a customer order. The full dataset is available in lab3_data.csv on the course website (170 rows). Note: scroll the table to the right to see all columns.
| OrderID | OrderDate | CustID | CustName | CustEmail | CustCity | CustState | CustZip | ProductID | ProductName | Category | CategoryMgr | Qty | UnitPrice | LineTotal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | 2024-11-02 | 501 | Maria Lopez | mlopez@mail.com | Portland | OR | 97201 | 100 | Yamaha FG800 Guitar | Guitars | Rick Samuels | 1 | 219.99 | 219.99 |
| 1001 | 2024-11-02 | 501 | Maria Lopez | mlopez@mail.com | Portland | OR | 97201 | 205 | Ernie Ball Strings (3-pack) | Accessories | Dana Kim | 3 | 16.99 | 50.97 |
| 1001 | 2024-11-02 | 501 | Maria Lopez | mlopez@mail.com | Portland | OR | 97201 | 310 | Fender Guitar Picks (12-pack) | Accessories | Dana Kim | 2 | 5.99 | 11.98 |
| 1002 | 2024-11-03 | 502 | James Reed | jreed@mail.com | Seattle | WA | 98101 | 100 | Yamaha FG800 Guitar | Guitars | Rick Samuels | 1 | 219.99 | 219.99 |
| 1002 | 2024-11-03 | 502 | James Reed | jreed@mail.com | Seattle | WA | 98101 | 420 | Roland TD-1K Drum Kit | Drums | Rick Samuels | 1 | 499.99 | 499.99 |
| 1003 | 2024-11-05 | 501 | Maria Lopez | mlopez@mail.com | Portland | OR | 97201 | 205 | Ernie Ball Strings (3-pack) | Accessories | Dana Kim | 2 | 16.99 | 33.98 |
| 1004 | 2024-11-07 | 503 | Susan Park | spark@mail.com | Beaverton | OR | 97005 | 550 | Casio CT-S300 Keyboard | Keyboards | Dana Kim | 1 | 229.99 | 229.99 |
| 1005 | 2024-11-10 | 504 | Kevin Tran | ktran@mail.com | Portland | OR | 97201 | 100 | Yamaha FG800 Guitar | Guitars | Rick Samuels | 2 | 219.99 | 439.98 |
The highlighted cells illustrate redundant data — the same customer information is repeated for every line item on the same order, and the same product information is repeated every time that product is purchased. We can do better!
- Each order is placed by exactly one customer on a single date
- Each order can contain multiple products (line items)
- Each product belongs to exactly one category
- Each category has exactly one category manager
- A product's unit price is set per product and does not vary by order (e.g., no discounts)
LineTotalis always calculated asQty × UnitPrice. (Note: you may have to scroll the table right to seeLineTotal)
Part 1 — Identify entities and relationships between them (2 points)
Start with the sample data above, the full CSV file, and the business rules.
1.1 List each entity (1 point)
List each of the entities present in the data above and give the attributes for each entity. You should find between 3 and 5 entities, depending on how you decide to handle attributes in the data above. Underline the primary key for each entity. Format your answer as:
Entity1(Attribute1, Attribute2, ...)
Entity2(Attribute1, Attribute2, ...)
1.2 Describe relationships between entities (1 point)
Recall entities are related as 1:1, 1:M, or M:N. Describe the relationships between entities (e.g., Entity1 1:M Entity2) and say in English how you will model each relationship.
Provide these answers in a Part 1 of a lab3.pdf file. Make sure you include the header specified at the end of Lab 1. You will submit lab3.pdf as part of your solution to this lab.
Part 2 — Normalize each entity (3 points)
In a section called Part 2 of lab3.pdf, for each table from Part 1, state any partial or transitive dependencies. If your relation is already in 3NF, say so. Otherwise, update your entities from Part 1 to 3NF following the same format and add any needed new tables. If you choose not to normalize to 3NF, state your reason.
1NF: First Normal Form (1 point)
Ensure that:
- All column values are atomic (no repeating groups, no multi-valued cells).
- There is a clearly defined primary key.
2NF: Second Normal Form (1 point)
Eliminate all partial dependencies:
- Identify every attribute that depends on only part of the composite primary key.
- Decompose the table(s) to move those attributes into separate tables.
- Show the resulting tables with their primary and foreign keys.
- Give one concrete example of an anomaly that was eliminated by moving to 2NF.
- Give one concrete example of an anomaly that still exists.
3NF: Third Normal Form (1 point)
Eliminate all transitive dependencies:
- Identify every attribute that depends on another non-key attribute.
- Decompose further to remove these transitive dependencies.
- Show the resulting tables with their primary and foreign keys.
- Give one concrete example of an anomaly that was eliminated by moving to 3NF.
- State whether any anomalies remain.
CustZip, CustCity, and CustState, and
also between Category and CategoryMgr.
Part 3 — ER Diagram (3 points)
Use MySQL Workbench to create an Entity-Relationship (ER) diagram for your final normalized schema.
3.1 Create a new ER model (2 points)
From MySQL Workbench choose File->New Model. Rename your model to lab3 and add an ER diagram. Your diagram must include:
- Entities — A rectangle for each table with all attributes listed with appropriate data types and
UNIQUEorNOT NULLwhere appropriate. - Primary keys — Attributes identified as
primary keysandauto_increment(if applicable) - Relationships — Lines connecting related entities clearly showing relationship cardinality using crow's foot notation and appropriate
ON DELETE/ON UPDATEbehavior.
Save your model as lab3.mwb. You will submit this diagram as part of your solution to this lab.
3.2 Create the database (1 point)
Use MySQL Workbench's Forward Engineer to generate CREATE TABLE statements for every table in your ER model.
Save the SQL commands that Forward Engineer creates in a file called lab3.sql. Make sure to include the header from Lab 1. You will submit this file as part of your solution to this lab.
Part 4 — Implementation & Verification (7 points)
Load the data into your normalized tables and prove that no data was lost during decomposition.
4.1 Load the unnormalized data (1 point)
After the CREATE TABLE commands in lab3.sql from Part 3, load the lab3_data.csv data into a single staging table and then use INSERT ... SELECT DISTINCT
queries to populate each normalized table. Note that you will need to handle the date format in the lab3_data.csv (m/d/Y, not Y-m-d that SQL expects), but this
code should get you started:
-- Load CSV into staging table
CREATE TABLE lab3_data (
OrderID INT,
OrderDate DATE,
CustomerID INT,
CustomerName VARCHAR(100),
CustomerEmail VARCHAR(150),
CustomerCity VARCHAR(100),
CustomerState VARCHAR(2),
CustomerZip VARCHAR(10),
ProductID VARCHAR(10),
ProductName VARCHAR(150),
Category VARCHAR(50),
CategoryMgr VARCHAR(100),
Qty INT,
UnitPrice DECIMAL(10,2),
LineTotal DECIMAL(10,2)
);
LOAD DATA LOCAL INFILE '<your directory>lab3_data.csv'
INTO TABLE lab3_data
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
4.2 Populate your normalized tables from the unnormalized data (2 points)
Query the unnormalized table (lab3_data above) and load your normalized tables so that you store data one time.
For example, customers are listed on multiple rows in the unnormalized data. Make one entry in the Customers table
for each customer.
-- Example: populate the Customer table with distinct customers, storing each customer one time
INSERT INTO Customer (CustomerID, CustomerName, CustomerEmail, CustomerZip)
SELECT DISTINCT CustomerID, CustomerName, CustomerEmail, CustomerZip
FROM lab3_data;
4.3. Reconstruct the original data using your normalized tables (1 point)
Write a single SELECT query (or CREATE VIEW) that joins
all of your normalized tables back together to reproduce the original flat
spreadsheet. The output should have the same columns and the same data as the
original CSV (column order may differ).
-- Reconstruction query
CREATE VIEW original_flat_view AS
SELECT
o.OrderID AS OrderID,
o.OrderDate AS OrderDate,
c.CustomerID AS CustomerID,
c.CustomerName AS CustomerName,
c.CustomerEmail AS CustomerEmail,
z.City AS CustomerCity,
z.State AS CustomerState,
c.CustomerZip AS CustomerZip,
p.ProductID AS ProductID,
p.ProductName AS ProductName,
cat.CategoryName AS Category,
cat.CategoryManager AS CategoryManager,
li.Quantity AS Quantity,
p.UnitPrice AS UnitPrice,
(li.Quantity * p.UnitPrice) AS LineTotal
FROM ... -- your JOINs here
;
After creating the view, run the following verification:
-- Verify row count matches
SELECT COUNT(*) AS flat_row_count FROM lab3_data;
SELECT COUNT(*) AS view_row_count FROM original_flat_view;
-- Spot-check a specific order
SELECT * FROM original_flat_view WHERE OrderID = 1001 ORDER BY ProductID;
4.4 Demonstrate anomaly resolution (3 points)
Write SQL statements that demonstrate how your normalized schema prevents the anomalies you identified in Part B. Provide at least one example of each:
- Insertion without redundancy (1 point): Insert a new product that has never been ordered. Show that this is now possible without creating a fake order.
-
Single-row update (1 point): Update a customer's email address with a single
UPDATEstatement. Show that the change is reflected everywhere (query the reconstruction view to confirm). - Safe deletion (1 point): Delete an order without losing product or customer information. Show that the product and customer still exist after the delete.
Submission Instructions
You should submit one zip file including:
- lab3.pdf with your answers to Part 1 and 2
- lab3.mwb with your ER model from Part 3
- lab3.sql with your answers to Part 3 and 4.
If you worked with a partner, both partners should submit the same solution. Note your partner's name in the textbox on the Canvas submission site. Remember to include the header as described in the submission instructions for Lab 1 for lab3.pdf and lab3.sql.
Frequently Asked Questions
Q: How should I store LineTotal in my normalized schema?
You can use a generated field or you can store the LineTotal as an attribute — think about it and justify your decision.
You may choose either approach as long as your reconstruction
view produces the correct LineTotal values.
Q: How many tables should I end up with?
There is no single "right" number, but most correct solutions for this dataset will have between 5 and 7 tables. If you have significantly more or fewer, double-check your functional dependencies.
Q: My reconstruction view row count doesn't match. What should I check?
Common causes: (1) You used INNER JOIN where a LEFT JOIN
was needed, dropping rows. (2) Your INSERT ... SELECT DISTINCT lost
rows that differ in ways you didn't expect. (3) A foreign key value is missing in
a parent table. Run the staging count vs. each table count to find where rows
disappeared.