Lab 3: Data Normalization & Schema Design

Overview

In this lab you will take on the role of a database consultant hired by Lab 3 Music Shop, a small business that has been tracking all of its sales data in a single, sprawling spreadsheet. Your job is to analyze the flat data, identify the problems it causes, and design a proper normalized relational database schema.

You will work through the normalization process step by step — from an unnormalized flat file through 1NF, 2NF, and 3NF — discovering anomalies at each stage and showing how each normal form resolves them. Finally, you will implement your normalized design in MySQL and prove that no data was lost by reconstructing the original flat view with a query.

Background: Why Normalize?

When data is stored in a single flat table, three types of anomalies can occur:

Insertion Anomaly: You cannot add new data without also adding unrelated data. For example, you cannot add a new product to the catalog unless a customer has purchased it.
Update Anomaly: The same fact is stored in multiple rows, so updating it requires changing many rows. If you miss one, the data becomes inconsistent. For example, if a customer changes their address, every row containing that customer must be updated.
Deletion Anomaly: Deleting a row may accidentally remove the only record of an independent fact. For example, deleting the last order for a product might erase all knowledge that the product exists.

Normalization systematically eliminates these anomalies by decomposing tables so that each table represents a single, coherent fact about the world.

The Dataset: Lab 3 Music Shop

Below is a sample of the spreadsheet the shop owner has been maintaining. Each row represents a line item on a customer order. The full dataset is available in lab3_data.csv on the course website (170 rows). Note: scroll the table to the right to see all columns.

OrderID OrderDate CustID CustName CustEmail CustCity CustState CustZip ProductID ProductName Category CategoryMgr Qty UnitPrice LineTotal
1001 2024-11-02 501 Maria Lopez mlopez@mail.com Portland OR 97201 100 Yamaha FG800 Guitar Guitars Rick Samuels 1 219.99 219.99
1001 2024-11-02 501 Maria Lopez mlopez@mail.com Portland OR 97201 205 Ernie Ball Strings (3-pack) Accessories Dana Kim 3 16.99 50.97
1001 2024-11-02 501 Maria Lopez mlopez@mail.com Portland OR 97201 310 Fender Guitar Picks (12-pack) Accessories Dana Kim 2 5.99 11.98
1002 2024-11-03 502 James Reed jreed@mail.com Seattle WA 98101 100 Yamaha FG800 Guitar Guitars Rick Samuels 1 219.99 219.99
1002 2024-11-03 502 James Reed jreed@mail.com Seattle WA 98101 420 Roland TD-1K Drum Kit Drums Rick Samuels 1 499.99 499.99
1003 2024-11-05 501 Maria Lopez mlopez@mail.com Portland OR 97201 205 Ernie Ball Strings (3-pack) Accessories Dana Kim 2 16.99 33.98
1004 2024-11-07 503 Susan Park spark@mail.com Beaverton OR 97005 550 Casio CT-S300 Keyboard Keyboards Dana Kim 1 229.99 229.99
1005 2024-11-10 504 Kevin Tran ktran@mail.com Portland OR 97201 100 Yamaha FG800 Guitar Guitars Rick Samuels 2 219.99 439.98

The highlighted cells illustrate redundant data — the same customer information is repeated for every line item on the same order, and the same product information is repeated every time that product is purchased. We can do better!

Business Rules:
  • Each order is placed by exactly one customer on a single date
  • Each order can contain multiple products (line items)
  • Each product belongs to exactly one category
  • Each category has exactly one category manager
  • A product's unit price is set per product and does not vary by order (e.g., no discounts)
  • LineTotal is always calculated as Qty × UnitPrice. (Note: you may have to scroll the table right to see LineTotal)

Part 1 — Identify entities and relationships between them (2 points)

Start with the sample data above, the full CSV file, and the business rules.

1.1 List each entity (1 point)

List each of the entities present in the data above and give the attributes for each entity. You should find between 3 and 5 entities, depending on how you decide to handle attributes in the data above. Underline the primary key for each entity. Format your answer as:

Entity1(Attribute1, Attribute2, ...)
Entity2(Attribute1, Attribute2, ...)

1.2 Describe relationships between entities (1 point)

Recall entities are related as 1:1, 1:M, or M:N. Describe the relationships between entities (e.g., Entity1 1:M Entity2) and say in English how you will model each relationship.

Provide these answers in a Part 1 of a lab3.pdf file. Make sure you include the header specified at the end of Lab 1. You will submit lab3.pdf as part of your solution to this lab.

Part 2 — Normalize each entity (3 points)

In a section called Part 2 of lab3.pdf, for each table from Part 1, state any partial or transitive dependencies. If your relation is already in 3NF, say so. Otherwise, update your entities from Part 1 to 3NF following the same format and add any needed new tables. If you choose not to normalize to 3NF, state your reason.

1NF: First Normal Form (1 point)

Ensure that:

  • All column values are atomic (no repeating groups, no multi-valued cells).
  • There is a clearly defined primary key.

2NF: Second Normal Form (1 point)

Eliminate all partial dependencies:

  • Identify every attribute that depends on only part of the composite primary key.
  • Decompose the table(s) to move those attributes into separate tables.
  • Show the resulting tables with their primary and foreign keys.
  • Give one concrete example of an anomaly that was eliminated by moving to 2NF.
  • Give one concrete example of an anomaly that still exists.

3NF: Third Normal Form (1 point)

Eliminate all transitive dependencies:

  • Identify every attribute that depends on another non-key attribute.
  • Decompose further to remove these transitive dependencies.
  • Show the resulting tables with their primary and foreign keys.
  • Give one concrete example of an anomaly that was eliminated by moving to 3NF.
  • State whether any anomalies remain.
Hint: Pay close attention to the relationship between CustZip, CustCity, and CustState, and also between Category and CategoryMgr.

Part 3 — ER Diagram (3 points)

Use MySQL Workbench to create an Entity-Relationship (ER) diagram for your final normalized schema.

3.1 Create a new ER model (2 points)

From MySQL Workbench choose File->New Model. Rename your model to lab3 and add an ER diagram. Your diagram must include:

  1. Entities — A rectangle for each table with all attributes listed with appropriate data types and UNIQUE or NOT NULL where appropriate.
  2. Primary keys — Attributes identified as primary keys and auto_increment (if applicable)
  3. Relationships — Lines connecting related entities clearly showing relationship cardinality using crow's foot notation and appropriate ON DELETE / ON UPDATE behavior.

Save your model as lab3.mwb. You will submit this diagram as part of your solution to this lab.

3.2 Create the database (1 point)

Use MySQL Workbench's Forward Engineer to generate CREATE TABLE statements for every table in your ER model.

Save the SQL commands that Forward Engineer creates in a file called lab3.sql. Make sure to include the header from Lab 1. You will submit this file as part of your solution to this lab.

Part 4 — Implementation & Verification (7 points)

Load the data into your normalized tables and prove that no data was lost during decomposition.

4.1 Load the unnormalized data (1 point)

After the CREATE TABLE commands in lab3.sql from Part 3, load the lab3_data.csv data into a single staging table and then use INSERT ... SELECT DISTINCT queries to populate each normalized table. Note that you will need to handle the date format in the lab3_data.csv (m/d/Y, not Y-m-d that SQL expects), but this code should get you started:

-- Load CSV into staging table
CREATE TABLE lab3_data (
    OrderID     INT,
    OrderDate   DATE,
    CustomerID      INT,
    CustomerName    VARCHAR(100),
    CustomerEmail   VARCHAR(150),
    CustomerCity    VARCHAR(100),
    CustomerState   VARCHAR(2),
    CustomerZip     VARCHAR(10),
    ProductID   VARCHAR(10),
    ProductName VARCHAR(150),
    Category    VARCHAR(50),
    CategoryMgr VARCHAR(100),
    Qty         INT,
    UnitPrice   DECIMAL(10,2),
    LineTotal   DECIMAL(10,2)
);

LOAD DATA LOCAL INFILE '<your directory>lab3_data.csv'
INTO TABLE lab3_data
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

4.2 Populate your normalized tables from the unnormalized data (2 points)

Query the unnormalized table (lab3_data above) and load your normalized tables so that you store data one time. For example, customers are listed on multiple rows in the unnormalized data. Make one entry in the Customers table for each customer.

-- Example: populate the Customer table with distinct customers, storing each customer one time
INSERT INTO Customer (CustomerID, CustomerName, CustomerEmail, CustomerZip)
SELECT DISTINCT CustomerID, CustomerName, CustomerEmail, CustomerZip
FROM lab3_data;

4.3. Reconstruct the original data using your normalized tables (1 point)

Write a single SELECT query (or CREATE VIEW) that joins all of your normalized tables back together to reproduce the original flat spreadsheet. The output should have the same columns and the same data as the original CSV (column order may differ).

-- Reconstruction query
CREATE VIEW original_flat_view AS
SELECT
    o.OrderID           AS OrderID,
    o.OrderDate         AS OrderDate,
    c.CustomerID        AS CustomerID,
    c.CustomerName      AS CustomerName,
    c.CustomerEmail     AS CustomerEmail,
    z.City              AS CustomerCity,
    z.State             AS CustomerState,
    c.CustomerZip       AS CustomerZip,
    p.ProductID         AS ProductID,
    p.ProductName       AS ProductName,
    cat.CategoryName    AS Category,
    cat.CategoryManager AS CategoryManager,
    li.Quantity         AS Quantity,
    p.UnitPrice         AS UnitPrice,
    (li.Quantity * p.UnitPrice) AS LineTotal
FROM ...  -- your JOINs here
;

After creating the view, run the following verification:

-- Verify row count matches
SELECT COUNT(*) AS flat_row_count FROM lab3_data;
SELECT COUNT(*) AS view_row_count FROM original_flat_view;

-- Spot-check a specific order
SELECT * FROM original_flat_view WHERE OrderID = 1001 ORDER BY ProductID;

Lossless Decomposition: If your row counts do not match, your decomposition may have introduced spurious tuples or lost data. Go back and check your functional dependencies and decomposition steps.

4.4 Demonstrate anomaly resolution (3 points)

Write SQL statements that demonstrate how your normalized schema prevents the anomalies you identified in Part B. Provide at least one example of each:

  1. Insertion without redundancy (1 point): Insert a new product that has never been ordered. Show that this is now possible without creating a fake order.
  2. Single-row update (1 point): Update a customer's email address with a single UPDATE statement. Show that the change is reflected everywhere (query the reconstruction view to confirm).
  3. Safe deletion (1 point): Delete an order without losing product or customer information. Show that the product and customer still exist after the delete.

Submission Instructions

You should submit one zip file including:

  • lab3.pdf with your answers to Part 1 and 2
  • lab3.mwb with your ER model from Part 3
  • lab3.sql with your answers to Part 3 and 4.

If you worked with a partner, both partners should submit the same solution. Note your partner's name in the textbox on the Canvas submission site. Remember to include the header as described in the submission instructions for Lab 1 for lab3.pdf and lab3.sql.

Frequently Asked Questions

Q: How should I store LineTotal in my normalized schema?

You can use a generated field or you can store the LineTotal as an attribute — think about it and justify your decision. You may choose either approach as long as your reconstruction view produces the correct LineTotal values.

Q: How many tables should I end up with?

There is no single "right" number, but most correct solutions for this dataset will have between 5 and 7 tables. If you have significantly more or fewer, double-check your functional dependencies.

Q: My reconstruction view row count doesn't match. What should I check?

Common causes: (1) You used INNER JOIN where a LEFT JOIN was needed, dropping rows. (2) Your INSERT ... SELECT DISTINCT lost rows that differ in ways you didn't expect. (3) A foreign key value is missing in a parent table. Run the staging count vs. each table count to find where rows disappeared.