Lesson 2: SQL Data Structures and Importing Data

Understand the building blocks of relational databases and three practical methods for importing data.

Estimated Read Time: 2 - 3 Hours

Learning Goals

In this lesson, you will learn to:

Technical & Analytical Skills:
  • Explain how relational databases organize data using tables, keys, and relationships
  • Identify and apply appropriate data types for different business data
  • Import data from multiple file formats (SQL dumps, CSV files, compressed archives)
  • Execute basic SQL queries to view and verify data
  • Navigate database schemas using Entity Relationship Diagrams (ERDs)
  • Create and maintain data dictionaries for documentation and analysis
 
Business Acumen:
  • Understand why proper data structure prevents costly analytical errors
  • Recognize how database design affects query performance and business decisions
  • Document data assets so teams can find and use information efficiently
  • Ensure data integrity through appropriate constraints and relationships
 
AI Literacy & AI-Proofing:
  • Use AI tools to accelerate database setup and data type selection
  • Verify AI-generated SQL code for accuracy and security
  • Understand when manual verification is critical (data types, constraints, business logic)

You’ve built your first database and executed your first query. Now comes the foundation that separates competent analysts from those who constantly battle data quality issues: understanding how data should be structured.

 

Poor data structure isn’t just a technical problem – it’s a business liability. Consider these real scenarios:

  • A retailer loses €2.3M because their inventory system stores product IDs as text instead of integers, causing duplicate entries and incorrect stock counts
  • An analytics team spends 40% of their time fixing date formats because the database accepts inconsistent inputs
  • A company can’t answer “Which customers bought Product X?” because their tables aren’t properly linked

 

This lesson teaches you how to avoid these problems by understanding the building blocks of relational databases: tables, keys, data types, and relationships. You’ll also learn three practical methods for importing data – skills you’ll use in every real-world project.

 

Let’s start with the fundamentals.

 

1. How Relational Databases Organize Data?

1.1. Tables: The Core Structure

A table is a collection of related data organized in rows and columns. Think of it as a spreadsheet, but with strict rules about what goes where.

Every table represents a specific entity—a person, product, transaction, or any business concept you need to track.

Example: E-commerce Order System

				
					orders table:
┌──────────┬─────────────┬────────────┬──────────────┐
│ order_id │ customer_id │ order_date │ total_amount │
├──────────┼─────────────┼────────────┼──────────────┤
│ 1001     │ 5847        │ 2024-11-15 │ 127.50       │
│ 1002     │ 2193        │ 2024-11-15 │ 89.00        │
│ 1003     │ 5847        │ 2024-11-16 │ 210.75       │
└──────────┴─────────────┴────────────┴──────────────┘
				
			
1.2. Terminology
  • Column (or field): Represents an attribute (e.g., order_date)
  • Row (or record/tuple): Represents one complete instance (e.g., Order #1001)
  • Cell: The intersection of a row and column (e.g., 127.50 in row 1, column 4)

 

1.3. Why This Structure Matters

Unlike spreadsheets where you can put anything anywhere, database tables enforce consistency:

  • Every column must have a defined data type
  • Every row must follow the same structure
  • Relationships between tables must be explicitly defined

 

This rigidity is actually a strength – it prevents the chaos that destroys data quality in unstructured systems.

 

2. Keys – The Foundation of Data Integrity

Keys are special columns that identify records and create relationships between tables. Understanding keys is non-negotiable for anyone working with databases.

 

2.1. Primary Key

A primary key uniquely identifies each row in a table. It must be:

  1. Unique for every row
  2. Never NULL (always has a value)
  3. Unchanging once set

 

Example:

				
					customers table:
┌─────────────┬────────────┬───────────────────────┐
│ customer_id │ first_name │ email                 │
├─────────────┼────────────┼───────────────────────┤
│ 5847        │ Sarah      │ sarah.m@example.com   │
│ 2193        │ Marcus     │ marcus.j@example.com  │
│ 8402        │ Elena      │ elena.k@example.com   │
└─────────────┴────────────┴───────────────────────┘
            ↑
      PRIMARY KEY
				
			

Here, customer_id is the primary key. Even if two customers have the same name, their IDs will always be different.

 

2.2. Foreign Key

A foreign key is a column that references the primary key of another table, creating a relationship.

 

Example:

				
					orders table:
┌──────────┬─────────────┬────────────┐
│ order_id │ customer_id │ order_date │  ← customer_id is a FOREIGN KEY
├──────────┼─────────────┼────────────┤    referencing customers.customer_id
│ 1001     │ 5847        │ 2024-11-15 │
│ 1002     │ 2193        │ 2024-11-15 │
│ 1003     │ 5847        │ 2024-11-16 │
└──────────┴─────────────┴────────────┘
				
			

This relationship tells us:

  • Order 1001 belongs to customer 5847 (Sarah)
  • Order 1002 belongs to customer 2193 (Marcus)
  • Order 1003 also belongs to customer 5847 (Sarah)

 

Foreign keys enforce referential integrity – you can’t create an order for a customer that doesn’t exist in the customers table.

 

2.3. Composite Key

When no single column can uniquely identify a row, combine multiple columns.

				
					enrollment table:
┌────────────┬───────────┬─────────────┐
│ student_id │ course_id │ semester    │
└────────────┴───────────┴─────────────┘
       ↓          ↓
   COMPOSITE PRIMARY KEY
   (student + course identifies one enrollment)
				
			
2.4. Surrogate Key

A system-generated ID with no business meaning, added purely for identification.

				
					films table:
┌─────────┬──────────────┬──────────┐
│ film_id │ title        │ duration │  ← film_id is a SURROGATE KEY
├─────────┼──────────────┼──────────┤    (just a number, no real-world meaning)
│ 1       │ Inception    │ 148      │
│ 2       │ Interstellar │ 169      │
└─────────┴──────────────┴──────────┘
				
			
Business Rule

If you’re unsure whether a table needs a key, the answer is always yes. Every table should have a primary key.

3. Query Performance—Why Some Queries Take Forever?

3.1. The Problem

You run a query to find all orders from a specific customer. You wait. And wait. And wait.

What’s happening: Your database is checking every single row in a table with 2.3 million records. This is called a full table scan, and it’s the performance killer in most slow queries.

The business impact:

  • Analysts waste hours waiting for results
  • Dashboard loading times frustrate executives
  • Real-time applications (e.g., customer portals) become unusable

 

3.2. The Solution: Indexes

An index is a special lookup structure that speeds up data retrieval, similar to an index in a book. Instead of scanning every row, the database jumps directly to the relevant data.

Example:

				
					-- Without index: Database scans all 2.3 million rows (3.2 seconds)
SELECT * FROM orders WHERE customer_id = 5847;

-- With index on customer_id: Database jumps to matching rows (0.04 seconds)
SELECT * FROM orders WHERE customer_id = 5847;
				
			

That’s an 80x speed improvement from a single index.

 
3.2.1. How Indexes Work?

Think of a phone book. Finding “Sarah Martinez” without the alphabetical index would require reading every page. With the index, you jump straight to the “M” section.

Databases work the same way:

  • Without index: Check row 1, check row 2, check row 3… (slow)
  • With index: Jump to the index entry, follow pointer to matching rows (fast)

 

3.2.2. Types of Indexes

Single-Column Index: Built on one column (most common)

				
					CREATE INDEX idx_customer_id ON orders(customer_id);
				
			

 

Composite Index: Built on multiple columns (for queries filtering by multiple fields)

				
					CREATE INDEX idx_customer_date ON orders(customer_id, order_date);
				
			

 

Unique Index: Enforces uniqueness (like a primary key, but for non-primary columns)

				
					CREATE UNIQUE INDEX idx_email ON customers(email);
				
			
3.3. Performance Optimization Decision Tree

Create indexes when:

  • Columns are frequently used in WHERE clauses
  • You’re joining large tables on specific columns
  • Queries consistently take more than 1-2 seconds
  • Tables have thousands or more rows

 

Avoid indexes when:

  • Tables are small (< 1,000 rows) — full scans are already fast
  • Columns are frequently updated — maintaining the index slows down INSERT/UPDATE/DELETE operations
  • Columns have many NULL values — indexes are less effective
  • Storage space is severely limited — indexes consume disk space
 
3.4. The Trade-Off: Speed vs. Storage

Indexes aren’t free. Every index:

  • Speeds up SELECT queries (often dramatically)
  • Slows down INSERT, UPDATE, DELETE (because the index must be maintained)
  • Consumes storage space (typically 10-30% of table size)
Best Practice

Index strategically. Add indexes for columns you query frequently, but don’t index everything – you’ll slow down data modifications and waste storage.

4. Data Types—The Rules of Data Consistency

Data types define what kind of information a column can hold. Choosing the wrong type causes errors, wasted storage, and incorrect analyses.

4.1. Common PostgreSQL Data Types
Category Data Type Example Use Case
Integers INTEGER 42 IDs, counts, quantities
  BIGINT 9223372036854775807 Large numbers (e.g., global user IDs)
Decimals NUMERIC(10,2) 1499.99 Currency, precise calculations
  REAL 3.14159 Scientific measurements
Text VARCHAR(255) “Sarah Martinez” Names, emails (limited length)
  TEXT Long descriptions Product descriptions, comments
Dates/Times DATE 2024-11-15 Birthdates, order dates
  TIMESTAMP 2024-11-15 14:32:07 Event logs, transaction times
Boolean BOOLEAN TRUE / FALSE Flags (is_active, is_verified)
 
4.2. Why Data Types Matter: A Real Example

Scenario: A retail company stores product prices.

Bad approach (using TEXT):

				
					products table:
┌────────────┬───────────┐
│ product_id │ price     │
├────────────┼───────────┤
│ 101        │ $49.99    │
│ 102        │ 89 dollars│
│ 103        │ 125.5     │
└────────────┴───────────┘
				
			

 

Problem: You can’t calculate average price or filter by price range because the column contains inconsistent text.

				
					-- This query FAILS or returns wrong results:
SELECT AVG(price) FROM products;
				
			

 

Good approach (using NUMERIC):

				
					products table:
┌────────────┬─────────┐
│ product_id │ price   │
├────────────┼─────────┤
│ 101        │ 49.99   │
│ 102        │ 89.00   │
│ 103        │ 125.50  │
└────────────┴─────────┘
				
			

 

Now you can run:

				
					SELECT AVG(price) FROM products;  -- Returns 88.16
SELECT * FROM products WHERE price > 50;  -- Works correctly
SELECT SUM(price) FROM products WHERE category = 'Electronics';  -- Accurate totals

				
			
Rule of Thumb

Always choose the most specific data type that fits your data. Don’t use TEXT when INTEGER or DATE is appropriate.

4.3. Data Type Selection Guide
Data Wrong Type Right Type Why
Age (34) TEXT INTEGER Need to calculate averages, compare ranges
Price (49.99) TEXT or REAL NUMERIC(10,2) Currency requires exact precision (no rounding errors)
Email TEXT (unlimited) VARCHAR(255) Emails have practical length limits; VARCHAR saves space
Order Date TEXT DATE or TIMESTAMP Need to filter by date ranges, calculate time differences
Is Active? TEXT (“yes”/”no”) BOOLEAN Only two states; BOOLEAN is explicit and efficient. Can also be used in logical operations

5. Importing Data

Real-world data arrives in different formats. You need to handle all of them.

 

Method 1: Importing SQL Dumps

An SQL dump is a file containing SQL commands to recreate tables and insert data. This is the most common format for sharing complete databases.

 

When to use?  Receiving a complete database export, migrating between systems, or working with standard datasets (e.g., Northwind, DVD Rental).

 

Steps to import: Follow the video walkthrough below to see how to import data using SQL dumps. For this module, we will use the Northwind Database, which you can download from the Lumen’s Github Repository.

Video 1 – Importing data from SQL dumps

What happens? PostgreSQL reads the SQL commands and executes them sequentially – creating tables, defining keys, inserting data.

 

Advantages:

  • Fast and automated
  • Includes complete table structure and relationships
  • Standard format across database systems

 

Method 2: Importing Compressed Archives (.tar, .zip with .dat files)

Some databases come as compressed folders containing raw data files (`.dat` format).

 

When to use? Large datasets that are too big for text-based SQL dumps, or when receiving exports from specific database tools. 

 

Steps to Import: Download the DVD rental database from Lumen’s Github Repository and follow the video walkthrough below to see how to import data from compressed archives.

Video 2 – Importing data from compressed archives

What happens? pgAdmin reads the compressed format and reconstructs tables with data and structure.

 

Advantages:

  • Efficient for very large datasets (millions of rows)
  • Preserves complete database structure including indexes and constraints
  • Smaller file size than SQL dumps

 

Method 3: Importing CSV Files

CSV (Comma-Separated Values) is the most common format for open data sources (Kaggle, government databases, business exports).

 

When to use? Almost always. CSVs are universal, human-readable, and easy to inspect before importing.

 

Steps to Import: Download the loan database from Lumen’s Github Repository and follow the video walkthrough below to see how to import data from CSVs.

  • Important difference: Unlike Methods 1-2, CSVs don’t include table structure. You must create the table first. 
  • Use AI to Generate Table Structure: Instead of manually writing SQL, leverage AI tools like ChatGPT or Claude.

Video 3 – Importing data from CSVs

Pro tip: Always test with a small sample (10-20 rows) before importing large files. Create a test CSV, import it, verify structure, then import the full dataset.

 

6. Database Schemas and ERDs

 

6.1. The Northwind Database: A Real-World Example

 

6.2. Northwind Table Overview

 

6.3. Reading the Northwind ERD
Northwind Entity Relationship Diagram
6.4. Business Questions Enabled by These Relationships

 

6.5. Star Schema vs. Snowflake Schema

 

Classification: Hybrid (Star with snowflake elements)

Why this matters for analysis:

  • Simple queries (e.g., “total sales by product”) = fast (star structure)
  • Complex queries (e.g., “sales by category and supplier”) = slightly slower (snowflake branches require more joins)
  • Data integrity = high (normalized design prevents duplication)

How to Identify Schema Patterns

Steps to analyze any database schema:

  1. Find the fact table(s)
    • Look for transaction/event data (orders, sales, shipments)
    • Contains metrics (quantities, amounts, counts)
    • Usually has multiple foreign keys
  2. Identify dimension tables
    • Contain descriptive information (who, what, where, when)
    • Referenced by fact tables via foreign keys
    • Examples: customers, products, employees, time
  3. Check for sub-dimensions
    • Do dimension tables link to other dimension tables?
    • Yes = snowflake characteristics
    • No = pure star schema
  4. Count the “join distance”
    • How many tables between fact and dimension?
    • 1 join = star schema
    • 2+ joins = snowflake schema

Northwind example:

  • Fact → Product = 1 join (star)
  • Fact → Category = 2 joins via Product (snowflake)
  • Fact → Customer = 2 joins via Orders (snowflake)

ERD Symbols Reference

When viewing ERDs, you’ll see these common symbols:

Symbol Meaning
(PK) Primary Key — uniquely identifies rows
(FK) Foreign Key — references another table’s PK
Line/Arrow Relationship between tables
1 or “One” side of relationship
or crow’s foot “Many” side of relationship

Relationship types in Northwind:

  • 1:N (One-to-Many): customers to orders (one customer, many orders)
  • N:1 (Many-to-One): orders to customers (many orders, one customer)
  • N:M (Many-to-Many): products to orders (through order_details junction table)

Part 7: Creating a Data Dictionary

A data dictionary is documentation that explains what each table and column contains. It’s essential for:

  • Onboarding new team members
  • Ensuring consistent understanding across teams
  • Supporting your future self when you return to a project months later

What to Include

For each table:

  1. Table name and purpose
  2. Column names, data types, and descriptions
  3. Primary and foreign keys
  4. Constraints (e.g., NOT NULL, UNIQUE)
  5. Sample values (if helpful)

Example Data Dictionary Entry: Northwind Customers Table

Table: customers Purpose: Stores customer company information including contact details and addresses for B2B sales tracking.

Column Data Type Description Constraints Sample Values
customer_id CHAR(5) Unique 5-character customer identifier code PRIMARY KEY ALFKI, BONAP, BOTTM
company_name VARCHAR(40) Legal customer company name NOT NULL Alfreds Futterkiste, Bon app’, Bottom-Dollar Markets
contact_name VARCHAR(30) Primary contact person at customer company   Maria Anders, Laurence Lebihan
contact_title VARCHAR(30) Job title of contact person   Sales Representative, Owner, Marketing Manager
address VARCHAR(60) Street address of customer   Obere Str. 57, 12 rue des Bouchers
city VARCHAR(15) City where customer is located   Berlin, Marseille, London
region VARCHAR(15) State/province/region (NULL for countries without regions) NULL allowed NULL, OR, BC
postal_code VARCHAR(10) Postal/ZIP code   12209

Summary

Databases solve problems that spreadsheets cannot:

  • They handle data volumes that exceed computer memory
  • They support concurrent access by multiple users
  • They enable complex queries that return only necessary data
  • They enforce data integrity through automated constraints

 

PostgreSQL is an enterprise-grade RDBMS:

  • Used by major technology companies
  • Open source and freely available
  • Standards-compliant (skills transfer across systems)
  • Feature-rich enough to support advanced analytics

 

pgAdmin provides a graphical interface:

  • Connects to PostgreSQL servers
  • Offers visual database navigation
  • Includes a query editor for SQL execution
  • Displays results in readable format

 

You’ve established your working environment and confirmed functionality with your first SQL queries.

Suggested References

Official Documentation:

 

Industry Context:

Exercise

Estimated Time to Complete: 1-2 hours

Exercise: Environment Validation and First Analysis
Objective

Verify your database environment is properly configured.

  1. Confirm PostgreSQL is running

    • Take a screenshot showing PostgreSQL service/process active
  2. Verify pgAdmin connection

    • Execute the query in the Query Tool to identify your SQL version.
    • Take a screenshot of the results
 
Submission Format

Submit a single PDF containing two screenshots (PostgreSQL running, query results)

 

Submission & Resubmission Guidelines

1. When submitting your exercise, use this naming format:

YourName_Submission_Lesson#.xxx

If you need to revise and resubmit, add a version suffix:

YourName_Submission_Lesson#_v2.xxx
YourName_Submission_Lesson#_v3.xxx

2. Do not overwrite or change the original evaluation entries in the rubric.

Instead, enter your updated responses or corrections in a new “v2” (or later) column in the rubric for mentor review.

Evaluation Rubric

Criterion Meets Expectation Needs Improvement Incomplete/Off-Track
Technical Verification

• Both screenshots clearly show required elements

• PostgreSQL service/process is demonstrably running

• Query executes successfully showing PostgreSQL version

• Screenshots are unclear or partially incomplete

• One screenshot is missing

• Query results are present but screenshot quality makes details difficult to verify

• Installation is incomplete or pgAdmin cannot connect to PostgreSQL

• Screenshots are missing or don’t show required verification elements

• Query did not execute or shows errors

Contact

Talk to us

Have questions or feedback about Lumen? We’d love to hear from you.