Lesson 5: Data Aggregation

Learn basic SQL techniques to answer complex business questions

Estimated Read Time: 2 - 3 Hours

Learning Goals

In this lesson, you will learn to:

Use aggregate functions to calculate summary statistics
Group data to analyze patterns across categories
Combine multiple SQL techniques to answer complex business questions

1. Data Aggregation

Individual records tell us facts. Aggregated data tells us stories. Aggregation is the foundation of business intelligence – it transforms raw transactions into actionable insights. In this lesson, you’ll learn how to aggregate data and answer complex business questions.

1.1. Aggregate Functions Overview

SQL provides five essential aggregate functions that summarize data:

Function	Purpose	Works On
COUNT()	Counts rows	Any data type
SUM()	Adds values	Numeric types only
AVG()	Calculates average	Numeric types only
MIN()	Finds minimum	Numeric, text, dates
MAX()	Finds maximum	Numeric, text, dates

Critical Understanding: Aggregate functions process multiple rows and return a single value.

1.2. COUNT: Counting Rows and Values

COUNT is the most versatile aggregate function. It answers questions like “How many?” and “How much?”

Three Forms of COUNT:

1. COUNT(*) – Count All Rows:

				
					-- Business Context: "How many products do we have in total?"
SELECT COUNT(*) AS total_products
FROM products;

COUNT(*) counts every row, including those with NULL values.

2. COUNT(column_name) – Count Non-NULL Values:

				
					-- Business Context: "How many customers have a region specified?"
SELECT COUNT(region) AS customers_with_region
FROM customers;

COUNT(column_name) ignores NULL values in that column.

3. COUNT(DISTINCT column_name) – Count Unique Values:

				
					-- Business Context: "How many different countries do we sell to?"
SELECT COUNT(DISTINCT country) AS countries_served
FROM customers;

Practical Comparison:

				
					SELECT 
    COUNT(*) AS total_customers,
    COUNT(region) AS customers_with_region,
    COUNT(DISTINCT country) AS unique_countries
FROM customers;

Result:

total_customers: 91
customers_with_region: 31 (some customers don’t have a region)
unique_countries: 21

1.3. SUM: Adding Numeric Values

SUM calculates the total of a numeric column.

Business Context: “What’s the total value of all inventory we’re holding?”

				
					SELECT SUM(units_in_stock) AS total_inventory_units
FROM products;

More Complex Example with Calculation:

Business Context: “Calculate the total value of our inventory at current prices.”

				
					SELECT 
    SUM(units_in_stock * unit_price) AS total_inventory_value
FROM products
WHERE discontinued = 0;  -- Only active products

Important

SUM ignores NULL values. If units_in_stock is NULL for some products, those rows don’t affect the sum.

1.4. AVG: Calculating Averages

AVG computes the arithmetic mean of numeric values.

Business Context: “What’s our average product price?”

				
					SELECT AVG(unit_price) AS average_price
FROM products;

Controlling Decimal Places:

				
					SELECT 
    ROUND(AVG(unit_price), 2) AS average_price
FROM products;

ROUND(value, 2) rounds to 2 decimal places.

Business Insight Example:

				
					-- Compare individual prices to average
SELECT 
    product_name,
    unit_price,
    ROUND(AVG(unit_price) OVER (), 2) AS avg_price,
    CASE 
        WHEN unit_price > AVG(unit_price) OVER () THEN 'Above Average'
        ELSE 'Below Average'
    END AS price_position
FROM products
ORDER BY unit_price DESC;

(Don’t worry if the OVER() syntax looks unfamiliar—you’ll learn window functions in advanced lessons.)

1.5. MIN and MAX: Finding Extremes

MIN and MAX identify the smallest and largest values.

Business Context: “What are our cheapest and most expensive products?”

				
					SELECT 
    MIN(unit_price) AS cheapest_product_price,
    MAX(unit_price) AS most_expensive_product_price
FROM products;

Finding the Actual Products (Not Just Prices):

				
					-- Cheapest product
SELECT product_name, unit_price
FROM products
WHERE unit_price = (SELECT MIN(unit_price) FROM products);

-- Most expensive product
SELECT product_name, unit_price
FROM products
WHERE unit_price = (SELECT MAX(unit_price) FROM products);

MIN and MAX with Dates:

Business Context: “When was our first and most recent order?”

				
					SELECT 
    MIN(order_date) AS first_order,
    MAX(order_date) AS most_recent_order
FROM orders;

1.6. GROUP BY: The Foundation of Analytical Thinking

GROUP BY is where SQL becomes truly powerful for analysis. It lets you split data into categories and compute aggregate statistics for each category.

Conceptual Understanding:

Think of GROUP BY as creating buckets:

SQL examines your grouping column(s)
It creates a separate “bucket” for each unique value
It puts all rows with that value into the corresponding bucket
It calculates aggregates for each bucket separately
It returns one result row per bucket

Basic Syntax:

				
					SELECT 
    grouping_column,
    AGGREGATE_FUNCTION(column) AS alias
FROM table_name
GROUP BY grouping_column;

Business Context 1: “How many products are in each category?”

				
					SELECT 
    category_id,
    COUNT(*) AS product_count
FROM products
GROUP BY category_id
ORDER BY product_count DESC;

Business Context 2: “What’s the average product price in each category?”

				
					SELECT 
    category_id,
    COUNT(*) AS product_count,
    ROUND(AVG(unit_price), 2) AS average_price,
    MIN(unit_price) AS cheapest,
    MAX(unit_price) AS most_expensive
FROM products
GROUP BY category_id
ORDER BY average_price DESC;

Multiple Column Grouping:

Business Context: “Analyze our customer distribution by country and city.”

				
					SELECT 
    country,
    city,
    COUNT(*) AS customer_count
FROM customers
GROUP BY country, city
ORDER BY country, customer_count DESC;

This creates buckets for each unique country-city combination.

The Golden Rule of GROUP BY:

Every column in your SELECT that isn’t inside an aggregate function MUST appear in the GROUP BY clause.

Wrong:

				
					-- This will cause an error
SELECT category_id, product_name, COUNT(*)
FROM products
GROUP BY category_id;

Why? SQL doesn’t know which product_name to show for each category (since each category has multiple products).

Correct Option 1 – Add to GROUP BY:

				
					SELECT category_id, product_name, COUNT(*)
FROM products
GROUP BY category_id, product_name;

Correct Option 2 – Use Only in Aggregate:

				
					SELECT category_id, COUNT(product_name)
FROM products
GROUP BY category_id;

1.7. HAVING: Filtering Grouped Data

WHERE filters individual rows before grouping.

HAVING filters grouped results after aggregation.

The Distinction:

				
					FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY

WHERE: “Which rows should I include in the calculation?”
HAVING: “Which group results should I show?”

Business Context: “Show me only categories with more than 10 products.”

				
					SELECT 
    category_id,
    COUNT(*) AS product_count
FROM products
GROUP BY category_id
HAVING COUNT(*) > 10
ORDER BY product_count DESC;

Why Not Use WHERE?

				
					-- WRONG - This won't work
SELECT category_id, COUNT(*)
FROM products
WHERE COUNT(*) > 10  -- X Can't use aggregates in WHERE
GROUP BY category_id;

-- CORRECT
SELECT category_id, COUNT(*)
FROM products
GROUP BY category_id
HAVING COUNT(*) > 10;  -- ✓ HAVING is for aggregate conditions

Combining WHERE and HAVING:

Business Context: “Among non-discontinued products, show categories where average price exceeds $30.”

				
					SELECT 
    category_id,
    COUNT(*) AS product_count,
    ROUND(AVG(unit_price), 2) AS avg_price
FROM products
WHERE discontinued = 0  -- Filter rows before grouping
GROUP BY category_id
HAVING AVG(unit_price) > 30  -- Filter groups after aggregation
ORDER BY avg_price DESC;

Decision Tree for WHERE vs HAVING:

Does the condition involve an aggregate function (COUNT, SUM, AVG, etc.)?

├─ NO → Use WHERE (filters rows before grouping)

└─ YES → Use HAVING (filters groups after aggregation)

2. Real-World Business Scenarios

Let’s apply everything you’ve learned so far to solve a business problem:

Scenario: Geographic Market Analysis

Business Question: “Analyze customer concentration across countries. Show countries with at least 5 customers, along with the count and percentage of total customers.”

Solution: Follow the video walkthrough to learn how to convert the business question in analytical terms and answer using SQL.

Video 1 – Answer Geographic Market Analysis question using SQL

3. Query Optimization and Best Practices

3.1. Understanding Query Execution Order

SQL executes your query in a specific order, regardless of how you write it:

				
					1. FROM        - Identifies tables
2. WHERE       - Filters individual rows
3. GROUP BY    - Creates groups
4. HAVING      - Filters groups
5. SELECT      - Defines output columns
6. DISTINCT    - Removes duplicates
7. ORDER BY    - Sorts results
8. LIMIT       - Restricts row count

Why This Matters:

You can’t use SELECT aliases in WHERE (they don’t exist yet)
You CAN use SELECT aliases in ORDER BY (they exist by then)
HAVING comes after GROUP BY (because it filters groups)

Example:

				
					-- This works
SELECT unit_price * 1.2 AS price_with_tax
FROM products
ORDER BY price_with_tax;  -- Alias exists here

-- This doesn't work
SELECT unit_price * 1.2 AS price_with_tax
FROM products
WHERE price_with_tax > 50;  -- ❌ Alias doesn't exist yet

-- Correct version
SELECT unit_price * 1.2 AS price_with_tax
FROM products
WHERE unit_price * 1.2 > 50;  -- ✓ Use the calculation

3.2. Performance Considerations

1. Use LIMIT During Development:

When testing queries on large tables, always use LIMIT:

				
					-- During development
SELECT *
FROM orders
LIMIT 100;  -- Just to see the structure

2. Filter Early:

				
					-- Better: Filter first, then aggregate
SELECT category_id, COUNT(*)
FROM products
WHERE discontinued = 0
GROUP BY category_id;

-- Less efficient: Aggregate everything, then filter
SELECT category_id, COUNT(*)
FROM products
GROUP BY category_id
HAVING discontinued = 0;  -- This doesn't even make logical sense!

3. Be Specific in SELECT:

				
					-- Good
SELECT product_name, unit_price
FROM products;

-- Wasteful
SELECT *
FROM products;  -- Returns all columns you might not need

Summary

You’ve completed a comprehensive journey through SQL fundamentals. Let’s review what you now command:

Data Aggregation & Analysis: You’ve mastered aggregate functions (COUNT, SUM, AVG, MIN, MAX) combined with GROUP BY to analyze data by categories, and HAVING to filter grouped results—enabling you to answer sophisticated business questions like “What’s the average price per category for products with more than 10 items in stock?”

Professional SQL Practices: You understand query execution order (FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT), write readable and well-formatted code with meaningful aliases, and think systematically about converting business questions into optimized SQL queries.

Suggested References

Exercise: SQL Fundamentals for Data Analysts

Using the NYC Yellow Taxi Trip Dataset

Exercise Set Overview

These exercises use the NYC Taxi and Limousine Commission’s Yellow Taxi trip data. You’ll be analyzing real-world transportation data to answer business questions that taxi fleet managers, city planners, and operations teams face daily.

Database Schema Reference:

fact_yellow_trips_raw_2025 – Main trip records table
dim_taxi_zones – Location information (pickup/dropoff zones)
dim_rate_codes – Rate code descriptions
dim_payment_types – Payment method descriptions
dim_vendors – Vendor/provider information
dim_date – Date dimension for time-based analysis

Important:

For each of the tasks below:

write the appropriate query to extract information from the database
generate insights and answer the business question

Part 1: Data Aggregation

Exercise 1.1: Overall Trip Statistics

Business Context: The executive team needs a high-level summary of taxi operations.

Task: Calculate the following metrics across ALL trips (no limit):

Total number of trips (count all records)
Average trip distance (rounded to 2 decimals)
Average fare amount (rounded to 2 decimals)
Minimum fare amount
Maximum fare amount
Average total amount (rounded to 2 decimals)

Display all metrics in a single row with meaningful column aliases.

Learning Objective: Practice aggregate functions (COUNT, AVG, MIN, MAX) with ROUND

Exercise 1.2: Payment Method Distribution

Business Context: Finance needs to understand payment method preferences across all trips.

Task: Group trips by payment_type and calculate:

Count of trips for each payment type
Average fare amount for each payment type (rounded to 2 decimals)
Average tip amount for each payment type (rounded to 2 decimals)
Total revenue (SUM of total_amount) for each payment type (rounded to 2 decimals)

Sort by trip count descending to see most popular payment methods first.

Learning Objective: Practice GROUP BY with multiple aggregates

Exercise 1.3: Vendor Performance Comparison

Business Context: Operations wants to compare performance across different taxi technology vendors.

Task: For each VendorID, calculate:

Total number of trips
Average trip distance (rounded to 2 decimals)
Average fare amount (rounded to 2 decimals)
Average total amount (rounded to 2 decimals)

Sort by vendor ID. Only include vendors with more than 100 trips.

Learning Objective: Practice GROUP BY with HAVING clause

Exercise 1.4: Passenger Load Analysis

Business Context: Fleet planning needs to understand typical passenger counts for vehicle allocation.

Task: Group trips by passenger_count and show:

Passenger count
Number of trips with that passenger count
Average trip distance for that passenger count (rounded to 2 decimals)
Average total amount for that passenger count (rounded to 2 decimals)

Only include passenger counts that appear in at least 50 trips. Sort by passenger count ascending.

Learning Objective: Practice GROUP BY with HAVING for minimum thresholds

Exercise 1.5: High-Value Trip Analysis by Rate Code

Business Context: Pricing team wants to understand revenue patterns across different rate types.

Task: For each RatecodeID, calculate:

Count of trips
Average fare amount (rounded to 2 decimals)
Minimum total amount
Maximum total amount
Total revenue (SUM of total_amount, rounded to 2 decimals)

Only show rate codes with at least 20 trips AND where average fare exceeds $15. Sort by total revenue descending.

Learning Objective: Practice GROUP BY with multiple HAVING conditions

Exercise 1.6: Trip Distance Distribution

Business Context: Operations wants to understand how trips cluster by distance ranges.

Task: Create distance categories and analyze them:

Use CASE to create categories: ‘0-5 miles’, ‘5-10 miles’, ’10-20 miles’, ’20+ miles’
For each category, calculate:
- Count of trips
- Average fare amount (rounded to 2 decimals)
- Average total amount (rounded to 2 decimals)

Group by the distance category. Sort by the category (you’ll need to use the same CASE logic in ORDER BY to sort properly).

Learning Objective: Practice GROUP BY with CASE expressions

Exercise 1.7: Tipping Behavior by Trip Value

Business Context: Marketing wants to understand if expensive trips correlate with better tipping.

Task: Create trip value categories based on total_amount:

‘Budget’ (< $20)
‘Standard’ ($20-50)
‘Premium’ ($50-100)
‘Luxury’ ($100+)

For each category, calculate:

Count of trips
Average tip amount (rounded to 2 decimals)
Average tip percentage (tip_amount / fare_amount * 100, rounded to 2 decimals)

Only include credit card payments (payment_type = 1) where fare_amount > 0. Only show categories with at least 30 trips. Sort by average tip percentage descending.

Learning Objective: Practice GROUP BY with CASE, calculated fields, and HAVING

Exercise 1.8: Peak Pricing Impact Analysis

Business Context: Finance wants to understand how extra charges (peak hour surcharges) affect total trip costs.

Task: Create surcharge categories:

‘No Extra’ (extra = 0)
‘Low Extra’ (extra > 0 and extra <= 1)
‘High Extra’ (extra > 1)

For each category, calculate:

Number of trips
Average fare amount (rounded to 2 decimals)
Average extra charge (rounded to 2 decimals)
Average total amount (rounded to 2 decimals)
Percentage of trips in this category (count / total trips * 100)

Only include categories with at least 100 trips. Sort by average total amount descending.

Learning Objective: Practice GROUP BY with CASE, multiple aggregates, and percentage calculations

Exercise 1.9: Zone-Based Trip Analysis

Business Context: City planners want to understand trip patterns by pickup location.

Task: Group trips by PULocationID (pickup location) and calculate:

Location ID
Total number of pickups from that location
Average trip distance from that location (rounded to 2 decimals)
Average total amount for trips from that location (rounded to 2 decimals)
Maximum trip distance from that location

Only include locations with at least 500 pickups. Sort by number of pickups descending. Show top 20 locations.

Learning Objective: Practice GROUP BY with HAVING, ORDER BY, and LIMIT

Exercise 1.10: Comprehensive Trip Summary with Filters

Business Context: Executive dashboard needs a complex summary of recent high-value operations.

Task: For trips where total_amount >= $30, group by payment_type and calculate:

Payment type
Number of trips
Average trip distance (rounded to 2 decimals)
Average fare amount (rounded to 2 decimals)
Average tip amount (rounded to 2 decimals)
Total revenue (SUM of total_amount, rounded to 2 decimals)
Percentage of total trips (within this filtered set)

Only include payment types with at least 200 trips in this high-value segment. Sort by total revenue descending.

Learning Objective: Practice combining WHERE filters with GROUP BY, HAVING, and complex calculations

Bonus Challenge: Multi-Faceted Analysis

Business Context: The executive team is preparing for a board meeting and needs a comprehensive summary answering: “What defines our most profitable trip segment?”

Task: Create a single query that identifies trips meeting ALL these criteria:

Trip distance between 10 and 30 miles
Total amount over $50
Credit card payment (payment_type = 1)
Tip percentage >= 15%

Display: trip distance, fare amount, tip amount, tip percentage (calculated), total amount, and a category label ‘High-Value Long-Distance’. Calculate the average of all metrics for this segment and display as a summary row (you’ll need UNION for this – if not covered yet, just calculate the individual trip metrics).

Sort by total amount descending. Limit to 50 records.

Learning Objective: Integrate multiple concepts – filtering, calculations, CASE statements, sorting

Exercise Submission Guidelines

For each exercise:

Write the SQL query
Execute it in pgAdmin
Take a screenshot of both the query and the first 10-20 rows of results
For aggregate exercises, include the complete result set (usually just a few summary rows)
Write 1-2 sentences explaining what business insight the results reveal

Example Format:

Exercise 2.1: High-Value Trip Analysis

Query:
[Your SQL code here]

Results:
[Screenshot]

Business Insight:
The top 25 trips range from $100 to $350, with the highest-value trips covering distances of 15-25 miles, suggesting airport or long-distance suburban routes are the most lucrative for drivers.

Submission & Resubmission Guidelines

1. When submitting your exercise, use this naming format:

YourName_Submission_Lesson#.xxx

If you need to revise and resubmit, add a version suffix:

YourName_Submission_Lesson#_v2.xxx

YourName_Submission_Lesson#_v3.xxx

2. Do not overwrite or change the original evaluation entries in the rubric.

Instead, enter your updated responses or corrections in a new “v2” (or later) column in the rubric for mentor review.

Evaluation Rubric

Criteria	Meets Expectations	Needs Improvement	Incomplete/Off-Track
Query Execution	All 10 required queries execute without errors and return correct results	8-9 queries work correctly; 1-2 have minor errors	Fewer than 8 queries execute successfully
Aggregation Logic	Correctly uses COUNT, SUM, AVG, MIN, MAX with GROUP BY; includes all non-aggregated columns in GROUP BY	Aggregates mostly correct but 2-3 queries have missing GROUP BY columns, wrong functions, or calculation errors	Frequent GROUP BY errors, wrong aggregate functions, or fundamental misunderstanding of grouping
HAVING vs WHERE	Properly distinguishes WHERE (row filtering) from HAVING (group filtering); uses each correctly	Understands concept but 1-2 queries place conditions in wrong clause	Consistently confuses WHERE and HAVING or misuses both
Business Insights	Insights reference specific metrics and explain patterns/trends revealed by aggregation	Insights present but don’t fully interpret aggregated results; missing for 2-3 exercises	Most insights missing or show no understanding of aggregation meaning

Student Submissions

Check out recently submitted work by other students to get an idea of what’s required for this Exercise: