Lesson 5: Data Aggregation
Learn basic SQL techniques to answer complex business questions
Estimated Read Time: 2 - 3 Hours
Learning Goals
In this lesson, you will learn to:
- Use aggregate functions to calculate summary statistics
- Group data to analyze patterns across categories
- Combine multiple SQL techniques to answer complex business questions
1. Data Aggregation
Individual records tell us facts. Aggregated data tells us stories. Aggregation is the foundation of business intelligence – it transforms raw transactions into actionable insights. In this lesson, you’ll learn how to aggregate data and answer complex business questions.
1.1. Aggregate Functions Overview
SQL provides five essential aggregate functions that summarize data:
| Function | Purpose | Works On |
|---|---|---|
| COUNT() | Counts rows | Any data type |
| SUM() | Adds values | Numeric types only |
| AVG() | Calculates average | Numeric types only |
| MIN() | Finds minimum | Numeric, text, dates |
| MAX() | Finds maximum | Numeric, text, dates |
Critical Understanding: Aggregate functions process multiple rows and return a single value.
1.2. COUNT: Counting Rows and Values
COUNT is the most versatile aggregate function. It answers questions like “How many?” and “How much?”
Three Forms of COUNT:
1. COUNT(*) – Count All Rows:
-- Business Context: "How many products do we have in total?"
SELECT COUNT(*) AS total_products
FROM products;
COUNT(*) counts every row, including those with NULL values.
2. COUNT(column_name) – Count Non-NULL Values:
-- Business Context: "How many customers have a region specified?"
SELECT COUNT(region) AS customers_with_region
FROM customers;
COUNT(column_name) ignores NULL values in that column.
3. COUNT(DISTINCT column_name) – Count Unique Values:
-- Business Context: "How many different countries do we sell to?"
SELECT COUNT(DISTINCT country) AS countries_served
FROM customers;
Practical Comparison:
SELECT
COUNT(*) AS total_customers,
COUNT(region) AS customers_with_region,
COUNT(DISTINCT country) AS unique_countries
FROM customers;
Result:
- total_customers: 91
- customers_with_region: 31 (some customers don’t have a region)
- unique_countries: 21
1.3. SUM: Adding Numeric Values
SUM calculates the total of a numeric column.
Business Context: “What’s the total value of all inventory we’re holding?”
SELECT SUM(units_in_stock) AS total_inventory_units
FROM products;
More Complex Example with Calculation:
Business Context: “Calculate the total value of our inventory at current prices.”
SELECT
SUM(units_in_stock * unit_price) AS total_inventory_value
FROM products
WHERE discontinued = 0; -- Only active products
Important
SUM ignores NULL values. If units_in_stock is NULL for some products, those rows don’t affect the sum.
1.4. AVG: Calculating Averages
AVG computes the arithmetic mean of numeric values.
Business Context: “What’s our average product price?”
SELECT AVG(unit_price) AS average_price
FROM products;
Controlling Decimal Places:
SELECT
ROUND(AVG(unit_price), 2) AS average_price
FROM products;
ROUND(value, 2) rounds to 2 decimal places.
Business Insight Example:
-- Compare individual prices to average
SELECT
product_name,
unit_price,
ROUND(AVG(unit_price) OVER (), 2) AS avg_price,
CASE
WHEN unit_price > AVG(unit_price) OVER () THEN 'Above Average'
ELSE 'Below Average'
END AS price_position
FROM products
ORDER BY unit_price DESC;
(Don’t worry if the OVER() syntax looks unfamiliar—you’ll learn window functions in advanced lessons.)
1.5. MIN and MAX: Finding Extremes
MIN and MAX identify the smallest and largest values.
Business Context: “What are our cheapest and most expensive products?”
SELECT
MIN(unit_price) AS cheapest_product_price,
MAX(unit_price) AS most_expensive_product_price
FROM products;
Finding the Actual Products (Not Just Prices):
-- Cheapest product
SELECT product_name, unit_price
FROM products
WHERE unit_price = (SELECT MIN(unit_price) FROM products);
-- Most expensive product
SELECT product_name, unit_price
FROM products
WHERE unit_price = (SELECT MAX(unit_price) FROM products);
MIN and MAX with Dates:
Business Context: “When was our first and most recent order?”
SELECT
MIN(order_date) AS first_order,
MAX(order_date) AS most_recent_order
FROM orders;
1.6. GROUP BY: The Foundation of Analytical Thinking
GROUP BY is where SQL becomes truly powerful for analysis. It lets you split data into categories and compute aggregate statistics for each category.
Conceptual Understanding:
Think of GROUP BY as creating buckets:
- SQL examines your grouping column(s)
- It creates a separate “bucket” for each unique value
- It puts all rows with that value into the corresponding bucket
- It calculates aggregates for each bucket separately
- It returns one result row per bucket
Basic Syntax:
SELECT
grouping_column,
AGGREGATE_FUNCTION(column) AS alias
FROM table_name
GROUP BY grouping_column;
Business Context 1: “How many products are in each category?”
SELECT
category_id,
COUNT(*) AS product_count
FROM products
GROUP BY category_id
ORDER BY product_count DESC;
Business Context 2: “What’s the average product price in each category?”
SELECT
category_id,
COUNT(*) AS product_count,
ROUND(AVG(unit_price), 2) AS average_price,
MIN(unit_price) AS cheapest,
MAX(unit_price) AS most_expensive
FROM products
GROUP BY category_id
ORDER BY average_price DESC;
Multiple Column Grouping:
Business Context: “Analyze our customer distribution by country and city.”
SELECT
country,
city,
COUNT(*) AS customer_count
FROM customers
GROUP BY country, city
ORDER BY country, customer_count DESC;
This creates buckets for each unique country-city combination.
The Golden Rule of GROUP BY:
Every column in your SELECT that isn’t inside an aggregate function MUST appear in the GROUP BY clause.
Wrong:
-- This will cause an error
SELECT category_id, product_name, COUNT(*)
FROM products
GROUP BY category_id;
Why? SQL doesn’t know which product_name to show for each category (since each category has multiple products).
Correct Option 1 – Add to GROUP BY:
SELECT category_id, product_name, COUNT(*)
FROM products
GROUP BY category_id, product_name;
Correct Option 2 – Use Only in Aggregate:
SELECT category_id, COUNT(product_name)
FROM products
GROUP BY category_id;
1.7. HAVING: Filtering Grouped Data
WHERE filters individual rows before grouping.
HAVING filters grouped results after aggregation.
The Distinction:
FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY
- WHERE: “Which rows should I include in the calculation?”
- HAVING: “Which group results should I show?”
Business Context: “Show me only categories with more than 10 products.”
SELECT
category_id,
COUNT(*) AS product_count
FROM products
GROUP BY category_id
HAVING COUNT(*) > 10
ORDER BY product_count DESC;
Why Not Use WHERE?
-- WRONG - This won't work
SELECT category_id, COUNT(*)
FROM products
WHERE COUNT(*) > 10 -- X Can't use aggregates in WHERE
GROUP BY category_id;
-- CORRECT
SELECT category_id, COUNT(*)
FROM products
GROUP BY category_id
HAVING COUNT(*) > 10; -- ✓ HAVING is for aggregate conditions
Combining WHERE and HAVING:
Business Context: “Among non-discontinued products, show categories where average price exceeds $30.”
SELECT
category_id,
COUNT(*) AS product_count,
ROUND(AVG(unit_price), 2) AS avg_price
FROM products
WHERE discontinued = 0 -- Filter rows before grouping
GROUP BY category_id
HAVING AVG(unit_price) > 30 -- Filter groups after aggregation
ORDER BY avg_price DESC;
Decision Tree for WHERE vs HAVING:
Does the condition involve an aggregate function (COUNT, SUM, AVG, etc.)?
├─ NO → Use WHERE (filters rows before grouping)
└─ YES → Use HAVING (filters groups after aggregation)
2. Real-World Business Scenarios
Let’s apply everything you’ve learned so far to solve a business problem:
Scenario: Geographic Market Analysis
Business Question: “Analyze customer concentration across countries. Show countries with at least 5 customers, along with the count and percentage of total customers.”
Solution: Follow the video walkthrough to learn how to convert the business question in analytical terms and answer using SQL.
Video 1 – Answer Geographic Market Analysis question using SQL
3. Query Optimization and Best Practices
3.1. Understanding Query Execution Order
SQL executes your query in a specific order, regardless of how you write it:
1. FROM - Identifies tables
2. WHERE - Filters individual rows
3. GROUP BY - Creates groups
4. HAVING - Filters groups
5. SELECT - Defines output columns
6. DISTINCT - Removes duplicates
7. ORDER BY - Sorts results
8. LIMIT - Restricts row count
Why This Matters:
- You can’t use SELECT aliases in WHERE (they don’t exist yet)
- You CAN use SELECT aliases in ORDER BY (they exist by then)
- HAVING comes after GROUP BY (because it filters groups)
Example:
-- This works
SELECT unit_price * 1.2 AS price_with_tax
FROM products
ORDER BY price_with_tax; -- Alias exists here
-- This doesn't work
SELECT unit_price * 1.2 AS price_with_tax
FROM products
WHERE price_with_tax > 50; -- ❌ Alias doesn't exist yet
-- Correct version
SELECT unit_price * 1.2 AS price_with_tax
FROM products
WHERE unit_price * 1.2 > 50; -- ✓ Use the calculation
3.2. Performance Considerations
1. Use LIMIT During Development:
When testing queries on large tables, always use LIMIT:
-- During development
SELECT *
FROM orders
LIMIT 100; -- Just to see the structure
2. Filter Early:
-- Better: Filter first, then aggregate
SELECT category_id, COUNT(*)
FROM products
WHERE discontinued = 0
GROUP BY category_id;
-- Less efficient: Aggregate everything, then filter
SELECT category_id, COUNT(*)
FROM products
GROUP BY category_id
HAVING discontinued = 0; -- This doesn't even make logical sense!
3. Be Specific in SELECT:
-- Good
SELECT product_name, unit_price
FROM products;
-- Wasteful
SELECT *
FROM products; -- Returns all columns you might not need
Summary
You’ve completed a comprehensive journey through SQL fundamentals. Let’s review what you now command:
- Data Aggregation & Analysis: You’ve mastered aggregate functions (COUNT, SUM, AVG, MIN, MAX) combined with GROUP BY to analyze data by categories, and HAVING to filter grouped results—enabling you to answer sophisticated business questions like “What’s the average price per category for products with more than 10 items in stock?”
- Professional SQL Practices: You understand query execution order (FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT), write readable and well-formatted code with meaningful aliases, and think systematically about converting business questions into optimized SQL queries.
Suggested References
Exercise: SQL Fundamentals for Data Analysts
Using the NYC Yellow Taxi Trip Dataset
Exercise Set Overview
These exercises use the NYC Taxi and Limousine Commission’s Yellow Taxi trip data. You’ll be analyzing real-world transportation data to answer business questions that taxi fleet managers, city planners, and operations teams face daily.
Database Schema Reference:
- fact_yellow_trips_raw_2025 – Main trip records table
- dim_taxi_zones – Location information (pickup/dropoff zones)
- dim_rate_codes – Rate code descriptions
- dim_payment_types – Payment method descriptions
- dim_vendors – Vendor/provider information
- dim_date – Date dimension for time-based analysis
Important:
For each of the tasks below:
- write the appropriate query to extract information from the database
- generate insights and answer the business question
Part 1: Data Aggregation
Exercise 1.1: Overall Trip Statistics
Business Context: The executive team needs a high-level summary of taxi operations.
Task: Calculate the following metrics across ALL trips (no limit):
- Total number of trips (count all records)
- Average trip distance (rounded to 2 decimals)
- Average fare amount (rounded to 2 decimals)
- Minimum fare amount
- Maximum fare amount
- Average total amount (rounded to 2 decimals)
Display all metrics in a single row with meaningful column aliases.
Learning Objective: Practice aggregate functions (COUNT, AVG, MIN, MAX) with ROUND
Exercise 1.2: Payment Method Distribution
Business Context: Finance needs to understand payment method preferences across all trips.
Task: Group trips by payment_type and calculate:
- Count of trips for each payment type
- Average fare amount for each payment type (rounded to 2 decimals)
- Average tip amount for each payment type (rounded to 2 decimals)
- Total revenue (SUM of total_amount) for each payment type (rounded to 2 decimals)
Sort by trip count descending to see most popular payment methods first.
Learning Objective: Practice GROUP BY with multiple aggregates
Exercise 1.3: Vendor Performance Comparison
Business Context: Operations wants to compare performance across different taxi technology vendors.
Task: For each VendorID, calculate:
- Total number of trips
- Average trip distance (rounded to 2 decimals)
- Average fare amount (rounded to 2 decimals)
- Average total amount (rounded to 2 decimals)
Sort by vendor ID. Only include vendors with more than 100 trips.
Learning Objective: Practice GROUP BY with HAVING clause
Exercise 1.4: Passenger Load Analysis
Business Context: Fleet planning needs to understand typical passenger counts for vehicle allocation.
Task: Group trips by passenger_count and show:
- Passenger count
- Number of trips with that passenger count
- Average trip distance for that passenger count (rounded to 2 decimals)
- Average total amount for that passenger count (rounded to 2 decimals)
Only include passenger counts that appear in at least 50 trips. Sort by passenger count ascending.
Learning Objective: Practice GROUP BY with HAVING for minimum thresholds
Exercise 1.5: High-Value Trip Analysis by Rate Code
Business Context: Pricing team wants to understand revenue patterns across different rate types.
Task: For each RatecodeID, calculate:
- Count of trips
- Average fare amount (rounded to 2 decimals)
- Minimum total amount
- Maximum total amount
- Total revenue (SUM of total_amount, rounded to 2 decimals)
Only show rate codes with at least 20 trips AND where average fare exceeds $15. Sort by total revenue descending.
Learning Objective: Practice GROUP BY with multiple HAVING conditions
Exercise 1.6: Trip Distance Distribution
Business Context: Operations wants to understand how trips cluster by distance ranges.
Task: Create distance categories and analyze them:
- Use CASE to create categories: ‘0-5 miles’, ‘5-10 miles’, ’10-20 miles’, ’20+ miles’
- For each category, calculate:
- Count of trips
- Average fare amount (rounded to 2 decimals)
- Average total amount (rounded to 2 decimals)
Group by the distance category. Sort by the category (you’ll need to use the same CASE logic in ORDER BY to sort properly).
Learning Objective: Practice GROUP BY with CASE expressions
Exercise 1.7: Tipping Behavior by Trip Value
Business Context: Marketing wants to understand if expensive trips correlate with better tipping.
Task: Create trip value categories based on total_amount:
- ‘Budget’ (< $20)
- ‘Standard’ ($20-50)
- ‘Premium’ ($50-100)
- ‘Luxury’ ($100+)
For each category, calculate:
- Count of trips
- Average tip amount (rounded to 2 decimals)
- Average tip percentage (tip_amount / fare_amount * 100, rounded to 2 decimals)
Only include credit card payments (payment_type = 1) where fare_amount > 0. Only show categories with at least 30 trips. Sort by average tip percentage descending.
Learning Objective: Practice GROUP BY with CASE, calculated fields, and HAVING
Exercise 1.8: Peak Pricing Impact Analysis
Business Context: Finance wants to understand how extra charges (peak hour surcharges) affect total trip costs.
Task: Create surcharge categories:
- ‘No Extra’ (extra = 0)
- ‘Low Extra’ (extra > 0 and extra <= 1)
- ‘High Extra’ (extra > 1)
For each category, calculate:
- Number of trips
- Average fare amount (rounded to 2 decimals)
- Average extra charge (rounded to 2 decimals)
- Average total amount (rounded to 2 decimals)
- Percentage of trips in this category (count / total trips * 100)
Only include categories with at least 100 trips. Sort by average total amount descending.
Learning Objective: Practice GROUP BY with CASE, multiple aggregates, and percentage calculations
Exercise 1.9: Zone-Based Trip Analysis
Business Context: City planners want to understand trip patterns by pickup location.
Task: Group trips by PULocationID (pickup location) and calculate:
- Location ID
- Total number of pickups from that location
- Average trip distance from that location (rounded to 2 decimals)
- Average total amount for trips from that location (rounded to 2 decimals)
- Maximum trip distance from that location
Only include locations with at least 500 pickups. Sort by number of pickups descending. Show top 20 locations.
Learning Objective: Practice GROUP BY with HAVING, ORDER BY, and LIMIT
Exercise 1.10: Comprehensive Trip Summary with Filters
Business Context: Executive dashboard needs a complex summary of recent high-value operations.
Task: For trips where total_amount >= $30, group by payment_type and calculate:
- Payment type
- Number of trips
- Average trip distance (rounded to 2 decimals)
- Average fare amount (rounded to 2 decimals)
- Average tip amount (rounded to 2 decimals)
- Total revenue (SUM of total_amount, rounded to 2 decimals)
- Percentage of total trips (within this filtered set)
Only include payment types with at least 200 trips in this high-value segment. Sort by total revenue descending.
Learning Objective: Practice combining WHERE filters with GROUP BY, HAVING, and complex calculations
Bonus Challenge: Multi-Faceted Analysis
Business Context: The executive team is preparing for a board meeting and needs a comprehensive summary answering: “What defines our most profitable trip segment?”
Task: Create a single query that identifies trips meeting ALL these criteria:
- Trip distance between 10 and 30 miles
- Total amount over $50
- Credit card payment (payment_type = 1)
- Tip percentage >= 15%
Display: trip distance, fare amount, tip amount, tip percentage (calculated), total amount, and a category label ‘High-Value Long-Distance’. Calculate the average of all metrics for this segment and display as a summary row (you’ll need UNION for this – if not covered yet, just calculate the individual trip metrics).
Sort by total amount descending. Limit to 50 records.
Learning Objective: Integrate multiple concepts – filtering, calculations, CASE statements, sorting
Exercise Submission Guidelines
For each exercise:
- Write the SQL query
- Execute it in pgAdmin
- Take a screenshot of both the query and the first 10-20 rows of results
- For aggregate exercises, include the complete result set (usually just a few summary rows)
- Write 1-2 sentences explaining what business insight the results reveal
Example Format:
Exercise 2.1: High-Value Trip Analysis
Query:
[Your SQL code here]
Results:
[Screenshot]
Business Insight:
The top 25 trips range from $100 to $350, with the highest-value trips covering distances of 15-25 miles, suggesting airport or long-distance suburban routes are the most lucrative for drivers.
Submission & Resubmission Guidelines
1. When submitting your exercise, use this naming format:
YourName_Submission_Lesson#.xxx
If you need to revise and resubmit, add a version suffix:
YourName_Submission_Lesson#_v2.xxxYourName_Submission_Lesson#_v3.xxx
2. Do not overwrite or change the original evaluation entries in the rubric.
Instead, enter your updated responses or corrections in a new “v2” (or later) column in the rubric for mentor review.
Evaluation Rubric
| Criteria | Meets Expectations | Needs Improvement | Incomplete/Off-Track |
|---|---|---|---|
| Query Execution | All 10 required queries execute without errors and return correct results | 8-9 queries work correctly; 1-2 have minor errors | Fewer than 8 queries execute successfully |
| Aggregation Logic | Correctly uses COUNT, SUM, AVG, MIN, MAX with GROUP BY; includes all non-aggregated columns in GROUP BY | Aggregates mostly correct but 2-3 queries have missing GROUP BY columns, wrong functions, or calculation errors | Frequent GROUP BY errors, wrong aggregate functions, or fundamental misunderstanding of grouping |
| HAVING vs WHERE | Properly distinguishes WHERE (row filtering) from HAVING (group filtering); uses each correctly | Understands concept but 1-2 queries place conditions in wrong clause | Consistently confuses WHERE and HAVING or misuses both |
| Business Insights | Insights reference specific metrics and explain patterns/trends revealed by aggregation | Insights present but don’t fully interpret aggregated results; missing for 2-3 exercises | Most insights missing or show no understanding of aggregation meaning |
Contact
Talk to us
Have questions or feedback about Lumen? We’d love to hear from you.