Lesson 7: Subqueries in SQL

Learn how to construct nested queries.

Estimated Read Time: 2 - 3 Hours

Learning Goals

In this lesson, you will learn to:

  • Construct subqueries in SELECT, FROM, and WHERE clauses to create calculated columns, filter datasets, and build temporary result sets
  • Differentiate between correlated and non-correlated subqueries and apply each appropriately
  • Apply the IN, EXISTS, and NOT IN operators to efficiently filter data using subquery results
  • Build derived tables to isolate complex logic and create reusable result sets

1. Introduction: The Power of Nested Logic

Over the past six lessons, you have built a formidable toolkit of SQL capabilities. You can extract and filter data, aggregate metrics, and join multiple tables to create comprehensive analytical views. These skills enable you to answer the vast majority of straightforward business questions that analysts encounter daily.

However, there exists a class of business questions that cannot be answered through simple, linear SQL queries—questions that require iterative logic, dynamic filtering based on calculated results, or comparisons against aggregated subsets of data. Consider these scenarios that a data analyst at Northwind Traders might face:

  • “Which products have unit prices higher than the average price in their category?”
  • “Who are our top 5 customers by revenue, and what percentage of total company revenue do they represent?”
  • “Which employees have processed more orders than the average employee in their region?”
  • “Show me all orders that contain products from our most profitable category.”

 

Each of these questions contains an implicit two-step logic:

  1. Calculate something (an average, a ranking, a filtered subset)
  2. Use that calculated result to filter or enhance the main query

 

This is where subqueries become essential. A subquery—also called a nested query or inner query—is a SELECT statement embedded within another SQL statement (the outer query or main query). The subquery executes first, produces a temporary result set, and then feeds that result into the outer query.

Subqueries enable you to:

  • Create calculated columns based on aggregated data from other tables
  • Filter records dynamically using computed values rather than static criteria
  • Build complex comparison logic that would be impossible with WHERE clauses alone
  • Decompose intricate business logic into readable, step-by-step operations

 

By the end of this lesson, you will be able to construct queries that solve business problems requiring multiple levels of logic, positioning you to tackle analyst responsibilities that demand advanced SQL proficiency.

 

2. Subquery Fundamentals: Structure and Execution

Before exploring different types of subqueries and their applications, we must establish a clear mental model of how subqueries work.

2.1. Anatomy of a Subquery

A subquery follows this fundamental structure:

				
					SELECT column1, column2
FROM table1
WHERE column3 OPERATOR (
    SELECT column4
    FROM table2
    WHERE condition
);
				
			

Key components:

  1. Parentheses: Subqueries must always be enclosed in parentheses. This distinguishes them from the main query and signals to the database engine that the enclosed SELECT statement should be executed independently.
  2. Independence: In most cases, a subquery can be executed as a standalone query. You can test and verify subquery logic by running it separately before embedding it in the outer query.
  3. Execution Order: The database engine evaluates the subquery first (from innermost to outermost if there are nested levels), then uses its results in the outer query.
  4. Return Value Types: Subqueries can return:
    • A single value (scalar subquery): Used with comparison operators (=, >, <, etc.)
    • A single column of values (column subquery): Used with IN, EXISTS
    • A complete result set (table subquery): Used in FROM clauses to create derived tables

 

2.2. Correlated vs. Non-Correlated Subqueries

Subqueries fall into two fundamental categories based on their relationship with the outer query:

 

Non-Correlated Subqueries (Independent):

A non-correlated subquery can execute completely independently of the outer query. It has no reference to columns from the outer query’s tables.

				
					-- Example: Non-correlated subquery
SELECT product_name, unit_price
FROM products
WHERE unit_price > (
    SELECT AVG(unit_price)
    FROM products
);
				
			

The inner query SELECT AVG(unit_price) FROM products executes once, calculates the average, and returns a single value. The outer query then uses that value for comparison.

 

Execution characteristics:

  • Executes once before the outer query runs
  • More efficient for large datasets
  • Preferred when possible

Correlated Subqueries (Dependent):

A correlated subquery references columns from the outer query, creating a dependency relationship. It must re-execute for each row processed by the outer query.

				
					-- Example: Correlated subquery
SELECT p.product_name, p.unit_price, c.category_name
FROM products p
INNER JOIN categories c ON p.category_id = c.category_id
WHERE p.unit_price > (
    SELECT AVG(p2.unit_price)
    FROM products p2
    WHERE p2.category_id = p.category_id  -- References outer query
);
				
			

The inner query references p.category_id from the outer query’s products table. This means the subquery must execute repeatedly—once for each product—calculating the average price for that specific product’s category each time.

 

Execution characteristics:

  • Executes repeatedly (once per row in outer query)
  • More resource-intensive
  • Necessary for row-by-row comparative logic

 

Performance Note: Correlated subqueries can be slow on large datasets because they execute repeatedly. For tables with millions of rows, consider joins or window functions (covered in later Lessons) as alternatives.

 

3. Subqueries in the WHERE Clause: Dynamic Filtering

The WHERE clause is the most common location for subqueries, enabling dynamic filtering based on calculated criteria. This approach is invaluable when filter conditions depend on aggregated values or complex logic.

 

3.1. Scalar Subqueries with Comparison Operators

A scalar subquery returns exactly one value (one row, one column). This value can be used with standard comparison operators: =, >, <, >=, <=, !=.

 

Business Case: Products Priced Above Average

The product management team wants to identify products priced above the company-wide average to understand our premium product portfolio.

				
					-- Find products priced above the overall average
-- Business purpose: Identify premium products for strategic positioning

SELECT 
    product_id,
    product_name,
    unit_price,
    units_in_stock,
    discontinued
FROM products
WHERE unit_price > (
    SELECT AVG(unit_price)
    FROM products
)
ORDER BY unit_price DESC;
				
			

Query Analysis:

  1. Inner Query Execution: SELECT AVG(unit_price) FROM products calculates the average unit price across all products (returns 28.87).
  2. Outer Query Execution: The database then executes WHERE unit_price > 28.87, filtering products to show only those exceeding this average.
  3. Business Value: This immediately identifies the 43 products that are positioned as premium offerings, helping pricing strategy and inventory management decisions.

 

Important Constraint: If the subquery returns multiple rows when you use =, >, < operators, the query will fail. The database expects exactly one value for comparison.

 

3.2. Using IN with Column Subqueries

When your subquery returns multiple values (a column of results), you must use the IN operator instead of = .

 

Business Case: Products from Top Revenue Categories

Finance wants to focus marketing efforts on products from the top 3 revenue-generating categories.

				
					-- Identify products in top 3 revenue-generating categories
-- Business purpose: Target marketing budget on highest-performing categories

SELECT 
    p.product_name,
    c.category_name,
    p.unit_price,
    p.units_on_order
FROM products p
INNER JOIN categories c ON p.category_id = c.category_id
WHERE p.category_id IN (
    SELECT cat.category_id
    FROM categories cat
    INNER JOIN products prod ON cat.category_id = prod.category_id
    INNER JOIN order_details od ON prod.product_id = od.product_id
    GROUP BY cat.category_id
    ORDER BY SUM(od.unit_price * od.quantity * (1 - od.discount)) DESC
    LIMIT 3
)
ORDER BY c.category_name, p.product_name;
				
			

Query Breakdown:

Subquery Logic (innermost):

  1. Joins categories → products → order_details
  2. Calculates total revenue per category
  3. Sorts categories by revenue descending
  4. Returns the category_id values for the top 3 categories (e.g., categories 1, 4, and 8)

 

Outer Query Logic:

  1. Selects product information
  2. Filters to only products WHERE category_id IN (1, 4, 8)
  3. The IN operator allows matching against multiple values

 

Business Application: Marketing can now focus their campaigns on products within the top-performing categories, optimizing advertising ROI.

 

3.3. Using NOT IN for Exclusion Logic

The inverse operator NOT IN identifies records that do not match any value in the subquery results.

 

Business Case: Customers Who Have Never Ordered Certain Products

The marketing team wants to run a targeted campaign for customers who have never purchased products from the Condiments category, offering them a first-time discount.

				
					-- Find customers who have never ordered condiments
-- Business purpose: Target non-customers of specific category for acquisition campaign

SELECT 
    customer_id,
    company_name,
    contact_name,
    city,
    country
FROM customers
WHERE customer_id NOT IN (
    SELECT DISTINCT o.customer_id
    FROM orders o
    INNER JOIN order_details od ON o.order_id = od.order_id
    INNER JOIN products p ON od.product_id = p.product_id
    INNER JOIN categories c ON p.category_id = c.category_id
    WHERE c.category_name = 'Condiments'
      AND o.customer_id IS NOT NULL  -- Important: handle NULLs
)
ORDER BY country, company_name;
				
			

Subquery Purpose: Returns a list of all customer IDs who have purchased condiments at least once.

 

Outer Query Purpose: Shows all customers NOT in that list—customers who are untapped for this product category.

 

Critical NULL Handling: NOT IN with NULL values can produce unexpected results. If the subquery returns any NULL values, the entire NOT IN condition evaluates to NULL (not TRUE or FALSE), and no rows will be returned. Always filter out NULLs in NOT IN subqueries using IS NOT NULL, or use NOT EXISTS instead (covered in section 3.5).

 

3.4. Correlated Subqueries in WHERE Clause

Correlated subqueries enable row-by-row comparison logic that would be impossible with non-correlated subqueries.

 

Business Case: Products Priced Above Their Category Average

Product managers want to identify products that are premium-positioned within their specific category (not company-wide), to evaluate category-level pricing strategies.

				
					-- Find products priced above their own category's average
-- Business purpose: Identify category-specific premium products

SELECT 
    p.product_id,
    p.product_name,
    c.category_name,
    p.unit_price,
    (
        SELECT ROUND(AVG(p2.unit_price)::numeric, 2)
        FROM products p2
        WHERE p2.category_id = p.category_id
    ) AS category_avg_price
FROM products p
INNER JOIN categories c ON p.category_id = c.category_id
WHERE p.unit_price > (
    SELECT AVG(p2.unit_price)
    FROM products p2
    WHERE p2.category_id = p.category_id  -- Correlated reference
)
ORDER BY c.category_name, p.unit_price DESC;
				
			

How it works: For each product, the subquery calculates that product’s category average price, then the WHERE clause compares the product price against it. This process repeats for every product row.

 

Performance Note: On large datasets, correlated subqueries can be slow since they execute repeatedly. We’ll learn more efficient alternatives in future lessons.

 

Business Insight: The results show products like “Côte de Blaye” (wine) priced at $263.50 against a Beverages category average of $37.98, indicating extreme premium positioning that may warrant review.

 

3.5. EXISTS and NOT EXISTS: Checking for Presence

The EXISTS operator tests whether a subquery returns any rows at all. It doesn’t care about the actual values—only whether results exist.

 

Business Case: Identify Customers with Recent Activity

The sales team wants to focus account management efforts on customers who have placed orders in the last 90 days, indicating active engagement.

				
					-- Find customers with orders in the last 90 days
-- Business purpose: Prioritize relationship management for active accounts

SELECT 
    c.customer_id,
    c.company_name,
    c.contact_name,
    c.city,
    c.country
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
      AND o.order_date >= CURRENT_DATE - INTERVAL '90 days'
)
ORDER BY c.country, c.company_name;
				
			

How EXISTS Works:

  1. For each customer, the subquery checks: “Are there any orders for this customer in the last 90 days?”
  2. If yes (even one order exists), the customer is included in results
  3. If no, the customer is excluded
  4. Performance advantage: EXISTS stops searching as soon as it finds the first matching row—it doesn’t need to retrieve all matches

 

EXISTS vs. IN:

  • EXISTS is generally faster for large datasets because it stops at the first match
  • IN must retrieve all matching values and then perform comparisons
  • EXISTS handles NULLs more gracefully than IN

 

NOT EXISTS for Gap Analysis:

Conversely, NOT EXISTS identifies records where no related records exist.

 

Business Case: Categories with No Discontinued Products

Quality assurance wants to identify product categories where we’ve never had to discontinue any products, indicating reliable supplier relationships.

				
					-- Find categories with no discontinued products
-- Business purpose: Identify high-reliability product categories

SELECT 
    c.category_id,
    c.category_name,
    c.description
FROM categories c
WHERE NOT EXISTS (
    SELECT 1
    FROM products p
    WHERE p.category_id = c.category_id
      AND p.discontinued = 1
)
ORDER BY c.category_name;
				
			

This returns categories where the entire product lineup is currently active—no discontinuations ever occurred.

 

4. Subqueries in the SELECT Clause: Calculated Columns

Subqueries in the SELECT clause create calculated columns based on related data. Each subquery must return a single scalar value and executes once per row in the result set.

 

Business Case: Product Comparison to Category Average

Product managers want a report showing each product alongside its category’s average price for quick visual comparison.

				
					-- Show products with their category average price for comparison
-- Business purpose: Enable quick pricing strategy evaluation

SELECT 
    p.product_id,
    p.product_name,
    c.category_name,
    p.unit_price AS product_price,
    (
        SELECT ROUND(AVG(p2.unit_price)::numeric, 2)
        FROM products p2
        WHERE p2.category_id = p.category_id
    ) AS category_avg_price,
    ROUND((p.unit_price / (
        SELECT AVG(p2.unit_price)
        FROM products p2
        WHERE p2.category_id = p.category_id
    ) * 100)::numeric, 1) AS percent_of_category_avg
FROM products p
INNER JOIN categories c ON p.category_id = c.category_id
WHERE p.discontinued = 0
ORDER BY c.category_name, p.unit_price DESC;
				
			

Query Breakdown:

  1. First subquery (in SELECT): Calculates the average price for each product’s category
  2. Second subquery (in calculation): Used to compute the percentage comparison
  3. Execution: Both subqueries run once for each product row

 

Performance Warning: Notice that we’ve calculated the category average twice in this query. Each calculation requires a separate subquery execution. For better performance on larger datasets, use CTEs (Common Table Expressions, covered in Lesson 8) to calculate once and reference multiple times.

 

Business Insight: The results immediately reveal products priced at 150-250% of their category average, highlighting aggressive premium positioning that may require justification or repricing.

 

5. Subqueries in the FROM Clause: Derived Tables

When you place a subquery in the FROM clause, you create a derived table (also called an inline view)—a temporary result set that the outer query treats as if it were an actual table.

Requirements for derived tables:

  1. Must be enclosed in parentheses
  2. Must have an alias (required in PostgreSQL and most databases)
  3. Can include any valid SELECT statement with its own JOINs, filters, and aggregations

 

Business Case: Top Customers Revenue Analysis

The executive team wants to analyze the characteristics of the top 10 customers by revenue—where they’re located, which employees manage them, and their average order value.

				
					-- Analyze top revenue customers in detail
-- Business purpose: Understand high-value customer profiles

SELECT 
    top_customers.customer_id,
    top_customers.company_name,
    top_customers.total_revenue,
    top_customers.order_count,
    ROUND((top_customers.total_revenue / top_customers.order_count)::numeric, 2) AS avg_order_value,
    c.city,
    c.country,
    c.contact_name,
    COUNT(DISTINCT e.employee_id) AS employees_serving
FROM (
    -- Subquery: Calculate revenue and order metrics for all customers
    SELECT 
        c.customer_id,
        c.company_name,
        COUNT(DISTINCT o.order_id) AS order_count,
        SUM(od.unit_price * od.quantity * (1 - od.discount)) AS total_revenue
    FROM customers c
    INNER JOIN orders o ON c.customer_id = o.customer_id
    INNER JOIN order_details od ON o.order_id = od.order_id
    GROUP BY c.customer_id, c.company_name
    ORDER BY total_revenue DESC
    LIMIT 10
) AS top_customers  -- Derived table alias is required
INNER JOIN customers c ON top_customers.customer_id = c.customer_id
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN employees e ON o.employee_id = e.employee_id
GROUP BY 
    top_customers.customer_id,
    top_customers.company_name,
    top_customers.total_revenue,
    top_customers.order_count,
    c.city,
    c.country,
    c.contact_name
ORDER BY top_customers.total_revenue DESC;
				
			

Execution Flow:

  1. Inner Subquery executes first:
    • Calculates total revenue and order count for each customer
    • Ranks customers by revenue descending
    • Returns only the top 10
    • This creates a temporary table called top_customers
  2. Outer Query then executes:
    • Treats top_customers as a regular table
    • Joins it back to the actual customer, order, and employee tables
    • Enriches the top 10 with additional contextual information

 

Why Use This Approach?

Without the subquery, calculating “top 10” after joining to employees would give incorrect results—the join would multiply rows, and LIMIT 10 would return 10 order-employee combinations, not 10 customers.

The subquery isolates the ranking logic, ensuring we genuinely get the top 10 customers before enriching with additional details.

Note: Derived tables can also filter on aggregated results. While HAVING within a single query often works too, derived tables provide more flexibility for complex transformations.

 

6. Practical Guidelines and Best Practices

6.1. When to Choose Subqueries

Use subqueries when:

  1. You need a calculated value for filtering
    • “Products priced above the company average”
    • “Customers who ordered more than the median order count”
  2. Logic is clearer with step-by-step decomposition
    • “Find top 5 customers, then show their order details”
    • Subqueries make the intent explicit
  3. You’re checking for existence/absence of relationships
    • “Customers who have never ordered”
    • EXISTS is purpose-built for this
  4. You need to filter on aggregated results
    • “Categories where average product price > $50”

 

Avoid subqueries when:

  1. A simple JOIN accomplishes the same goal more efficiently
  2. The subquery is correlated and processes millions of rows
  3. You’re calculating the same thing multiple times (use CTEs instead – Lesson 8)

 

6.2. Formatting Best Practices

1. Use clear indentation to show hierarchy:

				
					-- GOOD: Clear structure
SELECT product_name, unit_price
FROM products
WHERE category_id IN (
    SELECT category_id
    FROM categories
    WHERE category_name LIKE '%Food%'
);

-- BAD: Difficult to read
SELECT product_name, unit_price FROM products WHERE category_id IN (SELECT category_id FROM categories WHERE category_name LIKE '%Food%');
				
			

 

2. Add comments explaining subquery purpose:

				
					SELECT product_name, unit_price
FROM products
WHERE unit_price > (
    -- Calculate company-wide average as pricing benchmark
    SELECT AVG(unit_price)
    FROM products
);
				
			
6.3. Debugging Subqueries

1. Test subqueries independently first:

Before embedding a subquery, run it standalone to verify it returns expected results:

				
					-- Test subquery first
SELECT AVG(unit_price) FROM products;
-- Returns: 28.87

-- Then use in main query
SELECT product_name, unit_price
FROM products
WHERE unit_price > 28.87;
				
			

2. Verify return value type matches operator:

  • Single value needed for =, >, < operators
  • Multiple values needed for IN, EXISTS
  • Check if subquery returns what you expect

Summary

This lesson has equipped you with the ability to construct multi-layered analytical queries using subqueries. You now understand:

  1. Subquery Fundamentals: How nested queries execute within outer queries, the distinction between correlated and non-correlated subqueries, and different return value types.
  2. WHERE Clause Subqueries: Using scalar subqueries with comparison operators, IN for multiple-value matching, NOT IN for exclusions (with NULL handling), correlated subqueries for row-by-row comparisons, and EXISTS for checking record presence.
  3. SELECT Clause Subqueries: Creating calculated columns based on aggregate values and related table data, with awareness of performance implications.
  4. FROM Clause Subqueries: Building derived tables that act as temporary result sets, enabling complex filtering and ranking patterns.
  5. Best Practices: When to use subqueries vs. alternatives, formatting for readability, and basic debugging approaches.

 

Subqueries are essential tools for intermediate SQL analysis. While powerful, they are not always the optimal solution. In the next lesson, you will learn about Common Table Expressions (CTEs), which often provide a more readable and maintainable alternative to complex subqueries, particularly for multi-step logic.

 

You are now capable of answering business questions that require iterative logic, dynamic filtering, and comparative analysis against aggregated benchmarks—analytical capabilities that distinguish intermediate SQL practitioners from beginners.

Exercise: SQL Subqueries

Using the NYC Yellow Taxi Trip Dataset

 

Exercise Set Overview

These exercises use the NYC Taxi and Limousine Commission’s Yellow Taxi trip data. You’ll be analyzing real-world transportation data to answer business questions that taxi fleet managers, city planners, and operations teams face daily.

Database Schema Reference:

  • fact_yellow_trips_raw_2025 – Main trip records table
  • dim_taxi_zones – Location information (pickup/dropoff zones)
  • dim_rate_codes – Rate code descriptions
  • dim_payment_types – Payment method descriptions
  • dim_vendors – Vendor/provider information
  • dim_date – Date dimension for time-based analysis

 

Important:

For each of the tasks below:

  1. write the appropriate query to extract information from the database
  2. Execute the query and capture results
  3. generate insights and answer the business question 

 

Exercise 1: Scalar Subquery – Trips Above Average Fare

Business Question: “The operations team wants to analyze premium trips. Identify all trips with total fares exceeding the system-wide average fare. Show vendor name, pickup zone, dropoff zone, trip distance, and total fare.”

Understanding the characteristics of high-fare trips helps optimize pricing strategies and identify premium service opportunities.

  • Use a scalar subquery to calculate the average total fare
  • Filter trips exceeding this average
  • Join to vendor and zone dimension tables
  • Order by total fare descending
  • Limit to 100 rows for readability
  • vendor_name
  • pickup_zone
  • dropoff_zone
  • trip_distance
  • total_amount
  • fare_above_average (calculate the difference)

Scalar subquery in WHERE clause, multi-table joins, calculated columns

 
Exercise 2: IN Operator – Trips from High-Volume Zones

Business Question: “Identify trips that originated from the top 5 busiest pickup zones (by trip count). Show the first 500 trips with pickup zone, dropoff zone, pickup time, fare, and tip.”

High-volume zones represent critical demand centers. Analyzing trips from these zones helps with fleet allocation and surge pricing strategies.

  • Subquery to identify top 5 zones by pickup count
  • Use IN operator to filter main query
  • Join to dim_taxi_zones for zone names
  • Filter to January 2025
  • Order by pickup datetime
  • pickup_datetime
  • pickup_zone
  • pickup_borough
  • dropoff_zone
  • dropoff_borough
  • fare_amount
  • tip_amount
  • total_amount

Column subquery with IN, aggregation in subquery, multi-dimensional joins

 

 

Exercise 3: NOT IN – Zones with No Airport Pickups

Business Question: “The logistics team wants to identify taxi zones that have never been used for airport pickups (airport_fee > 0). List these zones with their borough to understand coverage gaps.”

Bonus Challenge: Why might this query be slow if the dataset is large? Suggest an alternative using NOT EXISTS. 

Zones with no airport business might indicate either geographical barriers or missed marketing opportunities for airport service.

  • Subquery to find all zones with airport pickups (airport_fee > 0)
  • Use NOT IN to exclude these zones
  • Show zone name and borough
  • Order by borough, then zone name
  • Zone
  • Borough
  • service_zone

NOT IN with subquery, exclusion logic, data gap analysis

 

 

Exercise 4: Correlated Subquery – Trips Above Zone Average

Business Question: “For each trip, determine if its fare exceeds the average fare for trips originating from the same pickup zone. Show trips where the fare is at least 50% higher than their zone’s average.”

Identifying trips with unusually high fares relative to their pickup zone can reveal premium service patterns, surge pricing effectiveness, or potential pricing anomalies.

  • Correlated subquery calculating average fare per pickup zone
  • Filter for trips exceeding zone average by 50%
  • Show both the trip fare and zone average for comparison
  • Limit to 200 rows
  • Order by the fare difference descending
  • pickup_zone
  • dropoff_zone
  • tpep_pickup_datetime
  • total_amount
  • zone_avg_fare (from subquery)
  • percent_above_avg (calculated: (total_amount / zone_avg – 1) * 100)

Correlated subquery, per-group comparisons, calculated metrics

 

 

Exercise 5: EXISTS – Vendors with Recent Activity

Business Question: “Identify vendors that have processed trips in the last 7 days of January 2025. Show vendor name and their total trip count for the entire month.”

Understanding which vendors are actively operating helps operations teams coordinate with active partners and identify inactive vendors for follow-up.

  • EXISTS subquery to check for trips in last 7 days
  • Main query aggregates all January trips
  • Join to vendor dimension
  • Show vendor name and trip count
  • vendor_name
  • total_trips_january
  • has_recent_activity (calculated: ‘Yes’ since these are filtered by EXISTS)

EXISTS operator, correlated subquery, aggregation in main query

 

 

Exercise 6: Subquery in SELECT – Trip Count per Zone

Business Question: “Create a report showing each taxi zone with the total number of trips that originated from that zone in January 2025. Include zones with zero trips (showing 0, not NULL).”

(Optional) Challenge: Write two versions: one with a subquery in SELECT, one using LEFT JOIN with GROUP BY. Compare which is more readable.

This analysis identifies demand patterns and coverage gaps across the city’s taxi zones.

  • Subquery in SELECT clause to count trips per zone
  • Use COALESCE to handle zones with no trips
  • Show zone name, borough, and trip count
  • Order by trip count descending
  • Zone
  • Borough
  • trip_count (using subquery with COALESCE for zeros)

Scalar subquery in SELECT, NULL handling with COALESCE

 

 

Exercise 7: Subquery in FROM – High-Fare Trip Analysis

Business Question: “Analyze the top 50 highest-fare trips. Create a derived table of these trips, then enhance it with vendor information, payment type descriptions, and pickup/dropoff zone details.”

Premium trips represent significant revenue opportunities. Understanding their characteristics (vendors, payment methods, locations) informs business development strategies.

  • Subquery in FROM creating a derived table of top 50 trips by total_amount
  • Alias the derived table appropriately
  • Join to all relevant dimension tables
  • Calculate trip duration in minutes
  • Order by total_amount descending
  • vendor_name
  • pickup_zone
  • dropoff_zone
  • tpep_pickup_datetime
  • tpep_dropoff_datetime
  • trip_duration_minutes (calculate: EXTRACT(EPOCH FROM (dropoff – pickup))/60)
  • trip_distance
  • payment_type_desc
  • total_amount

Derived table in FROM, multiple joins, datetime calculations, table aliasing

 

Example Submission Format:
Exercise: Scalar Subquery – Trips Above Average Fare

Query:
SELECT 
v.vendor_name,
pz.Zone AS pickup_zone,
...
[Your complete query here] Results: [Screenshot] Business Insight: The analysis reveals that 42% of trips exceed the average fare of $19.47, with premium trips concentrated in Manhattan and airport routes. The average fare for above-average trips is $34.23, indicating a significant premium segment that generates disproportionate revenue. Vendors should focus premium vehicle deployment on these high-value zones during peak hours.

 

 

Submission & Resubmission Guidelines

1. When submitting your exercise, use this naming format:

YourName_Submission_Lesson#.xxx

If you need to revise and resubmit, add a version suffix:

YourName_Submission_Lesson#_v2.xxx
YourName_Submission_Lesson#_v3.xxx

2. Do not overwrite or change the original evaluation entries in the rubric.

Instead, enter your updated responses or corrections in a new “v2” (or later) column in the rubric for mentor review.

Evaluation Rubric

Criteria Meets Expectations Needs Improvement Incomplete
Query Correctness

All 7 exercises execute without errors and return logically correct results. Appropriate subquery types used (scalar, column, correlated, derived table). Challenge exercises attempted show good effort.

Key Success Indicators:

✅ Exercise 1 uses scalar subquery with single value return

✅ Exercise 2 uses IN operator correctly

✅ Exercise 3 handles NULLs in NOT IN (or uses NOT EXISTS)

✅ Exercise 4 uses correlated subquery correctly

✅ Exercise 5 uses EXISTS (not IN or joins)

✅ Exercise 7 properly aliases derived table

5-6 exercises work correctly; 1-2 have minor logical errors. Subquery types mostly appropriate. Fewer than 5 exercises execute successfully. Major logical errors or fundamental misunderstanding of subquery mechanics.
Subquery Technique Application Correctly applies scalar subqueries, IN/NOT IN, EXISTS, correlated subqueries, and derived tables where specified. Shows understanding of when each technique is appropriate. Understands most techniques but confuses 2-3 (e.g., using IN when scalar is needed, or incorrect correlation). Cannot distinguish between subquery types or apply them correctly.
NULL Handling Properly handles NULLs in NOT IN queries (uses IS NOT NULL or EXISTS). Uses COALESCE appropriately for zero values. Basic understanding but may miss NULL handling in 1-2 queries. No consideration of NULL handling. Queries fail due to NULL issues.
Business Insights Provides meaningful 2-3 sentence insights for each exercise that interpret results in business context with specific metrics. Insights present but shallow or generic. Missing for 2-3 exercises. Most insights missing or show no understanding of business application.
Student Submissions

Check out recently submitted work by other students to get an idea of what’s required for this Exercise:

Approved on March 5th
Got Feedback?

Drop us a line here.

Contact

Talk to us

Have questions or feedback about Lumen? We’d love to hear from you.