Lesson 6 - Descriptive Analysis
Master the essential statistical techniques that transform raw data into actionable business insights.
Estimated Read Time: 1 - 1,5 Hours
Learning Goals
In this lesson, you will learn descriptive and statistical analysis concepts to analyze data distribution and central tendencies. By the end of this lesson, you will be able to:
- Use Excel’s statistical functions with confidence and precision
- Calculate and interpret measures of central tendency (mean, median, mode) to understand typical values in your dataset
- Create histograms in Excel to visualize data distribution and identify patterns
- Distinguish between symmetrical and skewed distributions and understand their business implications
- Apply measures of spread (IQR, outlier detection) to assess data variability and reliability
- Turn raw data into clear business insight with foundational descriptive analysis.
According to McKinsey, companies that use data-driven decision making are 23 times more likely to acquire new customers — and 9 times more likely to retain them — compared to their competitors. But to make those smart decisions, you first need to extract meaningful insights from your data. That’s where statistical analysis comes in.
Statistical analysis forms the cornerstone of data-driven decision making in modern business environments. Descriptive statistics are instrumental in EDA, providing numerical summaries that characterize the central tendency, dispersion, and shape of the dataset’s distribution. These statistics form the basis for deeper analysis and hypothesis testing.
In this lesson, you’ll learn how to apply basic statistical concepts to your dataset to uncover trends, identify problem areas, and make confident, data-backed decisions.
However, the profound difference between competent analysts and exceptional ones lies not in their ability to calculate statistics, but in their capacity to interpret these measures within the broader business context and translate findings into strategic insights.
Consider the complexity underlying a seemingly simple question posed by Rossmann’s executive team: “What’s our typical daily sales performance?”
A simple answer might be: “Our average daily sales is €5545.” (Refer Figure 1) But what if most stores actually sell around €3,000-€4,000, while a few flagship locations pull in €15,000-€20,000 daily? Suddenly, the average becomes misleading rather than illuminating.
This lesson teaches you to uncover the complete picture behind your data, moving beyond simple averages to understand patterns, outliers, and the shape of business performance.
1. Start with the Right Business Questions
Before calculating numbers, it is essential to approach your dataset with a business mindset. Simply running statistics without purpose does not lead to useful insights.
For the Rossmann Dataset, here are some relevant business questions you might consider:
- Is sales consistent across stores?
- What are the minimum, maximum, and average prices for the sales?
- Are the customers traffic having an impact on sales?
- Which products are receiving poor customer traffic, and why?
- What is the impact of promotions (or Holidays) on sales?
Sometimes, the questions are more indirect and are phrased with respect to business concepts, e.g.:
- The executive team wants to understand the overall sales performance patterns across their stores.
- Do the extreme values represent genuine business performance or data quality issues?
- What business factors could explain the maximum sales figure?
- The VP of Operations asks: “What’s our typical weekly performance for the store? I need this for budget planning and performance target setting.”
- The marketing team wants to understand sales patterns to optimize promotional strategies. They’ve heard that the top store performed similarly overall but want data-driven evidence.
- The finance team is concerned about sales volatility for cash flow planning. They need to understand whether stores have predictable performance or high variability.
- The strategy team hypothesizes that holiday periods significantly boost sales, but they want to quantify this relationship and understand if it’s consistent across all stores for the topmost dept.
By thinking through these questions, you can focus your statistical analysis on the areas that truly matter for business decision-making.
2. Prepare Your Stats Sheet
To keep your analysis organized, create a new sheet within your Excel file. Name this new sheet sales_stats. This is where you will calculate and document your key statistical metrics. Start by creating column headers in the sales_stats sheet:
| Metric | Value | Comments and Observations |
| Minimum | (formula output – to be filled in next steps) | (your analysis) |
| Maximum | (formula output – to be filled in next steps) | (your analysis) |
| Average | (formula output – to be filled in next steps) | (your analysis) |
| … | … | … |
3. Measures of Central Tendency
“What does a typical day look like for this business?”
This is the essence of central tendency. You’re trying to capture the “center” of the data—the point around which other values cluster.
3.1. Minimum and Maximum
Let’s begin by analyzing the sales from the Rossmann dataset. The sales are located in Column D of your main data sheet. We can start by calculating simple stats like minimum and maximum.
This is further elaborated in the video walkthrough below.
Calculating Minimum and Maximum Sales
To calculate the Minimum Rating:
- In the sales_stats sheet, under “Value” next to “Minimum Sales,” enter:
= MIN(raw_data!D:D)
- Press Enter.
This will display the lowest sales across all stores.
Reflect: Any surprises here? What could be the reason?
There is indeed a surprise – the minimum sales is coming out to be zero. We need to explore the cause. For this you can either sort the sales in ascending order or filter the sales column for zeroes.
Having a closer look at Open column indicates that the stores with zero sales have actually closed down (Open = 0). This leads to another question – do we want to include closed stores in the remaining analysis? The decision effects our analysis.
Let’s say you decide to leave closed stores out. The simplest solution is duplicate the raw_data sheet and delete the rows with sales = 0.
You can also filter data while writing the min (or any other) formula.
==MIN(FILTER(raw_data!D:D; raw_data!D:D>0))
This will give you a value of €612 which is the minimum sales of stores that are still open.
To calculate the Max sales:
- In the next row under “Value,” enter:
=MAX(raw_data!D:D)
- Press Enter.
This shows the highest sales value corresponding to €38037/-.
This is already thought-provoking – the difference between the worst performing and best performing stores is huge. Now, the question is whether these stores are exceptional cases? So, how much is the average? For this we move towards the Mean.
3.1.1 When to Use Min/Max Analysis
Use Min/Max When:
- Identifying performance boundaries and operational limits
- Detecting data quality issues (unexpected zeros, negative values)
- Setting realistic targets based on historical extremes
- Assessing risk exposure (worst-case scenarios)
Avoid Min/Max When:
- Making typical performance assessments (outliers skew perception)
- Forecasting future performance (extremes are rarely predictive)
- Comparing datasets of different sizes (larger datasets have more extreme values)
Business Decision Rule:
Min/Max tells you what’s possible, not what’s probable. Use for boundary setting, not performance expectations.
3.2 The Mean (Arithmetic Average)
The arithmetic mean represents the mathematical center of your dataset, calculated by summing all values and dividing by the number of observations. While conceptually straightforward, the mean’s interpretation requires sophisticated understanding of its sensitivity to extreme values and its appropriate application contexts.
When to use it: When data is relatively symmetrical without extreme outliers
Mathematical Foundation and Excel Implementation:
The mean serves as your initial anchor point for understanding typical performance.
Let’s calculate the average sales / customer for Rossmann:
- Calculate average daily sales: =AVERAGE(raw_data!D:D)
- Calculate average customers: =AVERAGE(raw_data!E:E)
- Calculate Sales per Customer: AVERAGE(raw_data!D:D)/AVERAGE(raw_data!E:E)
The Average sales is coming out to be €5545,36 which is quite low compared to the max sales. If you sort the data according to sales in descending order, a quick glance will reveal that for most of the high values there was a promotion running (Promo = 1). This explains the high value of max sales.
Also, if you calculate average of only open stores, the average value is a bit higher: €6694,55/-
In a similar way you can conduct segmented analysis to uncover meaningful patterns. Consider these key metrics to gain deeper insight:
Overall Mean Sales: =AVERAGE(D:D) Promotional Mean: =AVERAGEIF(raw_data!G:G; 1; raw_data!D:D)
Non-Promotional Mean: =AVERAGEIF(raw_data!G:G, 0, raw_data!D:D)
Weekend Mean: =AVERAGEIFS(raw_data!D:D; raw_data!B:B; ">=6")
Weekday Mean: =AVERAGEIFS(raw_data!D:D; raw_data!B:B; "<6")
Consider the videos below:
Calculating Average or Mean of Data
Calculating Conditional Average of Data
3.2.1. When to Use Mean vs. Alternatives
Use Mean When:
- Data is approximately symmetrical (normal distribution)
- No extreme outliers present
- Need mathematical precision for calculations (budgeting, forecasting)
- Stakeholders understand and expect traditional averages
- Planning resource allocation based on total performance
Avoid Mean When:
- Data is heavily skewed (retail sales, income data)
- Extreme outliers are present (Black Friday vs. regular days)
- Need to understand “typical” employee/customer experience
- Communicating to audiences who might misinterpret averages
Business Decision Examples:
- Use Mean for: Annual budget planning, total revenue calculations
- Avoid Mean for: “What does a typical customer spend?” (use median instead)
3.3. The Median
The median represents the middle value when all observations are arranged in ascending order, providing a measure of central tendency that remains stable in the presence of extreme values. This robustness makes the median particularly valuable in business contexts where outliers may distort the mean and mislead strategic decision-making.
When to use it: When you suspect outliers or want to understand typical performance
Imagine, if Store 85 had a few promotional days with €25,000 in sales, the average would jump. But the median would stay grounded, showing what a “normal” day really looks like.
3.3.1. Excel Implementation and Analytical Framework
Calculating Central Tendency of Data via Median
Median is calculated by:
=MEDIAN(raw_data!D:D) //Central sales performance across all stores
The analytical power of the median emerges through comparative analysis with the mean. This comparison reveals the underlying distribution characteristics and potential business implications:
Mean Sales: =AVERAGE(raw_data!D:D)
Median Sales: =MEDIAN(raw_data!D:D)
Mean-Median Difference: =[Mean Cell] - [Median Cell] Relative Difference: =([Mean Cell] - [Median Cell])/[Median Cell]
3.3.2. Distribution Interpretation Guide:
The relationship between mean and median provides immediate insight into your data’s distribution characteristics. This is explained below:
Interpreting Median and Mean in Business
Positive Skew (Mean > Median): A relatively small number of high-performing stores or exceptional sales days pull the average upward. This pattern suggests that while most locations operate at moderate performance levels, certain stores or conditions generate exceptional results. From a strategic perspective, this indicates opportunities to identify and replicate success factors across the broader network.
Symmetric Distribution (Mean ≈ Median): Performance is relatively balanced across the store network, suggesting consistent operational standards and predictable customer demand patterns. This distribution type facilitates more accurate forecasting and resource allocation planning.
Negative Skew (Mean < Median): Some underperforming locations or challenging conditions pull the average below the typical performance level. This pattern signals potential operational issues that require investigation and remediation strategies.
For Rossmann’s retail context, we observed positive skew in daily sales data for open stores. Retail environments typically exhibit this pattern due to seasonal peaks, promotional events, and the natural variation in store locations and customer demographics.
3.3.3. When to Choose Median Over Mean
Use Median When:
- Data contains outliers that skew the mean
- Distribution is skewed (most business metrics are)
- Need to communicate “typical” performance to stakeholders
- Comparing datasets with different outlier patterns
- Assessing employee or customer experience metrics
Avoid Median When:
- Need exact mathematical precision for financial calculations
- Data is perfectly symmetrical with no outliers
- Stakeholders specifically require traditional averages
- Performing calculations that depend on all data points
Strategic Framework:
- CEO Dashboard: Use median for “typical store performance”
- Financial Planning: Use mean for total revenue projections
- Performance Reviews: Use median to set realistic targets
Quick Test:
If Mean – Median > 20% of median value, always report both with explanation.
3.4. The Mode
The mode represents the most frequently occurring value within your dataset, providing insight into standard operating levels and common business patterns. While less commonly used than mean or median, the mode offers unique strategic value in identifying typical operational states and standard performance levels.
When to use it: To understand common patterns or standard operating levels
Use MODE when you want to know, for example, which customer count happens most often. This can be useful for staffing, capacity planning, or marketing.
3.4.1. Excel Implementation:
Calculating Mode of Data
In Excel, you calculate mode using the formula:
=MODE.SNGL(raw_data!D:D) // Most common sales figure across observations
We get a zero here – this is a surprise (or maybe not anymore). Let’s sort sales in ascending order and see what we get. What we find is that rows upto 15666 have a sales value of zero. On further exploration, we notice that for all these stores the Customers are also zero and the Open status is zero as well. Now it makes sense because these stores have closed over time. Now the question is shall we keep these stores in our analysis or filter them out. This is a question, that you can answer as an analyst after talking to the stakeholder.
Consider another scenario, where we want to calculate mode of sales only when promotions are running. To do this (excluding the stores with zero sales):
Promo Mode: =MODE.SNGL(IF(raw_data!G:G=1; raw_data!D15666:D91258)).
Note that the data is sorted and the first 15666 rows have sales=0, so we select the remaining rows for calculating mode. You can also apply a filter, or simply clean the data upfront.
You can compare this with non-promo sales to see the differences.
3.4.2. Business Context for Rossmann:
The mode’s business value becomes apparent when analyzing operational patterns rather than seeking central tendency measures. Consider these analytical applications:
Standard Promotional Levels: The mode may reveal common promotional sales figures, indicating standard discount structures or typical promotional campaign results across the store network.
Operational Baseline Identification: Frequent sales values might indicate standard closing procedures, minimum viable daily operations, or common inventory turnover patterns.
Weekend and Holiday Patterns: Mode analysis across different day types can reveal standard weekend performance levels or common holiday shopping patterns.
3.4.3 When Mode Provides Strategic Value
Use Mode When:
- Identifying most common operational states
- Understanding standard customer behavior patterns
- Capacity planning and resource allocation
- Quality control and process standardization
- Analyzing categorical data (product types, customer segments)
Avoid Mode When:
- Data is continuous with few repeated values
- Need central tendency for financial calculations
- Distribution has multiple peaks (bimodal/multimodal)
- Sample size is too small for meaningful patterns
Business Applications:
- Inventory Planning: Most common order sizes
- Staffing Models: Most frequent customer traffic levels
- Pricing Strategy: Most common price points customers accept
Limitation: Mode may not exist in continuous data or may mislead if multiple modes exist.
4. Understanding the Shape of Data
Histograms transform raw numerical data into visual narratives that reveal distribution patterns, outliers, and underlying business dynamics.
Let’s create a histogram for Rossmann sales distribution.
4.1 Creating Histograms in Excel
Step-by-Step Process:
- Select your data: Select Sales data (column D)
- Insert Histogram:
- Select Data → Insert Statistic Chart → Histogram
- Or select Insert Menu → Chart → Histogram
When you consider all the stores including closed ones as well, it appears as a spike at the left of histogram.
When you consider open stores only, you get a right-skewed distribution. This is inline with real-world retail patterns. Consider Figure 2 for the histograms when all stores are considered (left) vs. when only open stores are included (right).
4.2 Understanding Distribution Shapes
The shape of the data distribution provides profound insights into underlying business dynamics, operational consistency, and market conditions.
Symmetrical Distribution:
Symmetrical distributions exhibit balanced patterns where mean, median, and mode converge to similar values. The visual representation resembles a bell curve with equal tail lengths extending in both directions from the central peak.
- Mean ≈ Median ≈ Mode
- Bell-shaped curve
- Business meaning: Predictable, consistent performance
Symmetrical distributions suggest consistent operational performance across the store network, predictable customer demand patterns, and effective standardization of business processes. This distribution type facilitates accurate forecasting, reliable performance benchmarking, and confident resource allocation planning.
Right-Skewed (Positive Skew):
Right-skewed distributions exhibit extended tails stretching toward higher values, with the bulk of observations concentrated at lower levels. The mean exceeds the median, which exceeds the mode, creating a characteristic pattern where exceptional performance periods or locations pull the average upward.
- Mean > Median > Mode
- Long tail extending right
- Business meaning: Most days are average, but some exceptional high-performance days
Right-skewed sales patterns are common in retail environments and typically indicate that while most days or stores operate at moderate performance levels, certain conditions generate exceptional results. These exceptional periods might result from successful promotional campaigns, seasonal demand surges, or high-traffic location advantages.
Left-Skewed (Negative Skew):
- Mode > Median > Mean
- Long tail extending left
- Business meaning: Most days are good, but some poor-performance days drag average down
Left-skewed distributions feature extended tails toward lower values, indicating that while most observations cluster at higher performance levels, certain conditions or locations generate significantly lower results. The mode exceeds the median, which exceeds the mean.
4.2.1. Business Interpretation
Left-skewed sales patterns suggest that while the majority of operations perform well, certain systematic issues or challenging conditions create underperformance problems. This pattern requires immediate investigation to identify and address root causes of poor performance.
Focus analytical efforts on understanding the factors driving poor performance periods. Investigate whether these issues result from operational problems, market challenges, competitive pressures, or inadequate resource allocation.
Distribution Shape Assessment
You can calculate data skewness using the Skew Formula in Excel.
=SKEW(raw_data!D:D)
Interpretation:
- Positive = Right skewed
- Negative = Left skewed
- Near zero = Symmetric
For all stores, the sales skewness is coming out to be 0,2 which shows symmetric data. The sales skewness for open stores only increases to 1,15 which confirms the positive or right skewness.
5. Measures of Spread
Consider the video below, which is an overview to spreads using the Rossmann dataset.
Calculating Spread of Data
With this information you can identify which ones are your top performing stores, e.g. Store ID 57 is consistently performing well, followed by stores 84 and 25. We can also see that there was mostly no school holidays, however a promotion was running in those dates.
5.1. Interquartile Range (IQR) – Robust Spread Measurement
Building upon your outliers work from previous lesson, the interquartile range provides a robust measure of data spread that remains stable in the presence of extreme values. The IQR represents the range containing the middle 50% of your observations, offering insights into typical performance variation while minimizing outlier influence.
Let’s continue with our IQR calculation of Rossmann sales:
For this, let’s copy the “open” stores’ data in another sheet called: cleaned_data and use it moving forward:
// Comprehensive quartile analysis Q1 (25th Percentile): =QUARTILE(cleaned_data!D:D, 1) Q2 (50th Percentile/Median): =QUARTILE(cleaned_data!D:D, 2) Q3 (75th Percentile): =QUARTILE(cleaned_data!D:D, 3) IQR: =[Q3] – [Q1]
5.1.1. Interpretation
The IQR’s business value extends beyond simple spread measurement to strategic performance assessment.
Narrow IQR values indicate consistent performance across the middle 50% of observations, suggesting effective operational standardization and predictable business conditions. Wide IQR values reveal significant performance variation that may indicate optimization opportunities or operational inconsistencies requiring investigation. In short:
- Small IQR: Consistent performance across stores
- Large IQR: High variability, potential for optimization
Use quartile boundaries to establish performance categories: below Q1 (needs improvement), Q1-Q2 (developing), Q2-Q3 (meeting expectations), above Q3 (exceeding expectations). This framework provides objective performance evaluation criteria for individual stores or operational periods.
5.1.2. Outlier Detection and Business Context
Your previous work in data cleaning established outlier identification methodologies using the 1.5 × IQR rule. In the context of descriptive statistics, outliers take on strategic significance beyond data quality considerations.
Statistical Outlier Boundaries:
Lower Outlier Boundary: =[Q1] - 1.5 * [IQR] Upper Outlier Boundary: =[Q3] + 1.5 * [IQR]
In the context of descriptive statistics:
In descriptive statistics, outliers represent exceptional business conditions that deserve investigation rather than removal. These extreme values often contain valuable insights about optimal performance conditions or systematic problems requiring attention.
- Outliers affect mean more than median
- Understanding outliers helps explain distribution shape
- Business decision: Investigate vs. exclude outliers
High-Performance Outliers: Exceptionally high sales days provide case studies for success factor identification. Investigate the conditions surrounding these outliers: promotional activities, seasonal factors, competitive dynamics, or operational innovations that might be replicated across the network.
Low-Performance Outliers: Unusually poor sales performance periods signal potential operational problems, market challenges, or competitive threats requiring immediate attention and remediation strategies.
Outlier Impact Assessment: Quantify how outliers influence your central tendency measures. Compare mean calculations with and without outliers to understand their impact on performance benchmarks and forecasting accuracy.
6. Five-Number Summary
The five-number summary is a foundational snapshot of data distribution—frequently used in dashboards and reports. If a decision-maker asks for a quick performance profile of Rossmann stores, this five-number summary is the fastest way to communicate the full story: consistency, variability, and extremes. How-to in Excel (Rossmann – Sales column D):| Metric | Excel Formula |
| Minimum | =MIN(cleaned_data!D:D) |
| Q1 | =QUARTILE(cleaned_data!D:D,1) |
| Median (also referred to as Q2) | =QUARTILE(cleaned_data!D:D,2) or =MEDIAN(cleaned_data!D:D) |
| Q3 | =QUARTILE(cleaned_data!D:D,3) |
| Maximum | =MAX(cleaned_data!D:D) |
6. Measuring Consistency in Business Performance
6.1. Variance
Variance is one of the most fundamental concepts in statistics, yet often misunderstood. It tells us how far, on average, each data point is from the mean—not in raw units, but squared. It is the mathematical foundation of standard deviation, and plays a key role in everything from risk modeling to AI algorithms.
6.1.1. Why It Matters in Business?
Imagine two Rossmann stores with the same average daily sales of €6,000. One achieves that figure through steady performance—every day between €5,500 and €6,500. The other swings between €2,000 and €12,000. Which one is easier to manage? Which one is easier to forecast?
That’s what variance tells us: how stable or volatile the data is around its center.
- Low variance (< 10% of mean²) → predictable, operationally efficient
- Medium Variance (10-50% of mean²) → normal business variation, monitor trends
- High variance (> 50% of mean²) → volatile, potentially risky or opportunity-rich
6.1.2. Excel Implementation: Calculating Variance
In Excel, you can calculate population variance using:
=VAR.P([range_of_data])
For Rossmann, to compute the variance in daily sales:
=VAR.P(cleaned_data!D:D)
This gives you a single value—measured in squared euros—that captures how widely daily sales vary from their average.
If you wanted store-level variance, you could filter by Store ID (Column A) and calculate, say for store number 35:
=VAR.P(FILTER(cleaned_data!D:D, cleaned_data!A:A=35))
6.1.3. Interpretation Framework
| Variance Value | Business Interpretation |
| Close to 0 | Sales are tightly clustered around the mean → consistent performance |
| Moderate | Some day-to-day variation, normal for most retail environments |
| High | High volatility – explore causes: promos, location, seasonality, errors? |
Note: Variance is sensitive to extreme values, since it squares the deviations. That’s why analysts often prefer standard deviation—which brings the result back into the same unit (euros, customers, etc.).
6.1.4. When Should You Use Variance?
Use variance when:
- Comparing the stability of different stores or departments
- Understanding risk exposure in operations or forecasts
- Feeding statistical models that require mathematical variance, such as regression, clustering, or machine learning algorithms
6.1.5. Business Scenario: Comparing Variance Across Rossmann Stores
Suppose the executive team is deciding which regions need operational support. They ask:
“Which of our stores are consistent, and which ones swing wildly day to day?”
You build a variance summary table:
| Store ID | Avg Sales | Sales Variance | Std Dev | Comment |
| 17 | €5,820 | 1,280,000 | €1,131 | Consistent |
| 85 | €5,770 | 6,420,000 | €2,533 | Highly volatile – flag |
| 24 | €5,800 | 1,100,000 | €1,049 | Stable, well-managed |
| … | … | … | … | … |
This table becomes a strategic input—a way to decide where to allocate attention, marketing, or AI forecasting models.
6.2. Range and Standard Deviation
While the IQR is robust to outliers, the range and standard deviation tell you how extreme or volatile your data is. These measures are often used by finance and operations teams to assess risk, forecasting reliability, and operational control.
6.2.1. Excel Implementation: Calculating Standard Deviation
- Range of Sales:
- Calculate the max and min
- MAX(cleaned_data!D:D)
- MIN(cleaned_data!D:D)
- Find range
=MAX(cleaned_data!D:D) – MIN(D:D)
- Calculate the max and min
- Standard Deviation:
=STDEV.P(cleaned_data!D:D)
High standard deviation may suggest inconsistent store performance or volatile customer behavior—both of which impact forecasting reliability and staffing models
Add a summary row in your dashboard:
| Metric | Formula | Interpretation |
| Max Daily Sales | =MAX(cleaned_data!D:D) | Peak performance possible |
| Min Daily Sales | =MIN(cleaned_data!D:D) | Worst-case daily performance |
| Range | =MAX(cleaned_data!D:D)-MIN(cleaned_data!D:D) | Total spread of observed performance |
| Standard Deviation | =STDEV.P(cleaned_data!D:D) | Volatility of daily sales |
7. Explore Relationships with Correlation
Beyond individual statistics, you can explore how different factors in your dataset relate to one another. For example, you might ask:Does more customer traffic lead to more sales?
To check for a relationship between Customers and Sales, use the CORREL function.
Calculating Correlation between Variables
To calculate correlation between Sales (Column D) and Customers (column E):
- In the sales_stats sheet, enter:
=CORREL(cleaned_data!D:D, cleaned_data!E:E)
- Press Enter.
The result will be a number between -1 and 1:
- A value close to 1 indicates a strong positive relationship — as one increases, so does the other.
- A value close to -1 suggests a strong negative relationship — as one increases, the other decreases.
- A value near 0 indicates little or no relationship between the two.
8. Overall Decision Summary
Quick Reference: Which Statistic Should I Use?
For Central Tendency:
Is data symmetrical with no outliers?
- Yes → Use Mean
- No → Use Median (always report both if different)
For Spread:
Does data contain outliers?
- Yes → Use IQR
- No → Use Standard Deviation
For Relationships:
Is correlation > 0.5 AND makes business sense?
- Yes → Investigate further, design tests
- No → Note but don’t base decisions on it
Red Flags That Require Further Investigation:
- Mean and median differ by >25%
- Outliers represent >5% of data
- Correlation contradicts business logic
- Variance suggests major inconsistencies
Final Decision Rule:
When in doubt, calculate and report multiple measures with clear explanation of what each tells you about the business.
Summary
In this lesson you learned:
How to calculate and interpret measures of central tendency — mean, median, and mode — to understand typical values in your data
Why measuring the spread of data (range, interquartile range, standard deviation) reveals variability and reliability beyond averages
How to identify and handle outliers using quartiles, IQR, and visual tools like box plots
What the relationship between mean and median tells you about data skew and real-world business patterns
How to use Excel formulas (e.g.
AVERAGE,MEDIAN,MODE.SNGL,QUARTILE,CORREL) to compute descriptive statistics with precisionHow to explore relationships between variables using correlation, both statistically and visually
Why descriptive analysis is essential for business insight, forecasting readiness, and strategic decision-making in real-world datasets like Rossmann
Exercise
Estimated Time to Complete: 1-2 hours
Task 1: Exploratory Data Analysis
Business Context:
The executive team wants to understand the overall sales performance patterns across their stores. Calculate the following central tendency measures and generate a summary. You can use the “Initial Business Questions” as inspiration.
Hint: Simple phrases like “what is the overall sales performance?” or “are there red flags?” have underlying data questions that can be answered using data distribution and central tendency etc. Check out this short tutorial that uses the Rossmann Sales Dataset to answer the business question: How does Sales Look Like?
1. Data Boundaries Analysis:
- Calculate minimum, maximum, and range of Weekly_Sales
- Identify which store and department had the highest weekly sales
- Investigate any suspicious minimum values (zeros, negatives, or extremely low numbers)
- Create a new cleaned_data sheet with suspicious values removed – as much as possible.
- Which problems did you face while cleaning?
- Note: It is possible that you won’t be able to clean the complete data – Do your best, then move to next task.
2. Initial Business Questions:
- Do the extreme values represent genuine business performance or data quality issues
- What business factors could explain the maximum sales figure?
- How would you communicate these ranges to executives who might misinterpret them?
Task 2: Central Tendency Deep Dive
Business Context:
The VP of Operations has asked:
“What’s our typical weekly performance for the top-performing store-department combination (identified in Task 1)? I need this for budget planning and setting realistic performance targets.”
Hint: For a corresponding analysis on Rossmann Sales Dataset, refer to the tutorial: Finding “Typical” Sales for Top Stores with Statistics.
To answer this, conduct the following analysis:
-
Calculate Core Central Tendency Metrics:
Compute the Mean, Median, and Mode of weekly sales for the identified store-department. -
Perform a Segmented Analysis:
Calculate and compare the mean weekly sales during holiday vs. non-holiday periods for the same combination. -
Answer the Following Analytical Questions:
-
How do the mean and median compare for overall weekly sales? What does this suggest about the distribution’s shape (e.g., skewness)?
-
What is the holiday premium (Holiday Mean – Non-Holiday Mean)? Is it statistically or practically significant?
-
Based on the shape of the distribution and business context, which central tendency measure would you recommend for setting performance benchmarks — and why? Justify your choice using the decision framework discussed in the lesson.
-
Business Recommendation:
Write a concise, 3-sentence executive recommendation to the VP that summarizes your analysis. Be sure to:
-
State the “typical” weekly performance,
-
Name the central tendency metric you recommend,
-
And explain how this supports budget planning or target-setting.
Task 3: Distribution Shape and Business Implications
Business Context:
The marketing team is exploring how sales patterns can inform more targeted promotional strategies. They’ve heard that “the top department (identified in Task 1) performs similarly across all stores” — but they now want data-backed validation.
Hint: Checkout the short tutorial for a corresponding analysis on Rossmann Sales Dataset: How to Read a Histogram in Excel to Uncover Performance Patterns?
Visual Analysis:
-
Create a histogram of
Weekly_Salesusing Excel’s histogram tool.
(Tip: It may be helpful to first copy the top department’s data into a separate sheet for easier analysis.) -
Describe the distribution’s shape in 2–3 sentences based on the histogram.
Interpretation Challenge:
Use your visual and statistical observations to reflect on the following:
-
Is the overall distribution right-skewed, left-skewed, or symmetric?
-
What does this reveal about performance patterns across stores for the top department?
-
How should these insights influence the marketing team’s promotional planning or targeting?
Task 4: Spread Analysis and Risk Assessment
Business Context:
The finance team is evaluating sales volatility to support effective cash flow planning. They need to determine whether stores in the top-performing department (identified in Task 1) show stable, predictable patterns — or high variability that poses financial risk.
Hint: “Volatility” or “Stability” translates to standard deviation or variance in data terms. In business, they often go hand-in-hand with risk analysis. Refer to the short tutorial that uses Rossmann Sales Dataset to answer the question: How to Measure Sales Stability with Standard Deviation in Excel?
Technical Analysis:
-
Calculate the Interquartile Range (IQR) and Upper/Lower bounds using standard statistical formulas.
-
Alternatively, visualize the distribution using a Box and Whisker plot.
Business Risk Assessment:
-
Which measure of spread — IQR or Standard Deviation — is more suitable for this dataset? Justify using the decision framework discussed in the lesson.
-
Identify the best-performing and worst-performing stores within the top department.
-
Do these stores exhibit high or low variability?
(Hint: Calculate standard deviation or variance for the shortlisted stores.)
Finance Team Recommendation:
Write a short memo explaining which stores have predictable cash flows and which may require larger cash buffers due to volatility.
Task 5: Correlation and Relationship Analysis
Business Context:
The strategy team believes that holiday periods significantly influence sales. They want to quantify this relationship and assess whether the effect is consistent across all stores within the top-performing department identified earlier.
Hint: Whenever business is talking about “influence” or “factors”, it translates to “correlation” in the data world. Checkout the short tutorial that uses Rossmann Sales Dataset to answer the question: Do Holidays impact Sales?
Use an appropriate statistical metric to measure the correlation between IsHoliday and Weekly_Sales.
(Hint: Convert the IsHoliday column from TRUE/FALSE to 1/0 for calculation.)
Correlation Interpretation – Use the Decision Framework to Reflect On:
-
Is the correlation between sales and holiday periods strong enough to guide business decisions?
-
Does the observed relationship align with business intuition and context?
-
Based on your findings, what strategic recommendations would you make to the team?
Submission Guidelines
Submit your solution as an Excel workbook and a Word/PDF document:
Workbook:
- appropriate worksheets with analysis
Document:
- executive reports and reflections
Filename Format:
- YourName_Lesson6_Walmart_Descriptive_Analysis.xlsx
- YourName_Lesson6_Walmart_Reflections.docx
When you’re ready, submit your completed exercise to the designated folder in OneDrive. Drop your mentor a note about submission.
Important: Please scan your files for viruses before uploading.
Submission & Resubmission Guidelines
- Initial Submission Format: YourName_Lesson6_…
- Resubmission Format:
- YourName_Lesson6_…_v2
- YourName_Lesson6_…_v3
- Rubric Updates:
- Do not overwrite original evaluation entries
- Add updated responses in new “v2” or “v3” columns
- This allows mentors to track your improvement process
Evaluation Rubric
|
Criterion |
Exceeds Expectation |
Meets Expectation |
Needs Improvement |
Incomplete / Off-Track |
|
1. Data Boundaries & Cleaning |
|
|
|
|
|
2. Central Tendency & Recommendation |
|
|
|
|
|
3. Histogram & Distribution Shape |
|
|
|
|
|
4. Spread Analysis & Risk Memo |
|
|
|
|
|
5. Correlation & Strategic Insight |
|
|
|
|
Got Feedback?
Contact
Talk to us
Have questions or feedback about Lumen? We’d love to hear from you.