Lesson 5 - Outliers: What to Clean and What to Keep?

Learn how to detect, interpret, and handle outliers — not just as data errors, but as signals that reveal edge cases, expose risk, and impact business decisions.

Estimated Read Time: 1 Hour

Learning Goals

In this lesson, you will learn about outliers:

Technical & Analytical

Use statistical formulas like mean, standard deviation, 3σ rule and quartiles to identify data ranges and outliers

Business Impact

Recognize when an outlier is a data error vs when it’s a business signal.
Possible options to handle outliers based on business case

AI Literacy & AI-Proofing

Leverage AI tools to identify outliers
Learn what AI can’t see: context, nuance, business logic.

Identifying outliers is a key component of EDA. An outlier is a data point that deviates significantly from the rest of the values — but the real question is not “how different is it?” It is, “why is it different?” And that distinction matters because how you handle outliers can directly influence business recommendations.

Remember that not all extreme values are mistakes. In fact, in business data, they rarely are.

In this lesson, you’ll learn how to identify, investigate, and treat outliers in a systematic, context-aware way. We’ll work directly with the rossmann-sales.xlsx file, using the Sales and Customers columns.

1. What Causes an Outlier?

Before opening Excel, let’s understand the possibilities. In retail data like Rossmann’s, high or low values might stem from:

Genuine business events such as grand openings, clearance sales, or national holidays that cause spikes in sales.
Data entry errors, for example, an extra zero added accidentally (€50,000 instead of €5,000).
System glitches during data logging or integration.
Store closures, weather events, or regional disruptions, which might cause zero sales on what should have been open days.

2. Identifying Potential Outliers in Excel

Let’s begin with the Sales column in rossmann-sales.xlsx. The analyst’s question would be:

“Are there values in the Sales column that are unusually high or low compared to typical daily sales?”

Before deciding whether a value is problematic, we must first define what qualifies as an outlier — and this begins with a foundation in descriptive statistics.

3. Understanding the Statistical Definition of an Outlier

An outlier is commonly defined as a data point that lies far outside the range of expected values, given the distribution of the dataset. To quantify this, we rely on two fundamental statistical measures:

1. The Mean (Average)

The mean represents the central tendency of a dataset. It is calculated by summing all values and dividing by the number of observations.

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

Where:

$x_i$ = each individual value in the dataset
$n$ = total number of observations

In Excel, to calculate the mean of the Sales column (where the data starts in cell D2), you would use:

=AVERAGE(D2:D91257)

or simply:

=AVERAGE(D:D)

2. The Standard Deviation (σ)

The standard deviation measures how spread out the data is around the mean. A small standard deviation means most values are close to the mean; a large standard deviation indicates high variability.

$$
\sigma = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (x_i – \mu)^2 }
$$

Where:

μ is the mean

σ is the population standard deviation

To calculate the standard deviation in Excel, use:

=STDEV.P(range)

3.1. Defining Outliers Using the 3σ Rule

A commonly used rule in analytics is the Three Sigma Rule, also known as the 68–95–99.7 Rule. This principle comes from the properties of a normal distribution:

~68% of values lie within 1 standard deviation of the mean
~95% within 2 standard deviations
~99.7% within 3 standard deviations

By this rule, any value that lies beyond three standard deviations from the mean is statistically rare and may be considered an outlier.

$$
\text{Upper Outlier Threshold} = \mu + 3\sigma
$$

$$
\text{Lower Outlier Threshold} = \mu – 3\sigma
$$

These thresholds help identify data points that fall outside the typical range — either suspiciously high or unusually low.

Here is a short (external) video on Using the mean and standard deviation to identify outliers. Note that the video uses 2σ for calculating outliers – this is also common, based on distribution of data.

Applying the 3σ Rule in Excel

Now that you understand the statistical definition of an outlier using the Three Sigma Rule, let’s apply it to real data from the Rossmann sales file.

Suppose you’ve opened rossmann-sales.xlsx, and your Sales values are in column D, beginning from cell D2.

Step 1: Compute the Mean

The mean (average) sales value gives you a sense of the central value around which other data points fluctuate. It forms the baseline for comparison.

In an empty cell — say, L1 — enter:

=AVERAGE(D2:D91257)

Label this cell as “Mean” in M1 for clarity.

Step 2: Compute the Standard Deviation

Standard deviation measures how spread out the sales values are around the mean. A high standard deviation suggests high variability — possibly due to store size, location, seasonality, or inconsistent data quality.

In L2, enter:

=STDEV.P(D2:D91257)

Label this cell as “Standard Deviation” in M2.

Note: Use STDEV.P when analyzing the entire population (e.g., all sales data in your file). Use STDEV.S only when working with a sample. We will cover samples later in the module.

Step 3: Set the Outlier Thresholds

You’ll now calculate the upper and lower boundaries beyond which a sales value may be considered a statistical outlier.

In L3 (Upper Threshold):

=L1 + 3*L2

Label this as “Upper Outlier Threshold”.

In L4 (Lower Threshold):

=L1 – 3*L2

Label this as “Lower Outlier Threshold”.

These values represent the outer bounds of what would be expected in a normally distributed dataset. Any value above or below these is flagged for review.

Step 4: Flag Potential Outliers

Create a helper column in your data table — for example, in column J, insert a heading called “Outlier?” in J1.

In J2, enter the following formula:

=OR(D2 > $L$3, D2 < $L$4)

which means if the cell value of D2 is more than Upper Threshold OR less than Lower Threshold, then mark that entry as True. Else, mark it as False.

Copy this formula down for the entire column (e.g., through J91257). This will return:

TRUE for rows where sales exceed the threshold,
FALSE otherwise.

This step gives you a structured, documented way to isolate unusual values for further inspection.

Tip: Instead of deleting these rows immediately, consider filtering or highlighting them for manual review first.

Step 5: Highlight Outliers Visually (Optional)

To support visual review:

Select your Sales column (D2:D91257).
Go to Home → Conditional Formatting → New Rule → Use a formula to determine which cells to format.
Enter the formula:

=OR(D2 > $L$3, D2 < $L$4)

Choose a fill color (e.g., light red) and click OK.

This highlights all outlier values directly in the dataset — a useful visual aid when reviewing data in team discussions or stakeholder sessions.

3.2. Alternative Method: Using Interquartile Range (IQR)

The Three Sigma Rule is appropriate when your data is approximately normally distributed — that is, it follows a bell-shaped curve. However, business data — especially retail sales — is often skewed. Some stores may have vastly different sales patterns due to location, size, or operational differences. In such cases, the Interquartile Range (IQR) method offers a more robust and distribution-independent approach to detecting outliers.

What Is the IQR?

The Interquartile Range (IQR) measures the spread of the middle 50% of the data — the range between the first and third quartiles.

Definitions:

Q1 (First Quartile): The 25th percentile — 25% of the data falls below this value.
Q3 (Third Quartile): The 75th percentile — 75% of the data falls below this value.
IQR = Q3 – Q1

This method ignores the extreme lower and upper 25% of values and focuses on the “core” of your data — making it highly resistant to skewed distributions or existing outliers.

Why Use IQR for Outlier Detection?

Unlike standard deviation-based methods, the IQR method does not assume any specific distribution. It is particularly effective when:

The dataset is not symmetrical, or clearly skewed.
There are known segments with inherently different scales (e.g., high-volume vs. low-volume stores).
You wish to limit the influence of extreme values when determining thresholds.

Detecting Outliers with IQR in Excel

Let’s apply the IQR method to the Sales column in rossmann-sales.xlsx. Your sales data is in D2:D91257.

Step 1: Compute Q1 and Q3

In an empty section of your sheet (e.g., column N), use the following formulas:

Cell	Label	Formula
N2	Q1 (25th percentile)	=QUARTILE.EXC(D2:D91257, 1)
N3	Q3 (75th percentile)	=QUARTILE.EXC(D2:D91257, 3)

Note: Use QUARTILE.EXC to exclude the minimum and maximum values, which improves robustness for outlier detection. If your Excel version lacks QUARTILE.EXC, use PERCENTILE.EXC(D2:D91257, 0.25) and 0.75 respectively.

Step 2: Calculate the IQR

In N4 (labelled “IQR”):

=N3 – N2

This gives the spread of the middle 50% of your data.

Step 3: Determine Outlier Thresholds

Now calculate the upper and lower bounds beyond which a value is considered an outlier:

Cell	Label	Formula
N5	Lower Bound	=N2 – 1.5 * N4
N6	Upper Bound	=N3 + 1.5 * N4

These thresholds are based on the Tukey Rule for outlier detection, which considers values more than 1.5 × IQR outside Q1 and Q3 as outliers.

In more conservative applications, you may use 3 × IQR to flag only extreme outliers. However, 1.5× is widely accepted in exploratory data analysis.

Step 4: Flag Outliers with a Helper Column

Now create a new column in your dataset (e.g., column O) called “IQR Outlier?”. In cell O2, write:

=OR(D2 < $H$5, D2 > $H$6)

Copy this formula down for all rows in your dataset. This formula will return:

TRUE if the sales value is an outlier under the IQR method
FALSE otherwise

You may optionally combine this with the earlier 3σ method to compare the overlap and robustness of both approaches.

Visual Comparison with a Box Plot

A box plot (also called a box-and-whisker plot) visually depicts the IQR and its associated outliers. Here’s how to insert one:

1. Select your Sales

2. Go to Insert → Charts → Box and Whisker.

3. As seen in Figure 5 below, Excel will generate a box plot with:

- A box: from Q1 to Q3
- A line inside the box: the median (Q2)
- Whiskers: extending to the smallest and largest non-outlier values
- Dots or asterisks: for outlier points beyond 1.5×IQR

You are encouraged to interpret this plot critically. A long upper whisker suggests a right-skewed distribution — often the case with sales data from stores with large promotions or higher footfall.

3.3. When to Use IQR vs. 3σ?

Situation	Preferred Method
Normally distributed data	3σ Rule
Skewed or irregular data	IQR
Small datasets	IQR (more robust)
Need for stricter control	IQR with 1.5× or 3× multiplier
Business review setting	IQR + box plot for visual communication

Best Practices

Do not delete outliers automatically. First, interpret them in business context. Could they represent peak holiday sales? Inventory clearance? Promotions?

Use both statistical and visual methods. Flag outliers numerically and inspect them visually before making decisions.

Document all thresholds. Always log which method you used (IQR vs. 3σ), which multiplier, and why.

3.4. Alternate Method: Conditional Formatting

To visualize these outliers:

Select the Sales
Go to Home → Conditional Formatting → Color Scales.
Choose a gradient to highlight high and low values visually.

This won’t precisely match the statistical definition, but it can quickly guide your attention to extremes worth reviewing.

4. Investigating Outliers

Identifying potential outliers using statistical rules is only the beginning. In professional analysis, no number is an error until it is understood. An outlier might represent a data entry mistake—or it might capture a genuine, high-stakes business event. We need to ask:

“Is this outlier real, or is it an error?”

4.1. What Are We Looking For?

At this stage, your core objective is to determine the nature of each outlier:

Is it a real business event or an error?
Does it reflect abnormal but valid performance?
Does it require correction, removal, or deeper explanation in your analysis?

This requires both data reasoning and business understanding. You are not simply cleaning numbers—you are reconstructing the story behind them.

4.2. Example: A Suspicious Sales Spike

Let’s return to the rossmann-sales.xlsx dataset. Suppose the store in row 10 reports a sales value of €120,000, while the dataset’s mean sales hover around €5,500.

Statistically, this row has already been flagged as an outlier under both the 3σ and IQR methods. But what next?

Here is a structured investigation process.

1. Cross-Reference the Store ID

Look at the Store column for the outlier row. Suppose it’s Store 35.

Now open the rossmann-stores.xlsx file and locate information about Store 35.

Key questions to ask:

What is the store type (e.g., A, B, C, D)?
What is the assortment level?
Is this store in a high-footfall location or does it serve a niche market?

Perhaps Store 35 is a Type A store with a wide assortment and is known to handle larger volumes. This context may explain higher sales figures.

2. Check for Promotions or Business Events

In the rossmann-sales.xlsx file, look at columns:

Promo: Was a promotion running that day? (1 indicates yes)
StateHoliday: Was there a holiday?
SchoolHoliday: Was school closed?

Ask:

Was this spike due to a planned promotion?
Was there a holiday-driven shopping surge?

For example, if the high sales occurred on 24 December 2023, and the Promo field is 1, the outlier may reflect a legitimate spike due to Christmas Eve promotion.

3. Review the Open Status

In the Open column:

Was the store open on that day? (1 means open)

This seems obvious, but double-checking prevents situations where data exists for a day the store was officially closed — a likely sign of incorrect entries.

4. Compare Against Nearby Days

Create a small time series view of this store’s sales:

Filter rossmann-sales.xlsx to show only Store 35, then examine the sales figures:

3 days before the outlier
The outlier day
3 days after

Ask:

Is this a one-day anomaly or part of a broader trend?
Was the sales spike followed by a dip (e.g., due to clearance or inventory depletion)?
Or is it part of a multi-day campaign?

Insight: Patterns matter more than points. A single-day outlier may be an error; a trend suggests intention.

5. Examine the Customers Column

If the Customers column is filled, check if the customer count aligns with the sales number.

For example:

Did the number of customers also spike on the same day?
Or did the sales increase dramatically without a corresponding rise in customers?

If customer count stayed flat while sales jumped, this might indicate:

An error in the Sales field (e.g., misplaced decimal point)
A large individual purchase or bulk order
A data processing issue

6. Add Context

To support your investigation:

Use filters to isolate the outlier row and sort by date.
Use a line chart to visualize sales over time for that store.
Insert a calculated column:

=D2/E2

to derive sales per customer and evaluate anomalies in customer behavior.

Be sure to label new columns and calculations clearly. Good analysts make their thinking visible.

7. Consult Domain Knowledge or Stakeholders

Some questions cannot be answered purely through spreadsheet logic.

If after data analysis the value is still unclear:

Ask your business contact: “Were there any special campaigns or pricing experiments at Store 35 during late December 2023?”
Check documentation or marketing calendars, if available.
If you’re working in a company: consult marketing, operations, or IT teams.

4.3. Summary of the Investigation Process

Step	Question
Store metadata	Is this a high-volume or special-format store?
Promotion & holidays	Was the spike timed with a known business event?
Store open status	Was the store officially open on that date?
Neighboring dates	Does the value follow a pattern or stand alone?
Customer data	Are the sales volumes plausible given footfall?
Additional insight	Does business context confirm or contradict the spike?

4.4. Best Practices for Documentation

As you investigate each outlier:

Record your findings in your cleaning log (e.g., in a sheet called cleaning_log).
For each flagged row, include:
- Outlier type (high/low)
- Reason for suspicion
- Source of supporting evidence
- Final action: Keep / Correct / Remove / Flag

Example Entry:

Row	Store	Date	Issue	Investigation Summary	Decision
10	35	2023-12-24	High sales	Confirmed Christmas Eve promotion, Type A store	Keep

5. Treating Outliers

Once you have identified and investigated an outlier, the final step is to decide how to handle it. This is not merely a technical task; it is a matter of analytical judgment. Your decision must balance statistical rigor, business context, and the goals of your analysis.

At this stage, you’re no longer asking, “Is this number unusual?”—you’re asking:

“What is the most appropriate way to treat this data point so that it supports reliable, truthful, and relevant analysis?”

There are four possible actions:

5.1. Keep the Outlier

You should retain an outlier if your investigation suggests it reflects a real business event. This includes:

Spikes in sales due to major promotions
High footfall during holidays
Exceptional transactions (e.g., large corporate order)

Example: In the Rossmann example above, Store 35 showed an unusually high sales figure on 2023-12-24, just before Christmas. If the store was open, a promotion was active (Promo = 1), and customer count was also high, then this outlier reflects real business performance. It should be preserved, not corrected or removed.

Keeping such data ensures your analysis reflects reality, even if it is not “typical.”

However, you must document it clearly so stakeholders understand its context.

Best Practice: In your cleaning log, mark such rows with a comment like “Confirmed holiday sales spike—kept as valid outlier.”

5.2. Correct the Outlier

Some outliers are valid in structure (i.e., no missing values) but clearly contain errors—often due to:

Misplaced decimal points
Incorrect unit conversions
Partial data merges

Example: Suppose Store 87 has a sales entry of €500,000, but typical daily sales for this store are around €5,000. On investigation:

The store was not running a promotion
The customer count is typical (e.g., 550)
No holiday was occurring
The per-customer spend is ~€909

This suggests a likely data entry error. You might consult stakeholders or refer to original source systems. If the likely intent was €5,000, and a zero was added accidentally, you may correct the value.

However, never correct based on guesswork alone. Always document your reasoning and source of authority for the correction.

Best Practice: If correction is made, include a column in your cleaned dataset called Corrected_Sales and never overwrite the original value. Also add a note in the log: “Corrected from 500000 to 5000—decimal error confirmed via finance export.”

5.3. Remove the Outlier

You may choose to remove an outlier entirely when:

It is confirmed to be an error
It cannot be corrected with confidence
It distorts the analysis and is not part of your analytical focus

Example: If Store 96 shows sales of -€2435 (negative revenue), and there is no business context for refunds or returns in the dataset schema, this likely represents a data integrity issue.

If you cannot identify a correction and it is not critical to the analysis, it is appropriate to remove this row.

Caution: Deletion should be a last resort. You are not allowed to delete inconvenient truths—only confirmed errors.

Best Practice: Log the removal with justification: “Negative sales with no refunds field in schema—confirmed error, row deleted.”

5.4. Flag for Stakeholder Review

If you have exhausted your investigation but are still unsure about an outlier, do not act in isolation. Instead, flag the row for business review.

This approach is especially appropriate when:

The value is unusual, but not provably incorrect
You lack context (e.g., localized promotion or pricing test)
You suspect system issues that require IT verification

Example: Store 289 shows zero customers but reports sales of €9,850. This could indicate a missing Customer value, or a bulk purchase not properly logged. You cannot determine the truth without external input.

Best Practice: Add a new column, Review_Flag, and mark it as TRUE. Include a note in your executive summary and cleaning log: “Outlier requires stakeholder input—retained in cleaned file but excluded from aggregate analysis.”

6. A Note on Transparency and Ethics

Your credibility as an analyst depends not only on technical skill but on integrity. It is never acceptable to:

Delete or “fix” data to match expectations
Remove outliers simply because they reduce variance
Hide cleaning decisions from stakeholders

All decisions—especially involving outliers—must be traceable, reproducible, and justifiable. “Data cleaning is not about getting the answer you want. It’s about creating a dataset that reflects reality as accurately as possible.”

7. Excel’s AI – “Analyze Data”

Excel’s Analyze Data (formerly known as “Ideas”) is a built-in AI feature available in modern versions of Microsoft Excel (Microsoft 365). It allows analysts to explore datasets using natural language queries. Rather than building formulas from scratch, you can ask plain English questions like:

“Are there any outliers in Sales?”

Excel will then scan your data using behind-the-scenes statistical techniques to return charts, summaries, or highlighted insights.

But here’s the crucial point: the feature suggests patterns—it does not confirm truths. As an analyst, your responsibility is to verify every suggestion, interpret its business relevance, and decide whether any action is necessary.

7.1. A Practical Scenario: Detecting Outliers in Rossmann Sales

Suppose you’re working with rossmann-sales.xlsx, and you’ve already cleaned and standardized the dataset.

You activate Analyze Data by:

Clicking anywhere inside your dataset (formatted as a Table)
Selecting Home → Analyze Data (found in the ribbon)

A panel appears on the right side of your screen.

Now you type:

“Are there any outliers in the Sales column?”

What Excel Does Behind the Scenes

It calculates descriptive statistics: mean, standard deviation, quartiles
It looks for values significantly different from the rest of the dataset
It creates charts (like histograms or line graphs) and summaries such as:

“For ‘DayOfWeek: 6’, ‘Sales’ changed significantly from ‘Date’: 14.12.22 to 21.12.22 with 6 outliers.”

Consider Figure 6 that shows the prompt and the results.

Note: This is not a final verdict. It is a lead—an insight you must investigate and validate using your domain knowledge and Excel tools (e.g., calculating standard deviation, applying filters, checking against store promotions or holidays).

7.2. Step-by-Step: From Insight to Analyst Judgment

Let’s walk through the complete thinking process.

1. Ask a Specific Question

Avoid vague prompts. Instead of “What’s wrong with this data?”, ask:

“Are there stores with unusually high sales?”
“Which stores had the largest weekly fluctuation in customer counts?”
“Are there dates where Sales values spike significantly?”

2. Interpret the Output

Excel might highlight that Store 98 had €72,000 in sales on a single day—much higher than usual. Is this an error? Or a holiday spike?

At this stage, treat this like a hypothesis, not a conclusion.

3. Verify with Standard Techniques

Use traditional Excel tools to confirm or refute the insight:

Use AVERAGE() and P() to calculate the normal range for that store
Use conditional formatting to highlight extreme values
Cross-reference with the Promo and StateHoliday fields—was there a special event on that date?

4. Check for Patterns

Is the anomaly isolated?

Do other stores show similar spikes on the same day? (Suggests a holiday)
Does this store often show such variation? (Suggests volatility or error)

5. Document Your Conclusion

Whether you keep the value, adjust it, or flag it for review, record your decision in the cleaning log.

7.3. Best Practices

Best Practice	Why It Matters
Ask precise questions	AI can only act on what you specify. “Are there outliers in Sales?” is better than “Is this data okay?”
Always validate	Analyze Data offers clues, not final answers
Combine with Excel formulas	Use IF(), STDEV.P(), Z-score, or filters to cross-check results
Keep business context in mind	A spike in December may be expected; a spike in February might signal an error
Record every accepted insight	Manual documentation is essential—Analyze Data doesn’t track decisions

7.4. Limitations You Must Account For

Excel’s AI can surface suggestions faster than you can scroll—but it does not know:

Which stores had promotions
Whether missing values are “acceptable”
Whether a sales figure of €48,000 is realistic for a given store

You still need:

A structured workflow
Analyst intuition
Business context
Explicit documentation

Excel’s Analyze Data feature offers a fast way to explore your dataset for patterns and potential anomalies. But it does not know your business. It cannot distinguish between a holiday promotion and a data entry error. Only you can do that.

The most effective analysts treat AI-generated insights as questions to explore—not conclusions to accept.

You bring the business understanding, the investigative process, and the professional responsibility to decide what stays, what changes, and what gets flagged.

As with all AI tools, use it as a partner in thinking, not a substitute for judgment.

Summary

In this lesson, you’ve learned:

How to identify, interpret and handle outliers.
The role of AI tools in data cleaning, including their capabilities and limitations.

In the next lesson, you’ll get started with descriptive analysis.

Criteria	Exceeds Expectation	Meets Expectation	Needs Improvement	Incomplete / Off-Track
1. Outlier Identification and Handling	Demonstrates precise control over Excel tools and formulas. Tailors outlier handling logic based on context (e.g., preserving blanks for holidays). Inventive solutions backed by sound reasoning. Decisions are annotated and reproducible.	Executes all major outlier identification and handling steps accurately and fully Cleaning log is detailed, professional, and well-justified.	Some actions incomplete or applied inconsistently. Rationale unclear or generic. May rely on manual steps without clear logic. Cleaning log is vague or fragmented.	Outlier detection and handling is minimal, incorrect, or irrelevant. Key issues unresolved. Log missing or meaningless.
2. AI Literacy	Uses Analyze Data or other Excel AI features effectively. Generates relevant questions. Cross-validates results. Critically reflects on AI’s strengths and boundaries. Explores co-pilot and compares results with that of built-in “Analyze Data” feature.	Uses Analyze Data meaningfully. Reflects briefly on accuracy or limitations. Tries to validate results.	Uses AI tool but with minimal insight or reflection. Repeats suggestions without evaluation.	Not attempted or demonstrates misunderstanding of the tool.

Got Feedback?

Drop us a line here.

Contact

Talk to us

Have questions or feedback about Lumen? We’d love to hear from you.