Lesson 4 - Cleaning and Preparing Data
Transform messy, incomplete data into reliable insights.
Estimated Read Time: 1.5-2 Hours
Learning Goals
Technical & Analytical
- Identify and address common data quality issues: missing values, duplicates, inconsistent formats, and outliers.
- Apply systematic data cleaning techniques using Excel’s built-in tools and formulas.
- Validate data integrity through cross-checking, range testing, and logical consistency checks.
- Document cleaning decisions to ensure transparency and reproducibility in your analytical process.
- Learn to think like an analyst: ask better questions, validate assumptions, and make methodical, transparent decisions.
Business Impact
- Recognize how data quality issues directly impact business decisions and organizational trust.
Personal Branding & Career Development
- Build cleaning logs that reflect your analytical discipline.
- Document your decision-making process to gain stakeholder trust.
- Treat data cleaning as professional storytelling — and a source of competitive edge.
1. The Reality of Real-World Data
Data is rarely pristine. In fact, most analysts spend 60-80% of their time cleaning and preparing data before any meaningful analysis can begin. This isn’t a sign of failure — it’s the reality of working with information generated by imperfect systems, human input, and complex business processes.
Effective data cleaning is foundational for successful EDA. This process involves handling missing values, correcting errors, and ensuring consistency, thereby enabling accurate visualizations and reliable insights during the exploratory phase.
Consider Rossmann’s daily operations: thousands of transactions across hundreds of stores, recorded by different systems, entered by various staff members, and transmitted across networks. Each step introduces potential for inconsistency, errors, or missing information. Yet business decisions must be made with this imperfect data.
Your ability to systematically identify, address, and document data quality issues is what transforms raw information into reliable business insights. This lesson will teach you to approach messy data with structured thinking, appropriate tools, and professional judgment.
2. Understanding Data Quality Issues
Before diving into cleaning techniques, you must first learn to recognize the most common data quality problems. Like a skilled detective, an analyst develops the ability to spot inconsistencies, anomalies, and missing pieces that could undermine analysis.
2.1. Missing Values
Missing data occurs when expected information is absent from your dataset. In the Rossmann data, you might find:
- Sales figures missing for certain dates
- Customer counts blank (or zero) during store closures
- Promotional information incomplete for specific periods
Missing values aren’t always errors — sometimes they’re meaningful. A missing sales figure might indicate a store closure, while a blank promotional flag could mean no promotion was active. The key is understanding the difference between missing information and meaningful absences.
What Counts as Missing?
When we speak of “missing data,” we often imagine something technical or hidden. But in Excel, the first kind of missing data you will encounter is simply: nothing. A cell left empty.
This kind of missingness is common in real operational data. It can occur for many reasons:
- A store didn’t report a value.
- The system didn’t collect it at the time.
- The field didn’t apply to that store.
- The data was lost or never entered.
Your task isn’t just to spot these blank cells. It’s to interpret what they might mean — and what to do about them.
2.2. Duplicate Records
Duplicate entries occur when the same information appears multiple times in your dataset. This could result from:
- System errors during data export
- Multiple data sources being combined without proper deduplication
- Manual entry mistakes
Duplicates can severely skew your analysis, artificially inflating totals and distorting trends. A single day’s sales appearing twice would double the apparent revenue for that period.
2.3. Inconsistent Formats
Data inconsistency manifests in various ways:
- Date formats mixed between DD/MM/YYYY and MM/DD/YYYY
- Text entries with different capitalization (e.g., “Store A” vs “store a”)
- Numbers formatted as text, preventing calculations
- Currency symbols inconsistently applied
These inconsistencies prevent accurate sorting, filtering, and analysis. Excel might treat “12/03/2023” and “2023-03-12” as different entries, even though they represent the same date.
2.4. Outliers and Anomalies
Outliers are data points that appear significantly different from the majority of observations. In retail sales data, outliers might indicate:
- Data entry errors (e.g., sales of €500,000 instead of €5,000)
- Exceptional events (e.g., grand opening sales, clearance events)
- System malfunctions or processing errors
The challenge is distinguishing between legitimate outliers that reflect real business events and errors that should be corrected or removed. We will cover outliers in detail in lesson 5.
3. Data Cleaning Fundamentals
Effective data cleaning follows a systematic approach. Rather than randomly addressing issues as you encounter them, professional analysts follow structured workflows that ensure comprehensive, documented, and reproducible cleaning processes.
3.1. The Cleaning Workflow
- Assess and Document: Before making any changes, examine your data thoroughly and document what you find
- Prioritize Issues: Not all problems require immediate attention — focus on issues that would most impact your analysis
- Clean Systematically: Address issues in logical order, from most fundamental to most specific
- Validate Results: After cleaning, verify that your changes achieved the intended results
- Document Decisions: Record what you changed and why, ensuring transparency and reproducibility
3.2. Setting Up Your Cleaning Environment
Before beginning any cleaning process, establish a structured workspace:
Create a Cleaning Log
Add a new worksheet named “cleaning_log” to document every signnificant change you make. It is recommended to include:
- Date and time of cleaning
- Issue identified
- Action taken
- Rationale for the decision
- Rows/columns affected
Preserve Original Data
Never work directly on your original dataset. Create a copy named “data_cleaned” or similar, keeping the original intact for reference.
Use Consistent Naming
Establish clear naming conventions for:
- Worksheet tabs (e.g., “raw_data”, “cleaned_data”, “analysis”)
- Column headers (consistent capitalization and formatting)
- File versions (e.g., “rossmann_analysis_v1.xlsx”)
Note that creating a cleaning log is part of best practices and establishes trust in your work. Imagine you removed some datapoints while cleaning, without leaving any comment. This will cause confusion in case anyone else will look at the data. Hence, it is a best practice to create and maintain these logs.
Working with the Rossmann Dataset
Let’s apply these principles to the Rossmann dataset you’ve been working with. Open both rossmann-sales.xlsx and rossmann-stores.xlsx to begin your cleaning process.
Please Note that a few numbers/values used in this lesson are indicative only – and do not represent the real Rossmann dataset values. Take the examples with a grain of salt.
4. Initial Data Assessment
Before making any changes, spend time understanding what you’re working with:
Examine the Sales Data Structure
- How many rows and columns do you have?
- What date range does the data cover?
- Are all expected stores represented?
- Do you notice any immediately obvious issues?
Review the Stores Data
- Does every store in the sales data have corresponding store information?
- Are store types and assortments consistently formatted?
- Are there any missing values in critical fields?
Cross-Reference the Datasets
- Are there stores in one dataset but not the other?
- Do store identifiers match consistently across both files?
5. Identifying Missing Values
Let’s consider rossmann-stores.xlsx.
You’ll see one row per store, with columns describing its characteristics. Focus your attention on the following:
- CompetitionDistance
- CompetitionOpenSinceMonth
- CompetitionOpenSinceYear
- Promo2SinceWeek
- Promo2SinceYear
- PromoInterval
Excel provides several ways to identify missing data:
Option 1 – Visual Inspection Scroll through your data looking for blank cells, “N/A” entries, or other indicators of missing information. Pay particular attention to:
- Required fields like Store ID, Date, and Sales
- Patterns in missing data (e.g., all Sundays missing, specific stores affected)
Option 2 – Using Excel’s Go To Special Feature
- Select your data range
- Press Ctrl+G (or F5) to open the “Go To” dialog
- Click “Special”
- Select “Blanks” and click OK
This will highlight all empty cells in your selection, making missing values immediately visible – see Figure 1.
Option 3 – Filtering for Missing Values Use Excel’s filter feature to identify missing data:
- Apply filters to your dataset
- In relevant columns, look for “(Blanks)” in the filter dropdown
- Select “(Blanks)” to display only rows with missing values in that column
Option 4 – Using COUNTBLANK()
You can count the number of missing cells in a range, e.g. for rows (store #) or columns (CompetitionOpenSinceMonth).
For example, to check how many entries are missing for store # 1, type in a blank cell (K2):
= COUNTBLANK(A2:J2)
This will return 3 indicating three entries are missing for store # 1.
Similarly, for all missing entries of column CompetitionOpenSinceMonth, which is column E, you can enter the formula:
=COUNTBLANK(E2:E1116)
This gives you 354 indicating as many missing values in the column.
If these numbers are high (e.g., 60% or more blanks), the row/column might require special handling – review it for context before keeping or excluding it.
This function becomes especially powerful when used across:
- Key columns (e.g., date fields, distance fields)
- Newly added rows during incremental data updates
- Merged datasets where gaps are expected
5.1. Why Is the Data Missing?
Now that you’ve identified where values are missing, pause to ask why.
Here are three (hypothetical) examples you might notice in the Rossmann data:
|
Store |
CompetitionDistance |
CompetitionOpenSinceMonth |
Promo2 |
Promo2SinceWeek |
|
18 |
(blank) |
(blank) |
0 |
(blank) |
|
24 |
1630 |
9 |
0 |
(blank) |
|
35 |
2435 |
(blank) |
1 |
18 |
What could this mean?
- Is CompetitionDistance blank because the store has no competitors nearby?
- Is Promo2SinceWeek blank because the store never participated in Promo2?
- Are some blanks the result of data entry issues, or are they legitimate absences?
These questions are not side-notes. They are the work.
5.2. Addressing Missing Values
Once you’ve identified missing values, you must decide how to handle them. The appropriate approach depends on business context and the nature of the missing data.
Let’s take three of the blank columns as examples.
1. CompetitionDistance
This field tells us how far away the nearest competing store is. For most rows, this is a numeric value — such as 215.0 or 4890.0. But for some stores, the value is missing.
Possible interpretations:
- The data was not recorded.
- The store is in a remote area where competition isn’t relevant.
- The store is new, and the data hasn’t been updated yet.
As an analyst, you might:
- Fill in the missing values with the average competition distance across similar stores.
- Fill with a high number (e.g., 99999) to signal “unknown” or “assumed distant”.
- Leave them blank and exclude these rows from distance-based analysis.
All three options are valid — but each should be documented.
2. Promo2SinceWeek and Promo2SinceYear
You may notice that when Promo2 = 0, both of these fields are blank.
This suggests that missingness here is not random — it is structured.
In this case, you might decide:
- Not to fill these blanks at all.
- Or to fill them with a placeholder like “N/A” (not applicable) in a cleaned version.
As much as filling vs. not filling matters — equally important is understanding why the value is missing and making that clear in your cleaning log.
3. CompetitionOpenSinceMonth and CompetitionOpenSinceYear
These two columns work together — they tell us when a nearby competitor opened.
In some rows, one or both of these fields are missing. That creates ambiguity.
For example:
- If CompetitionDistance is given, but the opening date is missing, we might wonder if the competitor existed at the time of the store’s opening.
- If both are missing, we may consider removing the row, or flagging it for manual review.
Again: no automated tool can tell you what to do. You decide, and you document.
Now you have multiple options to handle missing values:
Option 1: Fill with Calculated Values – Sometimes you can derive missing values from other information:
- Calculate missing customer counts from sales and average transaction values
- Derive missing dates from sequence patterns
- Use store averages to fill missing promotional information
Option 2: Use Domain Knowledge – Apply business understanding to make informed decisions:
- If sales are missing but the store shows “Open = 0”, sales should be 0
- If customer count is missing but sales exist, estimate based on average transaction value
- If promotional information is missing, assume no promotion unless evidence suggests otherwise
Option 3: Flag for Further Investigation – When missing data might indicate important business issues:
- Large gaps in sales data might reveal system problems
- Missing store information could indicate new openings or closures
- Promotional data gaps might reflect incomplete campaign tracking
Option 4: Delete Records – Appropriate when:
- Missing data represents a small percentage of your dataset
- The missing information is not critical to your analysis
- You have sufficient remaining data for reliable analysis
- Use sparingly – don’t throw away data just because a few entries in the row / column are missing. It’s possible that your specific use case doesn’t need that specific entry, but other use cases do; and you also lose the other data which is available.
Here’s a real-world story that narrates why deleting data — even accidentally — can lead to much bigger blind spots than you’d expect.
5.3. Recording Your Decisions
Create a worksheet in your Excel file named cleaning_log.
Here’s how it might look:
Column | Type of Missingness | Decision | Rationale |
CompetitionDistance | True blank | Filled with median for same StoreType | To preserve records and allow fair comparison |
Promo2SinceWeek | Blank when Promo2 = 0 | Left as blank | Promo never ran, so no start week is expected |
CompetitionOpenSinceMonth | Some partial blanks | Removed rows with only one of month/year present | Incomplete time info makes analysis unreliable |
You’ll build on this log throughout your cleaning process.
6. Handling Duplicates
In structured datasets, each row is meant to represent a unique observation — a distinct transaction, store, or event. When the same row appears more than once, your counts, totals, and trends can become unreliable. But not all repeated entries are duplicates. Learning to distinguish between true duplicates and legitimate repetition is an essential analyst skill.
In this section, we’ll work with the rossmann-sales.xlsx file to:
- Understand how duplicates can arise
- Determine whether they are errors or valid data
- Apply Excel techniques to identify and manage them responsibly
6.1. What Is a Duplicate?
At first glance, two rows might look identical — same store, same date, same sales. But before deleting anything, we must ask:
Does this row truly repeat a recorded event, or is it just coincidentally similar?
In real data:
- Some repetition is normal (e.g., two stores with the same sales on a given day).
- But duplicate records for the same store on the same date — with identical values in every column — usually indicate accidental duplication.
6.2. Business Impact of Unchecked Duplicates
Let’s suppose one sales record has been accidentally entered twice.
Store | Date | Sales | Customers | Promo |
12 | 2024-07-10 | 7,854 | 823 | 1 |
12 | 2024-07-10 | 7,854 | 823 | 1 |
If this goes unnoticed:
- Total sales will be inflated by €7,854
- Daily averages will be distorted
- Forecasting models will overweight this data point
In business settings, such distortions can lead to poor inventory planning, overstaffing, or misjudged marketing success.
6.3. Step-by-Step: How to Check for Duplicates in Excel
Let’s walk through how to find and handle duplicates in your Rossmann dataset.
Step 1: Identify Your Uniqueness Criteria
Before Excel can tell you what’s duplicated, you must define what counts as a duplicate.
In rossmann-sales.xlsx, a clean row should represent one day of sales for one store. So the combination of Store and Date should be unique.
We’ll check whether this pair appears more than once.
Step 2: Use a Formula to Flag Duplicates
Add a helper column to the right of your table and name it Duplicate_Check. Then enter the following formula in the first data row:
=COUNTIFS(A:A; A2; C:C; C2)
Explanation:
- Column A: Store
- Column C: Date
- The formula counts how many times this exact Store-Date pair appears.
If the result is greater than 1, that record is duplicated.
Drag the formula down the entire column.
Reminder: Formula Separators Depend on Regional Settings
If Excel gives you an error when entering the COUNTIFS() formula, the issue might not be the formula itself — but your system’s regional formatting. In many non-European versions of Excel, you’ll need to use comma (,) instead of semi-colon (;). So the formula would become:
=COUNTIFS(A:A, A2, C:C, C2)
Please verify what works for you before proceeding.
As an analyst, it’s important to be aware of these subtle environmental differences — they affect formula behavior, date parsing, and number formatting across systems and teams.
Step 3: Filter to Review Duplicate Candidates
Now:
- Apply a filter to the Duplicate_Check
- Filter for rows where the value is >1.
- Carefully inspect a few examples:
- Are all columns identical? (Sales, Customers, Promo, etc.)
- Are there slight differences in the numbers?
If all fields are identical, the second (or third) entry is likely a mistake.
If there are differences, further investigation is needed.
Analyst Judgment in Practice
Let’s say you find this:
Store | Date | Sales | Customers | Promo |
24 | 2024-08-01 | 5,230 | 612 | 0 |
24 | 2024-08-01 | 5,230 | 612 | 0 |
Identical in every column — likely an accidental repeat during data export. You would mark one for removal.
Now consider:
Store | Date | Sales | Customers | Promo |
35 | 2024-08-15 | 3,490 | 412 | 1 |
35 | 2024-08-15 | 3,490 | 409 | 1 |
Sales match, but Customers is off by three. Now the situation is ambiguous. This might be a data entry correction — or it might indicate something deeper, like a merge error between systems.
In such cases, don’t delete immediately. Flag for review or consult the data source team.
Step 4: Remove Confirmed Duplicates
Once you’ve identified exact duplicates, you can remove them using Excel’s built-in feature:
Using “Remove Duplicates”
- Select your data range (including headers).
- Go to the Data
- Click Remove Duplicates.
- Select all columns (to ensure complete duplicates only).
- Click OK.
Excel will remove exact duplicates and report how many were found and deleted.
Caution:
- This method only removes rows where every column is identical. It will not catch near-duplicates.
- Always create a backup or copy of your data before running this step.
Step 5: Document What You Did
In your cleaning_log worksheet, record:
Issue | Detection Method | Rows Affected | Action Taken | Notes |
Duplicate sales entries (Store + Date) | COUNTIFS on Store and Date | 7 | Removed one duplicate per match | All fields identical. Reviewed visually. |
A log like this ensures:
- Reproducibility for others
- Transparency in your decisions
- A clear trail in case someone questions the numbers later
6.4. When Not to Remove Duplicates
If rows are flagged but contain subtle differences, or if they reflect genuine business operations (e.g., multiple shifts per day, merged records), hold off.
Instead:
- Add a comment or flag.
- Create a column such as Needs_Review.
- Inform your manager or client if it may affect decisions.
7. Standardizing Formats
“Dirty data,” doesn’t only refer to missing values or duplicates. In practice, some of the most disruptive issues stem from inconsistent formatting. These inconsistencies may not always be obvious at first glance — yet they interfere with filtering, grouping, formula logic, and merging datasets.
You may have the right data in front of you. But if Excel does not interpret it in the way you expect, the results of your analysis can be misleading or completely wrong.
In this section, you will learn how to identify and resolve common formatting inconsistencies. You will also begin to develop the habit of asking, “How is Excel actually reading this value?” — a question that sits at the core of analytical reliability.
We will work with real columns from the Rossmann datasets and walk through the steps preparing data for decision-making.
7.1. Dates
The Date column in rossmann-sales.xlsx may appear consistent on the surface, but we must always confirm whether Excel recognizes each entry as a true date value — not just as text that happens to look like a date.
For example, suppose you see the value 12/03/2024 in a cell. If Excel is treating this value as a string of text, then sorting, filtering, and date-based calculations will not work as expected. Worse, Excel may interpret that same string differently depending on your regional settings. In some countries, 12/03/2024 means March 12th. In others, it means December 3rd.
To determine whether Excel recognizes the value as a date, click the cell and look at the formula bar. If the value appears as a “date” (for example, 12/03/2024), then it is a proper Excel date. If it appears with quotation marks, or with a leading apostrophe (’12/03/2024), then Excel is treating it as text.
You can click “Format cell” to check if the cell is indeed a date, as shown in Figure 3.
Converting Text to Real Dates
If you identify date entries that are stored as text, you can convert them using Excel’s DATEVALUE() function. For example, if cell C2 contains a text date, you can enter the following formula in a new column:
=DATEVALUE(C2)
Then, format the result as a readable date by right-clicking the cell, selecting Format Cells, choosing Date, and selecting a standard format such as YYYY-MM-DD.
Make sure the entire column follows the same format — not just in terms of appearance, but in terms of how Excel processes it.
A consistent date format ensures that your filters, timelines, and formulas behave predictably across the entire dataset.
The Analyst’s Question
Let’s consider the Date column in rossmann-sales.xlsx.
“Are all entries in the Date column recognized by Excel as real dates, or are some of them stored as text?”
Here’s how to investigate:
- In a blank column (e.g., column J), enter:
=ISNUMBER(C2)
This checks whether the value in C2 — which should contain a date — is stored as a number (which is how Excel encodes actual dates). If it returns TRUE, the cell is a valid Excel date. If it returns FALSE, the cell is text and must be converted. PS: Try to add a single quote (‘) in front of cell C2. Your ISNUMBER(C2) will return a False as it will interpret the cell as a text.
- Drag the formula down to check the entire column.
What to Expect?
In this dataset, the dates are correctly recognized as Excel dates — you’ll see TRUE throughout.
But this is not always the case in real datasets. If Excel ever fails to recognize a date (for example, due to inconsistent separators or localization issues), you can convert it using:
=DATEVALUE(C2)
Then format the result properly:
- Right-click → Format Cells → Date → Select YYYY-MM-DD.
Tip: When dates are imported from CSVs or scraped from websites, they often lose their native format. Always check — never assume.
7.2. Text Fields: Case, Spacing, and Inconsistencies
Several columns in rossmann-stores.xlsx — such as StoreType and Assortment — are categorical variables stored as text. At first glance, values like ‘a’ and ‘A’, may seem similar, but Excel treats them as different entries. This can fragment your analysis, causing a pivot table to show what looks like duplicate categories, or returning errors in lookup formulas.
Let’s take the StoreType column as an example. As an analyst preparing this data for summary reporting, you need to make sure that all values follow a consistent format — for instance, all uppercase with no leading or trailing spaces.
To achieve this, Excel provides several helpful text-cleaning functions. You can apply them one by one, or combine them into a single cleaning formula.
- =UPPER(A2) converts all characters to uppercase.
- =LOWER(A2) converts them to lowercase.
- =PROPER(A2) capitalizes the first letter of each word.
- =TRIM(A2) removes any leading or trailing spaces.
- =CLEAN(A2) removes non-printable characters.
In practice, you might combine these functions like this:
=UPPER(TRIM(CLEAN(A2)))
You can apply this formula in a new column, then copy the cleaned values over the originals using Paste Values. This ensures that every entry is formatted consistently, without accidentally deleting the original data before verifying your changes.
Always scan through your text fields after cleaning to confirm that the transformation achieved the intended result. Subtle differences — such as non-breaking spaces copied from other software — may not be visible on screen, but can still interfere with grouping or filtering.
7.3. Numbers: Are They Truly Numeric?
Some columns in rossmann-stores.xlsx — such as CompetitionDistance — may contain numbers that appear normal but are actually stored as text. Just like “date” discussed above, these values may look fine visually, but Excel will silently ignore them in calculations like AVERAGE, SUM, or even when sorting from smallest to largest.
To check whether Excel recognizes a number as numeric, use the formula:
=ISNUMBER(A2)
If the result is FALSE, Excel is treating the value as text — even if it contains digits. Another visual cue is alignment: by default, Excel aligns text to the left and numbers to the right.
To convert a text-based number into a real number, you can use:
=VALUE(A2)
Once you’ve confirmed that the numbers are properly recognized, you can format them consistently. Right-click the column, choose Format Cells, and select the appropriate format — such as Number with two decimal places or Currency, depending on context.
Tip: Text values often get pasted in accidentally from PDFs, web sources, or Excel files using different regional settings.
7.4. Decimal and Currency Formats
If you are analyzing monetary values — such as average daily sales or promotional uplift — you will often work with decimal numbers. Be aware that some systems use a comma as the decimal separator (e.g., 3,14), while others use a period (e.g., 3.14).
Excel inherits these conventions from your operating system’s regional settings. If your dataset includes values copied from different sources (e.g., one in German settings, another in English), you may encounter inconsistencies.
To ensure proper numeric handling, first confirm that all values are being interpreted as numbers. Then standardize the decimal format and apply a consistent currency symbol, if appropriate.
- Select the column
- Right-click → Format Cells → Choose Number or Currency
- Set the number of decimal places
- Avoid mixing units (e.g., EUR and USD) unless your analysis accounts for it explicitly
You should also be cautious of hidden currency symbols or formatting that may have been pasted in from external sources — especially if your data was exported from a PDF or a web-based report.
7.5. Reflection
Take a moment to reflect on the following questions. Write down your responses in your journal or course notes — they will become part of your career story.
- Have you ever worked with data that looked fine until you tried to filter, calculate, or merge?
- What subtle formatting problems might go unnoticed in a dataset like Rossmann’s — and how could they affect a business decision?
- As a professional analyst, how would you explain the importance of format consistency to a manager who doesn’t use Excel?
Your ability to notice what others overlook — and to explain it clearly — is part of what sets you apart in an analytics career.
8. Data Validation and Quality Checks
Before any dataset can be considered analysis-ready, it must pass a set of structured validation checks. These checks verify that your cleaning process has not introduced new problems, that the cleaned data respects business logic, and that it is suitable for decision-making.
This step reflects professional discipline. It ensures that the dataset is not only neat—but also trustworthy.
Let’s walk through a couple of essential types of validation, using the Rossmann dataset as our context.
8. 1. Logical Consistency Checks
Your data must make sense across fields. This means the relationships between columns should follow expected business rules.
Example 1: Store Status vs. Sales
In the Rossmann sales.xlsx file, a store marked as closed (Open = 0) should not report any sales. If it does, this indicates a logic error.
Validation Formula:
In a new column, enter:
=IF(AND(F2=0;D2>0);”Inconsistent”;”OK”)
Where F is sales and D is open
Any row flagged as “Inconsistent” requires investigation.
Example 2: Customers and Sales
If Sales is positive but Customers is missing or zero, you must ask: how were those sales made? Did the Customer field fail to record footfall? Or is the sales figure incorrect?
Here, you are cross-validating plausibility, not just format.
8.2. Completeness Verification
You need to confirm that critical fields are fully populated—especially those that will be used in grouping, filtering, or aggregation.
Example: Missing Store Metadata
If a store in sales.xlsx does not have a corresponding entry in store.xlsx, your regional or assortment-based analysis will be incomplete or incorrect.
Validation Strategy:
- Use VLOOKUP() or XLOOKUP() to match Store IDs across datasets (We will cover Lookups later in the module).
- Flag rows in sales.xlsx where the store metadata returns an error.
Sample Formula:
=IF(ISNA(XLOOKUP([@Store],store!A:A,store!B:B)),”Missing Store Info”,”OK”)
(where [@Store] represents column A in sales.xlsx.
This confirms whether each transaction has an associated store type.
9. Documentation and Communication
Data cleaning is not merely a technical task. It is an analytical decision-making process. Each transformation—whether it’s filling a missing value, removing a duplicate, or resolving an inconsistency—is based on a series of judgments. Professional analysts do not keep these decisions in their heads. They make them visible.
In this section, we focus on two key outcomes:
- A clear, structured cleaning log that captures what was changed, why, and how.
- A concise, business-focused executive summary that communicates the implications of your cleaning work to non-technical stakeholders.
9.1. Why Documentation Matters
There are three primary reasons analysts document their cleaning work:
- Transparency
Business leaders need to know what happened to the raw data. A cleaning log builds trust by showing that no shortcuts or distortions were introduced. - Reproducibility
Future team members should be able to retrace your steps. This is especially important in regulated industries, audits, or when datasets are refreshed periodically. - Professionalism
A well-maintained cleaning log is a mark of analytical discipline. It signals that you take your work seriously—not just in outcome, but in method.
9.2. The Cleaning Log
The cleaning log is a structured worksheet in your Excel file. It should follow a consistent format that captures five key elements for every action taken.
| Date | Issue Identified | Action Taken | Rationale | Rows / Fields Affected |
| 2024-06-24 | Blank entries in CompetitionDistance | Replaced blanks with median value | Median better reflects distribution | Store.xlsx, Column D, 13 rows |
| 2024-06-25 | Duplicate rows in sales.xlsx | Removed exact duplicates | Prevents double-counting revenue | Sales.xlsx, Rows 12434, 33255 |
| 2024-06-25 | Date format inconsistency | Standardized to DD/MM/YYYY format | Ensures accurate sorting/filtering | Sales.xlsx, Column C, all rows |
| 2024-06-26 | Sales outlier: €535,000 | Confirmed as error, removed | Store closed on that day | Sales.xlsx, Row 813 |
Best Practice: Use consistent terminology in your log. For instance, always refer to “blanks” (not sometimes “nulls”), and use consistent date formats.
9.3. Guidelines for writing “Rationale”
Your rationale column is the most important. This is not just a “what”—it is a “why.”
Some weak examples:
- “Needed fixing”
- “To make things consistent”
- “Excel didn’t like the format”
Strong examples:
- “Filled missing Promo2SinceYear using known promo launch dates from campaign archives”
- “Deleted duplicate records after verifying that timestamps and transaction IDs were identical”
- “Standardized all text case in Assortment to avoid grouping errors during pivot table analysis”
Clarity in your rationale helps you defend your decisions and invites constructive discussion with others.
9.4. Executive Summary for Stakeholders
The executive summary is a short written document (or slide) that accompanies your submission. It addresses non-technical stakeholders—typically a business manager, team lead, or senior decision-maker.
The purpose is not to explain how Excel formulas work. Instead, it focuses on:
- What key issues were present in the original data
- How those issues could have affected business decisions
- What was done to address them
- What limitations remain
- What this cleaned dataset enables now
Structure of a Strong Executive Summary
Here is a sample 4-paragraph format:
- Data Quality Overview
The initial Rossmann dataset contained several quality issues that could have misled strategic decisions—most notably missing values in the store metadata, duplicate sales records, and inconsistencies in date formats across both files.
- Cleaning Strategy
We followed a structured five-step cleaning process: identifying issues, prioritizing based on business impact, applying appropriate solutions, validating changes, and documenting all decisions. All critical fields were checked for completeness, logical consistency, and range accuracy. Outliers were carefully investigated and resolved based on operational context.
- Business Implications
Without cleaning, several stores appeared to be over-performing due to duplicate sales rows. Some promotional analytics would have excluded stores with missing metadata. These corrections now enable more reliable insights into campaign performance, store clustering, and regional sales trends.
- Remaining Considerations
We flagged five stores with incomplete promotional history for further review. These are noted in the cleaning log. The dataset now provides a stable foundation for the upcoming Q3 retail performance analysis.
9.5. Versioning and Communication
When submitting cleaned data, always preserve version control:
- rossmann_raw.xlsx – untouched original
- rossmann_cleaned_v1.xlsx – cleaned and validated
- rossmann_cleaned_v2.xlsx – cleaned with additional QA fixes
Never overwrite prior versions unless explicitly approved. Versioning not only helps with rollback but demonstrates analytical maturity.
Communicate clearly when handing off datasets:
- What has been done
- What can now be done with the data
- What must still be done (e.g., manual confirmation, additional enrichment)
A cleaned dataset without documentation is like a building without blueprints—perhaps beautiful, but structurally unverifiable. As analysts, your goal is not just to prepare data—but to make your decisions visible, logical, and communicable.
This is what earns the trust of your team—and the confidence of decision-makers.
Summary
In this lesson, you’ve learned:
- Effective data cleaning is foundational for successful EDA: How to systematically identify common data quality issues like missing values, duplicates, inconsistent formats, and outliers.
- Practical techniques for addressing each type of data quality problem using Excel’s built-in tools and formulas.
- The importance of documentation and validation in professional data cleaning processes.
- How to balance technical cleaning requirements with business context and decision-making needs.
- The role of AI tools in data cleaning, including their capabilities and limitations.
In the next lesson, you’ll dig deeper into outliers – when to clean and when to keep them.
Suggested Readings & References
- Only 3% of Companies’ Data Meets Basic Quality Standards – Harvard Business Review
Exercise
Estimated Time to Complete: ~2 hour
Context
You’ve been hired as a data analyst on a consulting team preparing Walmart’s upcoming quarterly operations review. Leadership will use your analysis to decide which store formats to prioritize for marketing investment.
Before any modeling or dashboarding can begin, they need assurance that the data you’re working with is clean, consistent, and trustworthy. Your role: prepare an analysis-ready dataset, provide clear documentation of your process, and deliver a concise business-facing summary of the decisions you made.
Objectives
You are expected to:
- Identify and assess data quality issues across the walmart-sales file
- Systematically clean these datasets in Excel using techniques covered in Lesson 4.
- Validate your cleaning with cross-field logic, range tests, and consistency checks.
- Document your decisions transparently using a cleaning log.
- Write a stakeholder-facing summary explaining the business impact of the data quality issues and how your cleaning enables accurate insights.
- Reflect on your analytical mindset, your communication clarity, and your development as a professional analyst.
Data
You will be working with the following dataset in this exercise:
- walmart-sales.xlsx
You can download this file from Lumen’s github repository.
You can ignore the other two files for now:
- walmart-features.xlsx
- walmart-stores.xlsx
Tasks
Task 1: Data Quality Assessment
Using Excel, conduct a structured audit of the sales dataset. Document your findings in a table or cleaning log with columns such as:
- Field/Variable
- Type of Issue (e.g., Missing, Duplicate, Format Inconsistency, Outlier)
- Frequency or Affected Rows
- Possible Business Impact
- Initial Thoughts on Action
Focus questions:
- Are there any missing values? Where and how might they affect business decisions?
- Are there duplicated rows or transactions?
- Are there inconsistent formats (dates, text capitalization, currencies)?
- Do any values seem implausible (e.g., very large negative sales)?
Task 2: Implement Cleaning
Apply structured cleaning steps as per the workflow taught in this lesson.
Your cleaning should include:
- Identifying and addressing missing data using strategies aligned with business meaning
- Removing or flagging duplicate records logically.
- Standardizing date formats (ensure all dates use MM.YY or convert to ISO YYYY-MM-DD as appropriate).
- Fixing inconsistent column types (e.g., ensure prices are numeric).
- Creating a structured cleaning log in a separate worksheet with justifications for each action taken.
- Range validation: Are there any negative Weekly_Sales or illogical Store Sizes?
- Completeness check: Are all stores in sales represented in the stores file? Are all dates covered?
Include summary statistics and screenshots as needed.
Task 3: Stakeholder Communication
Write a one-page business-facing summary. Your goal is to communicate your cleaning work to a non-technical stakeholder (e.g., Walmart’s regional ops manager).
You may use this structure:
- What issues you found and why they mattered
- Your decisions and rationale
- Remaining risks or limitations
- What your cleaning now enables (e.g., performance comparison by store type)
Be clear, concise, and use business-friendly language.
Task 4: Engage with the Professional Analytics Community
Read the LinkedIn post on Data Cleaning Lessons. Reflect on the role of subtle assumptions and technical flaws in your own cleaning process. Comment on the LinkedIn post with an insightful comment. This can come either directly from your Walmart reflection or a learning in general. You can also ask a thoughtful question.
Submission Guidelines
Submit the following as a single Excel workbook and one Word/PDF document:
Workbook:
- raw_data: Original datasets copied as separate worksheets
- cleaned_data: Cleaned versions, clearly labeled
- cleaning_log: Your structured log of all issues + actions
Filename Format:
YourName_Lesson4_Walmart_Cleaning.xlsx
Document:
- YourName_Lesson4.docx
Include:- Business summary (max 1 page)
- Reflection (1–2 pages)
When you’re ready, submit your completed exercise to the designated folder in OneDrive. Drop your mentor a note about submission.
Important: Please scan your files for viruses before uploading.
Submission & Resubmission Guidelines
- Initial Submission Format: YourName_Lesson4_…
- Resubmission Format:
- YourName_Lesson4_…_v2
- YourName_Lesson4_…_v3
- Rubric Updates:
- Do not overwrite original evaluation entries
- Add updated responses in new “v2” or “v3” columns
- This allows mentors to track your improvement process
Evaluation Rubric
| Criteria | Exceeds Expectation | Meets Expectation | Needs Improvement | Incomplete / Off-Track |
| 1. Data Quality Assessment |
|
|
|
|
| 2. Cleaning Implementation |
|
|
|
|
| 3. Stakeholder Summary |
|
|
|
|
| Task 4: Personal Branding |
|
|
|
|
Student Submissions
Check out recently submitted work by other students to get an idea of what’s required for this Exercise:
Got Feedback?
Contact
Talk to us
Have questions or feedback about Lumen? We’d love to hear from you.