Lesson 4 - Cleaning and Preparing Data

Transform messy, incomplete data into reliable insights.

Estimated Read Time: 1.5-2 Hours

Learning Goals

In this lesson, you will learn to clean data. By the end, you will:
 

Technical & Analytical

  • Identify and address common data quality issues: missing values, duplicates, inconsistent formats, and outliers.
  • Apply systematic data cleaning techniques using Excel’s built-in tools and formulas.
  • Validate data integrity through cross-checking, range testing, and logical consistency checks.
  • Document cleaning decisions to ensure transparency and reproducibility in your analytical process.
  • Learn to think like an analyst: ask better questions, validate assumptions, and make methodical, transparent decisions.

Business Impact

  • Recognize how data quality issues directly impact business decisions and organizational trust.

Personal Branding & Career Development

  • Build cleaning logs that reflect your analytical discipline.
  • Document your decision-making process to gain stakeholder trust.
  • Treat data cleaning as professional storytelling — and a source of competitive edge.

1. The Reality of Real-World Data

Data is rarely pristine. In fact, most analysts spend 60-80% of their time cleaning and preparing data before any meaningful analysis can begin. This isn’t a sign of failure — it’s the reality of working with information generated by imperfect systems, human input, and complex business processes.

Effective data cleaning is foundational for successful EDA. This process involves handling missing values, correcting errors, and ensuring consistency, thereby enabling accurate visualizations and reliable insights during the exploratory phase.

Consider Rossmann’s daily operations: thousands of transactions across hundreds of stores, recorded by different systems, entered by various staff members, and transmitted across networks. Each step introduces potential for inconsistency, errors, or missing information. Yet business decisions must be made with this imperfect data.

Your ability to systematically identify, address, and document data quality issues is what transforms raw information into reliable business insights. This lesson will teach you to approach messy data with structured thinking, appropriate tools, and professional judgment.

 

2. Understanding Data Quality Issues

Before diving into cleaning techniques, you must first learn to recognize the most common data quality problems. Like a skilled detective, an analyst develops the ability to spot inconsistencies, anomalies, and missing pieces that could undermine analysis.

 

2.1. Missing Values

Missing data occurs when expected information is absent from your dataset. In the Rossmann data, you might find:

  • Sales figures missing for certain dates
  • Customer counts blank (or zero) during store closures
  • Promotional information incomplete for specific periods

Missing values aren’t always errors — sometimes they’re meaningful. A missing sales figure might indicate a store closure, while a blank promotional flag could mean no promotion was active. The key is understanding the difference between missing information and meaningful absences.

 

What Counts as Missing?

When we speak of “missing data,” we often imagine something technical or hidden. But in Excel, the first kind of missing data you will encounter is simply: nothing. A cell left empty.

This kind of missingness is common in real operational data. It can occur for many reasons:

  • A store didn’t report a value.
  • The system didn’t collect it at the time.
  • The field didn’t apply to that store.
  • The data was lost or never entered.

Your task isn’t just to spot these blank cells. It’s to interpret what they might mean — and what to do about them.

 

2.2. Duplicate Records

Duplicate entries occur when the same information appears multiple times in your dataset. This could result from:

  • System errors during data export
  • Multiple data sources being combined without proper deduplication
  • Manual entry mistakes

Duplicates can severely skew your analysis, artificially inflating totals and distorting trends. A single day’s sales appearing twice would double the apparent revenue for that period.

 

2.3. Inconsistent Formats

Data inconsistency manifests in various ways:

  • Date formats mixed between DD/MM/YYYY and MM/DD/YYYY
  • Text entries with different capitalization (e.g., “Store A” vs “store a”)
  • Numbers formatted as text, preventing calculations
  • Currency symbols inconsistently applied

These inconsistencies prevent accurate sorting, filtering, and analysis. Excel might treat “12/03/2023” and “2023-03-12” as different entries, even though they represent the same date.

 

2.4. Outliers and Anomalies

Outliers are data points that appear significantly different from the majority of observations. In retail sales data, outliers might indicate:

  • Data entry errors (e.g., sales of €500,000 instead of €5,000)
  • Exceptional events (e.g., grand opening sales, clearance events)
  • System malfunctions or processing errors

The challenge is distinguishing between legitimate outliers that reflect real business events and errors that should be corrected or removed. We will cover outliers in detail in lesson 5.

 

3. Data Cleaning Fundamentals

Effective data cleaning follows a systematic approach. Rather than randomly addressing issues as you encounter them, professional analysts follow structured workflows that ensure comprehensive, documented, and reproducible cleaning processes.

 

3.1. The Cleaning Workflow

  1. Assess and Document: Before making any changes, examine your data thoroughly and document what you find
  2. Prioritize Issues: Not all problems require immediate attention — focus on issues that would most impact your analysis
  3. Clean Systematically: Address issues in logical order, from most fundamental to most specific
  4. Validate Results: After cleaning, verify that your changes achieved the intended results
  5. Document Decisions: Record what you changed and why, ensuring transparency and reproducibility

 

3.2. Setting Up Your Cleaning Environment

Before beginning any cleaning process, establish a structured workspace:

Create a Cleaning Log

Add a new worksheet named “cleaning_log” to document every signnificant change you make. It is recommended to include:

  • Date and time of cleaning
  • Issue identified
  • Action taken
  • Rationale for the decision
  • Rows/columns affected

Preserve Original Data

Never work directly on your original dataset. Create a copy named “data_cleaned” or similar, keeping the original intact for reference.

Use Consistent Naming

Establish clear naming conventions for:

  • Worksheet tabs (e.g., “raw_data”, “cleaned_data”, “analysis”)
  • Column headers (consistent capitalization and formatting)
  • File versions (e.g., “rossmann_analysis_v1.xlsx”)

Note that creating a cleaning log is part of best practices and establishes trust in your work. Imagine you removed some datapoints while cleaning, without leaving any comment. This will cause confusion in case anyone else will look at the data. Hence, it is a best practice to create and maintain these logs. 

 


 
Working with the Rossmann Dataset

Let’s apply these principles to the Rossmann dataset you’ve been working with. Open both rossmann-sales.xlsx and rossmann-stores.xlsx to begin your cleaning process.

Please Note that a few numbers/values used in this lesson are indicative only – and do not represent the real Rossmann dataset values. Take the examples with a grain of salt.

 

4. Initial Data Assessment

Before making any changes, spend time understanding what you’re working with:

Examine the Sales Data Structure

  • How many rows and columns do you have?
  • What date range does the data cover?
  • Are all expected stores represented?
  • Do you notice any immediately obvious issues?

Review the Stores Data

  • Does every store in the sales data have corresponding store information?
  • Are store types and assortments consistently formatted?
  • Are there any missing values in critical fields?

Cross-Reference the Datasets

  • Are there stores in one dataset but not the other?
  • Do store identifiers match consistently across both files?

 

5. Identifying Missing Values

Let’s consider rossmann-stores.xlsx.

You’ll see one row per store, with columns describing its characteristics. Focus your attention on the following:

  • CompetitionDistance
  • CompetitionOpenSinceMonth
  • CompetitionOpenSinceYear
  • Promo2SinceWeek
  • Promo2SinceYear
  • PromoInterval

Excel provides several ways to identify missing data:

Option 1 – Visual Inspection Scroll through your data looking for blank cells, “N/A” entries, or other indicators of missing information. Pay particular attention to:

  • Required fields like Store ID, Date, and Sales
  • Patterns in missing data (e.g., all Sundays missing, specific stores affected)

Option 2 – Using Excel’s Go To Special Feature

  1. Select your data range
  2. Press Ctrl+G (or F5) to open the “Go To” dialog
  3. Click “Special”
  4. Select “Blanks” and click OK

This will highlight all empty cells in your selection, making missing values immediately visible – see Figure 1.

Figure 1 - Excel's Goto Feature Highlighting Blanks

 

Option 3 – Filtering for Missing Values Use Excel’s filter feature to identify missing data:

  1. Apply filters to your dataset
  2. In relevant columns, look for “(Blanks)” in the filter dropdown
  3. Select “(Blanks)” to display only rows with missing values in that column

 

Figure 2 - Excel's Filter Feature to Identify Blanks

Option 4 – Using COUNTBLANK()

You can count the number of missing cells in a range, e.g. for rows (store #) or columns (CompetitionOpenSinceMonth).

For example, to check how many entries are missing for store # 1, type in a blank cell (K2):

= COUNTBLANK(A2:J2)

This will return 3 indicating three entries are missing for store # 1.

Similarly, for all missing entries of column CompetitionOpenSinceMonth, which is column E, you can enter the formula:

=COUNTBLANK(E2:E1116)

This gives you 354 indicating as many missing values in the column.

If these numbers are high (e.g., 60% or more blanks), the row/column might require special handling – review it for context before keeping or excluding it.

This function becomes especially powerful when used across:

  • Key columns (e.g., date fields, distance fields)
  • Newly added rows during incremental data updates
  • Merged datasets where gaps are expected

 

5.1. Why Is the Data Missing?

Now that you’ve identified where values are missing, pause to ask why.

Here are three (hypothetical) examples you might notice in the Rossmann data:

Store

CompetitionDistance

CompetitionOpenSinceMonth

Promo2

Promo2SinceWeek

18

(blank)

(blank)

0

(blank)

24

1630

9

0

(blank)

35

2435

(blank)

1

18

What could this mean?

  • Is CompetitionDistance blank because the store has no competitors nearby?
  • Is Promo2SinceWeek blank because the store never participated in Promo2?
  • Are some blanks the result of data entry issues, or are they legitimate absences?

These questions are not side-notes. They are the work.

 

5.2. Addressing Missing Values

Once you’ve identified missing values, you must decide how to handle them. The appropriate approach depends on business context and the nature of the missing data.

Let’s take three of the blank columns as examples.

1. CompetitionDistance

This field tells us how far away the nearest competing store is. For most rows, this is a numeric value — such as 215.0 or 4890.0. But for some stores, the value is missing.

Possible interpretations:

  • The data was not recorded.
  • The store is in a remote area where competition isn’t relevant.
  • The store is new, and the data hasn’t been updated yet.

As an analyst, you might:

  • Fill in the missing values with the average competition distance across similar stores.
  • Fill with a high number (e.g., 99999) to signal “unknown” or “assumed distant”.
  • Leave them blank and exclude these rows from distance-based analysis.

All three options are valid — but each should be documented.

2. Promo2SinceWeek and Promo2SinceYear

You may notice that when Promo2 = 0, both of these fields are blank.

This suggests that missingness here is not random — it is structured.

In this case, you might decide:

  • Not to fill these blanks at all.
  • Or to fill them with a placeholder like “N/A” (not applicable) in a cleaned version.

As much as filling vs. not filling matters — equally important is understanding why the value is missing and making that clear in your cleaning log.

3. CompetitionOpenSinceMonth and CompetitionOpenSinceYear

These two columns work together — they tell us when a nearby competitor opened.

In some rows, one or both of these fields are missing. That creates ambiguity.

For example:

  • If CompetitionDistance is given, but the opening date is missing, we might wonder if the competitor existed at the time of the store’s opening.
  • If both are missing, we may consider removing the row, or flagging it for manual review.

Again: no automated tool can tell you what to do. You decide, and you document.

Now you have multiple options to handle missing values:

Option 1: Fill with Calculated Values – Sometimes you can derive missing values from other information:

  • Calculate missing customer counts from sales and average transaction values
  • Derive missing dates from sequence patterns
  • Use store averages to fill missing promotional information

Option 2: Use Domain Knowledge – Apply business understanding to make informed decisions:

  • If sales are missing but the store shows “Open = 0”, sales should be 0
  • If customer count is missing but sales exist, estimate based on average transaction value
  • If promotional information is missing, assume no promotion unless evidence suggests otherwise

Option 3: Flag for Further Investigation – When missing data might indicate important business issues:

  • Large gaps in sales data might reveal system problems
  • Missing store information could indicate new openings or closures
  • Promotional data gaps might reflect incomplete campaign tracking

Option 4: Delete Records – Appropriate when:

  • Missing data represents a small percentage of your dataset
  • The missing information is not critical to your analysis
  • You have sufficient remaining data for reliable analysis
  • Use sparingly – don’t throw away data just because a few entries in the row / column are missing. It’s possible that your specific use case doesn’t need that specific entry, but other use cases do; and you also lose the other data which is available.

🎧 Here’s a real-world story that narrates why deleting data — even accidentally — can lead to much bigger blind spots than you’d expect.

5.3. Recording Your Decisions

Create a worksheet in your Excel file named cleaning_log.

Here’s how it might look:

Column

Type of Missingness

Decision

Rationale

CompetitionDistance

True blank

Filled with median for same StoreType

To preserve records and allow fair comparison

Promo2SinceWeek

Blank when Promo2 = 0

Left as blank

Promo never ran, so no start week is expected

CompetitionOpenSinceMonth

Some partial blanks

Removed rows with only one of month/year present

Incomplete time info makes analysis unreliable

You’ll build on this log throughout your cleaning process.

 

6. Handling Duplicates

In structured datasets, each row is meant to represent a unique observation — a distinct transaction, store, or event. When the same row appears more than once, your counts, totals, and trends can become unreliable. But not all repeated entries are duplicates. Learning to distinguish between true duplicates and legitimate repetition is an essential analyst skill.

 

In this section, we’ll work with the rossmann-sales.xlsx file to:

  • Understand how duplicates can arise
  • Determine whether they are errors or valid data
  • Apply Excel techniques to identify and manage them responsibly

 

6.1. What Is a Duplicate?

At first glance, two rows might look identical — same store, same date, same sales. But before deleting anything, we must ask:

Does this row truly repeat a recorded event, or is it just coincidentally similar?

In real data:

  • Some repetition is normal (e.g., two stores with the same sales on a given day).
  • But duplicate records for the same store on the same date — with identical values in every column — usually indicate accidental duplication.

 

6.2. Business Impact of Unchecked Duplicates

Let’s suppose one sales record has been accidentally entered twice.

Store

Date

Sales

Customers

Promo

12

2024-07-10

7,854

823

1

12

2024-07-10

7,854

823

1

If this goes unnoticed:

  • Total sales will be inflated by €7,854
  • Daily averages will be distorted
  • Forecasting models will overweight this data point

 

In business settings, such distortions can lead to poor inventory planning, overstaffing, or misjudged marketing success.

 

6.3. Step-by-Step: How to Check for Duplicates in Excel

Let’s walk through how to find and handle duplicates in your Rossmann dataset.

 
Step 1: Identify Your Uniqueness Criteria

Before Excel can tell you what’s duplicated, you must define what counts as a duplicate.

In rossmann-sales.xlsx, a clean row should represent one day of sales for one store. So the combination of Store and Date should be unique.

We’ll check whether this pair appears more than once.

 
Step 2: Use a Formula to Flag Duplicates

Add a helper column to the right of your table and name it Duplicate_Check. Then enter the following formula in the first data row:

=COUNTIFS(A:A; A2; C:C; C2)

Explanation:

  • Column A: Store
  • Column C: Date
  • The formula counts how many times this exact Store-Date pair appears.

 

If the result is greater than 1, that record is duplicated.

Drag the formula down the entire column.

 

Reminder: Formula Separators Depend on Regional Settings

 

If Excel gives you an error when entering the COUNTIFS() formula, the issue might not be the formula itself — but your system’s regional formatting. In many non-European versions of Excel, you’ll need to use comma (,) instead of semi-colon (;). So the formula would become:

=COUNTIFS(A:A, A2, C:C, C2)

 

Please verify what works for you before proceeding.

 

As an analyst, it’s important to be aware of these subtle environmental differences — they affect formula behavior, date parsing, and number formatting across systems and teams.

 
Step 3: Filter to Review Duplicate Candidates

Now:

  1. Apply a filter to the Duplicate_Check
  2. Filter for rows where the value is >1.
  3. Carefully inspect a few examples:
    • Are all columns identical? (Sales, Customers, Promo, etc.)
    • Are there slight differences in the numbers?

If all fields are identical, the second (or third) entry is likely a mistake.

If there are differences, further investigation is needed.

 

Analyst Judgment in Practice

Let’s say you find this:

Store

Date

Sales

Customers

Promo

24

2024-08-01

5,230

612

0

24

2024-08-01

5,230

612

0

Identical in every column — likely an accidental repeat during data export. You would mark one for removal.

Now consider:

Store

Date

Sales

Customers

Promo

35

2024-08-15

3,490

412

1

35

2024-08-15

3,490

409

1

Sales match, but Customers is off by three. Now the situation is ambiguous. This might be a data entry correction — or it might indicate something deeper, like a merge error between systems.

In such cases, don’t delete immediately. Flag for review or consult the data source team.

 
Step 4: Remove Confirmed Duplicates

Once you’ve identified exact duplicates, you can remove them using Excel’s built-in feature:

Using “Remove Duplicates”

  1. Select your data range (including headers).
  2. Go to the Data
  3. Click Remove Duplicates.
  4. Select all columns (to ensure complete duplicates only).
  5. Click OK.

 

Excel will remove exact duplicates and report how many were found and deleted.

Caution:

  • This method only removes rows where every column is identical. It will not catch near-duplicates.
  • Always create a backup or copy of your data before running this step.
 
Step 5: Document What You Did

In your cleaning_log worksheet, record:

Issue

Detection Method

Rows Affected

Action Taken

Notes

Duplicate sales entries (Store + Date)

COUNTIFS on Store and Date

7

Removed one duplicate per match

All fields identical. Reviewed visually.

A log like this ensures:

  • Reproducibility for others
  • Transparency in your decisions
  • A clear trail in case someone questions the numbers later

 

6.4. When Not to Remove Duplicates

If rows are flagged but contain subtle differences, or if they reflect genuine business operations (e.g., multiple shifts per day, merged records), hold off.

Instead:

  • Add a comment or flag.
  • Create a column such as Needs_Review.
  • Inform your manager or client if it may affect decisions.

 

7. Standardizing Formats

“Dirty data,” doesn’t only refer to missing values or duplicates. In practice, some of the most disruptive issues stem from inconsistent formatting. These inconsistencies may not always be obvious at first glance — yet they interfere with filtering, grouping, formula logic, and merging datasets.

 

You may have the right data in front of you. But if Excel does not interpret it in the way you expect, the results of your analysis can be misleading or completely wrong.

 

In this section, you will learn how to identify and resolve common formatting inconsistencies. You will also begin to develop the habit of asking, “How is Excel actually reading this value?” — a question that sits at the core of analytical reliability.

 

We will work with real columns from the Rossmann datasets and walk through the steps preparing data for decision-making.

 

7.1. Dates

The Date column in rossmann-sales.xlsx may appear consistent on the surface, but we must always confirm whether Excel recognizes each entry as a true date value — not just as text that happens to look like a date.

 

For example, suppose you see the value 12/03/2024 in a cell. If Excel is treating this value as a string of text, then sorting, filtering, and date-based calculations will not work as expected. Worse, Excel may interpret that same string differently depending on your regional settings. In some countries, 12/03/2024 means March 12th. In others, it means December 3rd.

 

To determine whether Excel recognizes the value as a date, click the cell and look at the formula bar. If the value appears as a “date”  (for example, 12/03/2024), then it is a proper Excel date. If it appears with quotation marks, or with a leading apostrophe (’12/03/2024), then Excel is treating it as text. 

 

You can click “Format cell” to check if the cell is indeed a date, as shown in Figure 3.

Figure 3 - Date in Excel
Figure 3 - Date in Excel
Converting Text to Real Dates

If you identify date entries that are stored as text, you can convert them using Excel’s DATEVALUE() function. For example, if cell C2 contains a text date, you can enter the following formula in a new column:

=DATEVALUE(C2)

Then, format the result as a readable date by right-clicking the cell, selecting Format Cells, choosing Date, and selecting a standard format such as YYYY-MM-DD.

Make sure the entire column follows the same format — not just in terms of appearance, but in terms of how Excel processes it.

A consistent date format ensures that your filters, timelines, and formulas behave predictably across the entire dataset.

 
The Analyst’s Question

Let’s consider the Date column in rossmann-sales.xlsx.

“Are all entries in the Date column recognized by Excel as real dates, or are some of them stored as text?”

Here’s how to investigate:

  • In a blank column (e.g., column J), enter:

=ISNUMBER(C2)

This checks whether the value in C2 — which should contain a date — is stored as a number (which is how Excel encodes actual dates). If it returns TRUE, the cell is a valid Excel date. If it returns FALSE, the cell is text and must be converted. PS: Try to add a single quote (‘) in front of cell C2. Your ISNUMBER(C2) will return a False as it will interpret the cell as a text.

  • Drag the formula down to check the entire column.

 

What to Expect?

In this dataset, the dates are correctly recognized as Excel dates — you’ll see TRUE throughout.

But this is not always the case in real datasets. If Excel ever fails to recognize a date (for example, due to inconsistent separators or localization issues), you can convert it using:

=DATEVALUE(C2)

Then format the result properly:

  • Right-click → Format Cells → Date → Select YYYY-MM-DD.

Tip: When dates are imported from CSVs or scraped from websites, they often lose their native format. Always check — never assume.

 

7.2. Text Fields: Case, Spacing, and Inconsistencies

Several columns in rossmann-stores.xlsx — such as StoreType and Assortment — are categorical variables stored as text. At first glance, values like ‘a’ and ‘A’, may seem similar, but Excel treats them as different entries. This can fragment your analysis, causing a pivot table to show what looks like duplicate categories, or returning errors in lookup formulas.

Let’s take the StoreType column as an example. As an analyst preparing this data for summary reporting, you need to make sure that all values follow a consistent format — for instance, all uppercase with no leading or trailing spaces.

To achieve this, Excel provides several helpful text-cleaning functions. You can apply them one by one, or combine them into a single cleaning formula.

  • =UPPER(A2) converts all characters to uppercase.
  • =LOWER(A2) converts them to lowercase.
  • =PROPER(A2) capitalizes the first letter of each word.
  • =TRIM(A2) removes any leading or trailing spaces.
  • =CLEAN(A2) removes non-printable characters.

In practice, you might combine these functions like this:

=UPPER(TRIM(CLEAN(A2)))

You can apply this formula in a new column, then copy the cleaned values over the originals using Paste Values. This ensures that every entry is formatted consistently, without accidentally deleting the original data before verifying your changes.

Always scan through your text fields after cleaning to confirm that the transformation achieved the intended result. Subtle differences — such as non-breaking spaces copied from other software — may not be visible on screen, but can still interfere with grouping or filtering.

 

7.3. Numbers: Are They Truly Numeric?

Some columns in rossmann-stores.xlsx — such as CompetitionDistance — may contain numbers that appear normal but are actually stored as text. Just like “date” discussed above, these values may look fine visually, but Excel will silently ignore them in calculations like AVERAGE, SUM, or even when sorting from smallest to largest.

To check whether Excel recognizes a number as numeric, use the formula:

=ISNUMBER(A2)

If the result is FALSE, Excel is treating the value as text — even if it contains digits. Another visual cue is alignment: by default, Excel aligns text to the left and numbers to the right.

To convert a text-based number into a real number, you can use:

=VALUE(A2)

Once you’ve confirmed that the numbers are properly recognized, you can format them consistently. Right-click the column, choose Format Cells, and select the appropriate format — such as Number with two decimal places or Currency, depending on context.

Tip: Text values often get pasted in accidentally from PDFs, web sources, or Excel files using different regional settings.

 

7.4. Decimal and Currency Formats

If you are analyzing monetary values — such as average daily sales or promotional uplift — you will often work with decimal numbers. Be aware that some systems use a comma as the decimal separator (e.g., 3,14), while others use a period (e.g., 3.14).

Excel inherits these conventions from your operating system’s regional settings. If your dataset includes values copied from different sources (e.g., one in German settings, another in English), you may encounter inconsistencies.

To ensure proper numeric handling, first confirm that all values are being interpreted as numbers. Then standardize the decimal format and apply a consistent currency symbol, if appropriate.

  • Select the column
  • Right-click → Format Cells → Choose Number or Currency
  • Set the number of decimal places
  • Avoid mixing units (e.g., EUR and USD) unless your analysis accounts for it explicitly

You should also be cautious of hidden currency symbols or formatting that may have been pasted in from external sources — especially if your data was exported from a PDF or a web-based report.

 

7.5. Reflection

Take a moment to reflect on the following questions. Write down your responses in your journal or course notes — they will become part of your career story.

  • Have you ever worked with data that looked fine until you tried to filter, calculate, or merge?
  • What subtle formatting problems might go unnoticed in a dataset like Rossmann’s — and how could they affect a business decision?
  • As a professional analyst, how would you explain the importance of format consistency to a manager who doesn’t use Excel?

Your ability to notice what others overlook — and to explain it clearly — is part of what sets you apart in an analytics career.

8. Data Validation and Quality Checks

Before any dataset can be considered analysis-ready, it must pass a set of structured validation checks. These checks verify that your cleaning process has not introduced new problems, that the cleaned data respects business logic, and that it is suitable for decision-making.

This step reflects professional discipline. It ensures that the dataset is not only neat—but also trustworthy.

Let’s walk through a couple of  essential types of validation, using the Rossmann dataset as our context.

 

8. 1. Logical Consistency Checks

Your data must make sense across fields. This means the relationships between columns should follow expected business rules.

Example 1: Store Status vs. Sales

In the Rossmann sales.xlsx file, a store marked as closed (Open = 0) should not report any sales. If it does, this indicates a logic error.

Validation Formula:
In a new column, enter:

=IF(AND(F2=0;D2>0);”Inconsistent”;”OK”)

Where F is sales and D is open

Any row flagged as “Inconsistent” requires investigation.

Example 2: Customers and Sales

If Sales is positive but Customers is missing or zero, you must ask: how were those sales made? Did the Customer field fail to record footfall? Or is the sales figure incorrect?

Here, you are cross-validating plausibility, not just format.

 

8.2. Completeness Verification

You need to confirm that critical fields are fully populated—especially those that will be used in grouping, filtering, or aggregation.

Example: Missing Store Metadata

If a store in sales.xlsx does not have a corresponding entry in store.xlsx, your regional or assortment-based analysis will be incomplete or incorrect.

Validation Strategy:

  • Use VLOOKUP() or XLOOKUP() to match Store IDs across datasets (We will cover Lookups later in the module).
  • Flag rows in sales.xlsx where the store metadata returns an error.

Sample Formula:

=IF(ISNA(XLOOKUP([@Store],store!A:A,store!B:B)),”Missing Store Info”,”OK”)

(where [@Store] represents column A in sales.xlsx.

This confirms whether each transaction has an associated store type.

 

9. Documentation and Communication

Data cleaning is not merely a technical task. It is an analytical decision-making process. Each transformation—whether it’s filling a missing value, removing a duplicate, or resolving an inconsistency—is based on a series of judgments. Professional analysts do not keep these decisions in their heads. They make them visible.

In this section, we focus on two key outcomes:

  1. A clear, structured cleaning log that captures what was changed, why, and how.
  2. A concise, business-focused executive summary that communicates the implications of your cleaning work to non-technical stakeholders.

 

9.1. Why Documentation Matters

There are three primary reasons analysts document their cleaning work:

  1. Transparency
    Business leaders need to know what happened to the raw data. A cleaning log builds trust by showing that no shortcuts or distortions were introduced.
  2. Reproducibility
    Future team members should be able to retrace your steps. This is especially important in regulated industries, audits, or when datasets are refreshed periodically.
  3. Professionalism
    A well-maintained cleaning log is a mark of analytical discipline. It signals that you take your work seriously—not just in outcome, but in method.

 

9.2. The Cleaning Log

The cleaning log is a structured worksheet in your Excel file. It should follow a consistent format that captures five key elements for every action taken.

DateIssue IdentifiedAction TakenRationaleRows / Fields Affected
2024-06-24Blank entries in CompetitionDistanceReplaced blanks with median valueMedian better reflects distributionStore.xlsx, Column D, 13 rows
2024-06-25Duplicate rows in sales.xlsxRemoved exact duplicatesPrevents double-counting revenueSales.xlsx, Rows 12434, 33255
2024-06-25Date format inconsistencyStandardized to DD/MM/YYYY formatEnsures accurate sorting/filteringSales.xlsx, Column C, all rows
2024-06-26Sales outlier: €535,000Confirmed as error, removedStore closed on that daySales.xlsx, Row 813

Best Practice: Use consistent terminology in your log. For instance, always refer to “blanks” (not sometimes “nulls”), and use consistent date formats.

 

9.3. Guidelines for writing “Rationale”

Your rationale column is the most important. This is not just a “what”—it is a “why.”

Some weak examples:

  • “Needed fixing”
  • “To make things consistent”
  • “Excel didn’t like the format”

Strong examples:

  • “Filled missing Promo2SinceYear using known promo launch dates from campaign archives”
  • “Deleted duplicate records after verifying that timestamps and transaction IDs were identical”
  • “Standardized all text case in Assortment to avoid grouping errors during pivot table analysis”

Clarity in your rationale helps you defend your decisions and invites constructive discussion with others.

 

9.4. Executive Summary for Stakeholders

The executive summary is a short written document (or slide) that accompanies your submission. It addresses non-technical stakeholders—typically a business manager, team lead, or senior decision-maker.

The purpose is not to explain how Excel formulas work. Instead, it focuses on:

  • What key issues were present in the original data
  • How those issues could have affected business decisions
  • What was done to address them
  • What limitations remain
  • What this cleaned dataset enables now

 

Structure of a Strong Executive Summary

Here is a sample 4-paragraph format:

  1. Data Quality Overview

The initial Rossmann dataset contained several quality issues that could have misled strategic decisions—most notably missing values in the store metadata, duplicate sales records, and inconsistencies in date formats across both files.

  1. Cleaning Strategy

We followed a structured five-step cleaning process: identifying issues, prioritizing based on business impact, applying appropriate solutions, validating changes, and documenting all decisions. All critical fields were checked for completeness, logical consistency, and range accuracy. Outliers were carefully investigated and resolved based on operational context.

  1. Business Implications

Without cleaning, several stores appeared to be over-performing due to duplicate sales rows. Some promotional analytics would have excluded stores with missing metadata. These corrections now enable more reliable insights into campaign performance, store clustering, and regional sales trends.

  1. Remaining Considerations

We flagged five stores with incomplete promotional history for further review. These are noted in the cleaning log. The dataset now provides a stable foundation for the upcoming Q3 retail performance analysis.

 

9.5. Versioning and Communication

When submitting cleaned data, always preserve version control:

  • rossmann_raw.xlsx – untouched original
  • rossmann_cleaned_v1.xlsx – cleaned and validated
  • rossmann_cleaned_v2.xlsx – cleaned with additional QA fixes

Never overwrite prior versions unless explicitly approved. Versioning not only helps with rollback but demonstrates analytical maturity.

Communicate clearly when handing off datasets:

  • What has been done
  • What can now be done with the data
  • What must still be done (e.g., manual confirmation, additional enrichment)

A cleaned dataset without documentation is like a building without blueprints—perhaps beautiful, but structurally unverifiable. As analysts, your goal is not just to prepare data—but to make your decisions visible, logical, and communicable.

This is what earns the trust of your team—and the confidence of decision-makers.

Summary

In this lesson, you’ve learned:

  • Effective data cleaning is foundational for successful EDA: How to systematically identify common data quality issues like missing values, duplicates, inconsistent formats, and outliers.
  • Practical techniques for addressing each type of data quality problem using Excel’s built-in tools and formulas.
  • The importance of documentation and validation in professional data cleaning processes.
  • How to balance technical cleaning requirements with business context and decision-making needs.
  • The role of AI tools in data cleaning, including their capabilities and limitations.
 

In the next lesson, you’ll dig deeper into outliers – when to clean and when to keep them.

Suggested Readings & References

Exercise

Estimated Time to Complete: ~2 hour

 

Context

You’ve been hired as a data analyst on a consulting team preparing Walmart’s upcoming quarterly operations review. Leadership will use your analysis to decide which store formats to prioritize for marketing investment.

Before any modeling or dashboarding can begin, they need assurance that the data you’re working with is clean, consistent, and trustworthy. Your role: prepare an analysis-ready dataset, provide clear documentation of your process, and deliver a concise business-facing summary of the decisions you made.

 

Objectives

You are expected to:

  1. Identify and assess data quality issues across the walmart-sales file
  2. Systematically clean these datasets in Excel using techniques covered in Lesson 4.
  3. Validate your cleaning with cross-field logic, range tests, and consistency checks.
  4. Document your decisions transparently using a cleaning log.
  5. Write a stakeholder-facing summary explaining the business impact of the data quality issues and how your cleaning enables accurate insights.
  6. Reflect on your analytical mindset, your communication clarity, and your development as a professional analyst.

 

Data

You will be working with the following dataset in this exercise:

  • walmart-sales.xlsx

 

You can download this file from Lumen’s github repository.

 

You can ignore the other two files for now:

  • walmart-features.xlsx
  • walmart-stores.xlsx
 

Tasks

Task 1: Data Quality Assessment

Using Excel, conduct a structured audit of the sales dataset. Document your findings in a table or cleaning log with columns such as:

  • Field/Variable
  • Type of Issue (e.g., Missing, Duplicate, Format Inconsistency, Outlier)
  • Frequency or Affected Rows
  • Possible Business Impact
  • Initial Thoughts on Action

 

Focus questions:

  • Are there any missing values? Where and how might they affect business decisions?
  • Are there duplicated rows or transactions?
  • Are there inconsistent formats (dates, text capitalization, currencies)?
  • Do any values seem implausible (e.g., very large negative sales)?

 

Task 2: Implement Cleaning

Apply structured cleaning steps as per the workflow taught in this lesson.

Your cleaning should include:

  • Identifying and addressing missing data using strategies aligned with business meaning
  • Removing or flagging duplicate records logically.
  • Standardizing date formats (ensure all dates use MM.YY or convert to ISO YYYY-MM-DD as appropriate).
  • Fixing inconsistent column types (e.g., ensure prices are numeric).
  • Creating a structured cleaning log in a separate worksheet with justifications for each action taken.
  • Range validation: Are there any negative Weekly_Sales or illogical Store Sizes?
  • Completeness check: Are all stores in sales represented in the stores file? Are all dates covered?

 

Include summary statistics and screenshots as needed.

 

Task 3: Stakeholder Communication

Write a one-page business-facing summary. Your goal is to communicate your cleaning work to a non-technical stakeholder (e.g., Walmart’s regional ops manager).

You may use this structure:

  • What issues you found and why they mattered
  • Your decisions and rationale
  • Remaining risks or limitations
  • What your cleaning now enables (e.g., performance comparison by store type)

 

Be clear, concise, and use business-friendly language.

 
Task 4: Engage with the Professional Analytics Community

Read the LinkedIn post on Data Cleaning Lessons. Reflect on the role of subtle assumptions and technical flaws in your own cleaning process. Comment on the LinkedIn post with an insightful comment. This can come either directly from your Walmart reflection or a learning in general. You can also ask a thoughtful question.

 

Submission Guidelines

Submit the following as a single Excel workbook and one Word/PDF document:

Workbook:
  1. raw_data: Original datasets copied as separate worksheets
  2. cleaned_data: Cleaned versions, clearly labeled
  3. cleaning_log: Your structured log of all issues + actions

 

Filename Format:
YourName_Lesson4_Walmart_Cleaning.xlsx

 

Document:
  • YourName_Lesson4.docx
    Include:
    • Business summary (max 1 page)
    • Reflection (1–2 pages)

 

When you’re ready, submit your completed exercise to the designated folder in OneDrive. Drop your mentor a note about submission.

Important: Please scan your files for viruses before uploading.

 
Submission & Resubmission Guidelines
  1. Initial Submission Format: YourName_Lesson4_…
  2. Resubmission Format:
    • YourName_Lesson4_…_v2
    • YourName_Lesson4_…_v3
  3. Rubric Updates:
    • Do not overwrite original evaluation entries
    • Add updated responses in new “v2” or “v3” columns
    • This allows mentors to track your improvement process

 

Evaluation Rubric

Criteria Exceeds Expectation Meets Expectation Needs Improvement Incomplete / Off-Track
1. Data Quality Assessment
  • Conducts a comprehensive, multi-layered audit using both Excel formulas and business reasoning.
  • Identifies subtle, cascading issues across datasets (e.g., missingness correlated with store type).
  • Shows outstanding attention to nuance and downstream impact.
  • Accurately identifies all key data quality issues (missing, format, duplicates, outliers, inconsistencies).
  • Uses structured tools like COUNTBLANK(), filtering, conditional formatting.
  • Connects issues to potential business impact clearly and fully.
  • Audit is organized and complete.
  • Identifies some but not all issues.
  • May miss deeper or systemic problems.
  • Business impact is mentioned but underdeveloped.
  • Audit is partial or loosely organized.
  • Major issues missed.
  • Superficial or irrelevant assessment.
  • Lacks structure, clarity, or effort.
  • Submission is plagiarized or thoughtlessly generated by AI.
2. Cleaning Implementation
  • Demonstrates precise control over Excel tools and formulas.
  • Tailors cleaning logic based on context (e.g., preserving blanks for holidays).
  • Inventive solutions backed by sound reasoning.
  • Decisions are annotated and reproducible.
  • Executes all major cleaning steps accurately and fully: missing data handled contextually, duplicates removed or flagged, formats standardized, outliers evaluated with correct statistical logic.
  • Cleaning log is detailed, professional, and well-justified.
  • Some cleaning actions incomplete or applied inconsistently.
  • Rationale unclear or generic.
  • May rely on manual steps without clear logic.
  • Cleaning log is vague or fragmented.
  • Cleaning is minimal, incorrect, or irrelevant.
  • Key issues unresolved.
  • Log missing or meaningless.
3. Stakeholder Summary
  • Clear, polished, and business-facing. Communicates technical work in high-level terms (e.g., risk reduction, confidence in store comparisons).
  • Offers framing relevant to business decisions.
  • Concise and structured summary.
  • Accurately describes major cleaning actions and their purpose in language suitable for a regional manager.
  • Maintains focus on decision impact.
  • Basic or overly technical language.
  • Key points are described but not well-framed for business.
  • Misses clarity or purpose.
  • Clearly AI-generated report with no personal input.
  • Confusing, vague, off-topic, or irrelevant to stakeholders.
  • Thoughtlessly created by AI tools.
Task 4: Personal Branding
  • Composes a polished, insightful comment that clearly contributes to the conversation.
  • Ties in personal experience or reflection with an original observation, lesson, or provocative question.
  • Tone is thoughtful and professional, showing an ability to engage constructively in public discourse.
  • Reposts with their own thoughts
  • Shares a relevant comment with clear connection to their own experience or the post theme.
  • Tone is respectful and message is coherent.
  • Adds value to the post (e.g., through reflection, agreement, or brief elaboration).
  • Comment is brief, or shallow.
  • Lacks clear tie-in to the reflection or LinkedIn post content.
  • Tone may be overly casual or unclear.
  • No comment made or comment is clearly off-topic, empty, useless e.g. “thanks for sharing / great lesson” or AI copy-paste.
  • Demonstrates lack of effort or professionalism.

 

Student Submissions

Check out recently submitted work by other students to get an idea of what’s required for this Exercise:

Approved on Sep 10th

Got Feedback?

Drop us a line here.

Contact

Talk to us

Have questions or feedback about Lumen? We’d love to hear from you.