Session 2 - Data Cleanup
Introduction to Data Analytics for Beginners
Before you can extract meaningful insights from your data, it needs to be clean, consistent, and reliable. Messy data leads to incorrect analysis and misleading business decisions. In this session, you will learn how to clean a real-world dataset using Google Sheets, focusing on missing values, duplicates, formatting inconsistencies, and data validation.
We will use the Amazon Sales Dataset as our example, which you downloaded and imported into Google Sheets in the previous session. If not, follow the link above to download.
Why Is Data Cleanup Important?
Imagine you are a business owner trying to review your company’s sales performance. You open your sales data and immediately notice missing prices, inconsistent product details, duplicates, and strange formatting. Naturally, you will wonder, “How can I trust the insights I get from this?”
The answer is simple: you can’t.
This is why every data project must begin with thorough data cleaning. The popular phrase in analytics says it best:
“Garbage in, garbage out.”
If your raw data is flawed, your analysis will be flawed. On the other hand, clean, reliable data enables confident, accurate business decisions.
Finding and Handling Missing Values
Missing values are one of the most common data issues. Left unaddressed, they distort averages, totals, and other calculations.
How to Identify Missing Values
We will use the COUNTBLANK() function to count empty cells in each column.
Example:
To check for missing values in Column A (Product IDs):
Scroll to the bottom of your dataset to find a blank area for your formulas.
In a blank cell, type:
=COUNTBLANK(A2:A466)A2 is the first row of data (excluding headers).
A466 is the last row in your dataset. Adjust this number if your dataset is longer.
Press Enter to see the result.
A result of 0 means there are no missing values in that column. You should repeat this process for each column, updating the column letter accordingly.
Handling Missing Values
Once you identify missing values, consider the following approaches:
Option 1: Fill with the Column Average
For numerical columns, a safe, widely used option is to replace missing values with the average of that column.
To calculate the average:
=AVERAGE(H2:H466)
In our Amazon Sales Dataset, the Rating Count column may have missing values. After calculating the average (e.g., 11,937), copy this value into the missing cells.
Option 2: Fill with Minimum or Maximum Value
In some cases, you may prefer to replace missing values with the column’s minimum or maximum, especially if that aligns better with the business logic.
Option 3: Context-Specific Fill
For more precise imputation, you can calculate averages only for similar products (by category or brand) and use those values to fill missing data.
Option 4: Remove Rows (Last Resort)
Deleting rows with missing data should be a last resort. Only do this if:
The row has critical missing information that cannot be imputed.
The missing data makes the row unusable for your analysis.
Remember, data is valuable. Avoid discarding rows unnecessarily.
Detecting and Removing Duplicate Rows
Duplicate rows artificially inflate your dataset, leading to incorrect totals, averages, and conclusions.
Removing Duplicates with Google Sheets
Select your entire dataset, including headers.
Click on Data → Data Cleanup → Remove Duplicates.
Ensure Data has header row is checked.
Select all relevant columns to check for duplicates.
Click Remove Duplicates.
Google Sheets will tell you how many duplicates were found and removed.
Note: Datasets from sources like Kaggle are often pre-cleaned, so you may find no duplicates initially. For practice, you can manually add a duplicate row and repeat the removal process.
Step 3: Fixing Formatting Inconsistencies
Numbers stored as text or inconsistent use of symbols can break your calculations. It’s important to clean these issues.
Numbers Stored as Text
Look at the Rating Count column. Some numbers may be left-aligned (indicating text), while others are right-aligned (indicating true numbers).
To fix this:
Identify problematic cells. Left-aligned numbers are treated as text.
Often, commas or formatting symbols cause this issue.
Select the affected column.
Click Edit → Find and Replace.
In the Find field, type the unwanted character (such as a comma
,).Leave the Replace with field blank.
Click Replace All.
Repeat this process for other unwanted symbols. Once complete, your numbers will align properly and be recognized as numerical data.
Cleaning Price Columns
Price columns may contain currency symbols, preventing accurate calculations.
To clean price columns:
Select the Discounted Price or Actual Price column.
Open Edit → Find and Replace.
Enter the currency symbol (e.g., ₹, $, €) in the Find field.
Leave the Replace with field blank.
Click Replace All.
After this, your price columns should contain only numeric values. You can verify this by selecting a range and checking if the sum or average appears at the bottom of the screen.
Step 5: Trimming Extra Spaces
Extra spaces—especially at the beginning or end of cells—can cause hidden errors in your analysis.
To remove them:
Select your entire dataset.
Go to Data → Data Cleanup → Trim Whitespace.
Google Sheets will automatically remove leading and trailing spaces.
This ensures your data is consistent and clean.
Data Validation for Future Accuracy
Data validation prevents incorrect or invalid data from being entered in the future.
Example: Validating Product Links
Select the Product Link column.
Go to Data → Data Validation.
Under criteria, select Text → Valid URL.
Decide whether to show a warning or reject invalid entries.
Click Done.
Invalid URLs will now be flagged automatically, protecting your dataset from future errors.
Additional Considerations
Consistent Text Formatting
You may want to standardize capitalization or formatting using functions like:
UPPER() – Converts text to all uppercase.
LOWER() – Converts text to all lowercase.
PROPER() – Capitalizes the first letter of each word.
Apply these as needed to keep product names or categories consistent.
Conclusion
You have now learned essential data cleaning techniques, including how to:
✔ Identify and handle missing values.
✔ Detect and remove duplicate rows.
✔ Correct formatting inconsistencies.
✔ Clean up price and numerical columns.
✔ Apply validation rules to maintain data quality.
With your Amazon Sales Dataset clean and prepared, you are ready to move forward with analysis.
Next Steps
In the next session, we will explore descriptive and statistical analysis, using your cleaned dataset to uncover insights and answer real business questions.
Action Items Before You Continue
- Complete the data cleanup steps on your own dataset.
- Reflect on how data quality impacts business decisions.
- Apply data validation where needed to prevent future issues.
Contact
Talk to us
Have questions? We’re here to help! Whether you’re curious to learn more, want guidance on applying, or need insights to make the right decision—reach out today and take the first step toward transforming your career.