Lesson 5 - Statistical Visualizations: Histograms & Box Plots

Estimated Read Time: 2 - 2.5 Hours

Learning Goals

In this lesson, you will learn to:

Create a visual to demonstrate the distribution of a variable
Discuss use cases for visualizing statistical findings

Welcome back! Just like part of your analysis is statistical in nature, some visualizations are, as well. Finding a statistically significant difference between two groups leads to a stronger result than saying group A looks or seems different than group B. Think of it as a way to quantify how certain you are that the groups are actually different. In that same way, statistical visualizations help quantify things—the difference between groups, the strength of relationships, and more. You’ll be learning all about these statistical visualization types over the course of the next two Lessons.

Some of the charts you’ve already encountered can also be used to display statistical information. Whether a chart is statistical in nature isn’t determined by the type of chart it is; rather, the type and format of data it’s visualizing. One such example is bar charts. You already learned about bar charts as one way to display categories of data; however, when all of that data is numeric, bar charts can display distributions or frequencies of that data. In fact, a bar chart that displays frequencies is actually a specific type of statistical visualization: a histogram.

Some other charts are only used for displaying statistical information, meaning that you can’t use them for any other purpose. One such example is the box and whisker chart, or box plot. This chart provides a way of displaying the quartile, mean, median, and variance of a data set.

You’ll be taking a look at both of these types of statistical visualizations in this Lesson, not only how they work, but also how to create them in Tableau. Ready to go tit for tat with stats? Then, let’s get started!

1. Frequencies & Distributions

Frequency tables are composed of counts of quantitative data—counts that are created by way of Excel’s pivot tables.

In the table in Figure 1 below, you can see the frequency of data elements (on the right) for a selection of given value ranges (on the left). There are 66 data elements with values between 1 and 10, 70 data elements with values between 11 and 20, and so on:

Range	Frequency
1–10	66
11–20	70
21–30	73
31–40	57
41–50	70
51–60	82
61–70	58
71–80	74
81–90	59
91–100	67

Figure 1

Frequency charts have special names—histograms. A histogram looks like a bar chart but comes with a certain set of requirements that differentiate it from the composition charts you created in the previous Lesson. The main difference is that histograms display ranges while bar charts display groups. Take a look at the two charts in Figure 2, below. Notice how each column in the histogram represents a range of values, while each column in the bar chart represents a single group?

Figure 2. Histograms show frequencies along the y-axis and ranges of data along the x-axis.

These ranges of numbers in histograms are called bins, and each bin within the histogram are almost always (though not technically required to be) be equal in size. Each bin in the above example has 10 values (e.g., the “1–10” bin includes the values 1,2,3, 4, 5, 6, 7, 8, 9, and 10). In addition, the bins must be adjacent to one another and can’t skip any values—the histogram above includes every value between 1 and 100 within its bins.

Because their x-axes are numeric in nature, histograms can’t be sorted according to frequency. With bar charts, you could sort the groups within the chart from, say, smallest to largest. Histograms, however, must be sorted according to the ranges themselves. Take the example above in Figure 2. You couldn’t reorder the bars in this histogram; otherwise, you’d have a strange procession of ranges (e.g., “61–70” followed by “41–50” followed by “11–20”, etc.). It simply wouldn’t make any sense! This is just another way in which they differ from bar charts.

Histograms are considered statistical visualizations for displaying the distribution of data within a data set. Returning to the above example, you can see that the data is spread fairly evenly throughout the ranges—there aren’t any dramatic peaks or valleys. What does this mean for the analyst? Well, suppose that frequency, here, was representing museum tickets purchased, and each bin corresponded to a certain range of minutes after the museum opened. A distribution like the one above would show that visitors purchase tickets at a consistent rate for the first 100 minutes of the museum opening. The museum could use this data to plan for staffing needs, ensuring they have a consistent level of staff on the ticket booth for the first 100 minutes of the museum opening each day.

Had the histogram looked like the following, however, where there’s much higher frequencies for the 0–50 ranges and much lower frequencies for the 51–100 ranges, the museum might decide to temporarily place an additional staff member on the ticket booth when the museum first opens to account for the higher frequency of visitors, then reduce the number of staff after 50 minutes.

Figure 3. The first five bins (which represent the first 50 minutes of the museum opening) have high frequencies around 140 (representing, in this case, 140 tickets sold). That’s one popular museum!

Just like the frequency tables, histograms help you find data patterns. In the above example, you may have been analyzing ticket sales to help the museum adjust their staffing. The histogram demonstrated a very clear trend in ticket sales over time, which the museum manager could then use to determine staffing needs. Without this histogram (and subsequently without you, the analyst, making this histogram), they could experience staffing inefficiencies such as not having enough staff during busy times or having too much staff during down times.

At many organizations, data analysts are the ones who provide access to this kind of data. Not just anyone would know what to do with, for example, ticket sales data. Which is why it’s your job to analyze and visualize data for the organization, helping them find patterns, locate inefficiencies, and support their everyday decision-making. The ticket sales scenario above is just one example of the type of work you might do. In other scenarios, histograms may only be one small part of a complete analysis in a more complex research project. In both cases, however, they’re a go-to tool for identifying patterns or trends.

2. Creating Histograms In Tableau

Now that you know a little more about histograms in general, let’s walk through how you can create your own histograms in Tableau! For this example, you’ll be using the same OECD enrollment and graduation data set you used in the previous Lesson. This time, you’ll be using it to create a histogram of the frequency of different enrollment rates across countries.

Open your workbook from the previous Lesson and start a new sheet. Rename the sheet “Enrollment Histogram.”
Next, change your Time variable to a variable type of Date and give it a more descriptive name: Year.
Finally, give your Value variable a more descriptive name: Enrollment Rate.

After having used this data in the previous Lesson, you should already be familiar with what each of its variables represents (secondary and tertiary education for students by age and country). Previously, you focused on a specific country and made forecasts about enrollment rates for future years. Now, you’re going to focus on typical enrollment rates across multiple countries. This might be something you’d do if you were working for an education company in Germany and wanted to know whether enrollment rates in your country were comparable to those in the rest of the world. As this involves analyzing the frequency of enrollment rates, and because enrollment rate is a quantitative variable that can range from 0 to 100 continually, a histogram would be the perfect choice of visualization. Let’s get started!

2.1. Selecting Your Variable

Start by telling Tableau what variable you want to use to calculate the frequency of enrollments by dragging the Enrollment Rate variable onto the Rows shelf. You’ll end up with a very simple-looking bar chart:

Figure 4. This data set isn’t very interesting as a simple bar chart.

Now, click the Show Me menu and select Histogram. Here, Tableau informs you that a histogram can have one measure. This is a great reminder that histograms require quantitative data.

Figure 5. Remember that you can always check the requirements for a type of chart at the bottom of the Show Me menu!

After selecting Histogram, your chart does, indeed, become a histogram:

Figure 6. Your first histogram in Tableau!

Along the y-axis is the frequency (or count) of the enrollment rate according to country. How many records in the data set correspond to each enrollment rate bin? Well, if you were looking at a single age group (to simplify interpretation), fewer than five countries would fall into each of the first four bins (two countries, two countries, one country, and zero countries for each bin respectively).

Along the x-axis are your bins, which Tableau has automatically calculated for you. These bins represent enrollment rate ranges (which, you’ll learn in just a moment, is a range of 3.13). The first bin contains two countries. This means that two countries have an enrollment rate between 0 and 3.13. Note that this number is a percentage rather than a count. You’ll fix this on your chart later to make it more clear.

2.2. Formatting the X-Axis

Your x-axis represents enrollment rate ranges; however, your x-axis is currently displaying single numbers. Something’s wrong here! Those single numbers you see now are actually tick marks rather than labels for your ranges. In fact, if you look closely, you’ll see that they don’t actually align with your bins at all! The first bin is between 0 and 5, the second bin is straddling the 5, the third bin is somewhere between 5 and 10, and so on:

Figure 7. Unaligned tick marks make for a messy histogram.

This is a bit confusing, as you might imagine, as most people looking at the chart would assume the tick marks represent the size of each bin. Let’s fix this!

In your variables list, you’ll notice a new variable has been created in your Dimensions area: Enrollment Rate (bin). This is automatically created whenever a histogram is selected:

Figure 8. This new variable will allow you to edit your histogram’s bins.

Click the down arrow to the right side of the variable name to Edit this variable:

Figure 9

Within the Edit Bins dialog, you can adjust aspects of the variable:

Figure 10. The Edit Bins dialog gives you various bits of information about your bins.

It also gives you a bit more information around your ranges. For instance, the smallest enrollment rate in your data is 0.09 percent (Min), and the largest is 100 percent (Max). This means your data has a range of 99.92 (Diff). Additionally, there are 891 distinct values (CntD).

To display this data in a histogram, Tableau used a bin size of 3.13 percent. This means that there are 32 bins (99.92 divided by 3.13). You can adjust the number of bins by adjusting the bin size. Try to aim for between 10 and 30 bins. You also want your bins to be easily understandable. A bin size of 3.13 percent, for instance, is hard to remember, calculate, and interpret. Aim for an integer-based bin size, instead. In this case, a bin size of 4 percent would work well and be much easier for end users to interpret. To change the bin size, click on the Size of bins dropdown menu and select Enter a Value. This will make the field editable. Enter a new bin size of “4” and select OK.

Great! With larger bins (and less of them!), your chart should be easier to interpret. You still have one thing to do, though, and that’s update your x-axis labels to align with your new bins. Right-click the x-axis and select Edit Axis:

Figure 11. If it’s changes to the axes you seek, upon the Edit Axis dialog must you tweak.

Navigate to the Tick Marks tab, then change Major Tick Marks to Fixed with an interval of “4.” Four was also the size of your bins, so this will ensure your x-axis labels align with your bins:

Figure 12. When your bin size aligns with your tick marks, all is right with the world.

Once finished, take a look at your updated x-axis. Each histogram bin should now align with a tick mark, increasing from 0 to 100 by increments of four. Strangely, the very last bin looks like it ranges from 100 to 104. If you hover over it, the tooltip will inform you that the bin is 100. Logically, you can deduce that this 100 is the upper range of the bin, and the bin itself only contains enrollment rates of up to 100 percent (not up to 104 percent like Tableau says).

Figure 13. It doesn’t take a rocket scientist to know that 104 percent on a chart that only goes from 0 to 100 percent is simply illogical.

You can apply this same logic to the very first bin. Hovering over it will inform you that the bin is 0. Logically, however, you can deduce that this bin ranges from 0 to 4 percent.

Now, let’s edit the axis labels to include the percent sign. This will make it easier for viewers to understand at a glance that the values they’re looking at are percentages. You did this before in the previous Lesson, but for a recap, you can simply right-click the variable name in the Dimensions or Measures list and select Default Properties→Number Format. Remember that you don’t want to choose the Percentage option here, rather, Number (Custom) (otherwise, you’ll multiply all your numbers by 100!). Go ahead and do this now for your Enrollment Rate variable, setting its Decimal places to “0” and adding a Suffix of “%”. Once finished, your x-axis should look something like this:

Figure 14

2.3. Adjusting the Dimensions, Colors & Labels

Just like in your previous visualizations, color can be used to add an additional dimension to your histogram. Recall that the Subject variable in this data set contains age categories. To visualize this in your histogram, drag the Subject variable to the Color box on your Marks card to display these categories on your histogram. Your histogram should now be sporting some new colors, as well as a color legend informing you what each color represents:

Figure 15. The colors!

As always, you’ll need to make a few tweaks to your colors to ensure they follow the proper visualization style guidelines. Click the Color box on your Marks card to adjust the colors to something more visually intuitive. In this example, for instance, we’ve chosen a set of analogous colors on the cold side of the color wheel.

Because the data spans multiple years and multiple age groups, what the y-axis is actually representing is a count of countries with a particular enrollment rate for a particular age group by year. You can see the age groups as the different colors within each bin, but there’s no way to add the years to your histogram. Because of this, each country is actually being counted multiple times—once for each year. As such, the histogram is more of a total-records count than a country count.

This can be confusing for potential viewers (and you, to be honest). Let’s adjust the title to something more descriptive to decrease the chance for any ambiguity—something like “Enrollment Rates Histogram of OECD Countries by Age (2005 – 2017)” would work well. Additionally, you can rename the y-axis to “Frequency” (all histograms must have a y-axis of frequency, after all). Finally, remove the “(bin)” from the x-axis title. This simply cleans it up and eliminates any potential confusion for your viewers:

Figure 16. Your chart is growing more intuitive by the minute!

The final step is to add some labels to your chart. Because there are so many bars—and because the frequency covers such a large range—it can be difficult to determine the exact count for a particular color block. Labels will help.

To fix this, turn on labels via the Label box on the Marks card (simply check the Show mark labels option). By default, Tableau will display the number of records in each category in each bin:

Figure 17

While this may be what you want, it’s more common to display these numbers as percentages (so you can easily see the overall proportion of each category per bin). Tableau makes it relatively simple to convert to percentages. A percentage is nothing more than the count of something over a total, or denominator. In this example, the denominator could be the total records count, the total records by year, or the total records by age group. You want to see the percentage of each age group within each individual bin, so your denominator will be the number of records in each bin.

To set this up in Tableau, start by making a copy of your CNT(Enrollment Rate) variable and dragging it from the Rows shelf to the Label box. Remember that to copy, rather than move, variables, you need to hold down the Command key (for Mac) or the Control key (for Windows) while you drag the variable. Do so now.

Figure 18

Tableau Tip
Dragging variables while holding down the Command or Control key creates a copy of that variable. While you could drag a new copy of the Enrollment Rate variable from the Dimensions list, this would give it a new aggregation. By copying it, you ensure it retains the same CNT() aggregation as the original variable.

Next, click the down arrow next to the variable name on your Marks card to bring up the variable menu and select Quick Table Calculation→Percent of Total.

Tableau builds in a selection of what it thinks are common calculations. Displaying a count as a percentage of a total, for instance, is a common calculation. While you could create a new variable and perform this calculation yourself, Tableau saves you time by providing common calculations in this Quick Table Calculation menu:

Figure 19. Quick Table Calculations save you time, and there’s nothing a data analyst loves more than saving time (besides numbers, that is!).

Once selected, the numbers on your chart will change to percentages; however, you’ll notice they have a lot of decimal places, are hard to read, and only display for some of the categories. This will require some reformatting to fix; once again, click on the down arrow next to your CNT(Enrollment Rate) variable on the Marks card, only this time, select Format:

Figure 20

A menu will appear to the left, displaying numerous formatting options for this variable. Under Default, click the Numbers box, choose Percentage, and change the Decimal places to “0”:

Figure 21

Already, your chart should look much cleaner!

Figure 22

Now, however, you may notice a different problem—the percentages aren’t what you’d expect. Logically, the percentages in each bin should add up to 100 percent, but they’re currently only adding up to a fraction of that!

This is because Tableau chose the wrong denominator. Rather than giving you the percent of each category in relation to its corresponding bin, it’s giving you the percent of each category in relation to the entire histogram.

Fortunately, this is an easy fix—you just need to change the calculation sphere. To do so, bring up the CNT(Enrollment Rate) variable menu in your Marks card and select Compute Using→Table (down):

Figure 23

Tableau Tip
The different Compute Using options can be quite confusing, and the interpretation varies depending on the type of chart you’re using. The best way to ensure you choose the correct computation type is to calculate an example of the number you expect, then test the different Compute Using options one by one until the result onscreen matches the result you calculated. This is a great method of quality control you can use to systematically check that your chart matches your expectations.

The percentages for each category now add up to exactly 100 percent per bin—perfect!

Figure 24. Your finished histogram, complete with colors, labels, and correct axis titles!

Congratulations! You just made a histogram that shows the range and frequency of enrollment rates across countries and years according to age group. You can see that the data skews to the left—this means that there are only a few countries and years with enrollment rates below 40 percent. Meanwhile, 17-year-olds have the highest enrollment rates, with the majority of the blue bars concentrated to the right side of the chart. Conversely, 19-year-olds have the lowest enrollment rates with purple being concentrated to the left side of the chart.

In this way, histograms help with identifying data patterns, similar to what you saw in the museum ticket example earlier. This histogram shows age patterns that align with what many policymakers probably hope—that the highest education enrollment happens early (at 17 years) rather than late (at 19 years). Histograms are particularly useful when there are logical bins or thresholds. Country governments, for instance, may have a goal to ensure at least 50 percent of 17-year olds are enrolled in higher education. A histogram would make it easy to see whether this goal had been accomplished or not. Broadly speaking, histograms are most useful when looking for patterns in quantitative data: How long during opening hours do museum ticket sales peak? Are goals being reached for youth education enrollments? These are all questions you, as the analyst, can help organizations answer.

3. Box Plots

Box plots are so named because, well, they look like boxes. They’re also one of the most common types of statistical charts and one you’ll be working with often as a junior analyst. Box plots provide a way to visualize descriptive statistics, which means that they’re best suited for quantitative data. They allow you to see the median and data quartiles for a given variable:

Figure 25. A box plot consists of a box with two antennae called “whiskers.”

Note that the orientation of box charts doesn’t matter—they can be vertical or horizontal.

Data can be split into four quartiles. Each quartile contains one quarter of the data elements of that variable. A variable with 20 data elements, for example, would have quartiles that contain 5 elements each (20 divided by 4 is 5). In ascending order, this would look something like:

Quartile 1 { 1, 2, 3, 3, 4}
Quartile 2 { 5, 7, 9, 9, 10}
Quartile 3 {11, 12, 12, 14, 16}
Quartile 4 {17, 18, 18, 19, 20}

Let’s take a look at this in box plot form. The left-most dot, highlighted in red below, represents the minimum value of the data. Using the above data, this would be “1”:

Figure 26

A whisker, or line, extends from the minimum (red dot) to the end of the first quartile (the fifth element). Then comes the box, which extends from the end of the first quartile to the end of the third quartile (each box representing one quartile). The box itself is divided by a line representing the median of the data set. In this example, the box is divided in half. This means the data is symmetrical—there’s just as much data in the second quartile as there is in the third quartile. To use numeric terms, the mean and median of this data set are the same.

Remember that the median is the middle element in a data set. Because there are 20 data points (an even number), there’s no one number that can represent the middle element. In this case, you’d add the 10th and the 11th elements together and divide by 2: (10 + 11)/2 = a median of 10.5.

Figure 27

On the right side of the box, the chart continues with another whisker connecting the end of the third quartile to the rightmost dot, which represents the maximum value. This constitutes the fourth quartile:

Figure 28

Alternative Box Plots
Sometimes, box plots show a more technical representation of the data. While the leftmost dot and rightmost dot always represent the minimum and maximum values, respectively, the whiskers don’t always extend all the way to the minimum and maximum values, rather, to the end of the lower or upper quartile ranges:

Figure 29. That box has lost its dots!

This use of quartile ranges changes the box plot representation so that you can easily identify outliers (i.e., any dots beyond the lines). In the example above, the dots on the line signify data points occurring in the lower and upper quartile ranges, while the dots beyond the line signify outliers. Most software will perform these calculations for you—the Resources section at the bottom of this Lesson includes a link to a video walking through the calculations if you’re interested.

While you can technically calculate all the descriptive statistics necessary for a box plot and display them in a normal table, the visual element of box plots helps you more easily identify trends in those statistics. Let’s take a look at an example.

Figure 30

The two box plots in Figure 30, above, represent the same number of data elements, both ranging from 1 to 100. On the left, half of the data (in the box) is between 42 and 98, which equates to a range of 56. The median line is towards the top (median = 79), meaning that the data isn’t symmetric. If the data were symmetric, the median would be the same as the mean, and the line would be in the middle of the box, around 69.

On the right, half of the data (in the box) is between 2 and 68, which equates to a range of 66. Very little data falls outside that first quartile, as evidenced by the incredibly short bottom whisker (between 2 and 1); however, a great number of data elements fall above the third quartile, as evidenced by the very long top whisker (between 68 and 100).

In one visual, box plots display a great deal of statistical information—quartiles, medians, minimums, maximums, distribution, and skew. No other visualization shows this much statistical information in one chart. Bar charts and histograms, for instance, can show data distribution, which includes skew, minimums, and maximums, but they don’t include quartiles and medians.

This makes box plots incredibly useful when it comes to comparing data. Say, for example, that you want to look at sales revenue between two different regions. You might create a box plot of each region as a way to visually interpret the differences. One region may have few, high-revenue deals, while the other region may have many, low-revenue deals. Box plots would show the differing ranges, distributions, and medians:

Figure 31

The region with high-revenue deals clearly has a higher median, but the spread of the data is also very small—from 1400 to 1600. The top part of the box is smaller, meaning that the median is larger than the mean, so the deals skew on the higher end of the chart. The region with many, low-revenue deals, on the other hand, has a lower median and much larger spread of data—the whiskers range from 100 to 1000. The lower box is slightly larger than the upper box, meaning that the sales skew on the low end of the scale.

As you may have noticed, box plots aren’t the most intuitive of charts, nor can they be easily deciphered by non-data-savvy people who aren’t familiar with what all the pieces represent. Even for those that are data-savvy, box plots require practice to decipher and predict. For this reason, while box plots can be helpful to you as an analyst, be wary about using them as a communication tool for stakeholders. If your stakeholders are data-savvy, these charts will be suitable. But if you’re presenting to non-data stakeholders, you may want to use an alternative option for visualizing these statistics.

4. Creating Box Plots In Tableau

Fortunately, making box plots in Tableau is quite simple—Tableau finds the minimum, maximum, quartiles, and median for you. Let’s practice making one now using the enrollment rates from your OECD data set.

With your data already loaded and prepped in Tableau, all you need to do is configure some titles and variables to set the foundation for your box plot:

Create a new sheet called “Enrollment Box Plot.”
Drag the Enrollment Rate variable onto the Rows shelf.

Once ready, head to the Show Me menu and find the image for Box-and-Whisker Plot. But wait! What’s this? The box and whisker icon is greyed out, meaning that it’s unavailable. If you hover over it, Tableau tells you to choose one or more measures. You’ve done this! Haven’t you?

Figure 32. But you have “1 or more Measures”, Tableau!

Well, if you read the text underneath the requirements, you’ll see that it also says: “Use at least 1 dimension or disaggregate.” Your variable is currently aggregated, which is why Tableau isn’t allowing you to turn it into a box plot. To continue, you’ll need to disaggregate.

When you dragged the Enrollment Rate variable onto the Rows shelf, Tableau automatically aggregated the data by wrapping it in a SUM() equation. The result was a single bar chart of one number. All the enrollment rates were summed, or aggregated. While this was what you wanted for your histogram, not so for your box plot. To change this, open the Analysis menu from the top toolbar and uncheck the Aggregate Measures option:

Figure 33

Once unchecked, your chart should look a little different. Now, each individual enrollment rate appears as a circle rather than all the enrollment rate values being summed into one number. If you’ll remember, this data set includes enrollment rates for multiple countries for multiple years. Each of those rows of data is now represented by a circle:

Figure 34. There are so many circles in this bar chart that you can’t even differentiate them!

Now that your data has been disaggregated, the box and whisker plot option in the Show Me menu should be available. Give it a click! This should transform your circly bar chart into a more-useful box plot:

Figure 35

By default, Tableau has used the interquartile range calculations to keep the whiskers from extending to the minimum and maximum values. This means that the lines in their current form designate the interquartile ranges, and any dots beyond these lines designate outliers in the data.

From this box plot, you can glean that the median enrollment rate is 80 percent (the middle line in the box). The lower part of the box is slightly larger, and the lower whisker much longer, which means that there are more records with low enrollment rates (below the 80 percent median) than those with high enrollment rates (above the 80 percent median).

You can change to a simpler box plot with whiskers that span to the minimum and maximum values by right-clicking the y-axis and selecting Edit Reference Line:

Figure 36

This will open up a dialog where you can edit the length of the whiskers. To extend them to the maximum extent of the data, set the Whiskers extend to field to Maximum extent of the data. While you’re at it, you can also select Hide underlying marks (except outliers) so that the dots designating individual data points disappear and turn into lines. If you want to, you can also change the style and color of the box and whiskers on this menu:

Figure 37. How long do your whiskers extend?

Once clicking OK, your box plot should look something like this:

Figure 38

With the histogram, you were able to examine how enrollment rates differed according to age by way of color. With box plots, however, you can’t use color—the structure of the chart simply doesn’t allow for it. Instead, you can create multiple plots. Let’s create a few more now by dragging the Subject variable to the Columns shelf, which should transform your single box plot into three (one for each age category):

Figure 39

One thing you may notice is a little box at the bottom of the chart informing you that there are 334 null values. This means that not every data row has an enrollment rate, meaning that it’s safe to exclude them. To do so, simply drag the Enrollment Rate variable from your Measures list to the Filters box above your Marks card. On the resulting modal, you’ll notice a checkbox labeled Include Null Values. Don’t touch this! You actually want to leave it exactly as it is. Without touching anything, click OK, and the null indicator will be removed from your chart.

While you’re at it, let’s go ahead and change the title of your chart to something more descriptive. Something like “Enrollment Rates of OECD Countries by Age (2005 – 2017)” would work well. Finally, let’s remove the titles from your axes as they’re simply adding redundant information (a common occurrence with box plots). Right-click the y-axis, select Edit Axis, and delete the title text. You can also change the axis range from Automatic to a Fixed range of 0 to 100 percent (you can’t have enrollment rates greater than 100 percent, after all).

Figure 40

To get rid of that annoying Subject label at the top of the chart, right-click the label itself and select Hide Field Labels for Columns:

Figure 41. Bye-bye, subject line—hello, clean, lean box plot!

And with that, you’ve finished your first box plot in Tableau!

Figure 42

Congratulations! You’ve just finished creating your sixth type of chart in Tableau. You’re already building quite the arsenal of charts! From this box plot, it’s easy to see that 17-year-olds have a wide range of enrollment rates, but the majority (50 percent shown in the box) are in the 90th percentile. The 17-year-olds have the highest rates, followed by the 18-years-olds, and then the 19-years-olds, who have the lowest rates. Only the 17-year-olds, however, have a minimum value of 0 percent enrollment.

Box plots are incredibly helpful when comparing distributions across groups, particularly if you want to examine the spread and median of the data. You can easily see that many 17-year-olds have enrollment rates between 89 and 96 percent, but the rates range from 0 to 100 percent. The lowest rate for 18-year-olds is 6 percent, and for 19-year-olds, 15 percent. The middle, or median, drops progressively lower with each group.

Compare the box plots with the histogram you created earlier in this Lesson. You can see the age skews clearly in both examples (17-year-olds have the highest rates); however, it’s not as easy to see the minimums, maximums, and averages using the histogram. This is where box plots truly shine. Broadly speaking, box plots are useful when looking for differences between groups in quantitative data (which age group has the highest enrollment?) and when looking for average values (when does the average 18-year-old enroll?).

Figure 43. Box plots and histograms—the newest tools on your analytical shelf.

Summary

Statistics isn’t just calculations; some statistics can actually be visual! In fact, by visualizing your statistics, you not only aid your own understanding but that of those you share your data with, as well. In this Lesson, you learned about two specific types of statistical visualizations. Histograms are a great way to visually explore the distribution of a variable. They show whether some values are more common and whether some values don’t exist at all. Box plots, on the other hand, are a great way to visually display summary statistics (median, quartiles, minimum, and maximum). Because they’re a bit technical, however, they’re not as useful when sharing data with non-data-savvy individuals; instead, you’d be better off using them as a helpful guide for yourself during your data profiling and exploration.

In the next Lesson, you’ll be learning how to create two more types of statistical visualizations: scatterplots and bubble charts. But before that, let’s put what you’ve learned into practice by creating a histogram and box plot for your Cliqz project!

Exercise

Estimated Time to Complete: 1-3 Hours

Let’s create some new visualizations for your cliqz project—to be more specific, some statistical visualizations to look at the length of queries. This will give you more insight into the length of the queries. Do people type long or short queries? Does this vary based on the country? These are some of the questions you can try to answer using your visualizations.

Directions

Hint
For this task you’ll need to use the “create” option. Simply click the query variable and create a new variable “queryLength”

Create a histogram of query lengths.
- The histogram should be of lengths of queries.
- Make sure you have the correct bin size for proper insights generation
- Examine distribution by adding country as colors. (If there are too many, pick top 5)
- Do you see any valuable information?
- Is there any query length or country dominating the distribution?
Create a box and whisker plot of this same information.
Update the visualizations using the style guide checklist you created in Lesson 2.
Explain what the box plot tells you that the histogram can’t.
Copy your final charts and checklist into a Word document.
Include your written answers to step 1 and 4 in the Word document together with your charts
Export your final Word document as a PDF and submit it on one drive for your mentor to review.
Publish your workbook to Tableau Public in order to save your progress and submit the link along with your PDF.

Bonus Task

Find a histogram online and explain what works well and what doesn’t work well in terms of how it communicates data. You can also use your visualization style guide to critique its visual presentation. Include your critique along with your submission for this task.

Submission Guidelines

Filename Format:

YourName_Lesson5_StatisticalVisualization.docx

When you’re ready, submit your completed exercise to the designated folder in OneDrive. Drop your mentor a note about submission.

Important: Please scan your files for viruses before uploading.

Submission & Resubmission Guidelines

Initial Submission Format: YourName_Lesson#_…
Resubmission Format:
- YourName_Lesson#_…_v2
- YourName_Lesson#_…_v3
Rubric Updates:
- Do not overwrite original evaluation entries
- Add updated responses in new “v2” or “v3” columns
- This allows mentors to track your improvement process

Evaluation Rubric

Criteria	Exceeds Expectation	Meets Expectation	Needs Improvement	Incomplete / Off-Track
Statistical Charts	Everything in “Meets Expectations” The bonus task is done satisfactorily	Histogram displays the frequency of query lengths with country as colors, and box plot includes plot for query length Charts have been created according to the visualization checklist Written answers accurately address the query length distribution, whether some lengths are more used, and what the box plot tells a reader that a histogram doesn’t Work has been published to Tableau Public	Histogram and box plot are included in document, but one of the following is true: The histogram doesn’t display the frequency of query lengths with countries as colors; The box plot doesn’t include plots for query length distribution; Charts haven’t been created according to the visualization checklist; Written answers don’t accurately address the query length distribution, whether some lengths are more used, and what the box plot tells a reader that a histogram doesn’t; Work hasn’t been published to Tableau Public	Document is plagiarized or isn’t relevant to the task instructions; OR The wrong charts have been created Histogram and box plot are included in document, but two or more of the following are true: The histogram doesn’t display the frequency of query lengths with country as colors; The box plot doesn’t include the box plot; Charts haven’t been created according to the visualization checklist; Written answers don’t accurately address the query length distribution, whether some lengths are more used, and what the box plot tells a reader that a histogram doesn’t Work hasn’t been published to Tableau Public One of the charts is missing

Criteria

Exceeds Expectation

Meets Expectation

Needs Improvement

Incomplete / Off-Track

Statistical Charts

Everything in “Meets Expectations”
The bonus task is done satisfactorily

Histogram displays the frequency of query lengths with country as colors, and box plot includes plot for query length
Charts have been created according to the visualization checklist
Written answers accurately address the query length distribution, whether some lengths are more used, and what the box plot tells a reader that a histogram doesn’t
Work has been published to Tableau Public

Histogram and box plot are included in document, but one of the following is true:

- The histogram doesn’t display the frequency of query lengths with countries as colors;
- The box plot doesn’t include plots for query length distribution;
- Charts haven’t been created according to the visualization checklist;
- Written answers don’t accurately address the query length distribution, whether some lengths are more used, and what the box plot tells a reader that a histogram doesn’t;
- Work hasn’t been published to Tableau Public

Document is plagiarized or isn’t relevant to the task instructions; OR
The wrong charts have been created
Histogram and box plot are included in document, but two or more of the following are true:

- The histogram doesn’t display the frequency of query lengths with country as colors;
- The box plot doesn’t include the box plot;
- Charts haven’t been created according to the visualization checklist;
- Written answers don’t accurately address the query length distribution, whether some lengths are more used, and what the box plot tells a reader that a histogram doesn’t
- Work hasn’t been published to Tableau Public
- One of the charts is missing