Lesson 8 - Textual Analysis

Estimated Read Time: 2 Hours

Learning Goals

In this lesson, you will learn to:

Discuss use cases for textual analysis and visualization
Visualize free text data

Welcome back to another venture into visualization vernacular! Previous Lessons in this Module have focused predominantly on quantitative data. With the exception of bar charts, most charts have used solely quantitative, or numerical, data. And for good reason! These charts will cover the majority of your work as a junior analyst. There will be times, however, that you’ll be asked to analyze and visualize qualitative data—data that deals with text rather than numbers.

Survey responses are a great example of the type of qualitative data you may encounter as an analyst. Or, take social media data, much of which will come in the form of “free text” written by users. The internet has introduced a whole new world of unstructured data, and it’ll take the proficiency and fortitude of a data analyst to bring sense to all that data. While a bit more complicated than analyzing quantitative data, analyzing this textual data can open direct doors into the minds of your users in a way that numbers can’t—and there are some pretty cool visualizations you can make to communicate this data, too!

Let’s get textual!

1. Textual Analysis

Textual analysis is, as you might guess, the analysis of textual data. Textual data can come from surveys, interviews, blog posts, social media posts, emails, online reviews, and all sorts of other qualitative data sources. This type of data is usually unstructured to offer users more flexibility in their responses. Twitter, for example, limits the number of characters its users can type but not the things its users can type. You aren’t forced to choose from a finite number of responses, nor are hashtags restricted to a few set categories. From an analytical perspective, this type of flexibility usually equates to unstructured data, which is harder to analyze.

Actual analysis of free text, or unstructured text, is usually completed by sociologists or other researchers. Because language is complex, analyzing it without some guidance in place is a complicated task. However, you can add some structure to this information with a few key techniques. While this won’t allow for anything overly complex, it will nonetheless allow you to draw some valuable insight as a first step toward a more detailed analysis.

Figure 1. Language is as complex as the people who speak it!

Textual analysis falls into two broad categories: frequency analysis and sentiment analysis. Frequency analysis is the simplest form of textual analysis and involves counting specific words and phrases to isolate broad trends. While this type of analysis can be done in table format, it can also be visualized in the form of word clouds.

The second main category of textual analysis is sentiment analysis. Sentiment analysis involves categorizing text into negative, neutral, and positive groupings. This type of analysis is a bit more complicated given language’s nuances and complexities, so you’ll only approach this briefly towards the end of this Lesson.

2. Frequency Analysis for Textual Data

Imagine data is collected at the state level regarding student enrolments, so you may have looked at a frequency table of states like this:

State	Frequency
Alambama	2
Alabama	1411
Alaska	1398
Arizona	1457

Figure 2

Despite the fact that “State” is a textual field, it has a limited number of values. This is because there’s a limited number of states (50). Also, the table has a few misspellings⁠—for instance, “Alambama”⁠—which will result in the total number of states being more than the expected count.

You looked at frequencies when creating histograms in Lesson 5: Statistical Visualizations: Histograms & Box Plots. You may have looked at something similar to the following table, which shows how many states fall into each structured “Retired Population” category:

Retired Population	Frequency
0 to 1000	2
1001 to 2000	5
2001 to 3000	10
3001 to 4000	25

Figure 3

In this Lesson, you’ll once again be looking at frequencies, only this time in the context of qualitative data, which lacks that same structure. Rather than a state variable, you may have records with a general location variable with the specificity decided by the user. They can enter an exact address, a country, a region, a city, or even an intersection to designate their location instead. You have no control over what type of location will be entered, thus turning the location variable into unstructured, textual data.

Trying to display these results in a frequency table wouldn’t be very efficient or interesting. There would simply be too many possibilities, to the point where no single answer might ever be repeated. In these scenarios, it’s more common to use a word cloud.

3. Word Clouds

Word clouds provide a way to visualize frequency tables. At this point, the only visualization you’ve learned about that supports textual data (unstructured or not) is a bar chart. However, bar charts usually use structured data—textual data with a limited number of possibilities (or categories). This is because pure unstructured text is usually too varied for a bar chart to effectively display.

You’ve probably seen word clouds before without realizing it. Check out the example below, which is a word cloud produced from a Virginia politician’s social media feed. The font size of the word signifies its frequency (larger words have higher counts):

Figure 4. It’s no surprise that one of the most frequent words on this Virginia politician’s social media feed is “Virginia.” (Source: Blue Virginia)

An important thing to remember about word clouds is that they work well for single words but not phrases. In the cloud above, for example, the most common text might actually be “Coastal Virginia.” Here, however, “coastal” and “Virginia” are treated as two different words. This is significant because language is rarely a series of single words, and treating each word as a discrete object can lead to actual messages being missed. Take the phrase “not happy,” for instance. A word cloud of individual words would treat “not” and “happy” as two separate words, leading viewers to assume users were “happy” about something, when, in fact, it’s the exact opposite.

Another weakness of word clouds is that they often display words that aren’t particularly useful; for instance, adjectives and adverbs. Is the word “all” in the word cloud in Figure 4 important? Assuredly not, but there it is, anyway. Likewise, similar words (constituent, constituents) are counted separately—not exactly useful when it comes to interpreting the data.

To improve your word clouds, you can perform some data cleaning. This could include things such as combining singular and plural forms of words, looking for common phrases instead of single words, and retaining only nouns. However, these techniques are time consuming and somewhat dependent on the purpose of your analysis. Textual analysis is difficult. As a junior analyst, you won’t likely move beyond broad overviews like in the word cloud in Figure 4. For instance, you could see that this politician focuses on Virginians, families, and coastal areas, but it would take a deeper dive into the data to determine how they do so. In this way, word clouds can help when it comes to defining large trends (the “what”), but deeper analysis is required to answer the “why” and “how.”

Another variation of the word cloud uses bubbles of different sizes to indicate frequency, similar to the bubble chart you created in Lesson 6: Statistical Visualizations: Scatterplots & Bubble Charts. Here, the size of the text and the size of the bubble are used to indicate the frequency of the word, making it even easier to determine which words have the highest frequencies. The following example (Figure 5) shows the frequency of words spoken by politicians in the two main political parties in the United States:

Figure 5. Not only are the issues discussed between the two parties different, but the way they discuss those issues is different, too. (Source: New York Times)

This version of textual visualization, referred to as a “packed” bubble chart, tends to be used in more technical arenas, when the audience will want to perform comparisons. In the word cloud in Figure 4, for instance, it’s easy to see that “coastal” and “Virginia” are both large, but are they equally large? And how much larger are they than the other words? It’s impossible to tell exactly.

With the added dimension of the circle size, however, these differences are easier to distinguish. For instance, in the packed bubble chart in Figure 5, you can tell that Democrats use the word “change” more than they do “energy”—and that they use the word “change” more than Republicans, for that matter. Still, the circles require more space, meaning there’s a stricter limit on the number of words you can display. There are considerably fewer words in Figure 5 than in Figure 4, for example.

5. Creating Word Clouds in Tableau

Word clouds, particularly their simpler version, have become so common in recent days that many online tools will generate them for free. All you have to do is present a list of words, and these tools will parse out the frequency and create word clouds of various colors and shapes. Some tools will create word clouds from your own social media accounts, like Facebook and Twitter. And even Google Docs has a “word cloud generator” add-on, which you can use to generate word clouds from the text in Google documents.

As an analyst, however, you’ve got analytical tools on your side, which means it’s back to your good friend Tableau. Let’s run through how to create a word cloud in Tableau: Download the Naukri job postings data set (CSV)

This data set comes from a 2019 scrape of the Naukri website, which is India’s most popular platform for posting job ads. Each entry represents a different job opening, while the columns contain details of the respective posting such as title, salary, required experience, key skills, role category, and more. However, due to the size of the original data set, the one linked here is a shortened version (if it weren’t, you’d risk running into a memory shortage in Tableau due to the number of data points).

Download the data and, like always, start by taking a look at the file. To view it in Excel, open a new, blank Excel file, head to your Data tab, then select the From Text option:

Figure 6. Note that this interface might look slightly different from your own depending on your OS and version of Excel.

Select the “marketing_sample_naukri.csv” file you just downloaded, and a modal will appear asking you how you want to import your file. Select Comma from the “Delimiter” menu, then hit the Load button:

Figure 7

This will make sure to break the rows of data into columns in case there’s a comma—commas are used instead of column separators sometimes. Once you’ve hit Finish, the data will now be imported into Excel in table format:

Figure 8

You can choose whether to keep the formatting of the table or to clear all the formats and go with the standard non-coloured view. If you wish to do so, navigating to the Home tab, selecting the entire table and clicking on the eraser button under the Editing tab will enable you to clear all the formats:

Figure 9

Upon investigating the data in Excel, you can see that much of the information was entered as free text (as opposed to structured, dropdown options from a questionnaire). Under “job title,” for instance, there are five variations of the same title (“Assistant Manager”, “Assistant Manager – Credit Operations”, “Assistant Manager Accounts”, “Assistant Manager Commercial”, “Assistant Manager Finance”). The same is true for the column “Key skills,” which has entries containing details for different aspects of the positions.

Figure 10. Assistant managers are indeed en vogue!

If you wanted to use this data for a more complete, complex analysis, you’d need to spend some time cleaning it up. However, as you’ll only be analyzing one variable with your word cloud (and doing so in a broad sense), you don’t need to worry about these dirtier aspects of the set.

By now, you should be quite familiar with the initial steps for loading a file in Tableau. Open Tableau now and connect to your CSV file using Text File as your connection (as opposed to Microsoft Excel). Because CSV files can only have one sheet, Tableau will automatically pick the correct tab, so all you need to do is navigate to Sheet 1.

From your recent glance at the file in Excel, you determined that Job Title was an unstructured variable with many similar entries. Using this variable, you want to see what the most common job titles are and whether there are any similar titles that should be combined. In Excel, you could create a pivot table showing the frequencies of each value. In Tableau, you can do the same thing with a frequency table.

Start by renaming your sheet to “Frequency Table.”
Then, drag the Job Title variable from the Dimensions list to the Rows shelf.
Next, you want to tell Tableau to count the number of records for each value within the Job Title variable. To do so, drag the auto-generated Sheet1 (Count)s variable from the Measures list to the Text box on your Marks card. This will generate a frequency table of all the values within the Job Title variable. (If you don’t see the Sheet1(Count) variable, you might be using an old version of Tableau!)
Finally, click on the down arrow next to the Job Title variable (on the Rows shelf) and select Sort from the dropdown menu. On the modal that pops up, select Sort By Field, Sort Order: Descending, and Field Name: Sheet 1. This will order your table according to the job titles with the highest frequency:

Figure 11

You now have a frequency table! Though it’s quite the long frequency table—just look at how many different values there are! As mentioned earlier, data sets with a large number of values aren’t the best candidates for frequency tables for this exact reason. You can’t even see all the values in the table, let alone try to interpret anything from it! Creating a frequency visualization in the form of a word cloud would give you a better overview of the data.

In case your Tableau updates automatically to 2020.2 or higher, the field Number of records will disappear from the layout. Please check out this article for a way to circumvent that. It may appear like Sheet1(Count) in the Measures field in Tableau, as it is in the example here.

Go ahead and create a new sheet called “Wordcloud.” This time, drag the Job Title variable from the Dimensions list to the Text box on your Marks card. Your screen should instantly be filled with an amassment of job titles:

Figure 12. Look at that wall of text!

Now, drag that same Job Title variable from the Dimensions list to the Size box on your Marks card. This will tell Tableau to change the size of each job title to match its frequency. However, as you’ll quickly notice, nothing changes! This is because Tableau is currently displaying every unique job title. This means that the frequency for each job title is the same: 1. Even if a title exists more than once in the data, Tableau is only showing you that one unique instance. You can more easily notice this if you click the down arrow next to the Job Title variable, select Sort, and tell Tableau to sort the text in Descending, Alphabetic order. Your list of job titles should update to look something like the following:

Figure 13

Having a plain text list of all the unique job titles in this data set isn’t going to do you much good. What you want is for Tableau to count how many values there are for each job title. Only then will it be able to change their sizes accordingly.

To do this, you need to adjust the aggregation of the Size version of your Job Title variable in the Marks card (the one with the double-circle icon next to it). Click the down arrow to the right of the variable name, go to Measure, then select the Count aggregation:

Figure 14

Tip!
Not seeing a word cloud? If your visualization looks like a series of rectangles, similar to the treecharts you made previously, change your marks to Text via the dropdown menu:

Figure 15

Immediately, you should notice some changes in your chart. Some of the job titles have gotten quite large! Perfect!

Figure 16. Software Developer is the clear winner in this word cloud.

A few job titles likely stand out: Accounts Executive, Android Developer, senior software engineer, and Accountant. Notice that these are phrases, not just words. This is because Tableau treats each row (not each word) in the CSV file as a different entry.

Let’s use color to add another dimension to your word cloud. It would be interesting to see which industry dominates the job market. To do so, drag the Industry variable from the Dimensions list to the Color box on your Marks card and opt for the “Add all members” option in the prompt that appears:

Figure 17

You’ve got some color now! Still, this isn’t a very informative data visualization, and what’s more, the word size is all distorted now. There are way too many entries that all have the same frequency (one), making it difficult to read. How can you make this word cloud more useful?

In many word clouds, some sort of threshold is put in place, ensuring that only entries over a certain value are shown. For instance, in the packed bubble chart back in Figure 5, only words spoken more than 25,000 times were shown. Filtering out low-frequency values makes your data more interpretable. In Tableau, this is called a Calculated Field. Open up your Analysis menu and select Create Calculated Field now:

Figure 18

A menu will appear, allowing you to name your new variable (call it “Low Frequency”) and write the logic that will populate it. This logic will be similar to the formulas you wrote back in Excel. Your filter will ideally remove any job titles with low frequency, so start typing “count” in the input field until you’re able to select the COUNT() function.

Figure 19

Tableau Calculation Logic
As you become more familiar with Tableau, you’ll begin to learn what functions it has available and what they do, similar to the functions in Excel. While you’re still learning, however, you can use Tableau’s guidance. Click the arrow to the right of the calculated field menu and a new screen will appear, letting you search for and learn more about functions. The image below shows that the COUNT() function takes one argument and returns the number, or frequency:

Figure 20. Now, you can see exactly what the COUNT() function does, along with an example.

Start writing COUNT([Job Title]) (or [job_title] if this is how it appears in your data) in the input field. Tableau will automatically bring up a list of the available variables and change the colors. Blue signifies a function. Orange signifies a variable in the data set. Tableau also adds square brackets around the variable name to further signify it as a variable in the data:

Figure 21

With the variable name added, click OK to create your new calculation variable. It will now be listed as a new variable in your Measures list with an equals sign (=) next to the data type symbol (the # sign in this example). This signifies it is a calculated field:

Figure 22

To use this new variable in your visualization, drag the variable name from the Measures list to the Filters card to bring up the Filter menu. You can use this menu to filter out all the job titles with a frequency of one by selecting At Least and setting it to 2 before hitting OK:

Figure 23

Your word cloud should look considerably more readable now—and more like the word clouds you’re used to, as well:

Figure 24

If your word cloud were still difficult to interpret, you could restrict the entries even more, for instance, by only including job titles with more than 2 or 3 frequencies.

As you can see, word clouds aren’t very precise. However, they’re good at showing broad trends. In this example, while you can’t tell exactly how many values there are for each job title, you can tell that “Android Developer,” “Dot Net Developer,” and “Accounts Executive” are the clear winners. It’s also easy to see that the IT industry dominates the market in India, as the majority of the job titles are shown in red.

As a general rule, word clouds aren’t a good choice when decisions are being made using the data—they’re simply not precise enough. For instance, notice the red “BUSINESS DEVELOPMENT MANAGER” on the left side of the cloud? Is it the same size as the red “Software Developer” just above it? There’s simply no way to tell. For this reason, word clouds are best suited for first steps in the analytical process. They can be used to identify larger, overarching patterns and elicit further questions that require more-detailed analysis.

6. Creating Packed Bubble Charts in Tableau

Changing word clouds to bubble charts helps a bit with this problem of precision. These “packed” bubble charts aren’t the same as the bubble charts you created in Lesson 6. There, bubble charts were used to demonstrate frequency and correlation. Here, the size of the bubbles only refers to frequency—the same as the size of the text—and it doesn’t matter how they’re arranged.

To make your own in Tableau, duplicate your word cloud sheet, then rename your new sheet “Packed Bubble Chart.” In the Show Me menu, choose the Packed Bubbles option in the bottom right:

Figure 25. The packed bubbles option in the Show Me menu looks like a colorful clan of bubbles.

Your chart should transform into a bubble chart! Already, it’s a bit easier to more precisely compare the bubbles, but a way you can push this further is by adding the frequencies to each bubble as labels. Hold down the Command key (for Mac) or Control key (for PC) while dragging the CNT(Job Title) variable from the Marks card to the Label box to copy the variable (making it so the same variable is being represented by both size and labels):

Figure 26. Your chart should transform into what looks like an intense game of marbles.

Because of Tableau’s default settings, not every circle will include labels (Tableau doesn’t automatically show overlapping labels). In general, some circles will only show job titles, some will only show counts, and some won’t show anything at all. In fact, depending on your monitor and window sizes, the labels on your visualization probably look different from the labels in Figure 26! It may not show any labels at all if there are less bubbles.

Because of this, it’s still rather hard to interpret. There are just too many bubbles and not enough space. Let’s fix this by decreasing the number of bubbles to only show the top 10 most common job titles.

Start by removing the Low Frequency variable from the Filter card (right-click and select Remove). Then, drag the Job Title variable (the variable whose frequency is shown in the bubbles) from the Dimensions list to the Filter card. A menu will appear. Select Top from the tabs along the top of the modal, then under the By Field option, choose Top 10 by Job Title Count:

Figure 27

Hit OK, and your bubble chart will update. Now, only the top 10 job titles will be shown:

Figure 28. You’ve lost your marbles! But that’s ok.

These changes have made your visualization more precise. Notice how much easier it is to compare the bubbles when there are only ten categories?

There’s still one small issue, though, that could cause some confusion: some titles have a duplicated job title industry as a label, making it hard to distinguish whether the title or the industry is showing up. To clear up this confusion, you can remove the rank labels. These have already been indicated with color anyway, making the labels redundant. To do so, simply remove the Industry text labels variable from the Marks card. Your finished stacked bubble chart should look something like this:

Figure 29. What a stack!

Bubble chart versions of word clouds allow for more-precise comparisons. After all, it’s easier to compare the sizes of bubbles than it is to compare the font sizes of different words. Many people, particularly non-analysts, enjoy using word clouds to quickly notice patterns in large amounts of text. After identifying these trends, they can come up with further questions that require additional, more detailed analysis.

7. Sentiment Analysis

Even more complicated than analyzing frequency within textual data is sentiment analysis, which involves analyzing the feelings behind textual data. These feelings are typically categorized into negative, neutral, and positive. When companies ask for feedback from their users, for example, they often ask for ratings on a five- or ten-star scale. They also allow users to leave unstructured textual feedback along with their rating. In order to get a better idea how their users are feeling about their product or service, companies can conduct sentiment analysis on this feedback, looking for key words and phrases that equate to negative, neutral, or positive feelings.

Figure 30. Two thumbs up. Way up!

Sentiment analysis is much more involved than the analysis conducted here. It most commonly occurs when monitoring a company’s brand, social media space, or customer service and is conducted by researchers or data scientists. Rather than learning the ins and outs of performing a full sentiment analysis here, you’ll explore it more broadly—what it is, why it’s difficult, and when it’s useful. If you want to know more, check out Sentiment Analysis: Concept, Analysis and Applications or do some further research on your own.

At its core, sentiment analysis is classifying text. While the most common classification categories are, as mentioned above, negative, neutral, and positive, language itself is contextual. There are no key words that can, without a doubt, signify positive or negative.

Let’s look at an example. The three sentences below all contain the word “like,” a seemingly positive sentiment:

I like that Tesla.
That looks like a Tesla.
I don’t like that Tesla.

While the first sentence is, indeed, positive towards Tesla, the second sentence is neutral, and the third sentence is actually negative (not like). You can quickly see how using the term “like” to assign a sentiment to these sentences would lead to inaccurate results.

In the same way, negative words can actually be an indicator of positive sentiment depending on the context. Take the two sentences below, for instance. While the second sentence uses the word “bad,” which is normally reserved for negative sentiment, the sentence as a whole indicates positive sentiment towards Starbucks:

That Starbucks is bad.
I’m craving Starbucks so bad.

As such, classifying text into sentiment requires more than just looking for a few key words or phrases. Many tools and complex algorithms and methods exist for text processing, but these are outside the scope of this course (and beyond your normal job expectations). All you need to know is that sentiment analysis exists and that it can be useful.

Summary

Textual data usually takes the form of unstructured, qualitative data. This data often comes from surveys, and many organizations have researchers who manage and analyze it. As an analyst, however, you can provide a few first steps for this analysis using a common technique—frequency counts. Text frequency can be visualized in the form of word clouds, where the size of the word corresponds to its frequency. These clouds can then be used to highlight major patterns. They aren’t, however, suited for more specific comparisons. One solution for this is to turn your cloud into a packed bubble chart.

This Lesson marks the end of your exploration of different chart types in Tableau. You now know how to make pie charts, bar charts, stacked bar charts, treemaps, line charts (complete with trend lines), histograms, box plots, scatterplots, bubble charts, point maps, heat maps, choropleth maps, graduated symbol maps, word clouds, and packed bubble charts. Whew! That’s a lot of visualizations! Which is why, in the next Lesson, you’re going to learn how to start bringing all of those visualizations together to tell a story—the story of your data!

Before that, though, let’s see if you can’t think of a way to integrate a word cloud into your Cliqz project!

Exercise

Estimated Time to Complete: 1-3 Hours

Directions

Create a word cloud for queries in your data set
- The word cloud should use size to designate values with higher frequencies.
- Filter out any low-frequency values if appropriate.
Duplicate the chart in a new sheet and turn it into a packed bubble chart for the same data.
- Use color to add an additional data dimension to the chart.
Update both visualizations using the style guide you created in Lesson 2: Visual Design Basics & Tableau.
Explain what the bubble chart tells you that the word cloud can’t.
Include screenshots of your final charts, along with your answer to question 4, in a document.
Export your final Word document as a PDF and upload it on the drive for your mentor to review.
Publish your workbook to Tableau Public in order to save your progress and submit the link along with your PDF.

Bonus Task

Find a word cloud online and explain what works well and what doesn’t work well in terms of how it communicates data. You can also use your visualization style guide to critique its visual presentation. Include your critique along with your submission for this task.

Submission Guidelines

Filename Format:

YourName_Lesson8_TextualAnalysis.docx

When you’re ready, submit your completed exercise to the designated folder in OneDrive. Drop your mentor a note about submission.

Important: Please scan your files for viruses before uploading.

Submission & Resubmission Guidelines

Initial Submission Format: YourName_Lesson#_…
Resubmission Format:
- YourName_Lesson#_…_v2
- YourName_Lesson#_…_v3
Rubric Updates:
- Do not overwrite original evaluation entries
- Add updated responses in new “v2” or “v3” columns
- This allows mentors to track your improvement process

Evaluation Rubric

Criteria	Exceeds Expectation	Meets Expectation	Needs Improvement	Incomplete / Off-Track
Textual Analysis	Everything in “Meets Expectations” The bonus task is done satisfactorily	Submission contains one word-cloud and one packed bubble chart based on unstructured textual data, both visualizations have been created according to the visualization checklist, and written answer effectively discusses what the bubble chart tells a reader that the word cloud doesn’t.	Submission contains one word-cloud and one packed bubble chart, but data used wasn’t appropriate, and written answer doesn’t effectively discuss what the bubble chart tells a reader that the word cloud doesn’t	Submission is plagiarized or isn’t relevant to the task instructions; OR The wrong charts have been created;