Wolfram Cloud Document

Web Scraping and Text Analysis

How to Tag Words in Text

Introduction

After creating a structured list in the last chapter, we can analyze the dataset to find themes in the text through keyword search. This can be a more targeted way to identify themes when compared to sentiment analysis of text.

Identifying Specific Types of Words

Wolfram Language also has built-in intelligence where a command like TextCases can identify specific types of words, like a verb or a noun, in a block of text. The function works with a block of text rather than a structured list. The output is a list of words that satisfy the particular pattern:

In[]:=

paragraph="We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.";

In[]:=

findverbs=TextCases[paragraph,"Verb"]findnouns=TextCases[paragraph,"Noun"]

Out[]=

{hold,be,are,created,are,are}

Out[]=

{truths,men,Rights,pursuit,Happiness}

In the following calculation, StringCases can be used instead of TextCases to find instances of specific words in a block of text. In this case, there are four occurrences of either “truths” or “that” in the original sentence. The function Length counts the quantity of words in the resulting list, which is useful to count results from larger datasets:

In[]:=

findWords=StringCases[paragraph,{"truths","that"}]Length[findWords]

Out[]=

{truths,that,that,that}

Out[]=

Earlier in the chapter, similar commands were used to identify profit-focused words, people-focused words or words related to numbers. The following calculation takes the previously stored web3 from the last chapter, which is a much larger list of words, and runs the same type of analysis:

In[]:=

findPeople=Length[StringCases[web3,{"people","consumer","client","customer","user"}]]

Out[]=

In addition to nouns and verbs, TextCases can identify numeric values in a block of text. Other options include email addresses, dates or currency amounts:

In[]:=

TextCases[web3,"Number"]

Out[]=

{4,22,2022,1,one,1997,2,1997,1,2,3,4,one,1,2022,15,2,2022,10,2018,26,3,2021,16,4,5,132}

In[]:=

findNumber=Length[TextCases[StringRiffle[web4],"Number"]]

Out[]=

Summary

This is another case where a visual scan for keywords is possible but becomes impractical for larger bodies of text. This is especially true when you want to run this type of analysis on several websites and use the results to compare the websites against one another.

DownloadNotebook»