Wolfram Cloud Document

Web Scraping and Sentiment Analysis

Sentiment Analysis and Keyword Search

Introduction

With the imported PDF and imported dictionary of keywords, we can start the analysis. Being able to quantify themes in PDF documents can be a much more exact way to compare documents versus manually reading them, which likely provides more of a rough idea of overall trends.

Predicting Sentiment and Key Themes

First, since the imported list of keywords is all lowercase, we need to convert the text from the PDF to lowercase. The function ToLowerCase does just that. ToLowerCase expects a list of words, not a continuous block of text, so TextWords can be used to split up the full text into that list of words. Similarly to the last chapter, the results for this calculation are not printed to the screen because of the length of the list of words:

In[]:=

wordList=ToLowerCase[TextWords[fullText]];

Wolfram Language contains thousands of built-in functions or commands for a wide variety of calculations (spanning much more than this course covers). It’s also possible for us to make our own user-defined functions. This can be done as a matter of convenience to give us helper functions stored with a symbol name that can be used in other calculations to save typing. Practically speaking, this also makes code easier to understand, with fewer functions wrapped around other functions.

The underscore and colon equals are required syntax to define our user-defined function (they define a pattern and denote delayed assignment, respectively). By definition, no result is printed to the screen. The output is printed once we use the user-defined function with our data:

In[]:=

posTest[w_]:=MemberQ[posWords,w]negTest[w_]:=MemberQ[negWords,w]

The Select function is similar to Table in that it scans through a list and selects instances that satisfy a pattern. In this case, the pattern is whether a particular word from the PDF document is a member of the external dictionary of positive or negative words. The positive and negative words are tested separately, and the words common to both datasets are returned as a list:

In[]:=

Select[wordList,posTest]

Out[]=

{increase,milestones,surpassing,growth,growth,growth,strong,growth,strength,increased,increased,increased,increased,increased,grew,increased,increased,increased,increase,achievement,achievement,growth,strong,strength,growth,momentum,strength,growth,convenience,engagement,increase,increased,increased,increased,increased,increase,increase}

In[]:=

Select[wordList,negTest]

Out[]=

{temporarily,temporarily,inflation,exclusions,changes,changes,decrease,changes,decrease}

For our final results, we want to count the quantity of words in each category, which we can do with the Length function:

In[]:=

posWordsLength=Length[Select[wordList,posTest]]negWordsLength=Length[Select[wordList,negTest]]

Out[]=

In addition to themes based on keywords, the Wolfram Language function for sentiment analysis is called Classify. Classify can be used for many types of machine learning calculations, so the first argument is

"Sentiment"

to specify sentiment analysis. The first calculation below takes the full list of words and returns Positive, Negative, Neutral or Indeterminate for each word. The Tally function summarizes the tags. Within the company report, there are 471 words tagged as Negative, 532 tagged as Neutral and 1,464 tagged as Positive:

In[]:=

sentiment=Tally[Classify["Sentiment",wordList]]

Out[]=

{{Neutral,532},{Positive,1464},{Negative,471},{Indeterminate,19}}

The function ReverseSortBy orders the list based on the Last element, which is the tally of predicted emotions in this case:

In[]:=

mL=ReverseSortBy[sentiment,Last]

Out[]=

{{Positive,1464},{Neutral,532},{Negative,471},{Indeterminate,19}}

Summary

Many users of Wolfram Language have a set of their own favorite user-defined function, which further speed up the process of analyzing data. This is especially useful if data is sent daily or weekly in the same format. The very same user-defined functions can be used to quickly analyze the data within the new time frame.

DownloadNotebook»