Web Scraping and Text Analysis
7
|
How to Import Text from Websites
Introduction
Wolfram Language is different from a traditional programming language in that it requires less coding, which allows the user to be creative with minimal time investment. Use of Wolfram Language also does not require any background in computer science to write code and explore new ideas. In this section, we’ll create word clouds and analyze text using keyword searches to discover themes.

Importing the Text from a Website

The first step for an analysis of the text on a website is to import it into our Mathematica notebook. The following calculation uses the function Import on the front page of google.com and stores the text on the website with the symbol name web1. A second calculation uses the function WordCloud to summarize commonly used words on the website:
In[]:=
web1=Import["https://www.google.com"];​​WordCloud[web1]
Out[]=
The following set of commands reformats the textual data and stores the result with the symbol name web2. Wolfram Language has a structure of wrapping one command around another. First, the function TextWords splits up the block of text into a list of individual words. Then, the function ToLowerCase converts each word to lowercase. Then the DeleteStopwords function removes words like “the” or “and” that are not typically of interest for an analysis of text. Once the website text is in the form of a dataset with uniform lowercase, WordCloud shows the most commonly used words:
In[]:=
web2=DeleteStopwords[ToLowerCase[TextWords[web1]]];​​WordCloud[web2]
Out[]=
This process can be repeated for a webpage containing much more text. The word cloud highlights common themes in the article about Apple’s organizational structure:
In[]:=
web3=Import["https://julius-giron.medium.com/how-apples-organizational-structure-bred-both-innovation-and-silos-f6939f765151"];​​web4=DeleteStopwords[ToLowerCase[TextWords[web3]]];​​WordCloud[web4]
Out[]=

Summary

Wolfram Language can easily give structure to a block of text for visualization and analysis. The next chapter will take advantage of this dataset for further textual analysis.
DownloadNotebook»