Slide

Turning Webpages into Data with Wolfram Language

Slide

Things You’ll Learn

“What” and “Why” of Turning Webpages into Data

Sending HTTP Requests and Getting Text

Parsing Structured Responses

Scaling Up: Async Requests

Automating Web Browsers

Slide

Extracting Data

Extracting data from websites involves fetching a webpage’s content and parsing it to retrieve specific information, such as text, images or links. The process takes advantage of the fact that webpages typically have a consistent format. This consistency can be exploited to build automated tools that navigate a website’s structure and extract the desired information.

The main advantage of automated extraction is that it allows the collection of large amounts of data that would be impractical to collect manually. This information can be used on its own, or it can be combined with other programming methods, workflows and technologies.

Slide

Why Extract Data?

Typically, you would want to extract data from the web to do some further computation on it. While the functionality of Wolfram Language is extensive and easy to use, the main advantage it provides is that any data that you get is immediately available for computation with the rest of the functionality found in Wolfram Language. Once you get your data, it is in an environment where you can start doing image processing, machine learning, visualisation, LLM training and much more.

Throughout this webinar, we will be using the data we obtain with various visualisation techniques, dynamic elements, statistical analyses and AI integration, but the general point to keep in mind is that once your data is in Wolfram Language, it is immediately subject to the 6000+ built-in functions and 3000+ community-contributed functions:

In[]:=

$VersionNumber->Length[EntityList["WolframLanguageSymbol"]]

Out[]=

14.36692

Slide

Importing Webpages as Text

Aside: robots.txt

The

robots.txt

file is a standard used by websites to communicate with web crawlers about which parts of the site should not be accessed. It is typically located in the root directory of a website (e.g.

www.example.com/robots.txt

). This file contains directives that specify which user agents (crawlers) are allowed or disallowed from accessing certain sections of the site. Understanding and respecting the

robots.txt

file is crucial for responsible data extraction, as it helps ensure that extractors do not violate the site’s terms of service or overload the server by unintentionally crawling restricted areas. Adhering to these guidelines fosters responsible data collection practices and supports the site’s operational integrity.

Import

The most straightforward way of getting the contents of a webpage is to send an HTML request to the server that is providing the webpage. Here, we will use a website that provides sample webpages specifically for training purposes. In this case, we will use their “e-commerce” subpage, which mimics a web store webpage, and it’s already filled in with some mock products. We can pass a URL to Import and specify

"Text"

as our desired output. With that, Import can handle the underlying HTTP requests and response parsing for us:

In[]:=

ecommerce=Import["https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops","Text"];

Once we have the string, we can start parsing it, just like any other string in Wolfram Language. With strings, the typical way we do that is by looking at the structure surrounding the information we are interested in and using StringExpression or RegularExpression to extract it. For example, if we are interested in the product descriptions, we can see that they are enclosed in <p class = “description card-text” > < /p >. With that, we can match all the products on the page with StringExpression:

In[]:=

StringCases[ecommerce,StringExpression["<p class=\"description card-text"~~Shortest[__]~~"</p>"]];

Or with RegularExpression:

In[]:=

Column[StringCases[ecommerce,RegularExpression["<p class=\"description card-text.*description\">(.*?)</p>"]->"$1"]];

String matching is a very powerful method of parsing string data, not only in web data but in many other data analysis workflows as well. If you are interested in learning more about StringExpression and RegularExpression, there is an excellent guide page: Working with String Patterns.

URLRead

The more manual way of sending requests is by using URLRead directly, which will preserve all the information about the response from the server, not just the text. Let’s test it with the same e-commerce webpage:

In[]:=

ecommerceurlread=URLRead["https://webscraper.io/test-sites/e-commerce/allinone"]

Out[]=

HTTPResponse

Status: OK

Content type: text/html; charset=UTF-8



The output is an HTTPResponse object, which contains the text in its

"Body"

property:

In[]:=

ecommerceurlread["Body"];

This is identical to the text we got by using Import. As well as the body, the response normally also contains other information, such as the headers and status codes:

In[]:=

ecommerceurlread["Headers"];

This can give you more information about how the server handled your request, which can be very useful for debugging and troubleshooting. It can also be used for passing custom headers, such as user agents, which is a common practice when trying to bypass anti-automated restrictions.

Slide

Example: Import and String Cleaning

Collecting contact information from different webpages is a common workflow in producing mailing lists. Here, we will take some publicly available contact numbers from a few big businesses that have offices in the UK. First, we’ll collect their contact pages in a list of associations, where the key is the company name and the value is the webpage itself:

In[]:=

contactPages={<|"Apple"->"https://www.apple.com/uk/contact/"|>,<|"Barclays"->"https://www.barclays.co.uk/ways-to-bank/telephone-banking/"|>,<|"Monzo"->"https://monzo.com/help/"|>,<|"Evri"->"https://www.evri.com/contact-us"|>};

Then we will write a simple function which takes one association, opens the webpage (the value) and tries to find a phone number on the page. The regular expression below tries to capture common UK number formatting, and it can be read as “4 digits, followed by either a space or a dash, followed by 3 digits, followed by either a space or a dash, followed by 4 digits”:

Finally, we’ll apply the function to our original dataset:

Slide

Parsing Structured Responses

XML

XML is a markup language commonly used on websites to define their data structures. There is some degree of interoperability between HTML and XML, which means that you can typically import a webpage as an XMLObject and make use of Wolfram’s XML parsing tools to navigate its structure. However, XML is a stricter standard than HTML, and the HTML in lot of modern webpages cannot be perfectly represented as XML. There are still places where XML is common and standard.

One common example of XML in webpages is sitemaps of news websites, which are intended to display the current contents of the website in an accessible way for automated tools (typically Google web scrapers). Let’s import the latest BBC sitemap as an XMLObject:

Now we can use pattern matching to extract the information that we are interested in. By inspecting the XML structure above, we can see that hyperlinks to the actual news articles are contained in the following structure:

And headlines are in this structure:

Let’s put everything together in a Dataset structure:

Get a visual representation of the common themes:

And use some of the integrated LLM functionality with the headlines:

Slide

Example: Scaling Up with Async Requests

Taking our BBC example further, we can expand it by getting the body of the articles and not just the headlines. First, let’s get all the hyperlinks from the BBC’s front page:

Then we can string match those that lead to articles:

Prepare an empty list that will hold all the variables:

Now we have a few dozen articles. If we pass them to Import or URLRead, we will need to wait for them to be downloaded sequentially, which could take a long time. Wolfram Language is equipped with a lot of parallel computation functionality, which can greatly reduce the time needed for a list of independent computations. In the case of web fetching, parallel computation is often referred to as “asynchronous”, and the corresponding Wolfram Language function we can use for asynchronous web fetching is URLSubmit:

Check how many articles we have collected:

Produce a word cloud again to see at a glance what’s been talked about today:

Sort by the frequency of mentions:

Word cloud of the countries:

A GeoBubbleChart from the frequencies:

Use the article bodies to create a semantic search index:

Use the created semantic search index as a dynamic prompt for your LLM interactions:

Slide

Automated Browsers, Parsing HTML

Importing the text from webpages with Import or URLRead is often enough for simple, static webpages. However, as you start visiting more complex, dynamic webpages, you will find that the information you are looking for might not be contained in the server response, even though you see it when you open the webpage in a web browser. There are multiple reasons why this might happen, but typically the server responds differently based on who or what is requesting the information. In many cases, websites load content dynamically with JavaScript scripts, which can only run on web browsers, so an HTML request for the content is unlikely to give you the full webpage.

If the website only serves the information you need when it is requested by a web browser, the solution is to use an automated web browser. Wolfram Language has built-in functionality to start and manage web browser sessions programmatically.

Starting a Web Session

The first thing we need to do is start a web session. We will want to store that web session in a variable, since it is an object that we will be passing around every time we want to interact with it. We can start a web session in the following way:

WebExecute

With the session now open and its object stored in a variable, we can start interacting with it programmatically. The main function we are going to be using is WebExecute, which takes the web session as the first argument and an operation to be performed on that session as the second argument. For example, if we want to go to a specific webpage in our session, we can do it with:

There is a wide range of actions that WebExecute can take. Here, we will focus on locating and extracting elements. On a basic level, all webpages are a combination of HTML, CSS and script files. We can use the structure of each of those to locate the elements that we are interested in. Let’s start by looking at how to grab HTML elements that we are interested in with WebExecute:

Let’s navigate to a website with a slightly more complicated structure:

The easiest way to get an idea of the structure of the website, and therefore of the location of the elements you are interested in, is to use the web inspector option in your browser, which allows you to point at different parts of the current webpage you are on and inspect their underlying HTML structure. The current page we are on contains quotes located in <span> elements.

Now we have all the quotes in a list. Let’s get the author names, which are located in <small> HTML elements:

Put everything together in an association:

And get a random quote from our list:

When you are done with your session, remember to delete it with DeleteObject to avoid having unused processes running in the background (especially easy to forget if you’re running the browser headlessly):

Slide

Example: Headless Browsers and String Cleaning

Sometimes, we want to transform the data we find online into a different format. In this example, we will take some tabular data from a Wikipedia page and present it as an interactive map. Specifically, we are interested in the blood type distribution in European countries.

Starting Up a Browser Session

Extracting the Relevant Elements (in This Case: The Second <table>)

Get the text of the second table (for this specific webpage, there is an empty element with the <table> tag that gets added to our list):

String Cleaning

Now, we do some data cleaning that is specific to the webpage. We are mixing StringExpression and RegularExpression instances to extract the country names from the table:

Then we create an association thread which associates the proportion of people with A blood type to the country name:

Finally, we take the names of the European countries and interpret them as entities:

This gives us the final association:

Final Plot

Then we can pass our final association to GeoHistogram to produce a map:

Slide

Example: Headless Browsers, Manipulating Webpages

In many cases, the data that we are interested in exists in different places, making unified analysis difficult. In this example, we will look at the Association of Tennis Professionals (ATP), which provides a lot of data about the tennis players on tour. This data, however, is split between different pages (at least four). If we want to do any meaningful analysis, we need to visit each webpage, collect the data and finally put it in a single dataset.

Serve Stats

We’ll start with the statistics related to serves. Open up a web browser and navigate to the serve stats webpage:

We see our first challenge: when we open the webpage, we see only the first 50 players. To see all, we need to click a button labeled “Load More” at the bottom of the page. Since we don’t want to do this manually, we can use the methods in WebExecute to achieve that, but first we need to locate the button. By inspecting the webpage, we see that it is located in an <a> tag, so get those first:

There are a lot of <a> elements on a webpage, and we only need one of them. For each <a> element, we check if its text is equal to “Load More”, and if it’s not, we discard it, which should leave us with only the <a> element that we care about:

Once all the players are loaded, we can extract the information that we care about. In this case, each row in the table is in a <tr> element, so we get that:

And we clean it up with a RegularExpression:

Return Stats

Since this page is almost identical to the previous one, we follow the same procedure. The only difference is that the table contains fewer elements, so our final function that extracts the information from the table is slightly modified:

Under Pressure Stats

This is the same type of webpage as before, so copy/paste the procedure:

Rankings

The final page we care about is slightly different. First, all the players are listed automatically, so there is no need to look for a “Load More” button to click. We open up our browser and navigate to the webpage:

Getting the Countries

There are two things that we are interested in on this webpage: the players’ rankings and their countries. Since getting the countries is a bit more complicated, let’s start with that. All of the little flags beside the player’s image are located in <use> elements, which we can leverage:

Now we have a list of three-letter country codes that we need to convert to actual country entities. The built-in Interpreter function can do a lot of the heavy lifting even with three-letter inputs. We can test:

However, in some cases, it might fail:

Let’s create a helper function that will take our countryCodes and return country entities. In it, we want to try interpreting the country code with Interpreter; if that fails, we want to use a more flexible tool like an LLM:

We can now map our new helper function to our country codes to get a list of countries:

Getting the Rankings

Get all <a> elements:

Check if there is a player contained in each <a> element. <a> elements containing players have a “href” attribute, which we can represent as a string expression. We check if each element in “anchors” matches that expression, and we discard it if it does not:

Now that we have the anchors that contain players, we can extract their text:

Since the player names are already ordered by the way they appear on the webpage, we don’t need to extract their rankings; we can simply append Range[100] to the first 100 names:

Putting Everything Together

Once we have all the information we want from the different webpages, we can put it together in a single dataset:

Exploration of the Data

Once we have the data in Wolfram Language, we can start using any of the thousands of built-in functions to get insights. For example, find out which country has the most representatives in the ATP Top 100:

As well as world rankings, players are ranked across three other categories: serves, returns and performance under pressure. Find the player with the biggest discrepancy between world ranking and serve ranking:

Find the players who can serve really well but can’t return:

Create a Manipulate object to explore scatter plots of different factors in the dataset:

Slide

Example: Headless Browsers, JavaScript Injections

In this example, we will see how we can load dynamic elements by scrolling to the bottom of the page to load more. Start our web session as usual:

Navigate to the resource repository. If you open this page separately, you will see that by default it loads only a few rows’ elements. When the user scrolls down to the bottom of the page, some more are loaded. If we are controlling the browser manually, we will keep scrolling to the bottom of the page until all elements are loaded. But how can we do that programmatically?

Since there is no “keep scrolling down” functionality baked into WebExecute, we can write a simple JavaScript loop to do what we need. Here, we are getting the height of the webpage before scrolling down to the bottom. If the new height of the page is larger than the one previously, this indicates that more things are loaded, so we repeat. We only stop when the height doesn’t change after the scroll:

And clean them up as usual:

Write a small function to extract the authors of all the contributed examples:

Collect all the authors in a variable and clean up the web session:

Clean up the authors and tally them:

Slide

Conclusion

Between simple HTML requests with Import and URLRead, async requests with URLSubmit and automated web browsers with WebExecute, you have all the tools needed to build up web extraction workflows for virtually any website. The advantage of doing so in Wolfram Language, other than the accessibility and ease of use of these functions, is that once you get your data, you are able to immediately analyse, compute and visualise it with the thousands of built-in functions in Wolfram Language.