ECE101 Lab (Fall 2024)

Lab 5: Search Engines

NetID: <Please fill in>

Part 1: Crawling the Web to Build a Web Graph

Recall, when working with graphs, we use circles to represent nodes, and lines between these circles (called "edges") to represent connections between nodes.

In a web graph the nodes (also called vertices) represent URLs (representing specific web pages) and the edges represent hyperlinks between them. The web graph has directed edges. The edges are arrows going from the page containing the hyperlink, to the page at the URL the hyperlink points to.

We will crawl a very small part of the World Wide Web and build a graph that represents the websites we crawled.

Let’s set a URL as a starting page for the graph:

In[]:=

startPageURL="https://code.org/";

Step 1

Import all the hyperlinks from this page:

In[]:=

startPageHyperlinks=Import[startPageURL,"Hyperlinks"]

Check how may hyperlinks are present on this page alone:

In[]:=

Length[startPageHyperlinks]

Problem 1

List any 3 URLs you find interesting among all the “hyperlinks” obtained from the code.org webpage

Answer

Step 2

Pick one hyperlink at random from this list of hyperlinks on the page:
(Since we are selecting randomly, your choice will be different from the person sitting next to you)

In[]:=

selectedLink=RandomChoice[startPageHyperlinks]

Problem 2

List the URL that RandomChoice picked for you.

Answer

Step 3

Repeat steps 1 and 2 for the selected link.

Import all the hyperlinks from the page at this selected link. If it has no links, select another random link.

In[]:=

Import[selectedLink,"Hyperlinks"]

Import all the hyperlinks on the “selectedLink” page and pick 1 link at random from that list:

In[]:=

newLink=RandomChoice[Import[selectedLink,"Hyperlinks"]]

Problem 3

List the new URL RandomChoice picked for you on the second page

Answer

Putting it together

We need to repeat steps 1 and 2 for each hyperlink we find on the starting page, i.e. https://code.org/.

Below is some code that will combine the steps 1 and 2 described above, when it is given any URL as an input.
Note, it is possible you will land on a page that has NO hyperlinks. If the page does not seem to have a single hyperlink to follow, the code goes back to the starting page and picks another link at random, to follow. This keeps it from getting stuck at a dead-end.

In[]:=

startURL="https://code.org/";getLinkedURL[pageURL_]:=With[(*Importthehyperlinksonthispage*){listOfLinks=DeleteCases[(*Importallthehyperlinksfromthepageat"pageURL"*)Import[pageURL,"Hyperlinks"],(*Deleteallthelinkstoemailaddresses,phonenumbersandmaplocations--wecan'tusethemtogrowthewebgraph*)s_/;StringMatchQ[s,___~~"mailto"|"tel"|"maps"|"mp3"|"@"~~___]]},(*Ifthenumberofimportedhyperlinksismorethanzero*)If[Length[listOfLinks]!=0,(*Randomlyselectonehyperlinkfromamongtheimportedhyperlinks*)RandomChoice[listOfLinks],(*Else-meaningwhentherearenohyperlinkstofollow....gobacktothestartingpageandselectanotherrandomlinkfromthere*)RandomChoice[Import[startURL,"Hyperlinks"]]]]

We can “nest” this function so that it repeatedly calls itself a certain number of times.
In this case, every time it finds a random hyperlink on a page, it calls itself with that link as input.
It does so 10 times.
(The code may take a little while to run, so please be patient.)

Use the getLinkedURL function to get a list of pages:

In[]:=

listOfLinks=NestList[getLinkedURL[#]&,startURL,10]

Create a web graph from these links:

In[]:=

simpleWebGraph=Graph[MapThread[#1->#2&,{Most[listOfLinks],Rest[listOfLinks]}],VertexLabels->Placed["Name",Tooltip]]

The graph looks more like a chain than a graph because we are exploring a very small part of the web.
We simply keep following the path starting at the startURL page and follow the links from it to the next page and to the next page and so on.
Sometimes there may be loops leading back to a page we have already visited.

Putting it together: Advanced (Optional)

Instead of selecting just one hyperlink from each page, we can simultaneously select more than one and grow the graph in each direction. However you must realize that the graph becomes very big very quickly when you start picking multiple links from each page you land on.

For example, if you decide to pick 3 URLs from the starting page and try to follow the hyperlinks, in 5 hops you will end up with up to 3^5 = 243 nodes in the web graph. It can take a long time for the code to evaluate, so I have already evaluated the following code for you to observe and examine.

Set the number of links you’d like to follow from a page to 3:

In[]:=

howManyLinksToPick=3

Out[]=

Set the number of hops you’d like to follow to grow the graph to 5 (any more and it will take a very loooooooooong time for the code to run):

In[]:=

howManyHopsToGrow=5

Out[]=

The following code will pick multiple hyperlinks from a page at random to follow further and grow the web graph:

In[]:=

(*SetthestartingURL*)startURL="https://code.org/";getMultipleLinks[pageURL_,numLinks_]:=With[{listOfLinks=DeleteCases[(*Importallthehyperlinksfromthepageat"pageURL"*)Import[pageURL,"Hyperlinks"],(*Deleteallthelinkstoemailaddresses,phonenumbersandmaplocations--wecan'tusethemtogrowthewebgraph*)s_/;StringMatchQ[s,___~~"mailto"|"tel"|"maps"|"mp3"|"@"~~___]]},(*Checkifyougotanon-zeronumberoflinksthatyoucanactuallyfollow*)If[Length[listOfLinks]>0,(*ifyes-pickuptoacertainnumberoflinksrandomly*)RandomSample[listOfLinks,UpTo[numLinks]],(*Else-meaningwhenyoudidn'tfindanyhyperlinkstofollow....gobacktothestartingpageandselectanotherrandomlinkfromthere*)RandomSample[Import[startURL,"Hyperlinks"],UpTo[numLinks]]]](*NestthefunctiongetMultipleLinks,startingatthestartURL,acertainnumberoftimesi.e."howManyHopsToGrow"tocreatethegraph*)complexWebGraph=NestGraph[getMultipleLinks[#,howManyLinksToPick]&,startURL,howManyHopsToGrow,VertexLabels->Placed["Name",Tooltip]]

Count how many vertices are there in this graph:

Problem 4

You can hover over the nodes in the web graph you have created (simple or the advanced version) to see the URL of the page.
List a URL that seems interesting to you in this graph (maybe some page you didn’t quite expect to reach from code.org or maybe a page that you seem to keep coming back to or really any reason it seems interesting).

Answer

Problem 5

Change the startURL to a website of your choice, rerun the code above (in the “Putting it together” section) to create another graph. Copy paste the graph into the answer cell below.

You can try either
(1) the simpler version of selecting 1 link at a time or
(2) the advanced version of 3 links at a time for extra credit

Answer

Selecting one link at a time:

Answer (Extra Credit)

Selecting 3 links at a time:

Problem 6

Looking at the graph, which website do you think is (roughly) the most important? Why?

Answer

PageRank of a Web Page in the Web Graph

Graph Centrality

Here is an example of a directed graph.

“Graph centrality” is a quantitative measure for the nodes in a directed graph, that can be used to answer questions like:
* Which vertex is “the most important” in this graph?
* What do we even mean by important?

In real life this could help answer questions like:
* Who is an important in a social network? (not just the person with lots of friends but maybe one with “important” friends)
* Which academic publication has had the most impact (has been quoted by other research papers)?
* Which are the most important and relevant webpages to be listed as a result of a web search? (as discussed in our web search lecture)

Why not just count the number of links to a page?

The “in-degree” of a node (number of links coming into it) may not be sufficient as a measure of importance.
Consider the following graph. Which is the most important node? Is it f, g or m?

The most important node here is probably “g”. You could say g is the more “central” vertex of the graph.

What about the probability of arriving at a node as you travel through the graph?

However you travel through this graph you end up at “g”. (Sort of “all roads lead to Rome”)
If there is a node in the graph, where you end up no matter where you start in the graph, then probably it is linked from multiple nodes. Probably it is an “important” node.

The PageRank Algorithm

PageRank is named after Google founder Larry Page. It is also a play on the words “page rank” i.e. ranking of a page according to the search algorithm. It is a “graph centrality” measure that is used to determine the “importance” of a node.

Intuition

1. Imagine a person standing at each node/page of the web graph
2. Each person chooses an outgoing link at random (equal probability, independently of past/future decisions)
3. They follow that link to another node/page in the web graph
4. If they land on a page that has no outgoing links (a webpage with no hyperlinks), they ‘start over’ by jumping over to a new node in the graph, at random, instead of staying stuck in the node with no outgoing links.
5. The process is repeated many times and then stopped.
6. Now, count how many people are at a particular node in the graph.

The number of people at a node is a measure of the node’s “importance” (rank). The more number of people at a node, the higher is its “pagerank centrality”, i.e. the higher its importance.

Let the Computer Calculate PageRank for you

Wolfram Language has a function to compute the pagerank centrality of a node in a graph.
Remember, pagerank centralities represent the likelihood that a person randomly following links arrives at any particular page on the web graph.

Here is an example web graph:

The following code lists the nodes/vertices in the graph:

The following code computes the pagerank centralities of the each node/webpage in the web graph:
(The pagerank centralities are given in the same order as the vertex names listed above)

Problem 7

What is the pagerank of the page “resource.wolfram”? (Hint: resource.wolfram is the 5th node/vertex in the list of vertices)

Answer

You can also use the following code to find the pagerank of a vertex, if you know its index in the list of vertices:

Run the following code to rank the webpages according to PageRank Centrality and list them neatly in a table:
(ReverseSortBy orders it from highest pagerank to lowest)

Problem 8

Comment on the pagerank of the page “ko.wolfram”? Why do you think it seems to be ranked at this position?

Answer

Analyze Another Web Graph

Here is a toy section of the UIUC web graph:

Problem 9

Just by looking at this graph (you can find the name of the pages by hovering over the nodes), which node would you guess has the highest pagerank?

Problem 10

Create a ranking of the pages in the web graph according to the pagerank centrality measure. Use the code from above.
Find the page that has the highest probability of a person arriving at it by randomly clicking on hyperlinks in this graph (i.e. state the page with the highest pagerank centrality)
Does your guess match the computed answer?

Run the following line of code to see the top 10 nodes with the highest pagerank in the toyGraph:

Submitting your work

Ensure you have filled in your NetID at the top of the notebook

Save the notebook as a PDF file (Alternately, "Print to PDF" but please ensure the PDF looks ok and is not garbled)

Upload to Gradescope

Just to be sure that your submission was received, maybe email your TA (sattwik2@illinois.edu) that you have submitted.