Lab 5: Search Engines
Lab 5: Search Engines
NetID:
Link to published notebook:
In this lab, we will try to build a web crawler and create a web graph from our results.
We will also calculate the page rank of a node in a web graph.
We will also calculate the page rank of a node in a web graph.
Crawling the Web
Crawling the Web
In this part, we will crawl (a small part of) the World Wide Web, build a graph to represent the websites we crawled.
Let’s set a starting page:
In[]:=
startURL="https://ece.illinois.edu/";
Step 1
Step 1
Import all the hyperlinks from this page:
In[]:=
homeHyperlinks=Import[startURL,"Hyperlinks"]
Check how may hyperlinks are present on this page alone:
In[]:=
Length[homeHyperlinks]
Step 2
Step 2
Let’s pick a link that is not a phone number or email address or something on Google Maps:
linkedPages=DeleteCases[homeHyperlinks,s_/;StringMatchQ[s,___~~"mailto"|"tel"|"maps"~~___]]
Step 3
Step 3
Let’s just pick 1 link at random from this list:
In[]:=
selectedLink=RandomChoice[linkedPages]
Step 4
Step 4
Repeat steps 1, 2, and 3 for the selected link:
In[]:=
newLink=RandomChoice[DeleteCases[Import[selectedLink,"Hyperlinks"],s_/;StringMatchQ[s,___~~"mailto"|"tel"|"maps"~~___]]]
Putting it together
Putting it together
We need to repeat steps 1, 2 and 3 for each link we end with.
Here is a function that will do steps 1, 2 and 3, when it is given any URL as an input:
Here is a function that will do steps 1, 2 and 3, when it is given any URL as an input:
In[]:=
getLinkedURL[link_]:=RandomChoice[DeleteCases[Import[link,"Hyperlinks"],s_/;StringMatchQ[s,___~~"mailto"|"tel"|"maps"~~___]]]
In[]:=
listOfLinks=NestList[getLinkedURL[#]&,startURL,10]
Create a web graph from these links:
In[]:=
simpleG=Graph[MapThread[#1->#2&,{Most[listOfLinks],Rest[listOfLinks]}],VertexLabels->Placed["Name",Tooltip]]
Putting it together: Advanced (Optional - Extra credit)
Putting it together: Advanced (Optional - Extra credit)
Instead of selecting just one hyperlink from each page, we can simultaneously select more than one and grow the graph in each direction.
Set the number of links you’d like to follow from a page:
numLinks=3
Set the number of hops you’d like to go in the graph:
numHops=3
Note: The larger the values of numLinks and numHops, the longer it will take for the code to execute!!
In[]:=
g=NestGraph[Take[DeleteCases[Import[#,"Hyperlinks"],s_/;StringMatchQ[s,___~~"mailto"|"tel"|"maps"~~___]],numLinks]&,startURL,numHops,VertexLabels->Placed["Name",Tooltip]]
Problem 1
Problem 1
Change the startURL to a website of your choice, rerun the code above to create another graph. Copy paste the graph into the answer cell below.
Answer
Answer
Answer (Extra Credit)
Answer (Extra Credit)
Problem 2
Problem 2
Looking at the graph, which website do you think has (roughly) the highest PageRank? Why?
Answer
Answer
Page Rank of a Web Page in the Web Graph
Page Rank of a Web Page in the Web Graph
Wolfram Language has a function to compute the page rank centrality of a node in a graph. Page-rank centralities represent the likelihood that a person randomly following links arrives at any particular page on the web graph.
Here is an example web graph:
In[]:=
webGraphBasic=
;
This gives the page rank centralities of the pages in the graph:
In[]:=
VertexList[webGraphBasic]
In[]:=
PageRankCentrality[webGraphBasic,0.85]
We can rank the webpages according to Page Rank Centrality, which a metric that represents Page Rank:
In[]:=
Part[VertexList[webGraphBasic],Ordering[PageRankCentrality[webGraphBasic,0.85],All,Greater]]
We can compute the probability that a person randomly clicking on hyperlinks will arrive at a particular page:
In[]:=
Part[PageRankCentrality[webGraphBasic,0.85],VertexIndex[webGraphBasic,"wolframalpha.com"]]
Analyze the UIUC graph:
Analyze the UIUC graph:
Here is a toy section of the UIUC web graph:
In[]:=
uiucGraph=
;
Find the node with the maximum “In” degree i.e. maximum number of links coming to the page:
In[]:=
GraphHub[uiucGraph,"In"]
What is the vertex degree of these pages:
In[]:=
VertexInDegree[uiucGraph,#]&/@GraphHub[g,"In"]
Problem 3
Problem 3
Compute the probability that a person randomly clicking on hyperlinks will arrive at “https://illinois.edu” in the graph uiucGraph
Part[PageRankCentrality[
Problem 4
Problem 4
PageRankCentrality gives a list of page-rank scores for the vertices in the graph for some weight .
g
α
Rank the web pages, with the most visible pages first:
orderedNodes=
Look at the top 5 nodes:
In[]:=
Take[orderedNodes,5]
Look at the bottom 5 nodes:
In[]:=
Take[orderedNodes,-5]
Highlight the nodes that have high page rank:
In[]:=
HighlightGraph[g,Take[orderedNodes,5]]
Problem 4
Problem 4
How well does your answer in Problem 2 match up with the highlighted nodes above? Explain why there might be a difference (if there is a difference).
Submitting your work
Submitting your work
1
. Publish your notebook
1
.1
.From the cloud notebook, click on “Publish” at the top right corner.
1
.2
.From the desktop notebook, use the menu option File -> Publish to Cloud
2
.Copy the published link
3
.Add it to the top of the notebook, below your netID
4
.Print to PDF
5
.Upload to Gradescope
6
.Just to be sure, maybe ping your TA Sattwik on Slack that you have submitted.