You may be familiar with Google’s PageRank technology. Google considers lots of variables to calculate the PageRank of your website. This is a discussion on an extremely simplified version of PageRank.
Assume that we rank the websites by the number and quality of incoming links. The quality of an incoming link is defined as a function of the PageRank of the site which link to the other.
Let us take an example. The following figure shows how a small group of websites link to each other.
Note that website F does not have any incoming link while website G does not have any outgoing link.
Now from the given graph of links we have to find out the (relative) PageRank of each of the websites. Initially we will assume that all the pages have the same PageRank. Now we count the number of incoming links to each site and change the PageRank according to the number of incoming links.
We define PageRank of site A as:
PR(A) = ? PR(x)/L(x)
where L(x) = number of outgoing links in site x
and x denotes the sites linking to A.
When you run this algorithm for the first time, the PageRank of all the pages get updated. Now the problem is that since the PageRank of all the incoming pages have been updated, we have to re-calculate the PageRank of the pages again to take the new PageRank values into consideration. (You can predict this problem just by noticing that the function is a recursive one.) The same problem surfaces in every iteration of the algorithm.
The question is that if the PageRanks change in every iteration, how do we know when to stop the iteration? Do the PageRanks ever stabilize? (The proper term is convergence).
Here is a python script to simulate the PageRank calculation many times over and over to find out whether the values converge or not. The output values are represented as percentages. (Google considers this value as the probability of a person visiting any particular website).
The chart below shows how the PageRank changes after each iteration:
As you can see the PageRanks fluctuate highly in the initial iterations and then they stabilize. This means that the PageRank function converges.
Another think to note is that adding more nodes to the graph did not seem to affect the convergence. Even if you double the number of sites in the collection, the number of iterations taken to converge stays almost the same. Others have also reached the same results (ppt). The PageRank function is analogous to electric current flowing through a mesh. Even if there are a lot of nodes and sources, the current flow stabilizes (and stabilizes really fast).
Also note that site D has the highest PageRank, which is to be expected as it has the most incoming links. Site F has the lowest PageRank because it does not have incoming links.
According to this algorithm, linking out to other sites do not reduce the PageRank of your website. There is a problem though. Take the case of site G. It does not link to any other site. This means that the PageRank is not flowing out of site G to any other site. If site G linked to other sites, it would have increased the PageRank of the other sites by a tiny bit. (This case affects only the first link out of any site). To solve this problem, Google divides the PageRank of sites like these (called sinks) to all other sites. You may also want to read about Damping factor.
Before leaving can you explain why the PageRank of site A is greater than that of site B?