The flow of PageRank

Feb 02 2010

You may be familiar with Google’s PageRank technology. Google considers lots of variables to calculate the PageRank of your website. This is a discussion on an extremely simplified version of PageRank.

Assume that we rank the websites by the number and quality of incoming links. The quality of an incoming link is defined as a function of the PageRank of the site which link to the other.

Let us take an example. The following figure shows how a small group of websites link to each other.

Note that website F does not have any incoming link while website G does not have any outgoing link.

Now from the given graph of links we have to find out the (relative) PageRank of each of the websites. Initially we will assume that all the pages have the same PageRank. Now we count the number of incoming links to each site and change the PageRank according to the number of incoming links.

We define PageRank of site A as:

PR(A) = ? PR(x)/L(x)

where L(x) = number of outgoing links in site x

and x denotes the sites linking to A.

When you run this algorithm for the first time, the PageRank of all the pages get updated. Now the problem is that since the PageRank of all the incoming pages have been updated, we have to re-calculate the PageRank of the pages again to take the new PageRank values into consideration. (You can predict this problem just by noticing that the function is a recursive one.) The same problem surfaces in every iteration of the algorithm.

The question is that if the PageRanks change in every iteration, how do we know when to stop the iteration? Do the PageRanks ever stabilize? (The proper term is convergence).

Here is a python script to simulate the PageRank calculation many times over and over to find out whether the values converge or not. The output values are represented as percentages. (Google considers this value as the probability of a person visiting any particular website).

The chart below shows how the PageRank changes after each iteration:

As you can see the PageRanks fluctuate highly in the initial iterations and then they stabilize. This means that the PageRank function converges.

Another think to note is that adding more nodes to the graph did not seem to affect the convergence. Even if you double the number of sites in the collection, the number of iterations taken to converge stays almost the same. Others have also reached the same results (ppt). The PageRank function is analogous to electric current flowing through a mesh. Even if there are a lot of nodes and sources, the current flow stabilizes (and stabilizes really fast).

Also note that site D has the highest PageRank, which is to be expected as it has the most incoming links. Site F has the lowest PageRank because it does not have incoming links.

According to this algorithm, linking out to other sites do not reduce the PageRank of your website. There is a problem though. Take the case of site G. It does not link to any other site. This means that the PageRank is not flowing out of site G to any other site. If site G linked to other sites, it would have increased the PageRank of the other sites by a tiny bit. (This case affects only the first link out of any site). To solve this problem, Google divides the PageRank of sites like these (called sinks) to all other sites. You may also want to read about Damping factor.

Before leaving can you explain why the PageRank of site A is greater than that of site B?

16 responses so far

  • Kevin says:

    The algorithm shows how the number of incoming links can affect the page rank of a site. But what about the quality i.e the number of visitors to sites that have out links to the site. Where does that figure in this?

  • Niyaz PK says:

    As I said earlier, this is an extremely simplified form of the original PageRank algorithm. Google (and others who use similar algorithms) take many other factors (keyword density, traffic, domain age etc) into consideration for calculating the PageRank of a web page.

    Having said that, I would like to point out that our version of the algorithm does in fact take the quality of the links into consideration! In every iteration of the loop, the PageRank (from the previous iteration) of the incoming links are taken into consideration. PageRank (as per the definition given in the article) is correlated to the traffic to the site.

  • [...] Oh. Now I’d like to say that I was on the Digg front page, or that I’d been virally retweeted overnight, but I’m not that lucky. I’d be reorganising my office, and changed the structure of my network in the process. I forgot to check the state of my server afterwards which is a big mistake as I try to increase my pagerank. [...]

  • [...] The flow of PageRank | Diovo. [...]

  • Simon says:

    Very interesting.

    It seems as though the algorithm converges with A > B because G doesn’t link to anything. As a result, on each pass, some of the total Page Rank is lost (i.e. it is collected by G, but not redistributed back into the network – this is not solved by the normalization…in fact, if you remove the normalization you will notice that is actually converging on all zeros). If G is removed from the network or links somewhere else, the system converges correctly (with or without normalization, although some normalization does appear to significantly increase the rate of convergence).

    Problems also arise if two networks are connected in one direction only (i.e. G is connected to some node H, which is only connected back to G). All the Page Rank that flows into G gets trapped in G and H and never flows back out into the rest of the network. This is probably the idea that a lot of spammers use to try to get their page ranks up. Clearly, Google has solved this problem (I wonder how!).

  • [...] PageRank is used to rank webpages which are represented as a directed graph based on the concept of network flow, and may also be applied to other directed graphs including our [...]

  • vich says:

    Why google is not included in my site?

  • [...] Oh. Now I’d like to say that I was on the Digg front page, or that I’d been virally retweeted overnight, but I’m not that lucky. I’d been reorganising my office, and changed the structure of my network in the process. I forgot to check the state of my server afterwards which is a big mistake as I try to increase my pagerank. [...]

  • Dejan SEO says:

    This is one of the most intelligent studies of PageRank I have seen in a long time.

  • That’s a great explanation of Page Rank. Many thanks for explaining the mathematics behind it.

    • Mitch says:

      Great article amazing insight into page rank. I am very intere1sted in how this all works. Do you think if the link is no follow but high page rank it still gives value.

  • sydny says:

    hello,

    may i know what is pr flow in short??

  • outlr says:

    This post really cleared my few doubt.
    But can you explain that how G can increase page rank.

  • I blog frequently and I seriously appreciate your content.
    This great article has really peaked my interest.
    I will bookmark your blog and keep checking for
    new details about once a week. I subscribed to your Feed as well.

Leave a Reply