Ranking of search results is an art and a science. It is the most closely guarded secret of the major search engine companies. Google, Yahoo, and Microsoft have teams of people, sworn to secrecy, who focus on this every day. The algorithms typically evaluate 10 to 20 factors to help rank the results...all in less than a second, before presenting them to the user. It turns out that there is a Long Tail of Words. Amazingly, just 10 words account for 25% of the total words found in a typical document. As many of you know I was director of engineering at AltaVista many years ago, so I have a special interest in this stuff.
Alex Barnett's blog contained a reference to the Oxford English Corpus which is a collection of English words from novels, journals, newspapers, magazines and even online content. The OEC is used by search engine companies as a stable test bed for new algorithms. The interesting point here is the number of words, which are most common, and how often they are used.
The most basic and important factor in ranking search results is the uniqueness of the words contained in the query. All known words are given a weighting factor with the most rare words given the highest ranking. The most common words are ignored, and some common words typically used in spam are given a negative weighting.
Hint #1 - Don't use common words, or words typically associated with spam. Viagra will get you a negative weighting unless you are from the drug industry and have reputable links to your page.
How many English words are there? The answer from AskOxford.com;
Although it may be impossible to know the number of words in English, the Oxford English Corpus can help us assess the number of words in current use.
It is most useful to count base words or lemmas rather than individual inflectional word-forms; for example, climbs, climbing, and climbed are counted as examples of the lemma climb. Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the one billion words used in the Oxford English Corpus. If you were to read through the corpus, one word in four would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.
Vocabulary size (no. lemmas) % of content in OEC Example lemmas 10 25% the, of, and, to, that, have 100 50% from, because, go, me, our, well, way 1000 75% girl, win, decide, huge, difficult, series 7000 90% tackle, peak, crude, purely, dude, modest 50,000 95% saboteur, autocracy, calyx, conformist >1,000,000 99% laggardly, endobenthic, pomological
The remaining 5% of the corpus consists of a very large number of lemmas which occur rarely: words like evidentialist or microhouse, which may occur only once every several million words. Like all natural languages, English consists of a small number of very common words, a larger number of intermediate ones, and then an indefinitely long 'tail' of rare terms.
What are the most common words? Based on the evidence of the billion-word Oxford English Corpus, the 100 commonest English words found in writing around the world are as follows:
|76 come |
Hint #2 - The frequency of a word on a page, and the position of the word on the page, are very important factors in ranking results. However, use a word too often and it will be detected as spam, and given a huge negative weight. For example, in the early days web masters would repeat a word over and over in the text, place in in the title, meta tags, and even put it on the page in white font so the user wouldn't see it but the search engine would. The search engines got wise to all these tricks and adjusted their algorithms accordingly.
Hint #3 - Words that appear in the title of a page are more important than the same word in the body of the text. Words that appear in bold are more important than other words.
The Search Engine Optimization companies are in a "cat and mouse" game with the search engines. The SEOs reverse engineer the algorithms to determine what is important to each search engine. The search engine guys are constantly tweaking their algorithms to beat the spammers. Google has recently given a much higher ranking to blog results than to regular web pages. This could, and will, change as time goes on.
Spammers and SEOs have had the hardest time beating the page link ranking. Most search engines have adopted the PageRank system made famous by Larry Page of Google. Other search engines call it link analysis, link authority, hub and spoke analysis, or some other name, but it all boils down to the same thing. The importance of a page is determined by the number, and quality, of links to it. Quality of a link is determined by the number and quality of links to it. As you can see this is an iterative process and very compute intensive.
Whole books have been written on this topic, and there are many companies (SEO's) who do nothing but consult on how to improve your search engine results. So, I will not spend any more time on it here. My main interest today (Saturday) was to share with you "The Long Tail of Words" and the effect it has on search results. I will be back to my normal topics on Monday :-)