Ranking of search results is an art and a science. It is the most closely guarded secret of the major search engine companies. Google, Yahoo, and Microsoft have teams of people, sworn to secrecy, who focus on this every day. The algorithms typically evaluate 10 to 20 factors to help rank the results...all in less than a second, before presenting them to the user. It turns out that there is a Long Tail of Words. Amazingly, just 10 words account for 25% of the total words found in a typical document. As many of you know I was director of engineering at AltaVista many years ago, so I have a special interest in this stuff.
Alex Barnett's blog contained a reference to the Oxford English Corpus which is a collection of English words from novels, journals, newspapers, magazines and even online content. The OEC is used by search engine companies as a stable test bed for new algorithms. The interesting point here is the number of words, which are most common, and how often they are used.
The most basic and important factor in ranking search results is the uniqueness of the words contained in the query. All known words are given a weighting factor with the most rare words given the highest ranking. The most common words are ignored, and some common words typically used in spam are given a negative weighting.
Hint #1 - Don't use common words, or words typically associated with spam. Viagra will get you a negative weighting unless you are from the drug industry and have reputable links to your page.
How many English words are there? The answer from AskOxford.com;
Although it may be impossible to know the number of words in English, the Oxford English Corpus can help us assess the number of words in current use.
It is most useful to count base words or lemmas rather than individual inflectional word-forms; for example, climbs, climbing, and climbed are counted as examples of the lemma climb. Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the one billion words used in the Oxford English Corpus. If you were to read through the corpus, one word in four would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.
Vocabulary size (no. lemmas) % of content in OEC Example lemmas 10 25% the, of, and, to, that, have 100 50% from, because, go, me, our, well, way 1000 75% girl, win, decide, huge, difficult, series 7000 90% tackle, peak, crude, purely, dude, modest 50,000 95% saboteur, autocracy, calyx, conformist >1,000,000 99% laggardly, endobenthic, pomological The remaining 5% of the corpus consists of a very large number of lemmas which occur rarely: words like evidentialist or microhouse, which may occur only once every several million words. Like all natural languages, English consists of a small number of very common words, a larger number of intermediate ones, and then an indefinitely long 'tail' of rare terms.
What are the most common words? Based on the evidence of the billion-word Oxford English Corpus, the 100 commonest English words found in writing around the world are as follows:
1 the 2 be 3 to 4 of 5 and 6 a 7 in 8 that 9 have 10 I 11 it 12 for 13 not 14 on 15 with 16 he 17 as 18 you 19 do 20 at 21 this 22 but 23 his 24 by 25 from |
26 they 27 we 28 say 29 her 30 she 31 or 32 an 33 will 34 my 35 one 36 all 37 would 38 there 39 their 40 what 41 so 42 up 43 out 44 if 45 about 46 who 47 get 48 which 49 go 50 me |
51 when 52 make 53 can 54 like 55 time 56 no 57 just 58 him 59 know 60 take 61 people 62 into 63 year 64 your 65 good 66 some 67 could 68 them 69 see 70 other 71 than 72 then 73 now 74 look 75 only |
76 come 77 its 78 over 79 think 80 also 81 back 82 after 83 use 84 two 85 how 86 our 87 work 88 first 89 well 90 way 91 even 92 new 93 want 94 because 95 any 96 these 97 give 98 day 99 most 100 us |
Hint #2 - The frequency of a word on a page, and the position of the word on the page, are very important factors in ranking results. However, use a word too often and it will be detected as spam, and given a huge negative weight. For example, in the early days web masters would repeat a word over and over in the text, place in in the title, meta tags, and even put it on the page in white font so the user wouldn't see it but the search engine would. The search engines got wise to all these tricks and adjusted their algorithms accordingly.
Hint #3 - Words that appear in the title of a page are more important than the same word in the body of the text. Words that appear in bold are more important than other words.
The Search Engine Optimization companies are in a "cat and mouse" game with the search engines. The SEOs reverse engineer the algorithms to determine what is important to each search engine. The search engine guys are constantly tweaking their algorithms to beat the spammers. Google has recently given a much higher ranking to blog results than to regular web pages. This could, and will, change as time goes on.
Spammers and SEOs have had the hardest time beating the page link ranking. Most search engines have adopted the PageRank system made famous by Larry Page of Google. Other search engines call it link analysis, link authority, hub and spoke analysis, or some other name, but it all boils down to the same thing. The importance of a page is determined by the number, and quality, of links to it. Quality of a link is determined by the number and quality of links to it. As you can see this is an iterative process and very compute intensive.
Whole books have been written on this topic, and there are many companies (SEO's) who do nothing but consult on how to improve your search engine results. So, I will not spend any more time on it here. My main interest today (Saturday) was to share with you "The Long Tail of Words" and the effect it has on search results. I will be back to my normal topics on Monday :-)
Pretty intersting post - I've tried digesting it and re-writing it on webmetricsguru.com. Most of what your saying is already known but the way you say it, and what the ranking factors for on page attributes are based on are somewhat new for me. Not sure how I'd actually apply it but it's something that should be thought about before writing copy.
Posted by: WebMetricsGuru | April 29, 2006 at 08:04 PM
I worry that striving for SEO placement ends up dumbing down healdines - they're what catch our attention - not an SEO optimised headline. Or story.
The thin end of bland?
Posted by: Dennis Howlett | May 02, 2006 at 01:14 PM
Very interesting.
Does it mean that using simple language will negatively influence page ranking? If so, webmaster will tend to use abstruse language to raise ranking - not good for common readers.
Posted by: Mivanov | May 06, 2006 at 04:53 AM
The word ranking has to do with how search engines rank results based on the words in the query. For example if I entered "where do I find entrepreneur capital?" in a search engine, the word "entrepreneur" would be the most important word to match. The word "capital" would be second, and the rest of the words are almost irrelevant.
Web pages with the word "entrepreneur" in the title, headline, or in bold font, are likely to be ranked very high in the search results.
Word score is just one factor, the most basic, of many in ranking search results. The Oxford word list is a convenient way to illustrate how word scoring works. The SEOs have much more sophisticated methods to optimize search results.
Posted by: Don Dodge | May 06, 2006 at 05:44 PM