My Photo

Enter your email address:

Delivered by FeedBurner

« Microsoft - growth stock or blue chip? | Main | Developers love .Net environment and community »

The Long Tail of Words - How search engines rank results

Ranking of search results is an art and a science. It is the most closely guarded secret of the major search engine companies. Google, Yahoo, and Microsoft have teams of people, sworn to secrecy, who focus on this every day. The algorithms typically evaluate 10 to 20 factors to help rank the results...all in less than a second, before presenting them to the user.  It turns out that there is a Long Tail of Words. Amazingly, just 10 words account for 25% of the total words found in a typical document. As many of you know I was director of engineering at AltaVista many years ago, so I have a special interest in this stuff.

Alex Barnett's blog contained a reference to the Oxford English Corpus which is a collection of English words from novels, journals, newspapers, magazines and even online content. The OEC is used by search engine companies as a stable test bed for new algorithms. The interesting point here is the number of words, which are most common, and how often they are used.

The most basic and important factor in ranking search results is the uniqueness of the words contained in the query. All known words are given a weighting factor with the most rare words given the highest ranking. The most common words are ignored, and some common words typically used in spam are given a negative weighting.

Hint #1 - Don't use common words, or words typically associated with spam. Viagra will get you a negative weighting unless you are from the drug industry and have reputable links to your page.

How many English words are there? The answer from AskOxford.com;

Although it may be impossible to know the number of words in English, the Oxford English Corpus can help us assess the number of words in current use.

It is most useful to count base words or lemmas rather than individual inflectional word-forms; for example, climbs, climbing, and climbed are counted as examples of the lemma climb. Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the one billion words used in the Oxford English Corpus. If you were to read through the corpus, one word in four would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.

Vocabulary size (no. lemmas)% of content in OECExample lemmas
10 25% the, of, and, to, that, have
100 50% from, because, go, me, our, well, way
1000 75% girl, win, decide, huge, difficult, series
7000 90% tackle, peak, crude, purely, dude, modest
50,000 95% saboteur, autocracy, calyx, conformist
>1,000,000 99% laggardly, endobenthic, pomological

The remaining 5% of the corpus consists of a very large number of lemmas which occur rarely: words like evidentialist or microhouse, which may occur only once every several million words. Like all natural languages, English consists of a small number of very common words, a larger number of intermediate ones, and then an indefinitely long 'tail' of rare terms.

What are the most common words? Based on the evidence of the billion-word Oxford English Corpus, the 100 commonest English words found in writing around the world are as follows:

1     the
2     be
3     to
4     of
5     and
6     a
7     in
8     that
9     have
10    I
11    it
12    for
13    not
14    on
15    with
16    he
17    as
18    you
19    do
20    at
21    this
22    but
23    his
24    by
25    from
26    they
27    we
28    say
29    her
30    she
31    or
32    an
33    will
34    my
35    one
36    all
37    would
38    there
39    their
40    what
41    so
42    up
43    out
44    if
45    about
46    who
47    get
48    which
49    go
50    me
51    when
52    make
53    can
54    like
55    time
56    no
57    just
58    him
59    know
60    take
61    people
62    into
63    year
64    your
65    good
66    some
67    could
68    them
69    see
70    other
71    than
72    then
73    now
74    look
75    only
76    come
77    its
78    over
79    think
80    also
81    back
82    after
83    use
84    two
85    how
86    our
87    work
88    first
89    well
90    way
91    even
92    new
93    want
94    because
95    any
96    these
97    give
98    day
99    most
100   us

Hint #2 - The frequency of a word on a page, and the position of the word on the page, are very important factors in ranking results. However, use a word too often and it will be detected as spam, and given a huge negative weight. For example, in the early days web masters would repeat a word over and over in the text, place in in the title, meta tags, and even put it on the page in white font so the user wouldn't see it but the search engine would. The search engines got wise to all these tricks and adjusted their algorithms accordingly.

Hint #3 - Words that appear in the title of a page are more important than the same word in the body of the text. Words that appear in bold are more important than other words.

The Search Engine Optimization companies are in a "cat and mouse" game with the search engines. The SEOs reverse engineer the algorithms to determine what is important to each search engine. The search engine guys are constantly tweaking their algorithms to beat the spammers. Google has recently given a much higher ranking to blog results than to regular web pages. This could, and will, change as time goes on.

Spammers and SEOs have had the hardest time beating the page link ranking. Most search engines have adopted the PageRank system made famous by Larry Page of Google. Other search engines call it link analysis, link authority, hub and spoke analysis, or some other name, but it all boils down to the same thing. The importance of a page is determined by the number, and quality, of links to it. Quality of a link is determined by the number and quality of links to it. As you can see this is an iterative process and very compute intensive.

Whole books have been written on this topic, and there are many companies (SEO's) who do nothing but consult on how to improve your search engine results. So, I will not spend any more time on it here. My main interest today (Saturday) was to share with you "The Long Tail of Words" and the effect it has on search results. I will be back to my normal topics on Monday :-)

Comments

Pretty intersting post - I've tried digesting it and re-writing it on webmetricsguru.com. Most of what your saying is already known but the way you say it, and what the ranking factors for on page attributes are based on are somewhat new for me. Not sure how I'd actually apply it but it's something that should be thought about before writing copy.

I worry that striving for SEO placement ends up dumbing down healdines - they're what catch our attention - not an SEO optimised headline. Or story.

The thin end of bland?

Very interesting.
Does it mean that using simple language will negatively influence page ranking? If so, webmaster will tend to use abstruse language to raise ranking - not good for common readers.

The word ranking has to do with how search engines rank results based on the words in the query. For example if I entered "where do I find entrepreneur capital?" in a search engine, the word "entrepreneur" would be the most important word to match. The word "capital" would be second, and the rest of the words are almost irrelevant.

Web pages with the word "entrepreneur" in the title, headline, or in bold font, are likely to be ranked very high in the search results.

Word score is just one factor, the most basic, of many in ranking search results. The Oxford word list is a convenient way to illustrate how word scoring works. The SEOs have much more sophisticated methods to optimize search results.

The comments to this entry are closed.

Subscribe

AddThis Social Bookmark Button