Physicists Discover That English has... 1 Million Words
There is some confusion about the concept of "number of words" in languages. Here are some examples of collections of words that suggest that vocabularies are in the big-data league. The sociobiologist E. O Wilson has sponsored a project aimed at “documenting all 1.8 million named species of animals, plants, and other forms of life on Earth”. Thomson Scientific, an information publisher, has a dictionary of over 2 million items used in indexing and categorizing its stream of 100,000 documents a week. Google’s GeoNames gazetteer includes over 8.5 million toponyms for 6.5 million geographical places with 2 million alternate names in up to 200 languages. IBM had a database for handling info on spelling etc for 1 billion names called GNR - Global Names Reference (eg gender associations for names of several hundreds of language cultures) spelling conventions. (Editor)
Christopher Shea writes in the WSJ that physicists studying Google's massive collection of scanned books claim to have identified universal laws governing the birth, life course and death of words. Their paper gives the best-yet estimate of the true number of words in English — a million, far more than any dictionary has recorded (the 2002 Webster's Third New International Dictionary has 348,000), with more than half of the language considered 'dark matter' that has evaded standard dictionaries.