64 messages over 8 pages: 1 2 3 4 5 6 7 8 Next >>
luke Diglot Senior Member United States Joined 7160 days ago 3133 posts - 4351 votes Speaks: English*, Spanish Studies: Esperanto, French
| Message 1 of 64 24 March 2005 at 5:05am | IP Logged |
Quote:
A major New York newspaper established they used only 600 words on average in their newspaper daily.
|
|
|
I heard that, and other sources saying you only need a relatively small number of words (depending on the source, 1500, 1000, 600, 500, even 100) to communicate. It doesn't ring true to me.
Michel's citation of a study saying the NY Times only used 600 words has to be wrong in some way. Maybe it's 6000. (I know the CD says 600).
I did this little experiment. I went to usatoday.com (since no subscription is necessary), and grabbed the first headline article, and counted the words in the article. There were 857 distinct words in that one article.
Methodology.
Copy/paste article into a file.
Convert spaces to newlines.
Convert to lower case.
Remove punctuation, other than apostropes.
Remove numbers.
Do a unique sort on the file.
Count the words.
For posterity, this was the headline article.
http://www.usatoday.com/news/health/2005-0 3-23-cover-compoun ding_x.htm
Unless the 600 word study did something like exclude all proper nouns, conjugations of verbs, plurals, numbers, compound words, abbreviations, prefix and suffix word variations (words starting with "in", "un", "dis", "anti", etc, or ending in "ly", "ally", "ation", etc, it's impossible the Sunday NY Times (a big fat newspaper) had only 600 unique words.
I checked "20,000 words in Spanish in 20 minutes" out of the library. It's one of those books that goes the other direction, saying you can have a huge vocabulary by understanding cognate suffix patterns. The truth about the 20,000 words book is you may not know some of the cognates in your native tongue because they are highly scientific or obscure. That said, knowing the magic of cognate suffixes is helpful. When you understand prefixes and suffixes in English, your vocabulary goes up like 10 times.
I guess it's important to define "what is a word?". If you know 15 moods/tenses for a verb for 6 types of people, (I, you, we, y'all, polite and informal), is that 1 word, or 15 * 6 = 90 words? If you know masculine/feminine/plural of an adjective, is that 1 word or 4? There's no question that if you know a lot of variations of a word, you can express ideas more precisely than someone who doesn't know any variations
of the same word.
Obviously a lot depends on which words you know. If you know the most commonly used words, they are more useful than knowing obscure words. What's obscure and what's not depends on where you use them. The word "hill" may be unusual in a medical book, but it's a very common word. "mandibula" may appear several times in a college nursing textbook, and "music" may never appear.
The word "thorax" is known by a typical 4th grader. Ask someone in they 40s when they last heard the word thorax, and they may say, "in grade school". They know the word though.
Edited by luke on 24 July 2006 at 7:32pm
7 persons have voted this message useful
| administrator Hexaglot Forum Admin Switzerland FXcuisine.com Joined 7331 days ago 3094 posts - 2987 votes 12 sounds Speaks: French*, EnglishC2, German, Italian, Spanish, Russian Personal Language Map
| Message 2 of 64 24 March 2005 at 7:04am | IP Logged |
I have a file of lexemes in Russian sorted by frequency. Lexemes are 'unique' words, that is for instance 'to be' instead of counting 'is' 'was' 'are' are a different word each time.
With the files I was able to create a graph of frequency versus rank:
This is a very basic lexicographic analysis and only reproduces what you can find, I am sure, in many academic articles.
The result is that:
the 75 most common words make up 40% of occurences
the 200 most common words make up 50% of occurences
the 524 most common words make up 60% of occurences
the 1257 most common words make up 70% of occurences
the 2925 most common words make up 80% of occurences
the 7444 most common words make up 90% of occurences
the 13374 most common words make up 95% of occurences
the 25508 most common words make up 99% of occurences
This shows clearly that vocabulary frequency follows both the law of Pareto (80% of occurences by only 20% of words) and the law of diminishing returns.
So yes you can probably read any text with only 3000 or 5000 words, but you will always miss some key words. You can't really say that all you need is 3000 words although this certainly gets you to a more or less autonomous stage in your learning, from which you can learn many words by their context.
I hope this helps!
Edited by administrator on 24 March 2005 at 8:31am
9 persons have voted this message useful
| Eric Senior Member Australia Joined 7183 days ago 102 posts - 105 votes Speaks: English* Studies: Spanish, French
| Message 3 of 64 24 March 2005 at 7:30am | IP Logged |
administrator wrote:
The result is that the 75 most common words make up 40% of occurences. |
|
|
That's amazing Francois, truly it is.
If a mere 75 words can have such a high percentage of occurance, then you can only imagine if you had 600 how you could get by in most situations that aren't too specialized.
Luke some interesting stuff there, unfortunately I am a language pleb and can't answer.
Edited by Eric on 24 March 2005 at 7:37am
1 person has voted this message useful
| ElComadreja Senior Member Philippines bibletranslatio Joined 7193 days ago 683 posts - 757 votes 2 sounds Speaks: English* Studies: Spanish, Portuguese, Latin, Ancient Greek, Biblical Hebrew, Cebuano, French, Tagalog
| Message 4 of 64 24 March 2005 at 11:42am | IP Logged |
Not exactly sure how the russian grammar works, but those 75 words have got to be alot of the grammatical words. and, the, but, an, etc.
There's a joke about learning biblical greek that goes something like "When you learn the word for 'and', you can read the majority of the bible."
The fact that I have an (I guess) above average English vocabulary has been quite helpful. Like, I know that "odious" means hatred, so when I see something like the Spanish "odiar" it's not that big a deal. Oh, but dare I dive off into a language without a large influx of latin?
Edited by ElComadreja on 24 March 2005 at 12:05pm
1 person has voted this message useful
| luke Diglot Senior Member United States Joined 7160 days ago 3133 posts - 4351 votes Speaks: English*, Spanish Studies: Esperanto, French
| Message 5 of 64 24 March 2005 at 5:00pm | IP Logged |
administrator wrote:
Lexemes are 'unique' words. I hope this helps! |
|
|
You are awesome! I was unaware of the linguistic term lexemes.
I searched the fine web and found a paper by Mark Davies of Brigham Young University http://www.lingref.com/cpp/hls/7/paper1091.pdf which is about this topic for Spanish in particular.
The paper does some comparisons to earlier studies on English and German too.
Interestingly, the paper distinguishes between fiction, non-fiction, and oral vocabularies. Oral vocabularies are somewhat smaller than written. It suggests a vocabulary of the 4000 most popular word forms would cover 90% of Spanish speech, but you need the 8000 most popular word forms to cover 90% of written texts. He used a very broad source for his sample.
The paper also discusses frequency by word type, i.e. noun, verb, adjective, adverb, modifier, preposition, conjunction. The magic mix would be about 64% nouns, 24% adjectives, 6% verbs, 5% prepositions, conjunctions and modifiers, and 1% adverbs for spoken Spanish.
He also does analysis of the percentage of nouns, verbs, adjectives etc make up the most frequent lexemes. One would know 30% of all the adverbs, but only 10% of all the verbs to understand 90% of spoken Spanish.
Some words are very popular in speech but not popular
in non-fiction (gustar). Others are popular in non-fiction and not speech (denominar).
Mark Davies will publish a book in the summer of 2005 called "The Routledge Frequency Dictionary of Spanish" with a thorough analysis and the top 6000 words. It looks like it will be a great contribution. The paper is quite interesting.
Edited by luke on 30 August 2006 at 8:30pm
4 persons have voted this message useful
| ProfArguelles Moderator United States foreignlanguageexper Joined 7211 days ago 609 posts - 2102 votes
| Message 6 of 64 24 March 2005 at 7:46pm | IP Logged |
The maddening thing about these numbers and statistics is that they are impossible to pin down precisely and thus they vary from source to source. The rounded numbers that I use to explain this to my students I usually write in a bull's eye target on the whiteboard, but I don't have the computer skills to draw circles in this post, so I will just have to give a list:
250 words constitute the essential core of a language, those without which you cannot construct any sentence.
750 words constitute those that are used every single day by every person who speaks the language.
2500 words constitute those that should enable you to express everything you could possibly want to say, albeit often by awkward circumlocutions.
5000 words constitute the active vocabulary of native speakers without higher education.
10,000 words constitute the active vocabulary of native speakers with higher education.
20,000 words constitute what you need to recognize passively in order to read, understand, and enjoy a work of literature such as a novel by a notable author.
17 persons have voted this message useful
| heartburn Senior Member United States Joined 7162 days ago 355 posts - 350 votes Speaks: English* Studies: Spanish
| Message 7 of 64 24 March 2005 at 8:13pm | IP Logged |
Does anyone have, or know where I can get a lemmatized Spanish word frequency list? I don't want to wait 'til summer.
Edited by heartburn on 24 March 2005 at 8:14pm
1 person has voted this message useful
| heartburn Senior Member United States Joined 7162 days ago 355 posts - 350 votes Speaks: English* Studies: Spanish
| Message 8 of 64 25 March 2005 at 12:30am | IP Logged |
Ok. I spent too long looking, but I came up with some out-of-print titles that look very interesting. They all seem to be available at AbeBooks.
An English, French, German, Spanish Word Frequency Dictionary: A correlation of the first six thousand words in four single language semantic frequency lists
by Eaton, Helen S.
Spanish Key Words: The Basic Two Thousand Word Vocabulary Arranged by Frequency in a Hundred Units with Comprehensive Italian and English Indexes
by Pedro Casal
Arabic Key Words: The Basic Two Thousand-Word Vocabulary Transliterated and Arranged by Frequency in a Hundred Units
by David Quitregard
French Key Words: The Basic Two Thousand Word Vocabulary Arranged by Frequency in a Hundred Units with Comprehensive French and English Indexes
by Xavier-Yves Escande
Italian Key Words: The Basic Two Thousand Word Vocabulary Arranged by Frequency in a Hundred Units with Comprehensive Italian and English
by Gianpaolo Intronati
Edited by heartburn on 25 March 2005 at 1:59am
3 persons have voted this message useful
|
This discussion contains 64 messages over 8 pages: 1 2 3 4 5 6 7 8 Next >>
You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum
This page was generated in 0.2969 seconds.
DHTML Menu By Milonic JavaScript
|