How many words to speak? (General discussion) Language Learning Forum

How many words to speak?
Tags: Number of words \| Speaking
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : General discussion

309 messages over 39 pages: << Previous 1 2 3 4 5 6 7 ... 33 ... 38 39 Next >>

emk
Diglot
Moderator
United States
Joined 5530 days ago
2615 posts - 8806 votes

Speaks: English*, French^B2
Studies: Spanish, Ancient Egyptian
Personal Language Map

Message 257 of 309

22 September 2014 at 8:49pm | IP Logged

s_allard wrote:

I have to say that I find the whole thing so interesting that I might continue reading the whole comic book.

Sounds like fun! Google Translate and popup dictionaries can work very well, provided:

1. You can figure out where the words begin and end (a problem for Chinese and Egyptian).
2. The grammar is at least vaguely analogous to a language already know (largely with the IE family).

…

I've mentioned several times that going from a 300 word vocabulary to a 2000 word vocabulary makes a huge difference. I thought it would be amusing to illustrate this using a written example. This time I'm going to use the first two paragraphs of Harry Potter à l'école des sorciers, and frequency data from the Lexique book corpus, which is pretty well-tuned for fiction.

First, we'll try it with a 300 word vocabulary. If you want to estimate coverage, there are about 141 words in this text.

Quote:

Most frequent 300 words (Lexique book corpus)

Mr et Mrs Dursley, qui XXXXXXXXXX au 4, Privet Drive, avaient toujours XXXXXXX avec la plus grande XXXXXX qu’ils étaient XXXXXXXXXXXX XXXXXXX, XXXXX pour eux. Jamais XXXXXXXXX n’aurait XXXXXXX qu’ils puissent se trouver XXXXXXXXX dans quoi que ce XXXX d’XXXXXXX ou de XXXXXXXXXX. Ils n’avaient pas de temps à perdre avec des XXXXXXXXX.

Mr Dursley XXXXXXXXX la Grunnings, une XXXXXXXXXX qui XXXXXXXXXX des XXXXXXXXX. C’était un homme grand et XXXXXX, qui n’avait XXXXXXXXXXXX pas de XXX, mais XXXXXXXXX en XXXXXXXX une XXXXXXXXX de belle XXXXXX. Mrs Dursley, XXXXX à elle, était XXXXX et XXXXXX et XXXXXXXXX d’un XXX deux fois plus long que la XXXXXXX, ce qui lui était fort XXXXX pour XXXXXXXXX ses XXXXXXX en XXXXXXXXX par-XXXXXX les XXXXXXXX des XXXXXXX. Les Dursley avaient un petit XXXXXX XXXXXXXX Dudley et c’était à leurs yeux le plus XXX enfant du monde.

Unknown: affirmer, bel, blond, clôture, cou, cou, dessus, diriger, disposer, entreprendre, espionner, étrange, fabriquer, fierté, garçon, habiter, imaginer, impliquer, jardin, massif, merci, mince, moustache, moyenner, mystérieux, normal, parfaitement, perceur, posséder, pratiquement, prénommé, quant, quiconque, regardant, revanche, soit, sornette, tailler, utile, voisin

Even with a popup dictionary and English cognates, this is going to be pretty rough going. What if we jump up to a 1000 word vocabulary, or just a bit bigger than basic English?

Quote:

Most frequent 1000 words (Lexique book corpus)

Mr et Mrs Dursley, qui habitaient au 4, Privet Drive, avaient toujours XXXXXXX avec la plus grande XXXXXX qu’ils étaient XXXXXXXXXXXX XXXXXXX, merci pour eux. Jamais XXXXXXXXX n’aurait imaginé qu’ils puissent se trouver XXXXXXXXX dans quoi que ce XXXX d’étrange ou de XXXXXXXXXX. Ils n’avaient pas de temps à perdre avec des XXXXXXXXX.

Mr Dursley dirigeait la Grunnings, une XXXXXXXXXX qui XXXXXXXXXX des XXXXXXXXX. C’était un homme grand et XXXXXX, qui n’avait XXXXXXXXXXXX pas de cou, mais XXXXXXXXX en XXXXXXXX une XXXXXXXXX de belle XXXXXX. Mrs Dursley, XXXXX à elle, était XXXXX et blonde et XXXXXXXXX d’un cou deux fois plus long que la XXXXXXX, ce qui lui était fort XXXXX pour XXXXXXXXX ses voisins en XXXXXXXXX par-dessus les XXXXXXXX des jardins. Les Dursley avaient un petit garçon XXXXXXXX Dudley et c’était à leurs yeux le plus XXX enfant du monde.

Unknown: affirmer, bel, clôture, disposer, entreprendre, espionner, fabriquer, fierté, impliquer, massif, mince, moustache, moyenner, mystérieux, normal, parfaitement, perceur, posséder, pratiquement, prénommé, quant, quiconque, regardant, revanche, soit, sornette, tailler, utile

This is still pretty rough. Let's try doubling this, to get near a typical B1 passive vocabulary:

Quote:

Most frequent 2000 words (Lexique book corpus)

Mr et Mrs Dursley, qui habitaient au 4, Privet Drive, avaient toujours affirmé avec la plus grande XXXXXX qu’ils étaient parfaitement normaux, merci pour eux. Jamais XXXXXXXXX n’aurait imaginé qu’ils puissent se trouver XXXXXXXXX dans quoi que ce soit d’étrange ou de mystérieux. Ils n’avaient pas de temps à perdre avec des XXXXXXXXX.

Mr Dursley dirigeait la Grunnings, une entreprise qui fabriquait des XXXXXXXXX. C’était un homme grand et XXXXXX, qui n’avait XXXXXXXXXXXX pas de cou, mais possédait en revanche une moustache de belle XXXXXX. Mrs Dursley, XXXXX à elle, était mince et blonde et disposait d’un cou deux fois plus long que la XXXXXXX, ce qui lui était fort XXXXX pour XXXXXXXXX ses voisins en XXXXXXXXX par-dessus les XXXXXXXX des jardins. Les Dursley avaient un petit garçon XXXXXXXX Dudley et c’était à leurs yeux le plus XXX enfant du monde.

Unknown: bel, clôture, espionner, fierté, impliquer, massif, moyenner, perceur, pratiquement, prénommé, quant, quiconque, regardant, sornette, tailler, utile

If we throw in the English/Romance cognates, this is actually decipherable! And now that our list of unknown words is getting smaller, we can see bugs in my software: It thought that la moyenne was a form of moyenner, and it can't derive regardant from regarder. Oh well, maybe in version 2.0. But for now, let's jump up to 5,000 words, which is (IMO) a decent B2 passive vocabulary:

Quote:

Most frequent 5000 words (Lexique book corpus)

Mr et Mrs Dursley, qui habitaient au 4, Privet Drive, avaient toujours affirmé avec la plus grande fierté qu’ils étaient parfaitement normaux, merci pour eux. Jamais XXXXXXXXX n’aurait imaginé qu’ils puissent se trouver impliqués dans quoi que ce soit d’étrange ou de mystérieux. Ils n’avaient pas de temps à perdre avec des XXXXXXXXX.

Mr Dursley dirigeait la Grunnings, une entreprise qui fabriquait des XXXXXXXXX. C’était un homme grand et massif, qui n’avait pratiquement pas de cou, mais possédait en revanche une moustache de belle taille. Mrs Dursley, quant à elle, était mince et blonde et disposait d’un cou deux fois plus long que la XXXXXXX, ce qui lui était fort utile pour XXXXXXXXX ses voisins en XXXXXXXXX par-dessus les XXXXXXXX des jardins. Les Dursley avaient un petit garçon XXXXXXXX Dudley et c’était à leurs yeux le plus bel enfant du monde.

Unknown: clôture, espionner, moyenner, perceur, prénommé, quiconque, regardant, sornette

Here, we can see another bug: perceuse "drill" became the masculine perceur. And if we fix these bugs, and put back in the English cognates, then we can actually guess words like clôture "fence" and sornettes "poppycock, balderdash" from context. So what does this tell us? Using a completely generic vocabulary list based on a large corpus of books, we can plow our way through native fiction with a 2000 word passive vocabulary, and decipher nearly all of it with a 5,000 word passive vocabulary.

It would also be fun to see what happened if you learned every unknown word in chapter one, and how that would affect your comprehension of later chapters. I bet things get much better after 100 pages or so.

Of course, this thread is about conversation. But some of the same issues apply, because unless you like engaging in long, one-sided monologues, you need to understand what other people say back to you. Fortunately native speakers use a smaller vocabulary than native authors, so your coverage ramps up faster:

Looking at the text examples above, and comparing them to the graph here, I can make some estimates for how big a passive vocabulary you need to carry on conversations. Unless otherwise specified, these numbers assume you learn your words before you know what the conversation will be about.

1. 300 to 500 words: You can establish communication if you learn subject-specific vocabulary in advance.
2. 1000 to 1500 words: You can probably manage to talk about a lot of concrete things, using workarounds (A2?).
3. 2000 to 3000 words: You can be broadly competent at real-world tasks in familiar domains (B1?).
4. 5000 to 7000 words: You can debate abstract subjects semi-intelligently if you take your time (B2?).

Of course, vocabulary is only one piece of the puzzle. You'll also need grammar, idioms, a comprehensible accent, and enough fluidity that your conversation partners don't get bored. And past B2, passive vocabulary will become a minor factor in your conversational competence: At that point, it's all about excellent listening comprehension and getting your grammatical accuracy and fluidity up.
5 persons have voted this message useful

s_allard
Triglot
Senior Member
Canada
Joined 5428 days ago
2704 posts - 5425 votes

Speaks: French*, English, Spanish
Studies: Polish

Message 258 of 309

23 September 2014 at 12:21am | IP Logged

How many words are required to read this Harry Potter excerpt? I say 141 words. That's the length of the text.
emk says you need over 5000 words. What gives? It's all a question of perspective

emk's analysis is typical of these kinds of vocabulary studies. You take the aggregate vocabulary of many
samples to make a frequency list and then apply the frequency list against individual samples. I may be wrong
but I think that statistically, the vocabulary size needed for high coverage will always be higher than the word
count of the individual samples. These numbers do not reflect the vocabulary sizes of the individual works.

In this example, we see that to read a 141-word sample of Harry Potter, the reader has to learn over 5,000
words. In this case 4859 are useless. The interesting question, to which we really don't have an answer is: How
many words are required to read this book?

The only way to answer that question is to count the words in the Harry Potter book. Let's say that there are 2500
unique words in the book. We thus have to learn over 5000 words to read the 2500 words here. If you want to
read only the Harry Potter series, you probably do not need more words.

If you want to read Harry Potter and Jules Verne and Victor Hugo and Marcel Proust and Michel Houllebecque,
then you need more words.

What is J.K. Rowling's active vocabulary? 2500 words. Her passive vocabulary is much higher of course.

I strongly disagree with emk's conclusions as to what can be done with the various bands of vocabulary sizes for
the reason given above. I think for example that one can have a very intelligent debate on abstract topics with
1000 to 1500 words if you use most of them. emk's numbers are so high because many of those words are
unnecessary in actual conversations.

Here for example is a philosophical debate on the work of Paul Ricoeur. A very high-level abstract debate that I
don't always understand:

Le tournant herméneutique dans l'oeuvre de
Ricoeur

If I get time, I'll transcribe a minute or two. How many unique words are used in this 61 minute debate? I didn't
count, but I doubt that it would be over a 1000. According to emk's type of analysis it would probably take over
25000 words to get good coverage of this sample because many of these words are very rare.

One could object: suppose this panel decides to talk about the microbiology of the Ebola virus. Obviously, the
technical words would change. True, but the people here are philosophers. They're talking about Paul Ricoeur
and not Ebola. The vocabulary of Ebola is irrelevant here.

Edited by s_allard on 23 September 2014 at 12:23am
1 person has voted this message useful

emk
Diglot
Moderator
United States
Joined 5530 days ago
2615 posts - 8806 votes

Speaks: English*, French^B2
Studies: Spanish, Ancient Egyptian
Personal Language Map

Message 259 of 309

23 September 2014 at 2:47am | IP Logged

s_allard wrote:

How many words are required to read this Harry Potter excerpt? I say 141 words. That's the length of the text. emk says you need over 5000 words. What gives?

Yes, if you happen to know exactly the ~89 words which appear in the first two paragraphs, you'll understand those two paragraphs completely. But let's imagine you really enjoy those two paragraphs, and you decide to read the third:

Quote:

Mr et Mrs Dursley, qui habitaient au 4, Privet Drive, avaient toujours affirmé avec la plus grande fierté qu’ils étaient parfaitement normaux, merci pour eux. Jamais quiconque n’aurait imaginé qu’ils puissent se trouver impliqués dans quoi que ce soit d’étrange ou de mystérieux. Ils n’avaient pas de temps à perdre avec des sornettes.

Mr Dursley dirigeait la Grunnings, une entreprise qui fabriquait des perceuses. C’était un homme grand et massif, qui n’avait pratiquement pas de cou, mais possédait en revanche une moustache de belle taille. Mrs Dursley, quant à elle, était mince et blonde et disposait d’un cou deux fois plus long que la moyenne, ce qui lui était fort utile pour espionner ses voisins en regardant par-dessus les clôtures des jardins. Les Dursley avaient un petit garçon prénommé Dudley et c’était à leurs yeux le plus bel enfant du monde.

Les Dursley avaient XXXX ce qu’ils XXXXXXXXX. La XXXXX XXXXX XXXXXXXXXXX qu’ils possédaient, c’était un XXXXXX XXXX ils XXXXXXXXXXX plus que XXXX qu’XX le XXXXXXXX un XXXX. Si XXXXXX quiconque XXXXXX à XXXXXXXX XXXXXX des Potter, ils étaient XXXXXXXXXX qu’ils XX XXen XXXXXXXXXXXX pas. Mrs Potter était la XXXX de Mrs Dursley, mais XXXXXX deux XX XXétaient plus XXXXXX XXXXXX des XXXXXX. En XXXX, Mrs Dursley XXXXXXX XXXXX XX elle était XXXXX XXXXXX, XXX XX XXXX et XXX XXX à XXXX de XXXX étaient XXXXX XXXXXXXX que XXXXXXXX de XXXX ce qui XXXXXXX un Dursley. Les Dursley XXXXXXXXXXX d’XXXXXXXXX à la XXXXXX de ce que XXXXXXXX les voisins XX par XXXXXXX les Potter se XXXXXXXXXX dans XXXX XXX. Ils XXXXXXXX que les Potter, eux XXXXX, avaient un petit garçon, mais ils XX XXavaient XXXXXX XX. Son XXXXXXXXX XXXXXXXXXXX une XXXXXX XXXXXXXXXXXXXX de XXXXX les Potter à XXXXXXXX : XX n’était pas XXXXXXXX que le petit Dudley se XXXXX à XXXXXXXXXX un enfant XXXXX XXXXX-XX.

I don't think we're going to enjoy this book very much.

What happened here? It's exactly the same book, by exactly the same author, discussing exactly the same subject. So why were our ~89 words so successful for the first two paragraphs, and why did they fail so hopelessly for the third paragraph?

The problem is that we focused too much on the paragraphs actually in front of us. To borrow a technical term, our vocabulary suffers from overfitting. We didn't actually learn French, we learned "First Two Paragraphs of Harry Potter French", which turns out to be a really poor tool for tackling the third paragraph of Harry Potter.

The same thing would happen with any of the conversations you posted. We could take a five minute conversation, split it half, learn every word in the first 2.5 minutes, and then try to understand the second 2.5 minutes. And even though those second 2.5 minutes are about exactly the same topic as the first 2.5 minutes, we'd see the same thing that we saw above: We'd see total clarity over our training set, followed by total disaster as soon as we stepped outside of it.

This is the central objection to your theory. You can't escape overfitting unless you have a solid base of a couple thousand common words. But once you have that, you can easily learn to talk about new subjects just by learning subject-specific vocabulary. And this is why B1 is such a nice level: You might still suck in an absolute sense, but you can stop sucking at any one particular subject with just a bit of practice.
6 persons have voted this message useful

s_allard
Triglot
Senior Member
Canada
Joined 5428 days ago
2704 posts - 5425 votes

Speaks: French*, English, Spanish
Studies: Polish

Message 260 of 309

23 September 2014 at 5:22am | IP Logged

emk wrote:

s_allard wrote:

How many words are required to read this Harry Potter excerpt? I say 141
words. That's the length of the text. emk says you need over 5000 words. What gives?

Yes, if you happen to know exactly the ~89 words which appear in the first two paragraphs, you'll understand
those two paragraphs completely. But let's imagine you really enjoy those two paragraphs, and you decide to
read the third:

Quote:

Mr et Mrs Dursley, qui habitaient au 4, Privet Drive, avaient toujours affirmé avec la plus grande fierté
qu’ils étaient parfaitement normaux, merci pour eux. Jamais quiconque n’aurait imaginé qu’ils puissent se
trouver impliqués dans quoi que ce soit d’étrange ou de mystérieux. Ils n’avaient pas de temps à perdre avec des
sornettes.

Mr Dursley dirigeait la Grunnings, une entreprise qui fabriquait des perceuses. C’était un homme grand et massif,
qui n’avait pratiquement pas de cou, mais possédait en revanche une moustache de belle taille. Mrs Dursley,
quant à elle, était mince et blonde et disposait d’un cou deux fois plus long que la moyenne, ce qui lui était fort
utile pour espionner ses voisins en regardant par-dessus les clôtures des jardins. Les Dursley avaient un petit
garçon prénommé Dudley et c’était à leurs yeux le plus bel enfant du monde.

Les Dursley avaient XXXX ce qu’ils XXXXXXXXX. La XXXXX XXXXX XXXXXXXXXXX
qu’ils possédaient, c’était un XXXXXX XXXX ils XXXXXXXXXXX plus que XXXX qu’XX
le XXXXXXXX un XXXX. Si XXXXXX quiconque XXXXXX à XXXXXXXX XXXXXX
des Potter, ils étaient XXXXXXXXXX qu’ils XX XXen XXXXXXXXXXXX pas. Mrs Potter était la
XXXX de Mrs Dursley, mais XXXXXX deux XX XXétaient plus XXXXXX XXXXXX
des XXXXXX. En XXXX, Mrs Dursley XXXXXXX XXXXX XX elle était XXXXX
XXXXXX, XXX XX XXXX et XXX XXX à XXXX de XXXX étaient
XXXXX XXXXXXXX que XXXXXXXX de XXXX ce qui XXXXXXX un Dursley. Les Dursley
XXXXXXXXXXX d’XXXXXXXXX à la XXXXXX de ce que XXXXXXXX les voisins XX par
XXXXXXX les Potter se XXXXXXXXXX dans XXXX XXX. Ils XXXXXXXX que les Potter,
eux XXXXX, avaient un petit garçon, mais ils XX XXavaient XXXXXX XX. Son
XXXXXXXXX XXXXXXXXXXX une XXXXXX XXXXXXXXXXXXXX de XXXXX les Potter à
XXXXXXXX : XX n’était pas XXXXXXXX que le petit Dudley se XXXXX à XXXXXXXXXX
un enfant XXXXX XXXXX-XX.

I don't think we're going to enjoy this book very much.

What happened here? It's exactly the same book, by exactly the same author, discussing exactly the same
subject. So why were our ~89 words so successful for the first two paragraphs, and why did they fail so
hopelessly for the third paragraph?

The problem is that we focused too much on the paragraphs actually in front of us. To borrow a technical term,
our vocabulary suffers from overfitting. We didn't actually learn French, we learned "First Two Paragraphs of
Harry Potter French", which turns out to be a really poor tool for tackling the third paragraph of Harry Potter.

The same thing would happen with any of the conversations you posted. We could take a five minute
conversation, split it half, learn every word in the first 2.5 minutes, and then try to understand the second 2.5
minutes. And even though those second 2.5 minutes are about exactly the same topic as the first 2.5 minutes,
we'd see the same thing that we saw above: We'd see total clarity over our training set, followed by total disaster
as soon as we stepped outside of it.

This is the central objection to your theory. You can't escape overfitting unless you have a solid base of a couple
thousand common words. But once you have that, you can easily learn to talk about new subjects just by learning
subject-specific vocabulary. And this is why B1 is such a nice level: You might still suck in an absolute sense, but
you can stop sucking at any one particular subject with just a bit of practice.

I'm going to admit that I don't understand overfitting here. I don't see the problem. If a text contains 141 words,
why does it take over 5000 words to understand the text. If you increase the size of the text to 200 words, learn
200 words. I don't follow this idea of splitting a conversation and learning half and then not understanding half.
If a conversation contains 224 unique words, it seems to me that you need to learn 224 words.

I understand fully that if you want to understand all books written in French, you need a huge vocabulary, but if I
want to read a Harry Potter book, why do I need to learn thousands of words that are not in the Harry Potter
books? The same goes for the conversations that I gave. These are native speakers having short conversations
using between around 100 and 250 words. If I want to understand all the conversations, I need hundreds of
words. But to understand a specific conversation, I may only need a 100.

Similarly, we saw a scientific discussion in English that required less than 300 words. There are many other
discussions on other scientific subjects at that website. It's the same principle. If I want to read all the
discussions, I need a very wide vocabulary. But for this specific discussion I need only 300 words. I'm not
interested in the other subjects. I just don't see how overfitting enters into this.

To me we all experience this phenomenon in our daily lives. We have a vocabulary profile. I do not own a car.
Therefore my automotive vocabulary is limited. I don't have young children, so I don't know the vocabulary
associated with having young children. My interests in professional sports are very limited. I couldn't explain the
basics of American football, basketball or baseball with the the right terms. I know the very basics about personal
computers, but I would be hard-pressed to take my computer apart and name the various parts. But I like
cooking, so my cooking vocabulary in French is quite wide. My neighbour is a musician, so she knows a whole lot
of technical terms that escape me completely. Across the hall are three students at McGill University. They have
whole sets of vocabulary for things I know nothing about.

We all have different vocabulary configurations. Our usage depends a lot on our situation, especially work or
studies. Obviously, we share a lot of words, but at the same time, there are major differences.

I don't believe that I speak more than 500 unique words in a day, and even that is a lot. The words will vary from
day to day as I interact with different people in different situations, but so much of my daily activities are
repetitive. Now, winter will be arriving soon and I'll start skiing again. Another vocabulary set will be activated.
How many more different words? 100 or 200. And the words related to summer will disappear.

What i don't understand is this necessity of learning many words that I don't use? Why do I need a solid base of
thousands of words that I don't use?

Edited by s_allard on 23 September 2014 at 5:25am
1 person has voted this message useful

fiolmattias
Triglot
Groupie
Sweden
geocities.com/fiolmaRegistered users can see my Skype Name
Joined 6687 days ago
62 posts - 129 votes

Speaks: Swedish*, English, Arabic (Written)

Message 261 of 309

23 September 2014 at 6:05am | IP Logged

s_allard wrote:

The only way to answer that question is to count the words in the Harry Potter book.
Let's say that there are 2500
unique words in the book. We thus have to learn over 5000 words to read the 2500 words
here. If you want to
read only the Harry Potter series, you probably do not need more words.

This page (http://blog.self.li/post/20854405575/how-to-understand-harr y-potter-any-
language) says that the Spanish translation contains 12.000 different words (of course
debatable, but way more than 2.500), and this (http://www.amazon.com/Unofficial-Harry-
Potter-Vocabulary-Builder-ebook/dp/B0032D97RO) book contains the 3.000 hardest, so it
must contain more than 2.500 words in total.

I have the feeling that you think it is hard to learn vocabulary, or atleast hard for
students. I am lazy studying vocab, but I use about half an hour every day and I learn
about 30-40 new words every day (Iversen lists) and I do my spaced repetitions. With a
few lazy days that is still between 10.000-15.000 words in a year. I read from day 1
and I speak from day 1 (Hello/Good bye and so on), but I still believe that you need a
large vocabulary to be able to converse, i.e speak AND understand. But that is not a
problem, learninc vocabulary is easy compared to be able to use correct grammar and to
understand native, unscripted speech in full tempo. Atleast for me...
1 person has voted this message useful

robarb
Nonaglot
Senior Member
United States
languagenpluson
Joined 5057 days ago
361 posts - 921 votes

Speaks: Portuguese, English*, German, Italian, Spanish, Dutch, Swedish, Esperanto, French
Studies: Mandarin, Danish, Russian, Norwegian, Cantonese, Japanese, Korean, Polish, Greek, Latin, Nepali, Modern Hebrew

Message 262 of 309

23 September 2014 at 6:31am | IP Logged

s_allard wrote:

emk's analysis is spot-on. The idea is that, if you want to read a Harry Potter book, you could learn the words
that are used in the book and no others. However, despite this reasonably small number of words being enough
to understand a novel, you wouldn't be able to read much else without it being full of XXXX. That's because
the only novel ever written for which the set of words you learned is the set of useful ones is the first Harry
Potter book. All other novels ever written are full of words that don't occur in the book, so you didn't really learn
an adequate French vocabulary. You learned something more like a glossary of the first Harry Potter book, similar
to, say, a Bible scholar who only needs to learn the words of Hebrew that happen to be used in the Hebrew Bible.
Unfortunately, we here are not really trying to be first-two-paragraphs-of-Harry-Potter scholars or this-random-
conversation-between-two-French-guys scholars, so this is not desirable.

The same thing applies to speaking. If you want to prepare a speech about your favorite topic, you will perhaps
need 150 or 200 unique words. You could plan the content of your speech, look up the necessary words, and
convert it into a usable speech. You only need 200 words, and no others. This is awesome if you are a
motivational speaker, and go from town to town making the same monologue to audiences in perfect French. But
the 200 words you learned for your speech are "overfit" to that content. If you want to say anything else, you're
going to be missing lots of words. Even more words than you'd be missing if you learned an equal size set of very
frequent words rather than the set you needed for your speech.

For a more illustrative example of overfitting, suppose you're trying to learn to play Scrabble (the game where
you arrange tiles with letters on them to make words). To win a game, you only need to play about 25 unique
words. Based on that evidence, you might suppose that those 25 words will be enough to be a world champion at
Scrabble. But when you play the next game, you draw letters that don't spell any of the 25 words you know, so
you lose. What happened? You've estimated the usefulness of 25 words as 1, and the usefulness of all the other
words as 0, based on how useful they've been in your small sample. This is all based on the false assumption that
one game provides enough evidence about the usefulness of each word.

s_allard wrote:

To me we all experience this phenomenon in our daily lives. We have a vocabulary profile. I do not own a car.
Therefore my automotive vocabulary is limited. I don't have young children, so I don't know the vocabulary
associated with having young children. My interests in professional sports are very limited. I couldn't explain the
basics of American football, basketball or baseball with the the right terms.

Well OK, but even there you've used "car," "automotive," "children," "sports," "football," "basketball," and
"baseball." No more than three of those are even reasonable candidates for a 300-word shortlist (unless, as
above, you're overfitting on a planned speech about cars, children, and American sports).

emk wrote:

1. 300 to 500 words: You can establish communication if you learn subject-specific vocabulary in advance.
2. 1000 to 1500 words: You can probably manage to talk about a lot of concrete things, using workarounds
(A2?).
3. 2000 to 3000 words: You can be broadly competent at real-world tasks in familiar domains (B1?).
4. 5000 to 7000 words: You can debate abstract subjects semi-intelligently if you take your time (B2?).

This is a little harsh at the high levels, I'd say. Only 5000 unique words are fair game for the highest level of the
Chinese HSK, and those are passive, and you don't need all of them to pass. I would guess that if you had as
many as 7000 words you're comfortable using actively (and corresponding other skills), then a B2-level speaking
test would be absolutely a piece of cake, a C1 very doable, and a C2 not out of the question (but you'd need to
not make grammar mistakes, which is a separate question). Also, for what it's worth you are implying that
concrete language is less demanding than abstract language. While children do go through in that order, it
doesn't necessarily apply to adult learners: it's easy for me in French to say things like "there's simply no way to
demonstrate that a political system will work using theory alone; look at communism, it was impossible to
predict what would happen until people tried to carry it out." However, I have no idea how to say "pillowcase,"
"mop," "windowsill" or headphone jack."

Edited by robarb on 23 September 2014 at 7:03am
2 persons have voted this message useful

luke
Diglot
Senior Member
United States
Joined 7203 days ago
3133 posts - 4351 votes

Speaks: English*, Spanish
Studies: Esperanto, French

Message 263 of 309

23 September 2014 at 8:45am | IP Logged

s_allard wrote:

I'm going to admit that I don't understand overfitting here. I don't see the problem. If a text contains 141 words, why does it take over 5000 words to understand the text. If you increase the size of the text to 200 words, learn 200 words. I don't follow this idea of splitting a conversation and learning half and then not understanding half. If a conversation contains 224 unique words, it seems to me that you need to learn 224 words.

What i don't understand is this necessity of learning many words that I don't use? Why do I need a solid base of thousands of words that I don't use?

Maybe if you were willing to learn a few more words you would understand.
1 person has voted this message useful

robarb
Nonaglot
Senior Member
United States
languagenpluson
Joined 5057 days ago
361 posts - 921 votes

Message 264 of 309

23 September 2014 at 9:20am | IP Logged

fiolmattias wrote:

Looking at the blog post, it looks like the figure of 12000+ unique words in HP1 was generated by a simple
program that counted unique sequences of letters, which means it counts proper names, magical nonce words,
transparent derivations, inflections, and sound effects. The 3,000 hard words book deals with the entire series,
of which the later books are longer and written at a slightly more advanced reading level. That suggests that the
number of dictionary words you'd need to recognize to get through the book is rather less than 12000, and I feel
it's important to stress that learners who know far fewer than 12000 words, and whose vocabulary is not
preselected to fit Harry Potter, can and should try to read and comprehend the book. After all, emk's 5000-word
frequency list covers enough of the text to render the gist comprehensible, probably enough to enjoy reading the
book even if a word in every thirty is unknown.

But conversation is a different animal, both in the number of words required (fewer, but still many to be really
good in all situations) and the consequences of not knowing a word (skipping is not an option, but workarounds
are usually possible without resorting to a dictionary).

Edited by robarb on 23 September 2014 at 9:23am

2 persons have voted this message useful

This discussion contains 309 messages over 39 pages: << Prev 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Next >>

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 0.4844 seconds.