Register  Login  Active Topics  Maps  

Experimenting with French word frequency

 Language Learning Forum : Specific Languages Post Reply
55 messages over 7 pages: 1 24 5 6 7  Next >>
s_allard
Triglot
Senior Member
Canada
Joined 5433 days ago

2704 posts - 5425 votes 
Speaks: French*, English, Spanish
Studies: Polish

 
 Message 17 of 55
04 September 2014 at 10:21pm | IP Logged 
For those who have been following the narrative, I have the final stats for the text Un lycée pas comme les autres.
The total number of words is 224.

verbs     33 e.g. aller, arriver, avoir, blesser, changer, connaître
nouns     94 e.g. lycée, études, chose, 3e, niveau, envie, l'Armée, etc.
adjectives     29 e.g. militaire, fréquent, quelque, scolaire, admis, sage, etc.
adverbs     20 e.g. bon, très, là, fort, trop, tout, justement, etc.
pronouns     10 e.g ça, ce eux, il, ils, je, me nous, on, vous
connecting words     38 e.g. bonjour, donc, de, au, mais, quand même, pas, quoi, pour, etc.

The numbers come in pretty much as expected. The big difference between this text and the previous one is in
the nouns, as I thought. The readers who are up on their French grammar may question why some words are in
certain categories. I looked at their function in the phrase and not at their spelling. For example, bon can appear
as an adverb as in bon alors or it can be an adjective. Depending on one's analysis the numbers can move
around a bit, but not much.

I haven't really done a comparative analysis of the words in this text relative to that of the previous text. We know
that the verbs are different and that 59 verbs will allow you to read both texts. The pronouns are nearly identical.
I suspect the adverbs and the connecting words are very similar.

I couldn't really guess how many nouns are shared by both texts, probably not that many. How many words of
French would you have to know to follow these two conversations? I would say around 275. This of course does
not change the fact that in one conversation 145 words were used and in the other 224.
1 person has voted this message useful



s_allard
Triglot
Senior Member
Canada
Joined 5433 days ago

2704 posts - 5425 votes 
Speaks: French*, English, Spanish
Studies: Polish

 
 Message 18 of 55
04 September 2014 at 11:09pm | IP Logged 
emk wrote:
...
At the right hand edge of the graph, we know 1,000 verbs, and if we see a verb in a movie subtitle, there's a
97.64% chance that it's on our list. (We're counting text coverage here.) That's not bad, but I actually hoped that
1,000 verbs would give us better text coverage.

Interestingly, knowing 290 verbs will allow us to identify 9 out of every 10 verbs we encounter in movie
subtitles
. That's low enough to be pretty annoying, actually.

These figures are very much in line with all the studies of vocabulary size coverage across all domains. Zipf spoke
about this many years ago. We know that a tiny number of words represent a very large percentage of the uses
and as text coverage increases vocabulary size increases geometrically.

This is all true, but we shouldn't conclude that you need more than 1,000 verbs to watch a movie in French.
Again, we come up to this basic issue of methodology that I call the aggregate word effect. If you look at a large
sample of different texts, the sum of all the individual vocabularies will generate a large number necessary to
provide coverage for all the texts.

For example, if you want to read Harry Potter and the Philosophical Stone, you may need 5,000 words. But a
study of the vocabulary of the major works of English literature from the 19th century to the present-day will
conclude that a vocabulary of 30,000 words is necessary to read all these works with enjoyment. Does that mean
that you need 30,000 words to read Harry Potter? Of course not. It's just that with the Harry Potter vocabulary
you won't be able to read all the 19th century literature and some other works.

It all depends on what you want to accomplish. How big a vocabulary do you need to write a book? Do you need
30,000 words? Hell no. Maybe 3,000 will do, depending on your audience and your style. Of if you are Dr Seus
you can write a book with 50 words.

This is what I have been trying to show by looking at real conversations by native speakers of French. What I'm
seeing is that within the confines of short conversations, people are using very small vocabularies. For Pete's sake
nobody is saying that all French-speakers use the same 300 words or less all the time. It's just that it is possible
to have a pretty sophisticated conversation with 240 words or even 145 words.
1 person has voted this message useful



Jeffers
Senior Member
United Kingdom
Joined 4912 days ago

2151 posts - 3960 votes 
Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

 
 Message 19 of 55
05 September 2014 at 12:36am | IP Logged 
emk wrote:
The Rutledge frequency dictionary uses a lot more newspaper articles (relative to books), and it places belge at only 2,795. This shows one of the central problems with frequency lists: you really need to specify what corpus you're using.

Unfortunately, it's hard to get a corpus of spoken French. They exist, but I think a lot of them are proprietary. This seems to be why the Université de Savoie went with movie subtitles: they might overemphasize crime and drama, but they're still pretty much natural speech.


I'm not sure where you read that about the Routledge dictionary, but the numbers of newspaper articles relative to books isn't too bad: 2,015,000 words from newspaper stories, 3,000,000 words from newswire stories, and 4,734,000 words from literature. So they are pretty close.

The Routledge dictionary source list is half from "spoken" sources, and the various sources they use are quite interesting. Here is the spoken half of their source table:
Quote:
175,000     Conversations
3,750,000     Canadian Hansard [transcripts from the Canadian parliament]
3,020,000     Misc. interviews/transcripts
1,000,000     European Union parliamentary debates
855,000     Telephone conversations [how did they get Big Brother's database?]
470,000     Theatre dialogue/monologue
2,230,000     Film subtitles


Of course, about 40% of the oral corpus comes from political sources, which would probably account for belge being higher than other sources, among other words such as parlement (756) and parlementaire (1214). But then, if you read or watch any news, these words do come up.

I think the Routledge dictionary has a good theoretical basis for a wide variety of uses of French, but if you just want to have pleasant conversations then maybe it isn't the right source. Unfortunately, it has several shortcomings as a dictionary (mainly: the definitions are too brief, and no female forms of adjectives are given).

Edited by Jeffers on 05 September 2014 at 12:37am

1 person has voted this message useful





emk
Diglot
Moderator
United States
Joined 5535 days ago

2615 posts - 8806 votes 
Speaks: English*, FrenchB2
Studies: Spanish, Ancient Egyptian
Personal Language Map

 
 Message 20 of 55
05 September 2014 at 12:49am | IP Logged 
s_allard wrote:
Using what we have seen so far, what would a 300-word vocabulary look like. I suggest it would be something
along these lines.

Verbs: 80
Nouns: 130
Adjectives: 20
Adverbs: 30
Pronouns: 15
Connecting words: 25

Now the question of course is what to put in those categories, especially verbs and nouns. For the verbs, the 59
verbs we say here and 20 others of one's choosing should be a great start. Throw in "manger, boire, laver,
dormir, coucher, etc."

Sounds like a good breakdown to try! Is anyone interested in actually roughing in a few of these categories?

If somebody would like to try, I've multiplied the size of each category by two, and generated a list to use as a starting point. Please feel free to choose words that aren't on these lists, of course, and to remove any random garbage you find.

First, 160 verbs:

Quote:
être avoir aller faire dire pouvoir vouloir savoir voir devoir venir suivre parler prendre croire aimer falloir passer penser attendre trouver laisser arriver donner regarder appeler partir mettre rester arrêter connaître tuer mourir demander comprendre sortir entendre chercher aider essayer revenir plaire jouer finir perdre sentir rentrer vivre rendre tenir oublier travailler écouter manger entrer devenir commencer payer tirer ouvrir changer tomber foutre excuser dormir occuper marcher envoyer apprendre boire garder montrer asseoir porter souvenir prier servir écrire espérer désoler retrouver gagner acheter rappeler lire monter quitter emmener toucher continuer importer manquer raconter répondre sauver retourner rencontrer voler fermer valoir descendre suffire sembler compter marier poser inquiéter bouger apporter décider vendre cacher tourner expliquer battre agir imaginer adorer recevoir jeter pleurer amener promettre mentir utiliser coucher préférer offrir réveiller préparer permettre ramener enlever lâcher choisir conduire calmer chanter disparaître lever présenter accepter revoir casser frapper ignorer couper taire tromper ressembler jurer courir remettre refuser terminer amuser intéresser reconnaître rire pardonner

Here are 260 nouns from films. This list is weird! But I looked at all the high frequency nouns in Rutledge's dictionary, and that list was pretty weird, too. Also, this list looks dirtier than the others—there are some adjectives that snuck in, probably when Lexique's part-of-speech tagger misfired. Not sure this will be much use to anybody.

Quote:
chose homme jour femme temps vie dieu fois père peu an fille monde besoin accord ami monsieur enfant heure mère maison gens nuit soir nom bonjour peur maman problème argent main air fils tête coup raison mort amour moment voiture oeil fait question affaire frère travail idée famille truc merci chance histoire minute type porte tout mal mec année mois madame putain part eau sang place personne ville terre semaine gars chambre côté cas mot salut police suite matin mari revoir papa train film garçon corps coeur feu docteur façon nouveau point chien petit guerre genre arme cause endroit ordre reste pied envie fin merde photo droit école chef boulot pays peine livre tour vérité bébé partie jeu instant parent service plaisir soeur lit lieu roi verre aide journée numéro chéri musique faute mariage bureau route café confiance bonsoir compte téléphone rêve copain attention rue lettre fête esprit seigneur flic capitaine âge force pièce cul bras plan prison premier vieux état carte paix président cours ciel âme patron visage médecin rapport avis retour lumière dollar hôpital voix honneur équipe cheval maître avion bout faim pas oncle prix retard cadeau face gueule chemin général bateau million sac seconde erreur soleil voyage con balle cheveu papier sujet table clé agent pouvoir sens message salle effet espèce bois propos camp sorte hôtel début jambe choix sécurité avocat client courant peuple dame journal or loi fond gosse fric situation accident doute scène soldat preuve mer bien silence télé victime pute calme garde meurtre groupe crime colonel secret parole honte seul soirée bon armée être

You suggested 20 adjectives. Here are 80, just in case:

Quote:
tout quoi bon petit sûr seul vrai autre beau juste grand dernier même premier fou prêt jeune vieux désolé gros mauvais heureux ok important joli meilleur nouveau plein cher possible gentil nouvelle dur content pauvre long propre prochain facile malade super difficile sale noir drôle grave simple différent génial libre pareil bizarre froid normal vivant impossible clair faux tranquille blanc passé sérieux amoureux humain triste rouge dangereux certain pire fort mort doux magnifique terrible exact étrange bas chaud spécial sympa

And 60 adverbs:

Quote:
ne pas bien plus non oui ici si là alors très aussi jamais pourquoi encore est-ce que toujours tout maintenant vraiment comment même peut-être trop déjà mieux beaucoup comme ouais vite moins mal demain assez combien tant aujourd'hui tard longtemps seulement enfin là-bas ensemble juste peu loin avant hier plutôt ainsi après bientôt tellement presque dehors d'abord en fort parfois autant

An extra-large helping of 60 pronouns, including some junk. It will probably by necessary to fill in the basic pronoun tables by hand.

Quote:
je tu vous il ça on qui ce me nous elle y t' que te mon moi en le l' ils se toi s' rien ma lui où votre son ton tout sa ta les mes la notre quoi ses tes vos quelqu'un personne un nos cela autre elles tous eux leur leurs celui autres dont une ceux celle ceci

For the final category of connecting words, I'm going to break it down into sets, and throw in all the remaining Lexique categories. There are way more than 25 here, to give list-makers a good selection. There's a little bit of overlap with the pronoun lists, and a random assortment of numbers. I think a vocabulary list should contain enough number words to count from 0 to 1000, at least if shopping or directions are the goal.

Quote:
merci oh ah bon hein eh hé merde tiens pardon salut comment attention allô euh bravo adieu là vive heu hum allo dehors chut hop dommage hélas bye bah ha ho amen aïe ô ciao mince halte yeah zut déjà gare ouah

Quote:
et mais comme quand ou si pourquoi donc comment "parce que" ni puis car sinon puisque soit lorsque or "tandis que" quoique afin que néanmoins cependant

Quote:
la le un les une des du au aux de

Quote:
mon ce ma cette votre son ton sa ta ces mes notre deux quelque ses tes quel vos nos quelle cet tout trois quelques chaque leurs aucune aucun cinq quatre d'autres toute dix six plusieurs tel sept huit certains quels mille telle certaines quelles un cent nulle neuf vingt nul cents la plupart des quinze douze trente toutes telles une tels onze cinquante quarante seize treize quatorze "la plupart de" mienne différents soixante divers tiens tienne diverses maintes certain mien vôtre tien maint aucuns

Looking through these lists, my first reactions are:

- Nouns will be a challenge, because they have a much flatter distribution than verbs, and you never know which ones you'll need. 130 may not be enough to cover basic shopping, transportation and small talk.

- 15 pronouns and 25 connectors seems awfully low, given the lists above.

- Almost all of the other categories are going to require lots of hard decisions. There are just so many important words in each category that I can't choose a short list without feeling like I'm leaving out something essential.

Very little of the vocabulary above is rare or specialized. We don't need to listen to thousands of hours of movies to hear pardonner "forgive", soirée "evening", sympa "nice, likable", autant "as much as" or ceci "this". But each of these words is at or near the end of one of the lists above, and each list is at least twice as large we need to reach 300 words.

So yeah, any given short conversation requires a couple hundred words. But if we look at all the conversations which might appear on an A2 exam, I think we're going to need broader coverage. Overall, Milton's low-end number of 1,700 for the typical A2 student seems pretty plausible to me, actually. I could maybe believe it's possible to get there with 1,000 words if you chose them very carefully and used them well.

Anyway, I'm having a lot of fun with the Lexique database. This could be useful for quite a few language learning tools, actually. Many thanks to the researchers who released it under a Creative Commons license.

Edited by emk on 05 September 2014 at 12:56am

3 persons have voted this message useful



s_allard
Triglot
Senior Member
Canada
Joined 5433 days ago

2704 posts - 5425 votes 
Speaks: French*, English, Spanish
Studies: Polish

 
 Message 21 of 55
05 September 2014 at 6:16am | IP Logged 
emk wrote:
[...
Looking through these lists, my first reactions are:

- Nouns will be a challenge, because they have a much flatter distribution than verbs, and you never know which
ones you'll need. 130 may
not be enough to cover basic shopping, transportation and small talk.

- 15 pronouns and 25 connectors seems awfully low, given the lists above.

- Almost all of the other categories are going to require lots of hard decisions. There are just so many important
words in each category that I
can't choose a short list without feeling like I'm leaving out something essential.

Very little of the vocabulary above is rare or specialized. We don't need to listen to thousands of hours of movies
to hear pardonner
"forgive", soirée "evening", sympa "nice, likable", autant "as much as" or ceci "this". But each
of these words is at or near the
end of one of the lists above, and each list is at least twice as large we need to reach 300 words.

So yeah, any given short conversation requires a couple hundred words. But if we look at all the conversations
which might appear on an A2
exam, I think we're going to need broader coverage. Overall, Milton's low-end number of 1,700 for the typical A2
student seems pretty
plausible to me, actually. I could maybe believe it's possible to get there with 1,000 words if you chose them very
carefully and used them well.

Anyway, I'm having a lot of fun with the Lexique database. This could be useful
for quite a few language
learning tools, actually. Many thanks to the researchers who released it under a Creative Commons license.


I have no objections to putting any number of possible words in the various categories that I outlined. The
difficulty with this approach is that outside a very tiny core of high-frequency words, most of the other words are
equally probable. If I look at the list of suggested nouns, I really don't know what to make of that. To me it's
pretty useless.

This is the kind of problems that stems from approaching the problem in terms of what one needs for vocabulary
coverage. I prefer to approach the question by looking at what people actually use. I have to say that we have
very few studies of recordings of actual test candidates at the various CEFR levels. We have all kinds of estimates
of what candidates ought to know but nearly no studies of what candidates actually used. For example, the
widely-quoted study by James Milton,

The development of vocabulary breadth across
CEFR levels


is not a study of actual test results but a study of estimated vocabulary sizes for each CEFR level, using a
vocabulary size test. To my knowledge, there is no systematic study of actual CEFR speaking-test results.

The only study that offers a glimmer of light in this direction is the well-know study by James Head and Paul
Nation of actual recordings of test results for the IELTS English test.

Lexical dimension of the IELTS speaking test

On page 10 of this study, we see that at the highest level of speaking proficiency, Band 8, the 15 candidates
produced a mean or average of 1491 words but only 408.1 different words or types. I have to say that I don't
understand how the figure in the last column was calculated. If this figure is correct, the most proficient
speakers, the equivalent of our C2 speakers, only used around 408 different words in their test. It
should be pointed out that the speaking test is relatively short and consists of a conversation with an examiner.
The whole thing takes less than 15 minutes.

Regardless of how many thousands of different words the candidates may know and have at their command, the
fact is that they only used 408 of them on average. This is exactly what I would expect. Consider that a typical
speaking rate is about 200 words per minute and the test lasts 15 minutes. If the candidates speaks around half
this time, that a little over 7 minutes X 200 or around 1400 words. Of these words around 400 are unique words.

Let's look at our C2 candidate at the speaking test. If that persons speaks for 15 minutes, that's around 3,000
words or tokens. How different words or types will we see? Can we say around 1,000 - 1,200 based on the idea
that the longer the speaking period, the more different words will be used?

What we see, if it need repeating, that the number of types or different words is quite small relative to the total
number of words or tokens spoken. In all likelihood, our C2 candidate will not use more than 1,300 different
word or let's say 1,500, to be generous.

This does not mean, and I insist, that the C2 candidates needs only learn the first 1,500 words on a frequency list
and is guaranteed to ace the exam. What it does mean is that the C2 candidate does not have to know how to use
10,000 words. Instead, a good mastery of around 1,300 words, well chosen, will certainly do.

Again, the most common objection is that you don't know what words will come up in the test. Suppose you have
to discuss surrogacy in third-world countries and you don't know anything about surrogate mothers, in-vitro
fertilization, etc. What are you going to do with only 1,300 words? Consider first that you'll be given a text about
the subject with all the vocabulary you need on it. Does knowing only 1,300 words mean that you don't
understand a thing, automatically shut up and hang your head in shame? Well if you spent all your time learning
1,300 words and not how to use them, you are dead in the water. But if you have a good command of those
words, I see no reason why you can't give a good account of yourself.

The fundamental difference between myself and emk is that I believe that is is how one pronounces and connect
the words -- fluently and accurately -- is just as important as the number of words. As I have said a gazillion
times, I'm not against lots of vocabulary. The mistake I see time and time again is what I call lexical
discouragement. It's not the best term but the idea is that learners feel they need a huge vocabulary to
do anything in a language. I hear, "I can't speak French until I learn 1,000 verbs, So I have to learn to conjugate
three verbs a day for the next year if I want to speak French."

If you spend all your time reading great literature in French or reading Le monde in order to speak French, you
will get discouraged because there are always new words to learn. Instead, you should concentrate on listening to
real conversations and actually practicing speaking, especially if you are in a French-speaking environment. You
can be up and speaking very quickly.

Edited by s_allard on 05 September 2014 at 6:20am

1 person has voted this message useful



s_allard
Triglot
Senior Member
Canada
Joined 5433 days ago

2704 posts - 5425 votes 
Speaks: French*, English, Spanish
Studies: Polish

 
 Message 22 of 55
05 September 2014 at 6:58am | IP Logged 
This debate makes me think about situations that I have observed quite regularly. A person has applied for a job
that requires bilingualism or, more often trilingualism for international positions. As part of the preliminary
interview process, the candidate will be called by a language consultant to assess the candidate's foreign
language skills. At this stage, this will be in the form of a 10-15 minute phone call.

The interviewer will ask a series of questions to assess various aspects of proficiency. An appointment is set up
and on that day the consultant calls. The call starts with the usual pleasantries: identification, how are you? how's
the weather? what time is it where you live?, etc.

Right from the beginning, during all these pleasantries, the examiner is getting an idea of the candidate's level.
The way the person answers to "how are you today?" can tell the examiner how things are going to start off. Is
the person stuttering and stumbling or do they come across as articulate and fluent?

As Head and Nation point out in the article I mentioned above, high-proficiency speakers are able to provide
more nuanced and detailed answers in their discussions.

The examiner is not keeping track of the vocabulary being used but they notice mistakes. If the candidate can
talk about the weather in a detailed manner without making mistakes, then half the battle is won.

By the time the interview really begins, the examiner has a good idea where the whole thing is heading. This is
similar to what we see here in the real conversations that I have examined. These conversations are very short
and do not contain many words.but they demonstrate pretty high levels of proficiency. These are native speakers.
If a learner could come anywhere close to speaking like any of these people in the France Bienvenue
conversations, they would probably be at the C2 level.

What happens is that the examiner extrapolates. If someone can talk well about certain topics, this person will be
able to talk in similar fashion about anything. This is why a small vocabulary can work well in a test situation.
1 person has voted this message useful



Jeffers
Senior Member
United Kingdom
Joined 4912 days ago

2151 posts - 3960 votes 
Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

 
 Message 23 of 55
05 September 2014 at 8:19am | IP Logged 
s_allard wrote:

Right from the beginning, during all these pleasantries, the examiner is getting an idea of the candidate's level.
The way the person answers to "how are you today?" can tell the examiner how things are going to start off. Is
the person stuttering and stumbling or do they come across as articulate and fluent?

As Head and Nation point out in the article I mentioned above, high-proficiency speakers are able to provide
more nuanced and detailed answers in their discussions.

The examiner is not keeping track of the vocabulary being used but they notice mistakes. If the candidate can
talk about the weather in a detailed manner without making mistakes, then half the battle is won.


I think it's ironic, but you're really mixing up fluency and proficiency in your argument. In the first paragraph I quoted, you're clearly writing about fluency. But then you write about "high-proficiency speakers" who give "nuanced and detailed answers", and then talking about the weather "in a detailed manner". I agree it doesn't require high levels of vocabulary to be fluent, but these latter two descriptions certainly involve increased vocabulary.

The difference between "talking about the weather" and "talking about the weather in a detailed manner" is vocabulary. A "more nuanced and detailed answer" implies more vocabulary.
6 persons have voted this message useful





emk
Diglot
Moderator
United States
Joined 5535 days ago

2615 posts - 8806 votes 
Speaks: English*, FrenchB2
Studies: Spanish, Ancient Egyptian
Personal Language Map

 
 Message 24 of 55
05 September 2014 at 1:09pm | IP Logged 
s_allard wrote:
emk wrote:
Looking through these lists, my first reactions are:

- Nouns will be a challenge, because they have a much flatter distribution than verbs, and you never know which
ones you'll need. 130 may
not be enough to cover basic shopping, transportation and small talk.

- 15 pronouns and 25 connectors seems awfully low, given the lists above.

- Almost all of the other categories are going to require lots of hard decisions. There are just so many important
words in each category that I
can't choose a short list without feeling like I'm leaving out something essential.


I have no objections to putting any number of possible words in the various categories that I outlined. The
difficulty with this approach is that outside a very tiny core of high-frequency words, most of the other words are
equally probable. If I look at the list of suggested nouns, I really don't know what to make of that. To me it's
pretty useless.

The easiest way for you to convince me that 300 words will take somebody a long way is to actually fill out the list. Claiming a small number of words is enough doesn't make sense unless you actually choose the words. Once the words are picked out, it becomes possible to compare them against a wide variety of typical A2 conversations and see whether they suffice. I'm perfectly willing to help with this project—I can provide lists of candidate words, I can build tools to count the unique words in a corpus, and so on.

I agree that the noun list will be the most difficult. It's too subject-specific and the distribution is too flat. The top 2000 nouns only get you 84% coverage of nouns in running text (I find this to be astonishingly low, compared to 90% coverage with 260 verbs):





Oh, and for what it's worth: I believe you need both a solid vocabulary and high speaking fluency and correctness. And as for picking up the vocab you need from documents during a CEFR exam, DELF B2 oral prompts are normally only a paragraph long, and the student is required to produce a 10 minute presentation and answer questions on the subject for 10 minutes. When the subjects are things like, "Is adolescent rebellion an inevitability?", "Does TV and movie violence cause children to be violent?" and (for writing) "Given a choice, would you rather pay twice as much for your magazine subscription, or have twice as many ads?", you can't get all the vocabulary you need from just a short paragraph. Maybe you can pick up 3 or 4 domain specific words at the most, and if you're going to talk about the subject for 20 minutes, you'd ideally like a dozen domain specific words. This is why I suggest building links between the exam topic and a familiar topic.

Vocabulary is not sufficient for broad proficiency, but it's certainly necessary.

Jeffers wrote:
emk wrote:
The Rutledge frequency dictionary uses a lot more newspaper articles (relative to books), and it places belge at only 2,795.

I'm not sure where you read that about the Routledge dictionary, but the numbers of newspaper articles relative to books isn't too bad: 2,015,000 words from newspaper stories, 3,000,000 words from newswire stories, and 4,734,000 words from literature. So they are pretty close.

I should have been a bit more clear: The Rutledge corpus is actually quite nicely balanced, though I still think it favors writing and political speech over conversation. My only major complaints about this corpus are:

1. It only contains 5000 words.
2. It doesn't provide inflectional forms.
3. It's not available as raw data under an open license.

As you can see from the charts above, 5000 words isn't enough to get anywhere near 99.5% of varied real-world text.



And for those who are here for numbers and graphs, here's a combined chart, covering all parts of speech in the Lexique database:

Code:
-- Lots of ugly queries like:
select sum(freqfilms2)/907213 from
(select * from
    (select lemme,
             sum(freqfilms2) as freqfilms2,
             sum(freqlivres) as freqlivres
       from lexique group by lemme)
    order by freqfilms2 desc limit 250);



Code:
Words    Film    Boo k
250    76.16% 68.56%
500    82.79% 75.53%
1000   88.39% 82.03%
2000   93.00% 88.16%
4000   96.41% 93.42%
8000   98.55% 97.30%
16000 99.67% 99.73%

As you can see, if you only need 90% coverage to enjoy extensive activities, you can have free run of the bookstore relatively early. If you need 98% coverage to enjoy extensive activities, you'll need to either work with children's books, or pursue a "narrow reading" strategy, by focusing on just a few authors or subjects at first.

Just as a sanity check, various online vocabulary size estimators claim that my French vocabulary is a bit over 20K words. No, I don't really believe this! But when reading in reasonably familiar genres, my transparent+decipherable rate can exceed 99.5%. This actually corresponds fairly well to the chart above. Please keep in mind that all word counts are fairly fuzzy.

Actually, now that I think about it, this supports my argument about "cheating" and consolidating. If a particular learner needs 98% vocabulary coverage to tackle native materials, then they're looking at learning ~8,000 words before they get free reign of the bookstore and the movie channel. This represents a truly horrifying number of Anki cards, or a completely implausible number of graded readers. Thus, a learner would be well served by all the tricks mentioned in the other thread: Narrow reading, familiar subjects, pop-up dictionaries, native materials with pictures, children's materials, and so on. By using these techniques to temporarily boost comprehension—and by then consolidating that comprehension through lots of exposure—it's possible to spend much more time watching TV and much less time drilling vocabulary.


3 persons have voted this message useful



This discussion contains 55 messages over 7 pages: << Prev 1 24 5 6 7  Next >>


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 0.5000 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.