Experimenting with French word frequency (Specific Languages) Language Learning Forum

Experimenting with French word frequency
Tags: Word Frequency \| French
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : Specific Languages

55 messages over 7 pages: 1 2 3 4 5 6 7 Next >>

emk
Diglot
Moderator
United States
Joined 5320 days ago
2615 posts - 8806 votes

Speaks: English*, French^B2
Studies: Spanish, Ancient Egyptian
Personal Language Map

Message 1 of 55

04 September 2014 at 1:00pm | IP Logged

While searching for French frequency data, I found the rather cool Lexique project from the Université de Savoie. They provide an open source database of over 46,947 head words and 142,694 inflected forms. It includes two sets of frequency data: frequency in books, and frequency in movie subtitles. The latter makes for a pretty reasonable spoken corpus.

One drawback of this database is that it can't connect the various parts of speech—it treats aimer, aimant and aimé as separate words. In an effort to fix that, I fed it through the Snowball stemmer, which is often used to implement computerized word search. But that produced rather weird results, including:

Quote:

tu,tuer,tuant,tué
te,tee,tes
dan,dans
voir,voire,voirie
non,nonante,none,nonne
mon,mons
moi,moyer
comma,comme,commis
venir,venant
sur,suri,surir
bon,bonasse,boni,boniment,bonir,bonne,bonnement,bonnir
parler,parlant,parlante,parlement,parlé
par,para,parer,parait,parement,pars,paré

So, yeah, that's not going to work. Snowball may be the standard stemmer, but it's dumb. Let's stop using Snowball, because we don't want to treat tu "you" and tuer "to kill" as the same word—nor bonasse "sexy woman" and bonne "female servant" for that matter.

So let's try again. Without Snowball to collapse forms like aimer, aimant and aimé, here are the top 300 words of French, based on movie subtitles:

Quote:

être avoir je de ne pas le la tu vous il et un à l' aller les en ça faire on une ce d' pour des dire tout pouvoir qui vouloir mais me nous dans savoir elle du y bien t' voir que plus non te mon au avec moi si devoir s' oui ils comme se venir sur toi quoi ici rien ma lui bon où là suivre pourquoi parler prendre cette votre quand son alors ton chose par croire aimer falloir comment très ou passer penser aussi jamais même attendre petit trouver laisser merci sa autre ta arriver ces donner regarder encore appeler est-ce que peu homme partir mes toujours jour femme temps maintenant notre vie deux mettre rester sans seul arrêter vraiment connaître quelque sûr juste tuer mourir demander peut-être dieu fois oh père comprendre sortir personne an trop vrai chez fille aux monde ami mal après avant besoin accord ses beau monsieur grand enfant entendre voilà chercher heure mieux déjà tes aider mère essayer quel vos depuis quelqu'un beaucoup revenir donc plaire maison gens nuit ah soir nom bonjour jouer finir peur mort parce que perdre maman sentir ouais rentrer nos argent vivre premier problème quelle rendre dernier tenir cet main cela vite moins oublier air salut fils travailler demain tête manger coup écouter raison amour entrer fait devenir nouveau hein commencer merde moment voiture vieux revoir elles payer fou tirer ouvrir oeil changer question tomber assez foutre excuser affaire dormir combien frère travail idée eh puis famille tard truc trois tant souvenir ni tous occuper entre ok marcher chance envoyer aujourd'hui histoire prêt jeune apprendre minute boire garder quelques type porte montrer mec porter asseoir contre pendant attention droit année sous prier meilleur mois lire servir plein madame putain écrire part eau sang place espérer gros désoler

Interestingly, it counts le, la and l' separately. And it expects us to spend some of our top 300 word slots on ah, eh and ouais "yup" (which is only fair in spoken French, I suppose). You may also notice that words like peur and mort are quite high in the list, probably thanks to action movie subtitles. But it's still better than mixing up tu and tuer, or pulling our frequency distribution from Le Monde articles.

These are all terribly useful words, of course. We could probably make some very basic small talk, but we couldn't really go shopping or ask where to find the bathroom.

In the next episode: Improving s_allard's recommended list of 600 words of spoken French.
4 persons have voted this message useful

rdearman
Senior Member
United Kingdom
rdearman.orgRegistered users can see my Skype Name
Joined 5024 days ago
881 posts - 1812 votes

Speaks: English*
Studies: Italian, French, Mandarin

Message 2 of 55

04 September 2014 at 1:08pm | IP Logged

emk wrote:

It includes two sets of frequency data: frequency in books, and frequency in movie subtitles. The latter makes for a pretty reasonable spoken corpus.

So are you using the intersection of those two lists or are you looking at them independently? Or just looking at film in order to answer/reply/investigate s_allards link?
1 person has voted this message useful

s_allard
Triglot
Senior Member
Canada
Joined 5218 days ago
2704 posts - 5425 votes

Speaks: French*, English, Spanish
Studies: Polish

Message 3 of 55

04 September 2014 at 1:42pm | IP Logged

I'm not sure that movie subtitles make the best sample set for spoken French or written French. Why go to all this
trouble when there are a number of frequency lists out there that one can use. Here for example is one such list:

A French word
frequency list

When I get a minute I'll comment on the contents of the list suggested by emk, but I question the strategy of
looking at the top 300 words of frequency lists in general.

What I prefer to do is to look at actual conversations and see how many words are used. When we look at vast
sample of texts, we always run into that lexical aggregate problem. The issue isn't what the sum of all movie
subtitles says it's more what does one movie say.

Since I have a class in a few minutes, I'll come back later with results from looking at the words used in
conversations in France bienvenue

2 persons have voted this message useful

Doitsujin
Diglot
Senior Member
Germany
Joined 5108 days ago
1256 posts - 2363 votes

Speaks: German*, English

Message 4 of 55

04 September 2014 at 2:00pm | IP Logged

emk wrote:

One drawback of this database is that it can't connect the various parts of speech—it treats aimer, aimant and aimé as separate words.

Check out the dsl2mobi GitHub website. The website hosts inflections lists for 9 common languages, including French.

For example, the entry for aimer looks like this:

Quote:

aimer: n'aimeront, n'aiment, aimai, c'aimait, n'aimons, n'aimerons, n'aimassent, aimas, n'aimèrent, n'aimions, aimait, n'aimassions, aimais, n'aimerai, j'aimerai, n'aimasses, n'aimeras, j'aimais, c'aima, n'aimais, c'aime, n'aimait, aimions, n'aimâtes, aimasse, t'aimeras, n'aimas, n'aimai, aimerez, n'aimâmes, n'aimiez, s'aima, m'aimasse, s'aime, n'aimaient, aimera, aimassions, aimes, aimez, m'aimerai, m'aime, j'aime, t'aimasses, m'aimai, t'aimes, n'aimât, s'aimât, aimerons, aimeront, aimèrent, n'aimerez, t'aimas, s'aimera, s'aimait, qu'aimai, c'aimât, j'aimai, n'aimasse, aimasses, aimât, aime, aima, aimassent, t'aimais, qu'aimasse, qu'aimerai, aimâmes, aimiez, aimâtes, aimaient, aiment, c'aimera, n'aimes, n'aimez, aimons, n'aimassiez, n'aima, n'aime, m'aimais, qu'aimais, aimassiez, n'aimera, aimeras, aimerai, j'aimasse, qu'aime

This should be more than sufficient to reduce inflected forms to their canonical forms.
3 persons have voted this message useful

Jeffers
Senior Member
United Kingdom
Joined 4697 days ago
2151 posts - 3960 votes

Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

Message 5 of 55

04 September 2014 at 2:05pm | IP Logged

s_allard wrote:

The wiktionary list looks like a good idea, until you realize they didn't remove
capital letters, so not only does it count "le" and "la" as separate words, it also
counts "Le" and "La" separately.

EDIT:

s_allard wrote:

When I get a minute I'll comment on the contents of the list suggested by emk, but I question the strategy of
looking at the top 300 words of frequency lists in general.

What I prefer to do is to look at actual conversations and see how many words are used. When we look at vast
sample of texts, we always run into that lexical aggregate problem. The issue isn't what the sum of all movie
subtitles says it's more what does one movie say.

I couldn't quite figure out this objection. You can't learn the words you need for an actual conversation
after having the conversation. So you learn the words most likely to come up in conversation, and then
fake it, or ask for clarification, for the odd words which will turn up in any given conversation.

Edited by Jeffers on 04 September 2014 at 2:12pm
4 persons have voted this message useful

rdearman
Senior Member
United Kingdom
rdearman.orgRegistered users can see my Skype Name
Joined 5024 days ago
881 posts - 1812 votes

Speaks: English*
Studies: Italian, French, Mandarin

Message 6 of 55

04 September 2014 at 2:26pm | IP Logged

Jeffers wrote:

The wiktionary list looks like a good idea, until you realize they didn't remove
capital letters, so not only does it count "le" and "la" as separate words, it also
counts "Le" and "La" separately.

Those are easy enough to strip out, a one line command in Linux.
cat FR_WORDS.txt | tr '[:upper:]' '[:lower:]' | uniq

EDIT: Fixed bad grammar.

Edited by rdearman on 04 September 2014 at 2:26pm
1 person has voted this message useful

emk
Diglot
Moderator
United States
Joined 5320 days ago
2615 posts - 8806 votes

Speaks: English*, French^B2
Studies: Spanish, Ancient Egyptian
Personal Language Map

Message 7 of 55

04 September 2014 at 2:33pm | IP Logged

s_allard wrote:

I'm not sure that movie subtitles make the best sample set for spoken French or written French.

Well, virtually all of the frequency lists that I can find online are based on newspaper articles and books. We could use book data as a basis for these lists, but it actually gives pretty awful results for spoken French. The movie data is much more representative of ordinary dialog, even if mort and tuer and fric are a bit too frequent.

Anyway, you recently wrote:

s_allard wrote:

Since some people are interested in French vocabulary and like to throw around wild figures, it might be
interesting to know how French schools deal with this vocabulary question. The French school system uses
something called l'échelle Dubois-Buyse that specifies according to a series of steps or échelons what words a
student should know starting from l'école primaire (échelons 1 à 7) right up to the lycée (échelons 40 à 42).

Here is a very interesting web page that looks at these questions:

Les 600 mots français les
plus utilisés

…

Anybody who masters those 600 words well has an excellent foundation in French.

Well, as you mentioned, this list includes no pronouns, adverbs, articles, demonstratives or other helper words. To help you improve your list, here are 350 words that aren't included in your "excellent foundation" list:

Quote:

je de ne pas le la tu vous il et un à les en ça on une ce pour des qui mais me nous dans elle du y bien que plus non te mon au avec moi si oui ils comme se sur toi ici rien ma lui bon où là pourquoi cette votre quand son alors ton par comment très ou aussi jamais sa ta ces encore peu mes toujours maintenant notre vie deux sans vraiment quelque sûr peut-être oh an trop chez aux ami mal après avant ses voilà mieux déjà tes quel vos depuis quelqu'un beaucoup donc ah nom "parce que" ouais nos quelle cet cela vite moins air demain hein elles fou assez combien eh puis tard trois tant ni tous entre ok aujourd'hui quelques mec contre pendant sous eau eux longtemps hé ensemble leur cas mot seulement voici devant enfin pardon là-bas vers feu car leurs celui autres loin aucun chaque fin coeur dehors aucune tiens hier dont près d'autres plutôt dur ainsi ceux nouvelle jeu derrière con bientôt lit tellement bas presque roi d'abord calme dîner dessus baiser sinon idiot cinq bonsoir rue parfois âge autant quatre cul surtout pire exactement dès partout âme celle pardonner souvent génial normal sûrement hôpital ceci or tôt ailleurs ensuite l'un déranger cadeau amoureux sac bête sauf grâce supposer épouser dix clé six message allô gamin remercier avocat attraper dépêcher vérifier loi américain fric euh accident preuve mer télé victime complètement pute meurtre salaud crime ferme doucement blesser honte dangereux certains dos visite bordel fumer signer ficher diable anniversaire mériter discuter pourtant incroyable connerie riche chacun fier supporter rendez-vous debout public dommage étranger allemand merveilleux absolument regretter mur enfer prouver santé magnifique obtenir vue terrible autour conseil plaisanter vin animal exact déjeuner est virer régler souhaiter mission créer mille chanson pote surveiller simplement recommencer adieu moitié billet spécial personnel sympa vaisseau moi-même anglais coûter intéressant surprise poste vol connard parier mme bière poisson selon lieutenant danger appartenir rater ben dedans directeur liberté règle inutile patient banque partager dossier ressentir mignon défendre stupide but bande fatiguer radio prévoir celui-là

Again, there's a few duplicates here, such as aucun aucune. But it's a good starting point for improving your list.

Obviously, you want to include things like je "I" and bien "well" and et "and" and un "a" and mais "but" before you claim somebody has an "excellent foundation." And if you want to go shopping, you probably want coûter "to cost". And it would help to add dix voilà voici devant enfin pardon anglais hier and a whole bunch of other stuff from this list. You could get by without vaisseau. The profanity? It's probably actually pretty useful if you're younger and you want to make casual friends.

But even if we take your 600 word list as a base, we can almost certainly find 200 words on the list above that would be awfully handy even for small talk.

Do you have suggestions about how to take your 600 and the 350 above, and squash them down to 300 words? Ideally, I'd like to be able to satisfy the A2 spoken interaction criteria, which seem like a pretty good standard for basic survival:

Quote:

A2 Spoken Interaction

I can make simple transactions in shops, post offices or banks.
I can use public transport : buses, trains, and taxis, ask for basic information and buy tickets.
I can get simple information about travel.
I can order something to eat or drink.
I can make simple purchases by stating what I want and asking the price.
I can ask for and give directions referring to a map or plan.
I can ask how people are and react to news.
I can make and respond to invitations.
I can make and accept apologies.
I can say what I like and dislike.
I can discuss with other people what to do, where to go and make arrangements to meet.
I can ask people questions about what they do at work and in free time, and answer such questions addressed to me.

According to the Milton's study I mentioned earlier, the typical A2 French student knows about 1700 of the most frequent 5000 words. Can we actually whittle that down to a list of 300 and still meet the A2 criteria? Honestly, I just don't see how you can fit everything necessary into 300 words: something's got to give.

s_allard wrote:

You make various claims about 300 words, but you don't seem to mean an actual list of 300 words that we could just go ahead and write down and make into an Anki deck. You mean an constantly-changing list of 300 words that sometimes includes vrai and sometimes includes chez and sometimes includes coûter, but the actual makeup of the list can change with every conversation.

But this is sort of the whole point: Even if you try to limit yourself to A2 conversations (small talk, transportation and shopping, basically), you still need a lot of words even within that limited beginner domain.

The only way I'm going to believe that 300 words will get me anywhere useful is if somebody actually writes them down on a list, and forces themselves to make the hard choices. I can't make those choices for you, because whichever list I pick, you're going to see some obvious problems: a list based on newspapers won't be conversational enough. A list based on movies won't include enough shopping. And this is more or less exactly my point: the choices are just too hard when the list is that short, and you can't really even cover the A2 basics without leaving out something important.

So I'm going to put forth a challenge: If you believe that 300 words is enough to get by, please make a list. Make those hard decisions.
5 persons have voted this message useful

emk
Diglot
Moderator
United States
Joined 5320 days ago
2615 posts - 8806 votes

Speaks: English*, French^B2
Studies: Spanish, Ancient Egyptian
Personal Language Map

Message 8 of 55

04 September 2014 at 2:54pm | IP Logged

rdearman wrote:

Jeffers wrote:

The wiktionary list looks like a good idea, until you realize they didn't remove
capital letters, so not only does it count "le" and "la" as separate words, it also
counts "Le" and "La" separately.

Those are easy enough to strip out, a one line command in Linux.
cat FR_WORDS.txt | tr '[:upper:]' '[:lower:]' | uniq

EDIT: Fixed bad grammar.

The Wiktionary list is junk. It lists Belgique as the 93rd most common word in French.

Seriously, the Lexique data set is awesome. It has frequency data for both movies and books, it has parts of speech, it has genders, and it has the common inflectional forms:

Quote:

être: es,est,furent,fus,fusse,fussent,fusses,fussiez,fussions,fut, fûmes,fût,fûtes,sera,serai,seraient,serais,serait,seras,s erez,seriez,serions,serons,seront,soient,sois,soit,sommes,so nt,soyez,soyons,suis,étaient,étais,était,étant,étiez,é tions,été,êtes,être,êtres

Like I said, the only peculiarity of this data set is that it distinguishes words by parts of speech: aimer "to love", aimé "loved" and aimant "loving" are treated as separates lexemes. It's open source, and it's far more useful than most of the garbage lists you'll find online.

Doitsujin wrote:

Check out the dsl2mobi GitHub website. The website hosts inflections lists for 9 common languages, including French.

Now that looks promising. At the very least, it will allow me to collapse related forms in the Lexique data set. Thank you for the link.

EDIT: Drat:

Quote:

No aimé, no aimant. Still, it was a good try!

Edited by emk on 04 September 2014 at 3:04pm

2 persons have voted this message useful

This discussion contains 55 messages over 7 pages: 1 2 3 4 5 6 7 Next >>

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 0.4219 seconds.