Experimenting with French word frequency (Specific Languages) Language Learning Forum

Experimenting with French word frequency
Tags: Word Frequency \| French
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : Specific Languages

55 messages over 7 pages: 1 2 3 4 5 6 7

s_allard
Triglot
Senior Member
Canada
Joined 5433 days ago
2704 posts - 5425 votes

Speaks: French*, English, Spanish
Studies: Polish

Message 49 of 55

07 September 2014 at 8:43pm | IP Logged

emk wrote:

...

One thing really stands out here: Nouns are just brutal. They may only make up 15% of a typical text, but
you need far more nouns than anything else. Seeing this, I'm actually prepared to believe that if you didn't count
the nouns, you could construct a 250–500 word beginner vocabulary that would take you surprisingly far. But
then you need to add over a thousand nouns to get any kind of reasonable coverage.

Or you need to spend a whole lot of time saying, C'est quoi, ce truc? "What's this thing?" Life's pretty
miserable when you're missing two or three key nouns in a single conversation. If you don't have "train" or
"station" or "ticket", for example, you're going to a really vague and awkward conversation. "I want to buy a thing
to go on a choo-choo!. Where is the big place with the choo-choo?" Even if you've mastered the basics
of sentence construction, missing nouns can reduce you pantomime and sound effects in no time.

Let's visit this debate one more time. We see the results of the typical study of vocabulary coverage for a variety
of datasets. The results are exactly what has been observed for all such studies across all languages. Because of
the very nature of the sampling methodology, we observe that to get high coverage across all the samples you
need progressively larger and larger vocabularies. Nothing new. One of the conclusions is that "you need over a
thousand nouns to get any kind of reasonable coverage."

All this is very true, and French is no different from the other languages. The conclusion is that with less than
1,000 nouns, you have the situation described above. "If you don't have "train" or "station" or "ticket", for
example, you're going to a really vague and awkward conversation." This is certainly true. I agree.

Suppose we're not interested in buying a train ticket. Suppose we're interested in talking about life in a lycée
militaire in France. By coincidence, we have a conversation between two native-speakers right here in an earlier
post. This conversation takes exactly 94 nouns. It's not about trains of course.

Let's say we wanted to have a conversation about taking the train from Paris to Marseille, how many nouns would
be necessary for that? It all depends on the length of the conversation and the detail. Let's take the same two
speakers from the previous conversation speaking for the same duration. There's a good chance that we would
get approximately the same number of nouns.

emk's position is totally coherent and valid. If you take 1,000 nouns and try to talk about all the possible subjects
under the sun in French, there is a good chance, you'll miss some words. Maybe you want to talk about the
Grand cru wines of Bordeaux, you'll miss some important words. Maybe you want to talk about the history of foie
gras in southwest France. Or maybe the construction of French naval vessels in the 18th century. You'll be
missing a ton of words.

If you want to combine all three domains, you will need three sets of terms. The more domains you add, the
more terms you need. This is a given. We all know this. But do all people in France need to be able to talk about
the history of foie gras, the Bordeaux Grands crus and 18th century French naval architecture?. No. None of those
interest me. I want to talk about the lycée militaire. Then I only need 94 nouns.

If you take vocabulary set of 94 nouns of this conversation at compare it to the emk's database, you will probably
get a very low coverage. Let's say around 40%. What does this prove? That this conversation is not real? That the
people having the conversation can't understand each other? I'll let people draw the conclusions they want.

That pretty much sums up the debate. The number of nouns you need depends on the domains you will be
talking about. You can have a conversation about one domain or 100 domains or more. You then choose the
vocabulary necessary. There's no such thing as a one-size-fits-all 300-word set.

What is very interesting in the statistics emk provides is the coverage of verbs. Verbs are interesting to count
because they are a very stable category. Again, for teachers of French, there is nothing surprising in these
figures.

French has around 13,000 verbs. We see that 2149 verbs or around 20% of all verbs provide 99.5% coverage. The
other 10,851 verbs, approximately, cover the remaining 0,5% of French verbs. If you can tolerate slightly less
coverage, the number of verbs needed goes down dramatically. For example, at 95%, which is not exactly shabby,
the number of verbs needed is 583. I really like the fact that only 63 verbs give you 75%. Although the tables
don't show it, if I remember correctly, four verbs account for around 30% of all verb coverage.

Again, and I don't see why it is necessary to repeat this position, I say that a nice little 300-word set whose
structure I've already given should allow us to talk about the lycée militaire and buying a train ticket and maybe
even being stopped by the police for a contrôle d'identité. But not also about foie gras and the grand cru wines
of Bordeaux.

I actually would be willing to have a go at providing a dialogue for buying a Paris-Marseille ticket and combining
all three dialogues to see what kind of vocabulary range we get. The more I think about it, the more I might just
do exactly that.

Edited by s_allard on 07 September 2014 at 10:22pm
1 person has voted this message useful

Jeffers
Senior Member
United Kingdom
Joined 4912 days ago
2151 posts - 3960 votes

Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

Message 50 of 55

07 September 2014 at 9:41pm | IP Logged

emk wrote:

You wouldn't have to add them at once, of course. Nouns can be added as and when needed, and you can even have a bit of control over the conversations you have. You could say, "Tomorrow I'm going to buy a bus ticket, then visit a cafe. So today I need to learn the vocabulary related to taking a bus, and ordering a few items." Of course you can't control where any of these conversations might go, but you can always try to steer conversations back to familiar territory.
1 person has voted this message useful

s_allard
Triglot
Senior Member
Canada
Joined 5433 days ago
2704 posts - 5425 votes

Speaks: French*, English, Spanish
Studies: Polish

Message 51 of 55

07 September 2014 at 9:57pm | IP Logged

Jeffers wrote:

emk wrote:

You wouldn't have to add them at once, of course. Nouns can be added as and when needed, and you can even
have a bit of control over the conversations you have. You could say, "Tomorrow I'm going to buy a bus ticket,
then visit a cafe. So today I need to learn the vocabulary related to taking a bus, and ordering a few items." Of
course you can't control where any of these conversations might go, but you can always try to steer
conversations back to familiar territory.

That's exactly what I have been saying. Learn the words as you need them.

Edited by s_allard on 07 September 2014 at 10:19pm
1 person has voted this message useful

rdearman
Senior Member
United Kingdom
rdearman.orgRegistered users can see my Skype Name
Joined 5239 days ago
881 posts - 1812 votes

Speaks: English*
Studies: Italian, French, Mandarin

Message 52 of 55

01 October 2014 at 12:48pm | IP Logged

emk wrote:

For those of you who'd like to play around with this data, all my code and data is now available on GitHub.

For those of you who like pretty graphs and data analysis, check out this iPython notebook page. Some highlights:

And the same data as a table:

One thing really stands out here: Nouns are just brutal. They may only make up 15% of a typical text, but you need far more nouns than anything else. Seeing this, I'm actually prepared to believe that if you didn't count the nouns, you could construct a 250–500 word beginner vocabulary that would take you surprisingly far. But then you need to add over a thousand nouns to get any kind of reasonable coverage.

Or you need to spend a whole lot of time saying, C'est quoi, ce truc? "What's this thing?" Life's pretty miserable when you're missing two or three key nouns in a single conversation. If you don't have "train" or "station" or "ticket", for example, you're going to a really vague and awkward conversation. "I want to buy a thing to go on a choo-choo!. Where is the big place with the choo-choo?" Even if you've mastered the basics of sentence construction, missing nouns can reduce you pantomime and sound effects in no time.

OK, after a bit of a learning curve with Github I've got the DB up and running. But I seem to have a difference in my figures as compared to yours. I am looking for the 98% minus nouns, articles, and pronouns. deducting from your 98% numbers I get a remaining value of 3947 words.

But when I run my SQL and do a count() I get 5397. So where have I gone wrong? I'm OK with SQL although I've never used sqlite before.

Quote:

SELECT
--count(mt.lemme),
    count(mt.lemme),
    mt.cgram,
    mt.freqfilms2,
    mt.freqfilms2 * 100 / agg.value_sum AS percentage
FROM
    lemme mt
      JOIN ( SELECT
                lemme, cgram, freqfilms2,
                sum(freqfilms2) AS value_sum
               FROM lemme
               -- WHERE freqfilms2 > 12
             GROUP BY lemme
      ) agg ON (mt.lemme = agg.lemme)
WHERE
percentage > 98
AND
mt.cgram != 'NOM' AND mt.cgram not like 'ART%' AND mt.cgram not like 'PRO%'
;

1 person has voted this message useful

emk
Diglot
Moderator
United States
Joined 5535 days ago
2615 posts - 8806 votes

Speaks: English*, French^B2
Studies: Spanish, Ancient Egyptian
Personal Language Map

Message 53 of 55

01 October 2014 at 1:16pm | IP Logged

rdearman wrote:

I'm glad to see you got everything up and running! I hope the process wasn't too horrible.

My table was generated by analyzing each part of speech separately. You can see the whole process in the notebook, in the section titled "Text coverage by part of speech." Part way through the analysis, I switched from raw SQL to Pandas and custom Python code. But you can get the same results using raw SQL:

Quote:

SELECT lemme FROM lemme WHERE cgram = 'ADV' ORDER BY freqfilms2 DESC LIMIT 118;

You'll need to run this for each interesting value of "cgram" in order to reproduce my results. For some parts of speech, you'll need "cgram IN ('AUX', 'VER')", or "cgram LIKE 'PRO%'" to combine related categories. I think the fundamental reason that your numbers don't line up with mine is that you took 98% coverage and tried to filter the nouns out, whereas I calculated coverage individually for each part of speech. And also take a look at the difference between the tables lexique, lemme, and lemme_simple. The lexique table is raw data, the lemme table sums over lemma+part-of-speech, and the lemme_simple table discards parts-of-speech completely.

Also, don't hesitate to fire up the IPython notebook interface and mess directly with the Python code. The learning curve is pretty short, and you can do all kinds of cool stuff just by tweaking my code and running it interactively. Python has tons of awesome libraries for language processing, data analysis and graphing, and IPython is beyond cool.
1 person has voted this message useful

rdearman
Senior Member
United Kingdom
rdearman.orgRegistered users can see my Skype Name
Joined 5239 days ago
881 posts - 1812 votes

Speaks: English*
Studies: Italian, French, Mandarin

Message 54 of 55

01 October 2014 at 3:01pm | IP Logged

It wasn't too bad. The worst part was trying to remember my github password. :)

Python is nice, but I'm an old man and my weapons of choice are C, Perl, and Assembler; in that order. :D

I noticed some odd data:

2e;ADJ;0.04;100.0
58e;ADJ;0.03;100.0

A quick grep through the original file (lexique.txt), showed these values there also. Any idea what they are? Just an anomaly perhaps?

EDIT: Referring to: 2e and 58e

Edited by rdearman on 01 October 2014 at 3:06pm
1 person has voted this message useful

emk
Diglot
Moderator
United States
Joined 5535 days ago
2615 posts - 8806 votes

Speaks: English*, French^B2
Studies: Spanish, Ancient Egyptian
Personal Language Map

Message 55 of 55

01 October 2014 at 4:44pm | IP Logged

rdearman wrote:

I noticed some odd data:

2e;ADJ;0.04;100.0
58e;ADJ;0.03;100.0

Welcome to the wonderful world of natural language processing!

When working with NLP data sets and corpora, there's always going to be some weird dirt in your data. In this case, I bet these tokens are abbreviations for deuxième and cinq-huitième which appeared sufficiently frequently in the underlying corpus to qualify for inclusion.

In theory, the right way to deal with these issues is to treat your input data as probability distribution. If you only see something once or twice, and it doesn't fit in nicely with the other data, maybe it's an artifact. If you only some something once, but it looks like a perfectly normal -er verb, then maybe you can believe it.

Basically, before placing too much faith in any isolated data point, see Shirky's article on why reasoning via syllogisms doesn't work in the real world. Basically, the simplest model you can actually get away with is something like naive Bayes with additive smoothing to eliminate absolute probabilities of 0% and 100%. You don't need to dig through all the math. Basically all you need is an illustrated guide to Bayes' rule, and a rule which says, "Nothing is ever 100% certain, so I'll change 100% to 99.9%, and 0% to 0.1%." For a purely linguistic justification of why you need to think this way, see Norvig's famous response to Chomsky.

Fortunately, you're working with frequency data here, which means that you don't need to deal with low-frequency garbage data explicitly. Instead, you can just ORDER BY filmfreq2 DESC LIMIT ___, which will discard most of your garbage data below the cutoff. Of course, you'll still have other sorts of garbage higher up in the list if the tokenizer thought that aujourd'hui was two words, or whatever.

In short, don't place any more faith in your data than it deserves.

Edited by emk on 01 October 2014 at 4:45pm

3 persons have voted this message useful

This discussion contains 55 messages over 7 pages: << Prev 1 2 3 4 5 6 7

If you wish to post a reply to this topic you must first login. If you are not already registered you must first register

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 0.3145 seconds.