Estimating vocabulary size from novels (Learning Techniques, Methods & Strategies) Language Learning Forum

Estimating vocabulary size from novels
Tags: Number of words \| Vocabulary learning
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : Learning Techniques, Methods & Strategies

14 messages over 2 pages: 1 2

Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6706 days ago
9078 posts - 16473 votes

Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

Message 9 of 14

24 November 2014 at 1:44pm | IP Logged

patrickwilken wrote:

Ari wrote:

If you have a paper dictionary, you can just do it by counting the number of known words on a random page. Multiply the proportion of known words with the number of words contained in the dictionary. Do more pages for better accuracy.

Good point. Though I thought Iversen said you get different estimates depending on the dictionary you use (i.e., dictionaries with more headwords tend to give you lower estimates). Also I am not quite sure how to relate words in dictionaries with the sort of word-groups that Nation uses. But sure, if you use the same dictionary each time it would be a great way of keeping track of progress too.

I use paper dictionaries to estimate my vocabulary, but there are some caveats. The most obvious is the size of a dictionary: if there only are 5000 word in the dictionary then your estimate must evidently lie lower than 5000 words no matter how advanced you are. On the other hand a dictionary with 300.000 words will drown you in words not even an educated native speaker will know - and seeing mainly unknown words in a dictionary may even lull you into a 'fiasco mode' where you overlook words you actually have seen before.

Another problem is that not only languages, but also dictionaries differ in the way they treat compounds and derivations. For instance some dictionaries separate homonymous nouns and verbs into separate articles, others don't. Some list fixed two-word expressions in the same way as single words, others mix them with examples and expressions. Derivation patterns can also be used in a more or less predictable way, which raises the question: should you count the results as separate words or as inflections of one word?

For these (and a number of other) reasons I have come to the conclusion that it is more meaningful to look at percentages than absolute numbers. If I know two thirds of whatever I count in a French dictionary, then that can be compared directly with the third of the items I know in an Indonesian one, even though I may have 70.000 words in my French dictionary and only 10.000 in my Indonesian one, and even though I count different things in the two dictionaries. The percentages seem to be fairly stable over small and midsized dictionaries, and only very big dictionaries (or midsized dictionaries in weak languages) tend to yield lower percentages.

However a given percentage doesn't guarantee that I also know an equal percentage of the expressions which are quoted in a dictionary - and even less that I know how to use those words actively.

The book based statistics take the vocabulary thing from a totally different angle which is just as relevant, and I'm actually thinking about making some similar statistics myself - though in my case it would be more relevant to use texts about popular science beause that is what I mostly read about.

If you take two texts in a given language, there are a limited number of very common words which are present in both texts, a limited number of less common words which also are shared (more or less by accident) and a LOT of words which aren't used in either. So the proportion of rare words in almost any kind of book or magazine will be much lower than the percentage they take up in a dictionary. Couldn't we then just skip the dictionary based wordcounts and be happy with the higher percentages in book based texts? No, because the dictionary based wordcounts represent your ability to understand words across ALL books and texts in a language, and it is purely a mathematical quirk based on word frequencies that make these numbers lower than the percentages you get with any specific text.

But strictly speaking there are two book based statistics to take into account: the coverage by words actually used and the proportion of known unique words (or word families) among those used in the same books. And the last number would be lower because it doesn't overrepresent the most common words. But probably higher than the dictionary based figures because there are lots of words in a dictionary which you wouldn't expect to find in any given book, no matter how big it is or how learned it's author is.

Edited by Iversen on 24 November 2014 at 1:56pm
2 persons have voted this message useful

patrickwilken
Senior Member
Germany
radiant-flux.net
Joined 4536 days ago
1546 posts - 3200 votes

Studies: German

Message 10 of 14

26 November 2014 at 1:24pm | IP Logged

Readlang vs Word Counts

I started using Readlang in the last week. It has the potentially nice feature that it give a rough estimate of the reading difficulty of EPUB formatted books you upload.

I thought I would compare relative rankings between my word count estimates and their reading difficulty as estimated by Readlang. The Readlang feature is still in beta, and it was unable to parse two of the books, but here is the data for the eight I was able to upload:

By far the biggest outlier is the Murakami, which I personally feel was pretty easy (which Readlang agrees with - my ranking 8 vs Readlang 2), but that I missed quite a few words in.

So overall, while this isn't a perfect proof of Readlang's algorithm, it suggests to me that it is tapping into something real.

One problem with this estimate is that the relative difficultly of all the books I estimated are fairly equal. Out of curiosity I threw in the seven Harry Potter books as the books are meant to get more difficult as the series progresses. I certainly remembered feeling a big jump between the first four and the latter three.

Here the data is actually surprisingly clean, suggesting that indeed HP does get harder as the series progresses, and that Readlang is giving some sort of useful estimate.

As Readlang estimates are very quick to generate it might be worth using if you are trying to get a rough sense of the difficulty of books before reading.

Edited by patrickwilken on 26 November 2014 at 1:31pm
2 persons have voted this message useful

smallwhite
Pentaglot
Senior Member
Australia
Joined 5311 days ago
537 posts - 1045 votes

Speaks: Cantonese*, English, Mandarin, French, Spanish

Message 11 of 14

26 November 2014 at 2:11pm | IP Logged

Earlier this year, I was reading 4 Spanish novels at the same time. Dracula had the highest number of unknown words, but I found it the easiest to read. That was because the sentences were often very detailled, long-winded and had many redundant words, such that missing a word or two didn't really affect understanding. For example, one of the sentences read, "Then, I had desire to read, because I did not want to run around the rooms of the castle without obtaining the corresponding permission of the Count". Half of that could be unknown and you'd still understand the sentence...
1 person has voted this message useful

patrickwilken
Senior Member
Germany
radiant-flux.net
Joined 4536 days ago
1546 posts - 3200 votes

Studies: German

Message 12 of 14

26 November 2014 at 2:37pm | IP Logged

smallwhite wrote:

I have had the same experience. I read the German translation of Jack Kerouac's the Dharma Bums earlier in the year, and was surprised how easy it was to read, basically because as you say there were a lot of words that were not mission critical.

1 person has voted this message useful

s_allard
Triglot
Senior Member
Canada
Joined 5433 days ago
2704 posts - 5425 votes

Speaks: French*, English, Spanish
Studies: Polish

Message 13 of 14

26 November 2014 at 3:12pm | IP Logged

With all due respect to iversen, I must admit that I couldn't really grasp all the meaning of his last post
without some concrete examples of how to use a dictionary to estimate one's receptive vocabulary
size.

There are a number of fundamental methodological issues in play here. One issue is how does one
define what knowing a word means. Here are some possible interpretations:

1. I recognize the word because I've seen it before.
2. I know at least one definition or use of the word.
3. I know what it means in the present context.
4. I feel confident that I could use the word if necessary.
5. I have actually used the word in my speaking and writing.

This point has recently come to my attention as I have been recently watching the British television
series Downton Abbey. I am struck by how many words and expressions I have had to look up in the
dictionary, not because I didn't recognize the words, but because I didn't really understand the
meaning or use in the context. Luckily, I could simply pause the video to use the dictionary.

For example, a character says, "It's a sprat to catch a mackerel", meaning to make a small expenditure
in order to make a large gain. I knew all the individual words but I had never heard the expression.
Similarly, I had to look up the word "footman" that I thought I knew but I realized that I didn't know it's
precise meaning in the context here.

1 person has voted this message useful

Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6706 days ago
9078 posts - 16473 votes

Message 14 of 14

26 November 2014 at 4:26pm | IP Logged

When I make my dictionary-based wordcounts I use a mild version of the definitions/translations test: can I give a translation of any given word or at least define what it means? It may seem that this can't function when you look into a dictionary where there already is a translation (or explanation for monolingual dictionaries), but I do my tests by separating all the headwords on a page into known, 'something in the middle' and unknown words and writing them down on a piece of paper in three different colours. And thanks to this system I can check the accuracy by running through the supposedly known words later and verifying that I actually know the the supposedly known words - even without their context in the dictionary. And as far as I can see the results are reliable when tested in this way.

I have a supplementary rule of thumb: if I would need a word to express something and my first guess would be the word in the dictionary exactly as it stands, then I count the word as known - even if I can't actually remember having seen or used it myself. If the correct word is similar, but not quite the same, then this word is counted as a borderline case and placed in the middle group. This test would obviously be harder if I started out from a Danish (or English) word and was asked to produce a word in Greek or Serbian, but my goal is to estimate my passive vocabulary, not the active one, so using a Target->Base language dictionary is the right thing to do.

That's one case. Now for the other case, which is estimating the size of your vocabulary based on concrete texts.

If you test word comprehension in genuine texts you will probably ask yourself: do I know (or rather NOT know) the meaning of this word in this particular context? And mostly the context will be a help rather than a distraction so you should in principle find fewer totally incomprehensible words when you use the context. In contrast there isn't much context in a dictionary, and I do my best not to read the translations/explanations before I have classified each headword. So the results from test based counts of unknown words can't be directly compared to the results from dictionary based enquiries.

But that doesn't mean that it is a less valid estimate: if people report that they understand all but 15 words on page 437 in tome VII of Harry Potter in some language then that's an extremely relevant claim because it is based on a realistic situation. The problem is that you have to know the total number of words on that page to get a percentage, and you need go through many pages to get a reliable estimate of your unknown-words-ratio across different text types. And unless you can get your results for free by using electronical resources it is unlikely that you will ever bother to do those statistics. But if you can get the numbers then they will be a good indicator for your reading capabilities.

Edited by Iversen on 26 November 2014 at 4:34pm

1 person has voted this message useful

This discussion contains 14 messages over 2 pages: << Prev 1 2

If you wish to post a reply to this topic you must first login. If you are not already registered you must first register

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 0.4375 seconds.