Register  Login  Active Topics  Maps  

Estimating vocabulary size from novels

 Language Learning Forum : Learning Techniques, Methods & Strategies Post Reply
14 messages over 2 pages: 1 2  Next >>
patrickwilken
Senior Member
Germany
radiant-flux.net
Joined 4536 days ago

1546 posts - 3200 votes 
Studies: German

 
 Message 1 of 14
21 November 2014 at 11:19pm | IP Logged 
I haven't found any good vocabulary size tests for German online so I decided to waste a bit of time yesterday trying to estimate how many words I know. I took a somewhat random set of ten books I own and counted how many words I did not know in the first 1000 words of each book with the following results:

98.7% American Gods by Neil Gaiman
98.6% The Spy who came out of the cold by John Le Carre
98.3% Blicke windwärts by Ian M. Banks
98.0% Das fünfte Zeichen by Jo Nesbø
97.6% Die Vermessung der Welt by Daniel Kehlmann
97.5% Das Paradies ist anderswo by Mario Vargas Llosa
97.4% Bonita Avenue by Peter Buwalda
97.2% Just Kids by Patty Smith
97.1% Solar by Ian McEwan
97.1% Sputnik Sweetheart by Haruki Murakami

Average words known: 97.8%

Assuming that German word frequencies are roughly the same as English word frequencies, Paul Nation's word frequency tables suggest that I am somewhere in the 7000-8000 word range.

I was surprised how consistent the numbers generated were across the books.

While I take with a grain of salt any absolute figure for numbers of words known, I think this is a useful method for testing relative improvements in whatever language you are learning over time. I'll come back to these books in a year or so and retest them again to see where I am.
4 persons have voted this message useful



Jeffers
Senior Member
United Kingdom
Joined 4912 days ago

2151 posts - 3960 votes 
Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

 
 Message 2 of 14
22 November 2014 at 2:01pm | IP Logged 
That is quite interesting. Do you find that you can easily "read with pleasure", and figure out most of the unknown words by context?

I know of one online German vocabulary test: http://www.itt-leipzig.de/static/startseiteeng.html. It only goes up to 5000 words. I'd be curious to see how you do on the test in comparison to your estimate from books. I have to say, a lot of people on HTLAL complained that they knew a word but just didn't understand the definitions given for it. The tests are monolingual, and some of the circumlocutions used are pretty awkward.
2 persons have voted this message useful



Ari
Heptaglot
Senior Member
Norway
Joined 6585 days ago

2314 posts - 5695 votes 
Speaks: Swedish*, English, French, Spanish, Portuguese, Mandarin, Cantonese
Studies: Czech, Latin, German

 
 Message 3 of 14
22 November 2014 at 2:15pm | IP Logged 
If you have a paper dictionary, you can just do it by counting the number of known words on a random page. Multiply the proportion of known words with the number of words contained in the dictionary. Do more pages for better accuracy.

This is arguably the only thing paper dictionaries are good for. :)

Edited by Ari on 22 November 2014 at 2:16pm

4 persons have voted this message useful



patrickwilken
Senior Member
Germany
radiant-flux.net
Joined 4536 days ago

1546 posts - 3200 votes 
Studies: German

 
 Message 4 of 14
22 November 2014 at 2:38pm | IP Logged 
Jeffers wrote:
That is quite interesting. Do you find that you can easily "read with pleasure", and figure out most of the unknown words by context?7


I could read any of these books without a dictionary and get most of the text now, which is a little surprising for me as I hadn't realized that my level has creeped up this high. Having done this exercise I feel I could now pretty much read any standard novel in a bookstore.

However, some are still much easier than others. The Gaiman and the Le Carre were a breeze to read. Nesbo (another crime novel) is also pretty easy. I generally find Murakami easy, so I was a little surprised by the lower rating, but I think that comes from a difficult paragraph (1% - is equivalent to to 10 errors). Of course, grammar/style also makes a big difference -- crime novels are just generally written in a much straightforward style than more "literally" works. Murakami's style is very straightforward, which is perhaps why I find it so easy, despite the lower vocabulary estimate.

Keep in mind that while 98.5% vs 97.0% doesn't seem like a lot, it means that there are twice as many unknown words the "97%" book. And I also think that even the "known" words are perhaps more difficult - there are afterall words you really know and some that you know but aren't so comfortable with.



Overall the numbers seemed to reasonably well track the difficulty of the books for me.

Jeffers wrote:

I know of one online German vocabulary test: http://www.itt-leipzig.de/static/startseiteeng.html. It only goes up to 5000 words.


I tried this and didn't do so well. I am not sure why there was a discrepancy. Perhaps in part because my vocabulary has largely grown from reading novels, not other sources like newspapers.

But like I said I don't really trust this as an accurate estimate of vocabulary, but I do think it would be a helpful way to keep track of progress. It's the sort of thing I would like to do once a year.

Ari wrote:
If you have a paper dictionary, you can just do it by counting the number of known words on a random page. Multiply the proportion of known words with the number of words contained in the dictionary. Do more pages for better accuracy.


Good point. Though I thought Iversen said you get different estimates depending on the dictionary you use (i.e., dictionaries with more headwords tend to give you lower estimates). Also I am not quite sure how to relate words in dictionaries with the sort of word-groups that Nation uses.

But sure, if you use the same dictionary each time it would be a great way of keeping track of progress too.

I do like actually getting a real estimate of how much of a real book I can read though. There is something much more interesting for me about knowing I can read 95% or 98% or 99% of a novel than knowing I know 8000 words, but that's just a personal preference obviously.

Edited by patrickwilken on 22 November 2014 at 3:13pm

1 person has voted this message useful



fiolmattias
Triglot
Groupie
Sweden
geocities.com/fiolmaRegistered users can see my Skype Name
Joined 6692 days ago

62 posts - 129 votes 
Speaks: Swedish*, English, Arabic (Written)

 
 Message 5 of 14
22 November 2014 at 2:46pm | IP Logged 
patrickwilken wrote:


Average words known: 97.8%

Assuming that German word frequencies are roughly the same as English word
frequencies, sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CDAQFjAC&url=http% 3A%2F%2Fwww.victoria.ac.nz
%2Flals%2Fabout%2Fstaff%2Fpublications%2Fpaul-nation%2F2006- How-large-a-
vocab.pdf&ei=6blvVLa1MMbjO-qJgKAN&usg=AFQjCNGg_jQB-
xi64WlAL4eCUxgsMmyPRw&bvm=bv.80185997,d.ZWU">Paul Nation's word frequency tables

suggest that I am somewhere in the 7000-8000 word range.


While I found this both interesting and fascinating, there is a little thing that is
uncertain here; Paul Notion discusses books at a certain vocabulary level, and for you
to compare you also need to know how the vocabulary of these books is compared to
other novels in German. If I take 10 kinder books in German I might understand some
98% as well, but that does not make my German vocabulary size in the 7000-8000 word
range :)
Besides the problem you already mentioned about average vocabulary size in books
between German and English, of course.
Nevertheless a very interesting post!
2 persons have voted this message useful



patrickwilken
Senior Member
Germany
radiant-flux.net
Joined 4536 days ago

1546 posts - 3200 votes 
Studies: German

 
 Message 6 of 14
22 November 2014 at 3:07pm | IP Logged 
fiolmattias wrote:

While I found this both interesting and fascinating, there is a little thing that is
uncertain here; Paul Notion discusses books at a certain vocabulary level, and for you
to compare you also need to know how the vocabulary of these books is compared to
other novels in German. If I take 10 kinder books in German I might understand some
98% as well, but that does not make my German vocabulary size in the 7000-8000 word
range :)


The figures I was using were for adult novels. The figures Nation shows are pretty consistent across the adult novels, but they are old Gutenburg books. So there are probably some problems relating them to the books I am using.

This sort of estimate is obviously affected by the sorts of books you read too. If you are a heavy scifi fan you might find yourself understanding more that genre than others.

I just tried to pick a reasonable range of adult books that I hadn't read to get an estimate. I was pretty surprised how consistent the estimate was to be honest.

fiolmattias wrote:

Besides the problem you already mentioned about average vocabulary size in books
between German and English, of course.


I vaguely remember seeing a graph years ago that compared English and German word frequencies and the distributions seemed very similar, but I can't find it now.

Overall I think this method is best for estimating relative, not absolute, vocabularies. And it is obviously bound by the books you sample, but that can be a plus. I can say with some confidence now that of the books I am likely to want to read (any standard novel in a bookstore) that I could understand somewhere in the range 97%-98.5% of the vocabulary, which is quite useful for me.

I think despite all the uncertainties around the vocabulary estimates, that indicates that my vocabulary is somewhere in the 7000-8000 range, because: (1) if my range was much higher I would be getting closer to 99% in standard novels; (2) I can now follow movies with a very high understanding, which in English is around 6000-7000 words; (3) and if it was much lower I would be getting closer to 95%.

Edited by patrickwilken on 22 November 2014 at 3:15pm

2 persons have voted this message useful



Ari
Heptaglot
Senior Member
Norway
Joined 6585 days ago

2314 posts - 5695 votes 
Speaks: Swedish*, English, French, Spanish, Portuguese, Mandarin, Cantonese
Studies: Czech, Latin, German

 
 Message 7 of 14
22 November 2014 at 4:42pm | IP Logged 
Sounds like you ought to pop that German up to Basic Fluency in your profile!

Basic Fluency - you understand at least 80% of a regular newspaper in your target language and can hold regular conversations about any topic, understanding what people say and getting your point across.
2 persons have voted this message useful



agta
Diglot
Groupie
Poland
Joined 5527 days ago

43 posts - 53 votes 
Speaks: Polish*, English
Studies: German, Italian

 
 Message 8 of 14
22 November 2014 at 6:53pm | IP Logged 
This is very interesting idea and I think since now I will check percentage for every novel I'll be reading. Not for any particular purpose but just out of curiosity.


2 persons have voted this message useful



This discussion contains 14 messages over 2 pages: 2  Next >>


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 0.8281 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.