10 messages over 2 pages: 1 2 Next >>
Raistlin Majere Trilingual Hexaglot Senior Member Spain uciprotour-cycling.c Joined 7152 days ago 455 posts - 424 votes 7 sounds Speaks: English*, Spanish*, Catalan*, FrenchA1, Italian, German Studies: Swedish
| Message 1 of 10 23 August 2005 at 12:38pm | IP Logged |
Zipf's law is an empirical (we only know it works, not why it works) rule applying to all natural languages that says that there is always a constant in the proportion of times a determined word appears and the next-commonest word does. It was formulated by linguist George Kingsley Zipf.
That would mean, using random numbers as an example, that if we picked up a long book and analysed it...
the commonest word would appear 100.000 times
the second commonest word would appear 98.000 times
the third commonest word would appear 96.040 times
the fourth commonest word would appear 94.119 times
the fifth commonest word would appear 92.223 times
and so on; every word appears 98% as much as the previous word. This percentage changes from language to language and also amongst individual speakers of the same language, but there is always a constant proportion.
Zipf's law, amongst other uses, can be used to discern "real" languages from artifficial ones; and to check if one piece of text was written by just one author or more than one.
What's amazing about it is that nobody knows why it works, but it does.
Edited by Raistlin Majere on 23 August 2005 at 12:40pm
1 person has voted this message useful
| Nephilim Diglot Senior Member Poland Joined 7145 days ago 363 posts - 368 votes Speaks: English*, Polish
| Message 2 of 10 23 August 2005 at 2:02pm | IP Logged |
That would mean, using random numbers as an example, that if we picked up a long book and analysed it...
the commonest word would appear 100.000 times
the second commonest word would appear 98.000 times
the third commonest word would appear 96.040 times
the fourth commonest word would appear 94.119 times
the fifth commonest word would appear 92.223 times
and so on; every word appears 98% as much as the previous word. This percentage changes from language to language and also amongst individual speakers of the same language, but there is always a constant proportion.
Raistlin,
Have you really thought this through? How long would the ‘long book’ have to be to generate such numbers. Lets say the most common words in the book are as follows:
1. the
2. of
3. to
4. and
5. a
Note these words are taken from the link at the end of this post.
According to your rule, ‘the’ being the most common would occur 100,000 times; ‘of’ would appear 98,000 times; ‘to’ would appear 96,040 times; ‘and’ would appear 94,119 times and ‘a’ would appear 92,223 times. This means that these five words alone would come to 480, 382 words. I guess in a novel you would probably have at least a couple of thousand different words, all of different frequency. How long would the book have to be with, say, three thousand words at different frequencies? I just don’t get this at all. Do you mean percentages perhaps? Please explain what you mean here as I'm well and truly lost on this one.
[http://esl.about.com/library/vocabulary/bl1000_list1.htm
Edited by Nephilim on 23 August 2005 at 2:05pm
1 person has voted this message useful
| Raistlin Majere Trilingual Hexaglot Senior Member Spain uciprotour-cycling.c Joined 7152 days ago 455 posts - 424 votes 7 sounds Speaks: English*, Spanish*, Catalan*, FrenchA1, Italian, German Studies: Swedish
| Message 3 of 10 23 August 2005 at 2:31pm | IP Logged |
Sorry;
When I wrote "100.000", "98.000" and so on, I was only giving an example, and you may apply the same principle to much smaller numbers such as, say, 600, 588, 576 and so on. I agree that the numbers I exposed above seem quite large (indeed, how long would such a book be!).
However, you didn't have to take those values literally. What I meant is that there is always a constant proportion between the frequency of a word and the frequency of the next commonest word.
Following your example, if the appears X times as much as of, then of will appear X times as much as to, which will in turn appear X times as much as and, and so on (X is always the same value).
Edited by Raistlin Majere on 23 August 2005 at 2:33pm
1 person has voted this message useful
| administrator Hexaglot Forum Admin Switzerland FXcuisine.com Joined 7376 days ago 3094 posts - 2987 votes 12 sounds Speaks: French*, EnglishC2, German, Italian, Spanish, Russian Personal Language Map
| Message 4 of 10 23 August 2005 at 3:33pm | IP Logged |
Raistlin, thanks for bringing this most interesting topic up. Does Zipf law apply to lexemes (ie 'to be' as one entry instead of 'am', 'is', 'are', etc...) or actual words?
I have an Excel spreadsheet with the 30,000 most common words, with their number of occurences, from a rather comprehensive Russian corpus of texts if you feel like trying it and making a graph to test the law. It only lists lexemes though, but if actual words are needed there are several other such lists on the web for other languages.
1 person has voted this message useful
| Raistlin Majere Trilingual Hexaglot Senior Member Spain uciprotour-cycling.c Joined 7152 days ago 455 posts - 424 votes 7 sounds Speaks: English*, Spanish*, Catalan*, FrenchA1, Italian, German Studies: Swedish
| Message 5 of 10 23 August 2005 at 3:51pm | IP Logged |
It applies to actual words, not different lexemes ("am, is, be, been" count all as one).
And it is best seen in a text, not in the language as a whole, as the value of X may vary between one speaker and another.
Zipf's Law
Edited by Raistlin Majere on 23 August 2005 at 3:52pm
1 person has voted this message useful
| fanatic Octoglot Senior Member Australia speedmathematics.com Joined 7146 days ago 1152 posts - 1818 votes Speaks: English*, German, French, Afrikaans, Italian, Spanish, Russian, Dutch Studies: Swedish, Norwegian, Polish, Modern Hebrew, Malay, Mandarin, Esperanto
| Message 6 of 10 03 September 2005 at 3:15am | IP Logged |
The above link doesn't work. Can you give a reference? I am intrigued by the concept.
This reminds me of another question I have. I believe men and women use a different vocabulary for both writing and speaking. Has this been covered anywhere? Should a man have a woman as his only teacher and a woman have a male teacher? Do we use different vocabularies?
1 person has voted this message useful
| Raistlin Majere Trilingual Hexaglot Senior Member Spain uciprotour-cycling.c Joined 7152 days ago 455 posts - 424 votes 7 sounds Speaks: English*, Spanish*, Catalan*, FrenchA1, Italian, German Studies: Swedish
| Message 7 of 10 03 September 2005 at 4:21am | IP Logged |
The above link works for me :s Try this one:
Zipf's Law
1 person has voted this message useful
| fanatic Octoglot Senior Member Australia speedmathematics.com Joined 7146 days ago 1152 posts - 1818 votes Speaks: English*, German, French, Afrikaans, Italian, Spanish, Russian, Dutch Studies: Swedish, Norwegian, Polish, Modern Hebrew, Malay, Mandarin, Esperanto
| Message 8 of 10 03 September 2005 at 6:06am | IP Logged |
Thank you very much. That link worked fine. I have downloaded the page. I am interested in the mathematics of language.
1 person has voted this message useful
|
This discussion contains 10 messages over 2 pages: 1 2 Next >>
You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum
This page was generated in 0.4063 seconds.
DHTML Menu By Milonic JavaScript
|