Register  Login  Active Topics  Maps  

Zipf’s Law

  Tags: Law
 Language Learning Forum : Philological Room Post Reply
10 messages over 2 pages: 1 2  Next >>
Raistlin Majere
Trilingual Hexaglot
Senior Member
Spain
uciprotour-cycling.c
Joined 7152 days ago

455 posts - 424 votes 
7 sounds
Speaks: English*, Spanish*, Catalan*, FrenchA1, Italian, German
Studies: Swedish

 
 Message 1 of 10
23 August 2005 at 12:38pm | IP Logged 
Zipf's law is an empirical (we only know it works, not why it works) rule applying to all natural languages that says that there is always a constant in the proportion of times a determined word appears and the next-commonest word does. It was formulated by linguist George Kingsley Zipf.

That would mean, using random numbers as an example, that if we picked up a long book and analysed it...

the commonest word would appear 100.000 times
the second commonest word would appear 98.000 times
the third commonest word would appear 96.040 times
the fourth commonest word would appear 94.119 times
the fifth commonest word would appear 92.223 times

and so on; every word appears 98% as much as the previous word. This percentage changes from language to language and also amongst individual speakers of the same language, but there is always a constant proportion.

Zipf's law, amongst other uses, can be used to discern "real" languages from artifficial ones; and to check if one piece of text was written by just one author or more than one.

What's amazing about it is that nobody knows why it works, but it does.

Edited by Raistlin Majere on 23 August 2005 at 12:40pm

1 person has voted this message useful



Nephilim
Diglot
Senior Member
Poland
Joined 7145 days ago

363 posts - 368 votes 
Speaks: English*, Polish

 
 Message 2 of 10
23 August 2005 at 2:02pm | IP Logged 
That would mean, using random numbers as an example, that if we picked up a long book and analysed it...

the commonest word would appear 100.000 times
the second commonest word would appear 98.000 times
the third commonest word would appear 96.040 times
the fourth commonest word would appear 94.119 times
the fifth commonest word would appear 92.223 times

and so on; every word appears 98% as much as the previous word. This percentage changes from language to language and also amongst individual speakers of the same language, but there is always a constant proportion.




Raistlin,

Have you really thought this through? How long would the ‘long book’ have to be to generate such numbers. Lets say the most common words in the book are as follows:

1.     the
2.     of
3.     to
4.     and
5.     a

Note these words are taken from the link at the end of this post.


According to your rule, ‘the’ being the most common would occur 100,000 times; ‘of’ would appear 98,000 times; ‘to’ would appear 96,040 times; ‘and’ would appear 94,119 times and ‘a’ would appear 92,223 times. This means that these five words alone would come to 480, 382 words. I guess in a novel you would probably have at least a couple of thousand different words, all of different frequency. How long would the book have to be with, say, three thousand words at different frequencies? I just don’t get this at all. Do you mean percentages perhaps? Please explain what you mean here as I'm well and truly lost on this one.






[http://esl.about.com/library/vocabulary/bl1000_list1.htm

Edited by Nephilim on 23 August 2005 at 2:05pm

1 person has voted this message useful



Raistlin Majere
Trilingual Hexaglot
Senior Member
Spain
uciprotour-cycling.c
Joined 7152 days ago

455 posts - 424 votes 
7 sounds
Speaks: English*, Spanish*, Catalan*, FrenchA1, Italian, German
Studies: Swedish

 
 Message 3 of 10
23 August 2005 at 2:31pm | IP Logged 
Sorry;

When I wrote "100.000", "98.000" and so on, I was only giving an example, and you may apply the same principle to much smaller numbers such as, say, 600, 588, 576 and so on. I agree that the numbers I exposed above seem quite large (indeed, how long would such a book be!).

However, you didn't have to take those values literally. What I meant is that there is always a constant proportion between the frequency of a word and the frequency of the next commonest word.

Following your example, if the appears X times as much as of, then of will appear X times as much as to, which will in turn appear X times as much as and, and so on (X is always the same value).

Edited by Raistlin Majere on 23 August 2005 at 2:33pm

1 person has voted this message useful



administrator
Hexaglot
Forum Admin
Switzerland
FXcuisine.com
Joined 7376 days ago

3094 posts - 2987 votes 
12 sounds
Speaks: French*, EnglishC2, German, Italian, Spanish, Russian
Personal Language Map

 
 Message 4 of 10
23 August 2005 at 3:33pm | IP Logged 
Raistlin, thanks for bringing this most interesting topic up. Does Zipf law apply to lexemes (ie 'to be' as one entry instead of 'am', 'is', 'are', etc...) or actual words?

I have an Excel spreadsheet with the 30,000 most common words, with their number of occurences, from a rather comprehensive Russian corpus of texts if you feel like trying it and making a graph to test the law. It only lists lexemes though, but if actual words are needed there are several other such lists on the web for other languages.
1 person has voted this message useful



Raistlin Majere
Trilingual Hexaglot
Senior Member
Spain
uciprotour-cycling.c
Joined 7152 days ago

455 posts - 424 votes 
7 sounds
Speaks: English*, Spanish*, Catalan*, FrenchA1, Italian, German
Studies: Swedish

 
 Message 5 of 10
23 August 2005 at 3:51pm | IP Logged 
It applies to actual words, not different lexemes ("am, is, be, been" count all as one).

And it is best seen in a text, not in the language as a whole, as the value of X may vary between one speaker and another.

Zipf's Law

Edited by Raistlin Majere on 23 August 2005 at 3:52pm

1 person has voted this message useful



fanatic
Octoglot
Senior Member
Australia
speedmathematics.com
Joined 7146 days ago

1152 posts - 1818 votes 
Speaks: English*, German, French, Afrikaans, Italian, Spanish, Russian, Dutch
Studies: Swedish, Norwegian, Polish, Modern Hebrew, Malay, Mandarin, Esperanto

 
 Message 6 of 10
03 September 2005 at 3:15am | IP Logged 
The above link doesn't work. Can you give a reference? I am intrigued by the concept.

This reminds me of another question I have. I believe men and women use a different vocabulary for both writing and speaking. Has this been covered anywhere? Should a man have a woman as his only teacher and a woman have a male teacher? Do we use different vocabularies?

1 person has voted this message useful



Raistlin Majere
Trilingual Hexaglot
Senior Member
Spain
uciprotour-cycling.c
Joined 7152 days ago

455 posts - 424 votes 
7 sounds
Speaks: English*, Spanish*, Catalan*, FrenchA1, Italian, German
Studies: Swedish

 
 Message 7 of 10
03 September 2005 at 4:21am | IP Logged 
The above link works for me :s Try this one:

Zipf's Law
1 person has voted this message useful



fanatic
Octoglot
Senior Member
Australia
speedmathematics.com
Joined 7146 days ago

1152 posts - 1818 votes 
Speaks: English*, German, French, Afrikaans, Italian, Spanish, Russian, Dutch
Studies: Swedish, Norwegian, Polish, Modern Hebrew, Malay, Mandarin, Esperanto

 
 Message 8 of 10
03 September 2005 at 6:06am | IP Logged 
Thank you very much. That link worked fine. I have downloaded the page. I am interested in the mathematics of language.


1 person has voted this message useful



This discussion contains 10 messages over 2 pages: 2  Next >>


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 0.4063 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.