Zipf�s Law (Philological Room) Language Learning Forum

Zipf�s Law
Tags: Law
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : Philological Room

10 messages over 2 pages: 1 2 Next >>

Raistlin Majere Trilingual Hexaglot Senior Member Spain uciprotour-cycling.c Joined 7152 days ago 455 posts - 424 votes 7 sounds Speaks: English, Spanish, Catalan*, French^A1, Italian, German Studies: Swedish	Message 1 of 10 23 August 2005 at 12:38pm \| IP Logged
	Zipf's law is an empirical (we only know it works, not why it works) rule applying to all natural languages that says that there is always a constant in the proportion of times a determined word appears and the next-commonest word does. It was formulated by linguist George Kingsley Zipf. That would mean, using random numbers as an example, that if we picked up a long book and analysed it... the commonest word would appear 100.000 times the second commonest word would appear 98.000 times the third commonest word would appear 96.040 times the fourth commonest word would appear 94.119 times the fifth commonest word would appear 92.223 times and so on; every word appears 98% as much as the previous word. This percentage changes from language to language and also amongst individual speakers of the same language, but there is always a constant proportion. Zipf's law, amongst other uses, can be used to discern "real" languages from artifficial ones; and to check if one piece of text was written by just one author or more than one. What's amazing about it is that nobody knows why it works, but it does. Edited by Raistlin Majere on 23 August 2005 at 12:40pm 1 person has voted this message useful
Nephilim Diglot Senior Member Poland Joined 7145 days ago 363 posts - 368 votes Speaks: English*, Polish	Message 2 of 10 23 August 2005 at 2:02pm \| IP Logged
	That would mean, using random numbers as an example, that if we picked up a long book and analysed it... the commonest word would appear 100.000 times the second commonest word would appear 98.000 times the third commonest word would appear 96.040 times the fourth commonest word would appear 94.119 times the fifth commonest word would appear 92.223 times and so on; every word appears 98% as much as the previous word. This percentage changes from language to language and also amongst individual speakers of the same language, but there is always a constant proportion. Raistlin, Have you really thought this through? How long would the �long book� have to be to generate such numbers. Lets say the most common words in the book are as follows: 1. the 2. of 3. to 4. and 5. a Note these words are taken from the link at the end of this post. According to your rule, �the� being the most common would occur 100,000 times; �of� would appear 98,000 times; �to� would appear 96,040 times; �and� would appear 94,119 times and �a� would appear 92,223 times. This means that these five words alone would come to 480, 382 words. I guess in a novel you would probably have at least a couple of thousand different words, all of different frequency. How long would the book have to be with, say, three thousand words at different frequencies? I just don�t get this at all. Do you mean percentages perhaps? Please explain what you mean here as I'm well and truly lost on this one. [http://esl.about.com/library/vocabulary/bl1000_list1.htm Edited by Nephilim on 23 August 2005 at 2:05pm 1 person has voted this message useful
Raistlin Majere Trilingual Hexaglot Senior Member Spain uciprotour-cycling.c Joined 7152 days ago 455 posts - 424 votes 7 sounds Speaks: English, Spanish, Catalan*, French^A1, Italian, German Studies: Swedish	Message 3 of 10 23 August 2005 at 2:31pm \| IP Logged
	Sorry; When I wrote "100.000", "98.000" and so on, I was only giving an example, and you may apply the same principle to much smaller numbers such as, say, 600, 588, 576 and so on. I agree that the numbers I exposed above seem quite large (indeed, how long would such a book be!). However, you didn't have to take those values literally. What I meant is that there is always a constant proportion between the frequency of a word and the frequency of the next commonest word. Following your example, if the appears X times as much as of, then of will appear X times as much as to, which will in turn appear X times as much as and, and so on (X is always the same value). Edited by Raistlin Majere on 23 August 2005 at 2:33pm 1 person has voted this message useful
administrator Hexaglot Forum Admin Switzerland FXcuisine.com Joined 7376 days ago 3094 posts - 2987 votes 12 sounds Speaks: French*, English^C2, German, Italian, Spanish, Russian Personal Language Map	Message 4 of 10 23 August 2005 at 3:33pm \| IP Logged
	Raistlin, thanks for bringing this most interesting topic up. Does Zipf law apply to lexemes (ie 'to be' as one entry instead of 'am', 'is', 'are', etc...) or actual words? I have an Excel spreadsheet with the 30,000 most common words, with their number of occurences, from a rather comprehensive Russian corpus of texts if you feel like trying it and making a graph to test the law. It only lists lexemes though, but if actual words are needed there are several other such lists on the web for other languages. 1 person has voted this message useful
Raistlin Majere Trilingual Hexaglot Senior Member Spain uciprotour-cycling.c Joined 7152 days ago 455 posts - 424 votes 7 sounds Speaks: English, Spanish, Catalan*, French^A1, Italian, German Studies: Swedish	Message 5 of 10 23 August 2005 at 3:51pm \| IP Logged
	It applies to actual words, not different lexemes ("am, is, be, been" count all as one). And it is best seen in a text, not in the language as a whole, as the value of X may vary between one speaker and another. Zipf's Law Edited by Raistlin Majere on 23 August 2005 at 3:52pm 1 person has voted this message useful
fanatic Octoglot Senior Member Australia speedmathematics.com Joined 7146 days ago 1152 posts - 1818 votes Speaks: English*, German, French, Afrikaans, Italian, Spanish, Russian, Dutch Studies: Swedish, Norwegian, Polish, Modern Hebrew, Malay, Mandarin, Esperanto	Message 6 of 10 03 September 2005 at 3:15am \| IP Logged
	The above link doesn't work. Can you give a reference? I am intrigued by the concept. This reminds me of another question I have. I believe men and women use a different vocabulary for both writing and speaking. Has this been covered anywhere? Should a man have a woman as his only teacher and a woman have a male teacher? Do we use different vocabularies? 1 person has voted this message useful
Raistlin Majere Trilingual Hexaglot Senior Member Spain uciprotour-cycling.c Joined 7152 days ago 455 posts - 424 votes 7 sounds Speaks: English, Spanish, Catalan*, French^A1, Italian, German Studies: Swedish	Message 7 of 10 03 September 2005 at 4:21am \| IP Logged
	The above link works for me :s Try this one: Zipf's Law 1 person has voted this message useful
fanatic Octoglot Senior Member Australia speedmathematics.com Joined 7146 days ago 1152 posts - 1818 votes Speaks: English*, German, French, Afrikaans, Italian, Spanish, Russian, Dutch Studies: Swedish, Norwegian, Polish, Modern Hebrew, Malay, Mandarin, Esperanto	Message 8 of 10 03 September 2005 at 6:06am \| IP Logged
	Thank you very much. That link worked fine. I have downloaded the page. I am interested in the mathematics of language. 1 person has voted this message useful

This discussion contains 10 messages over 2 pages: 1 2 Next >>

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 0.4063 seconds.