nystagmatic Triglot Groupie Brazil Joined 4306 days ago 47 posts - 58 votes Speaks: Portuguese*, English, French Studies: German
| Message 1 of 6 17 June 2014 at 6:52am | IP Logged |
Hi,
Does anyone know of a program that could be used to analyze the lexical complexity of a number of text files based on an arbitrary corpus? What I was thinking of, more specifically, was a program to which I could feed a couple thousand .epub files that it would then analyze and list by order of average word complexity (more rare words = higher complexity). The corpus, in this case, could be just the books themselves. It sounds like it would be reasonably simple to make and very useful, so I figured there must already be something similar floating around the web — but I haven't found anything. If it comes to it, I could try to dust off my C++ and code it myself, but it won't be pretty. :|
Any ideas? Doesn't have to be exactly like what I described. Anything that could be used for the same purpose would be wonderful. Only thing is that it cannot be limited to English (thus the part about the corpus being arbitrary).
Thanks!
Edited by nystagmatic on 17 June 2014 at 6:54am
1 person has voted this message useful
|
Doitsujin Diglot Senior Member Germany Joined 5317 days ago 1256 posts - 2363 votes Speaks: German*, English
| Message 2 of 6 17 June 2014 at 7:44am | IP Logged |
For English you could use standard readability tests, for example, the Flesch–Kincaid test, which is integrated in the Count Pages Calibre plug-in.
For other languages adapted Flesch–Kincaid versions also exist, but I've found them to be not very reliable. (Most English readability tests take advantage of the fact that non-Germanic words tend to be longer than words of Germanic origin. However, the same statistics-based method obviously doesn't work for Romance languages.)
If you find a suitable algorithm for the languages that you're interested in you might be able to integrate it into the Count Pages plug-in or develop your own plug-in.
1 person has voted this message useful
|
luke Diglot Senior Member United States Joined 7202 days ago 3133 posts - 4351 votes Speaks: English*, Spanish Studies: Esperanto, French
| Message 3 of 6 17 June 2014 at 8:35am | IP Logged |
I took a quick glance at http://flesh.sourceforge.net/], which uses the Flesch-Kincaid readability index. I'm unconvinced of its accuracy.
There is https://lexile.com/. I've looked at that site a lot more. As with any computerized lexical processor, it can be fooled. It did offer guidance though (for English in particular).
1 person has voted this message useful
|
Doitsujin Diglot Senior Member Germany Joined 5317 days ago 1256 posts - 2363 votes Speaks: German*, English
| Message 4 of 6 17 June 2014 at 9:40am | IP Logged |
Neither am I. Some time ago I tested the German version of the Flesch-Kincaid readability index, which proved to be not very reliable.
luke wrote:
There is https://lexile.com/. I've looked at that site a lot more. As with any computerized lexical processor, it can be fooled. It did offer guidance though (for English in particular). |
|
|
AFAIK, Lexile's proprietary algorithm is mainly used for textbooks.
IMHO, any readability algorithm that takes only the word length into account is unsuitable for languages other than English.
An ideal readability tool, would perform the following tasks:
1. Reduce all word forms to their canonical forms with a stemmer and count them.
2. Load a language specific word frequency list containing the 2000-4000 most frequently used words and compare them against the canonical word forms from the book.
Any text which contains mostly high-frequency words is most likely easier to read than a text that contains a hight percentage of non-frequently used words. However, since it's possible to write hard-to-read texts using only the most frequently used words, this method isn't fully reliable either.
1 person has voted this message useful
|
Jeffers Senior Member United Kingdom Joined 4906 days ago 2151 posts - 3960 votes Speaks: English* Studies: Hindi, Ancient Greek, French, Sanskrit, German
| Message 5 of 6 17 June 2014 at 3:15pm | IP Logged |
It's funny, I was planning to write a program in Python to do just what the OP mentions. The textbook
I'm using, Think Python, has a
chapter which shows how to make a concordance of a text, showing frequencies. I'm pretty sure it
wouldn't be too hard to then create a frequency list out of a large corpus of text, and compare the
words in any text to that corpus.
I don't think, with a large base corpus, it would really be necessary to reduce all words to their
canonical forms. In fact, it would be a better reflection of the difficulty of the text if it counted
the frequency of forms, not the frequency of words (however defined).
1 person has voted this message useful
|
Doitsujin Diglot Senior Member Germany Joined 5317 days ago 1256 posts - 2363 votes Speaks: German*, English
| Message 6 of 6 17 June 2014 at 4:04pm | IP Logged |
Jeffers wrote:
It's funny, I was planning to write a program in Python to do just what the OP mentions. The textbook I'm using, Think Python, has a chapter which shows how to make a concordance of a text, showing frequencies. I'm pretty sure it wouldn't be too hard to then create a frequency list out of a large corpus of text, and compare the words in any text to that corpus. |
|
|
If you interested in NLP, check out NLTK, a Python toolkit with lots of NLP modules, among them a concordance function.
Edited by Doitsujin on 18 June 2014 at 10:15am
1 person has voted this message useful
|