Software to analyze lexical complexity (General discussion) Language Learning Forum

Software to analyze lexical complexity
Tags: Software \| Reading \| Difficulty
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : General discussion

nystagmatic
Triglot
Groupie
Brazil
Joined 4306 days ago
47 posts - 58 votes

Speaks: Portuguese*, English, French
Studies: German

Message 1 of 6

17 June 2014 at 6:52am | IP Logged

Hi,

Does anyone know of a program that could be used to analyze the lexical complexity of a number of text files based on an arbitrary corpus? What I was thinking of, more specifically, was a program to which I could feed a couple thousand .epub files that it would then analyze and list by order of average word complexity (more rare words = higher complexity). The corpus, in this case, could be just the books themselves. It sounds like it would be reasonably simple to make and very useful, so I figured there must already be something similar floating around the web — but I haven't found anything. If it comes to it, I could try to dust off my C++ and code it myself, but it won't be pretty. :|

Any ideas? Doesn't have to be exactly like what I described. Anything that could be used for the same purpose would be wonderful. Only thing is that it cannot be limited to English (thus the part about the corpus being arbitrary).

Thanks!

Edited by nystagmatic on 17 June 2014 at 6:54am
1 person has voted this message useful

Doitsujin
Diglot
Senior Member
Germany
Joined 5317 days ago
1256 posts - 2363 votes

Speaks: German*, English

Message 2 of 6

17 June 2014 at 7:44am | IP Logged

For English you could use standard readability tests, for example, the Flesch–Kincaid test, which is integrated in the Count Pages Calibre plug-in.
For other languages adapted Flesch–Kincaid versions also exist, but I've found them to be not very reliable. (Most English readability tests take advantage of the fact that non-Germanic words tend to be longer than words of Germanic origin. However, the same statistics-based method obviously doesn't work for Romance languages.)
If you find a suitable algorithm for the languages that you're interested in you might be able to integrate it into the Count Pages plug-in or develop your own plug-in.
1 person has voted this message useful

luke
Diglot
Senior Member
United States
Joined 7202 days ago
3133 posts - 4351 votes

Speaks: English*, Spanish
Studies: Esperanto, French

Message 3 of 6

17 June 2014 at 8:35am | IP Logged

I took a quick glance at http://flesh.sourceforge.net/], which uses the Flesch-Kincaid readability index. I'm unconvinced of its accuracy.

There is https://lexile.com/. I've looked at that site a lot more. As with any computerized lexical processor, it can be fooled. It did offer guidance though (for English in particular).
1 person has voted this message useful

Doitsujin
Diglot
Senior Member
Germany
Joined 5317 days ago
1256 posts - 2363 votes

Speaks: German*, English

Message 4 of 6

17 June 2014 at 9:40am | IP Logged

luke wrote:

I took a quick glance at http://flesh.sourceforge.net/">, which uses the Flesch-Kincaid readability index. I'm unconvinced of its accuracy.

Neither am I. Some time ago I tested the German version of the Flesch-Kincaid readability index, which proved to be not very reliable.

luke wrote:

There is https://lexile.com/. I've looked at that site a lot more. As with any computerized lexical processor, it can be fooled. It did offer guidance though (for English in particular).

AFAIK, Lexile's proprietary algorithm is mainly used for textbooks.

IMHO, any readability algorithm that takes only the word length into account is unsuitable for languages other than English.

An ideal readability tool, would perform the following tasks:

1. Reduce all word forms to their canonical forms with a stemmer and count them.
2. Load a language specific word frequency list containing the 2000-4000 most frequently used words and compare them against the canonical word forms from the book.

Any text which contains mostly high-frequency words is most likely easier to read than a text that contains a hight percentage of non-frequently used words. However, since it's possible to write hard-to-read texts using only the most frequently used words, this method isn't fully reliable either.

1 person has voted this message useful

Jeffers
Senior Member
United Kingdom
Joined 4906 days ago
2151 posts - 3960 votes

Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

Message 5 of 6

17 June 2014 at 3:15pm | IP Logged

It's funny, I was planning to write a program in Python to do just what the OP mentions. The textbook
I'm using, Think Python, has a
chapter which shows how to make a concordance of a text, showing frequencies. I'm pretty sure it
wouldn't be too hard to then create a frequency list out of a large corpus of text, and compare the
words in any text to that corpus.

I don't think, with a large base corpus, it would really be necessary to reduce all words to their
canonical forms. In fact, it would be a better reflection of the difficulty of the text if it counted
the frequency of forms, not the frequency of words (however defined).
1 person has voted this message useful

Doitsujin
Diglot
Senior Member
Germany
Joined 5317 days ago
1256 posts - 2363 votes

Speaks: German*, English

Message 6 of 6

17 June 2014 at 4:04pm | IP Logged

Jeffers wrote:

If you interested in NLP, check out NLTK, a Python toolkit with lots of NLP modules, among them a concordance function.

Edited by Doitsujin on 18 June 2014 at 10:15am

1 person has voted this message useful

If you wish to post a reply to this topic you must first login. If you are not already registered you must first register

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 0.2617 seconds.