Register  Login  Active Topics  Maps  

Software to analyze lexical complexity

 Language Learning Forum : General discussion Post Reply
nystagmatic
Triglot
Groupie
Brazil
Joined 4306 days ago

47 posts - 58 votes 
Speaks: Portuguese*, English, French
Studies: German

 
 Message 1 of 6
17 June 2014 at 6:52am | IP Logged 
Hi,

Does anyone know of a program that could be used to analyze the lexical complexity of a number of text files based on an arbitrary corpus? What I was thinking of, more specifically, was a program to which I could feed a couple thousand .epub files that it would then analyze and list by order of average word complexity (more rare words = higher complexity). The corpus, in this case, could be just the books themselves. It sounds like it would be reasonably simple to make and very useful, so I figured there must already be something similar floating around the web — but I haven't found anything. If it comes to it, I could try to dust off my C++ and code it myself, but it won't be pretty. :|

Any ideas? Doesn't have to be exactly like what I described. Anything that could be used for the same purpose would be wonderful. Only thing is that it cannot be limited to English (thus the part about the corpus being arbitrary).

Thanks!

Edited by nystagmatic on 17 June 2014 at 6:54am

1 person has voted this message useful



Doitsujin
Diglot
Senior Member
Germany
Joined 5317 days ago

1256 posts - 2363 votes 
Speaks: German*, English

 
 Message 2 of 6
17 June 2014 at 7:44am | IP Logged 
For English you could use standard readability tests, for example, the Flesch–Kincaid test, which is integrated in the Count Pages Calibre plug-in.
For other languages adapted Flesch–Kincaid versions also exist, but I've found them to be not very reliable. (Most English readability tests take advantage of the fact that non-Germanic words tend to be longer than words of Germanic origin. However, the same statistics-based method obviously doesn't work for Romance languages.)
If you find a suitable algorithm for the languages that you're interested in you might be able to integrate it into the Count Pages plug-in or develop your own plug-in.
1 person has voted this message useful



luke
Diglot
Senior Member
United States
Joined 7202 days ago

3133 posts - 4351 votes 
Speaks: English*, Spanish
Studies: Esperanto, French

 
 Message 3 of 6
17 June 2014 at 8:35am | IP Logged 
I took a quick glance at http://flesh.sourceforge.net/], which uses the Flesch-Kincaid readability index. I'm unconvinced of its accuracy.

There is https://lexile.com/. I've looked at that site a lot more. As with any computerized lexical processor, it can be fooled. It did offer guidance though (for English in particular).
1 person has voted this message useful



Doitsujin
Diglot
Senior Member
Germany
Joined 5317 days ago

1256 posts - 2363 votes 
Speaks: German*, English

 
 Message 4 of 6
17 June 2014 at 9:40am | IP Logged 
luke wrote:
I took a quick glance at http://flesh.sourceforge.net/">, which uses the Flesch-Kincaid readability index. I'm unconvinced of its accuracy.

Neither am I. Some time ago I tested the German version of the Flesch-Kincaid readability index, which proved to be not very reliable.

luke wrote:
There is https://lexile.com/. I've looked at that site a lot more. As with any computerized lexical processor, it can be fooled. It did offer guidance though (for English in particular).

AFAIK, Lexile's proprietary algorithm is mainly used for textbooks.

IMHO, any readability algorithm that takes only the word length into account is unsuitable for languages other than English.

An ideal readability tool, would perform the following tasks:

1. Reduce all word forms to their canonical forms with a stemmer and count them.
2. Load a language specific word frequency list containing the 2000-4000 most frequently used words and compare them against the canonical word forms from the book.

Any text which contains mostly high-frequency words is most likely easier to read than a text that contains a hight percentage of non-frequently used words. However, since it's possible to write hard-to-read texts using only the most frequently used words, this method isn't fully reliable either.

1 person has voted this message useful



Jeffers
Senior Member
United Kingdom
Joined 4906 days ago

2151 posts - 3960 votes 
Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

 
 Message 5 of 6
17 June 2014 at 3:15pm | IP Logged 
It's funny, I was planning to write a program in Python to do just what the OP mentions. The textbook
I'm using, Think Python, has a
chapter which shows how to make a concordance of a text, showing frequencies. I'm pretty sure it
wouldn't be too hard to then create a frequency list out of a large corpus of text, and compare the
words in any text to that corpus.

I don't think, with a large base corpus, it would really be necessary to reduce all words to their
canonical forms. In fact, it would be a better reflection of the difficulty of the text if it counted
the frequency of forms, not the frequency of words (however defined).
1 person has voted this message useful



Doitsujin
Diglot
Senior Member
Germany
Joined 5317 days ago

1256 posts - 2363 votes 
Speaks: German*, English

 
 Message 6 of 6
17 June 2014 at 4:04pm | IP Logged 
Jeffers wrote:
It's funny, I was planning to write a program in Python to do just what the OP mentions. The textbook I'm using, Think Python, has a chapter which shows how to make a concordance of a text, showing frequencies. I'm pretty sure it wouldn't be too hard to then create a frequency list out of a large corpus of text, and compare the words in any text to that corpus.

If you interested in NLP, check out NLTK, a Python toolkit with lots of NLP modules, among them a concordance function.

Edited by Doitsujin on 18 June 2014 at 10:15am



1 person has voted this message useful



If you wish to post a reply to this topic you must first login. If you are not already registered you must first register


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 0.2617 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.