Register  Login  Active Topics  Maps  

Readlang, my language reading site

 Language Learning Forum : Learning Techniques, Methods & Strategies Post Reply
131 messages over 17 pages: << Previous 1 2 3 4 5 6 7 ... 10 ... 16 17 Next >>
SteveRidout
Diglot
Groupie
Spain
readlang.com
Joined 1722 days ago

65 posts - 55 votes 
Speaks: English*, Spanish

 
 Message 73 of 131
26 June 2013 at 2:14am | IP Logged 
Thanks Crush, great work!

The word counts on the Catalan one look on the low side but it's probably OK for our purposes.
There are some dates and numbers in there that I'll strip out but to be consistent with the other
lists I'll leave the apostrophes there for now*.

I'll have a shot at adding these Basque and Catalan lists to Readlang tomorrow, so you can play
around with them and see the results.

About the monolingual dictionaries - I think that copyright may be a problem with scraping sites
for data, I'd only want to use ones that use an appropriate licence, like one of the Creative
Commons ones. Wiktionary may be a good place to start although I hear it's a pain to parse since
all the languages are formatted differently. I'm not sure when/if I'll get around to this.

* The word lists I'm using have apostrophes. The English list I use has entries like "can't" and
"don't", and I count these as one word for the purposes of translating and difficulty grading,
which makes sense. But other contractions like the English 's and the Catalan l' feel like they
should be separated. Currently they aren't. The hard thing about separating them is that it would
require language specific rules. It may turn out that the grading algorithm will still make
reasonable predictions even with all the apostrophes, let's see.



Crush
Diglot
Senior Member
ChinaRegistered users can see my Skype Name
Joined 3305 days ago

1622 posts - 682 votes 
Speaks: English*, Spanish
Studies: Mandarin, Esperanto, Basque

 
 Message 74 of 131
26 June 2013 at 6:27am | IP Logged 
In the last file i uploaded, i separated things like l' and d', since really they shouldn't be counted as one word. For the next batch i've separated the other clitic pronouns, too. I'll continue to add more fairly recent books (unfortunately, only about half of them are original Catalan works, the rest are translations). Adding subtitles takes a bit longer but i'll download a few more and add them too. I'll also just remove all numbers.

I think it'd be worthwhile to separate them in the main reading interface, too, if possible. It might help with translations both from Google and the Wordreference dictionaries. At least for Catalan/French it's easy since clitics are separated by an apostrophe or dash, Spanish is probably much more difficult and maybe not worth the trouble separating them.

Here's another Catalan version with a corpus of 6,000,000 words and clitics (mostly) separated:
http://www.mirari.fr/9pU7
I deleted everything that just had 1 entry as most of those weren't even Catalan words. Numbers/dates should be gone, too.

EDIT: I can't seem to translate words anymore. It just highlights a bunch of words starting from the word i clicked on and ending wherever my mouse is.

Edited by Crush on 26 June 2013 at 5:04pm



SteveRidout
Diglot
Groupie
Spain
readlang.com
Joined 1722 days ago

65 posts - 55 votes 
Speaks: English*, Spanish

 
 Message 75 of 131
26 June 2013 at 8:21pm | IP Logged 
Crush wrote:
In the last file i uploaded, i separated things like l' and d', since really
they shouldn't be counted as one word. For the next batch i've separated the other clitic
pronouns, too. I'll continue to add more fairly recent books (unfortunately, only about half
of them are original Catalan works, the rest are translations). Adding subtitles takes a bit
longer but i'll download a few more and add them too. I'll also just remove all numbers.

I think it'd be worthwhile to separate them in the main reading interface, too, if possible.
It might help with translations both from Google and the Wordreference dictionaries. At least
for Catalan/French it's easy since clitics are separated by an apostrophe or dash, Spanish is
probably much more difficult and maybe not worth the trouble separating them.


Thanks, I've used your word lists for Catalan and Basque on the site now. And for Catalan I'm
now separating the apostrophes in the texts before looking up in the word frequency list. I'm
thinking of open-sourcing this algorithm, it's currently very simple but dealing with
contractions and compound words for all the languages correctly would be very tricky and
I'd require help from others.

Crush wrote:

EDIT: I can't seem to translate words anymore. It just highlights a bunch of words starting
from the word i clicked on and ending wherever my mouse is.


Sorry about this, it should be fixed now, please let me know if you spot any more problems. It
was related to a feature I've just added where you can click and drag (or touch and drag) to
translate multi-word phrases in one go.

Edited by SteveRidout on 26 June 2013 at 8:23pm



Crush
Diglot
Senior Member
ChinaRegistered users can see my Skype Name
Joined 3305 days ago

1622 posts - 682 votes 
Speaks: English*, Spanish
Studies: Mandarin, Esperanto, Basque

 
 Message 76 of 131
27 June 2013 at 1:40am | IP Logged 
I figured that's what it was about, which would be a nice feature, but unfortunately it seems to still not be working. I also am not noticing the separation of apostrophes (and hyphens, too, would be ideal), so i wonder if either the changes didn't quite make it online or if it's just a problem on my end not being able to load the latest version.

What exactly do you need help with as regards the contractions, etc.? LWT i believe asks you to input language specific delimiters, something like that could work. We could come up with a list of language-specific delimiters, maybe not as streamlined as you'd like but i don't see how else to handle it. Being able to select multiple words makes it easier, if mother-in-law gets separated into three words, we can just highlight all three words.



SteveRidout
Diglot
Groupie
Spain
readlang.com
Joined 1722 days ago

65 posts - 55 votes 
Speaks: English*, Spanish

 
 Message 77 of 131
27 June 2013 at 3:15am | IP Logged 
Ah, sorry about this, I've been testing in Chrome and forgot to try it Firefox, where I just saw the
problem I assume you have been getting. It's now fixed, hopefully for real this time!

I wasn't clear earlier, I'm splitting the words only for the purposes of counting the percentage of
high frequency words in the difficulty grading algorithm at the moment, not for click-to-translate, at
least not yet.

Specifiying delimiters for each language would be a good start, as well as this I'd want to know
whether the apostrophe joins the start of the word or the end, e.g.

l’altre is "l'" and "altre", the apostrophe joins the first part. But it seems apostrophes can occur at
the ends of words too (ref: http://en.wiktionary.org/wiki/Category:Catalan_contractions) in which case
I suppose the apostrophe joins the last part? And are there some words which are exceptions and
containing the delimiter but shouldn't be split?

And many languages have prefixes or suffixes or join words together with no delimiters, e.g. in Spanish
"dime" should be "di" "me" (tell me), and these seem to follow very predictable rules which could
perhaps be incorporated. This is one of the things I was imagining needing help for. Another thing
might be to create a training set of texts with the expected grade level to calibrate the algorithm to
give sensible grades for all the languages. Basically, I think there's a lot of potential for work here
which others may be interested in doing if there was an open source project.



Crush
Diglot
Senior Member
ChinaRegistered users can see my Skype Name
Joined 3305 days ago

1622 posts - 682 votes 
Speaks: English*, Spanish
Studies: Mandarin, Esperanto, Basque

 
 Message 78 of 131
27 June 2013 at 5:44am | IP Logged 
For Catalan i can't think of any exceptions. This is what i searched for:
d' l' m' t' s' n' as well as the reverse ('d, 'l, etc.). I also changed all dashes into spaces, as those are only used to attach full (non-shortened) clitics to verbs. It'd be like writing habla-me, sentir-lo, etc. in Spanish. Really, i don't think there are any words that include a dash or apostrophe, so i think just using them as delimiters would be fine. In this sense i think handling Catalan would be much easier than Spanish. For Spanish, you need to know verb forms to be able to tell if it's a verb with a pronoun attached or another word. For example, "¡Vente pronto!" vs. "Es muy potente." Though i don't think having the verb forms in the system would be a bad thing!

Ah and yep, it's fixed now, thanks!

I'm all for open source, ultimately it's your choice but i think that benefits everyone. And i'd be glad to help with whatever i can. I'm still curious how to handle separable phrasal verbs in languages like English and German.

Also, thanks for adding the sorting options to the public library.



SteveRidout
Diglot
Groupie
Spain
readlang.com
Joined 1722 days ago

65 posts - 55 votes 
Speaks: English*, Spanish

 
 Message 79 of 131
27 June 2013 at 12:28pm | IP Logged 
It now splits the words by apostrophe for translations in the reader page too, only for
Catalan so far. It always attaches the apostrophe to the shortest bit of the word, or the
first bit if they are equal. Should I apply the same rule to French too?



tricoteuse
Pentaglot
Senior Member
Norway
littlang.blogspot.co
Joined 4118 days ago

746 posts - 101 votes 
Speaks: Swedish*, Norwegian, EnglishC1, Russian, French
Studies: Ukrainian, Bulgarian

 
 Message 80 of 131
27 June 2013 at 12:45pm | IP Logged 
I'm really glad I didn't discover this now just to realize it's been around for several
years, because that would make me a bit pissed that I'd been missing out ;) The site
looks great. I haven't read through the entire thread here, but how do you add new
languages? By request, by popularity?



This discussion contains 131 messages over 17 pages: << Prev 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  Next >>


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 0.3281 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2017 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.