131 messages over 17 pages: << Previous 1 2 3 4 5 6 7 ... 10 ... 16 17 Next >>
SteveRidout Diglot Groupie Spain readlang.com Joined 4280 days ago 65 posts - 121 votes Speaks: English*, Spanish
| Message 73 of 131 26 June 2013 at 2:14am | IP Logged |
Thanks Crush, great work!
The word counts on the Catalan one look on the low side but it's probably OK for our purposes.
There are some dates and numbers in there that I'll strip out but to be consistent with the other
lists I'll leave the apostrophes there for now*.
I'll have a shot at adding these Basque and Catalan lists to Readlang tomorrow, so you can play
around with them and see the results.
About the monolingual dictionaries - I think that copyright may be a problem with scraping sites
for data, I'd only want to use ones that use an appropriate licence, like one of the Creative
Commons ones. Wiktionary may be a good place to start although I hear it's a pain to parse since
all the languages are formatted differently. I'm not sure when/if I'll get around to this.
* The word lists I'm using have apostrophes. The English list I use has entries like "can't" and
"don't", and I count these as one word for the purposes of translating and difficulty grading,
which makes sense. But other contractions like the English 's and the Catalan l' feel like they
should be separated. Currently they aren't. The hard thing about separating them is that it would
require language specific rules. It may turn out that the grading algorithm will still make
reasonable predictions even with all the apostrophes, let's see.
1 person has voted this message useful
| Crush Tetraglot Senior Member ChinaRegistered users can see my Skype Name Joined 5863 days ago 1622 posts - 2299 votes Speaks: English*, Spanish, Mandarin, Esperanto Studies: Basque
| Message 74 of 131 26 June 2013 at 6:27am | IP Logged |
In the last file i uploaded, i separated things like l' and d', since really they shouldn't be counted as one word. For the next batch i've separated the other clitic pronouns, too. I'll continue to add more fairly recent books (unfortunately, only about half of them are original Catalan works, the rest are translations). Adding subtitles takes a bit longer but i'll download a few more and add them too. I'll also just remove all numbers.
I think it'd be worthwhile to separate them in the main reading interface, too, if possible. It might help with translations both from Google and the Wordreference dictionaries. At least for Catalan/French it's easy since clitics are separated by an apostrophe or dash, Spanish is probably much more difficult and maybe not worth the trouble separating them.
Here's another Catalan version with a corpus of 6,000,000 words and clitics (mostly) separated:
http://www.mirari.fr/9pU7
I deleted everything that just had 1 entry as most of those weren't even Catalan words. Numbers/dates should be gone, too.
EDIT: I can't seem to translate words anymore. It just highlights a bunch of words starting from the word i clicked on and ending wherever my mouse is.
Edited by Crush on 26 June 2013 at 5:04pm
1 person has voted this message useful
| SteveRidout Diglot Groupie Spain readlang.com Joined 4280 days ago 65 posts - 121 votes Speaks: English*, Spanish
| Message 75 of 131 26 June 2013 at 8:21pm | IP Logged |
Crush wrote:
In the last file i uploaded, i separated things like l' and d', since really
they shouldn't be counted as one word. For the next batch i've separated the other clitic
pronouns, too. I'll continue to add more fairly recent books (unfortunately, only about half
of them are original Catalan works, the rest are translations). Adding subtitles takes a bit
longer but i'll download a few more and add them too. I'll also just remove all numbers.
I think it'd be worthwhile to separate them in the main reading interface, too, if possible.
It might help with translations both from Google and the Wordreference dictionaries. At least
for Catalan/French it's easy since clitics are separated by an apostrophe or dash, Spanish is
probably much more difficult and maybe not worth the trouble separating them.
|
|
|
Thanks, I've used your word lists for Catalan and Basque on the site now. And for Catalan I'm
now separating the apostrophes in the texts before looking up in the word frequency list. I'm
thinking of open-sourcing this algorithm, it's currently very simple but dealing with
contractions and compound words for all the languages correctly would be very tricky and
I'd require help from others.
Crush wrote:
EDIT: I can't seem to translate words anymore. It just highlights a bunch of words starting
from the word i clicked on and ending wherever my mouse is. |
|
|
Sorry about this, it should be fixed now, please let me know if you spot any more problems. It
was related to a feature I've just added where you can click and drag (or touch and drag) to
translate multi-word phrases in one go.
Edited by SteveRidout on 26 June 2013 at 8:23pm
1 person has voted this message useful
| Crush Tetraglot Senior Member ChinaRegistered users can see my Skype Name Joined 5863 days ago 1622 posts - 2299 votes Speaks: English*, Spanish, Mandarin, Esperanto Studies: Basque
| Message 76 of 131 27 June 2013 at 1:40am | IP Logged |
I figured that's what it was about, which would be a nice feature, but unfortunately it seems to still not be working. I also am not noticing the separation of apostrophes (and hyphens, too, would be ideal), so i wonder if either the changes didn't quite make it online or if it's just a problem on my end not being able to load the latest version.
What exactly do you need help with as regards the contractions, etc.? LWT i believe asks you to input language specific delimiters, something like that could work. We could come up with a list of language-specific delimiters, maybe not as streamlined as you'd like but i don't see how else to handle it. Being able to select multiple words makes it easier, if mother-in-law gets separated into three words, we can just highlight all three words.
1 person has voted this message useful
| SteveRidout Diglot Groupie Spain readlang.com Joined 4280 days ago 65 posts - 121 votes Speaks: English*, Spanish
| Message 77 of 131 27 June 2013 at 3:15am | IP Logged |
Ah, sorry about this, I've been testing in Chrome and forgot to try it Firefox, where I just saw the
problem I assume you have been getting. It's now fixed, hopefully for real this time!
I wasn't clear earlier, I'm splitting the words only for the purposes of counting the percentage of
high frequency words in the difficulty grading algorithm at the moment, not for click-to-translate, at
least not yet.
Specifiying delimiters for each language would be a good start, as well as this I'd want to know
whether the apostrophe joins the start of the word or the end, e.g.
l’altre is "l'" and "altre", the apostrophe joins the first part. But it seems apostrophes can occur at
the ends of words too (ref: http://en.wiktionary.org/wiki/Category:Catalan_contractions) in which case
I suppose the apostrophe joins the last part? And are there some words which are exceptions and
containing the delimiter but shouldn't be split?
And many languages have prefixes or suffixes or join words together with no delimiters, e.g. in Spanish
"dime" should be "di" "me" (tell me), and these seem to follow very predictable rules which could
perhaps be incorporated. This is one of the things I was imagining needing help for. Another thing
might be to create a training set of texts with the expected grade level to calibrate the algorithm to
give sensible grades for all the languages. Basically, I think there's a lot of potential for work here
which others may be interested in doing if there was an open source project.
1 person has voted this message useful
| Crush Tetraglot Senior Member ChinaRegistered users can see my Skype Name Joined 5863 days ago 1622 posts - 2299 votes Speaks: English*, Spanish, Mandarin, Esperanto Studies: Basque
| Message 78 of 131 27 June 2013 at 5:44am | IP Logged |
For Catalan i can't think of any exceptions. This is what i searched for:
d' l' m' t' s' n' as well as the reverse ('d, 'l, etc.). I also changed all dashes into spaces, as those are only used to attach full (non-shortened) clitics to verbs. It'd be like writing habla-me, sentir-lo, etc. in Spanish. Really, i don't think there are any words that include a dash or apostrophe, so i think just using them as delimiters would be fine. In this sense i think handling Catalan would be much easier than Spanish. For Spanish, you need to know verb forms to be able to tell if it's a verb with a pronoun attached or another word. For example, "¡Vente pronto!" vs. "Es muy potente." Though i don't think having the verb forms in the system would be a bad thing!
Ah and yep, it's fixed now, thanks!
I'm all for open source, ultimately it's your choice but i think that benefits everyone. And i'd be glad to help with whatever i can. I'm still curious how to handle separable phrasal verbs in languages like English and German.
Also, thanks for adding the sorting options to the public library.
1 person has voted this message useful
| SteveRidout Diglot Groupie Spain readlang.com Joined 4280 days ago 65 posts - 121 votes Speaks: English*, Spanish
| Message 79 of 131 27 June 2013 at 12:28pm | IP Logged |
It now splits the words by apostrophe for translations in the reader page too, only for
Catalan so far. It always attaches the apostrophe to the shortest bit of the word, or the
first bit if they are equal. Should I apply the same rule to French too?
1 person has voted this message useful
| tricoteuse Pentaglot Senior Member Norway littlang.blogspot.co Joined 6676 days ago 745 posts - 845 votes Speaks: Swedish*, Norwegian, EnglishC1, Russian, French Studies: Ukrainian, Bulgarian
| Message 80 of 131 27 June 2013 at 12:45pm | IP Logged |
I'm really glad I didn't discover this now just to realize it's been around for several
years, because that would make me a bit pissed that I'd been missing out ;) The site
looks great. I haven't read through the entire thread here, but how do you add new
languages? By request, by popularity?
1 person has voted this message useful
|
You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum
This page was generated in 1.2969 seconds.
DHTML Menu By Milonic JavaScript
|