14 messages over 2 pages: 1 2
luhmann Senior Member Brazil Joined 5344 days ago 156 posts - 271 votes Speaks: Portuguese* Studies: Mandarin, French, English, Italian, Spanish, Persian, Arabic (classical)
| Message 9 of 14 21 March 2014 at 3:46pm | IP Logged |
For tagging gender in french, you could use the database from Lexique.org, which is a dictionary of inflected forms containing grammatical information.
Also, one may also want to play with these:
http://en.wikipedia.org/wiki/Treebank
2 persons have voted this message useful
| chokofingrz Pentaglot Senior Member England Joined 5200 days ago 241 posts - 430 votes Speaks: English*, French, Spanish, German, Italian Studies: Russian, Japanese, Catalan, Luxembourgish
| Message 10 of 14 21 March 2014 at 11:34pm | IP Logged |
Yaan wrote:
There is a python library called NTLK (Natural Language Processing Toolkit), it has lot of functionalites, among them a POS Tagger, here is an example:
|
|
|
I find this library pretty interesting and I might have a play with it this weekend, because I am intrigued by the challenge of parsing Russian text to extract some useful "root" vocabulary words rather than all the genitives and datives that Lingocracy wants me to learn!
1 person has voted this message useful
| Yaan Triglot Groupie France Joined 4085 days ago 61 posts - 88 votes Speaks: French*, English, Mandarin Studies: Spanish, Esperanto
| Message 11 of 14 22 March 2014 at 3:05am | IP Logged |
chokofingrz wrote:
Yaan wrote:
There is a python library called NTLK (Natural Language Processing Toolkit), it has lot of functionalites, among them a POS Tagger, here is an example:
|
|
|
I find this library pretty interesting and I might have a play with it this weekend, because I am intrigued by the challenge of parsing Russian text to extract some useful "root" vocabulary words rather than all the genitives and datives that Lingocracy wants me to learn! |
|
|
Here is the nltk class that may help you for finding the stem: nltk.stem.snowball.RussianStemmer
It seems that nlkt stemmers are based on another project called SnowBall, you can try the stemmer here: http://text-processing.com/demo/stem/
However, the stem is not exactly what we want as language learners, for example in English the stem of "removing" is "remov", but we want to learn "remove".
What you are looking for seems to be the lemma.
Wikipedia wrote:
In morphology and lexicography, a lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words (headword)[citation needed]. In English, for example, run, runs, ran and running are forms of the same lexeme,
with run as the lemma. |
|
|
Wikipedia entry for Lemma: http://en.wikipedia.org/wiki/Lemma_(morphology)
Edited by Yaan on 22 March 2014 at 3:05am
2 persons have voted this message useful
| Doitsujin Diglot Senior Member Germany Joined 5331 days ago 1256 posts - 2363 votes Speaks: German*, English
| Message 12 of 14 22 March 2014 at 8:31am | IP Logged |
chokofingrz wrote:
I find this library pretty interesting and I might have a play with it this weekend, because I am intrigued by the challenge of parsing Russian text to extract some useful "root" vocabulary words rather than all the genitives and datives that Lingocracy wants me to learn! |
|
|
You might find this Github website helpul, which hosts English, French, Italian, German, Spanish, Portuguese, Polish and Russian inflection lists, which make reducing inflected forms to their canonical forms relatively easy.
The entries have the following format:
Code:
окно: окно, окна, окна, окон, окну, окнам, окно, окна, окном, окнами, окне, окнах |
|
|
As you can see, the list contains some redundant entries. However, the author also provided an undocumented Ruby script that apparently removes these entries. Since I'm not familiar with Ruby, I'm not 100% sure, though.
4 persons have voted this message useful
| Yaan Triglot Groupie France Joined 4085 days ago 61 posts - 88 votes Speaks: French*, English, Mandarin Studies: Spanish, Esperanto
| Message 13 of 14 22 March 2014 at 12:02pm | IP Logged |
Doitsujin wrote:
You might find this
Github website helpul, which hosts English, French, Italian, German, Spanish, Portuguese, Polish and Russian
inflection lists, which make reducing inflected forms to their canonical forms relatively easy. |
|
|
Very interesting resource! thank you for sharing that :)
I'm wondering what is the source of those dictionaries, and what is the methodology used to build them. It could be
great if more languages are supported.
The method used is a bit "brute force" with a list of all possible combinations that lead to huge dictionary files, the
files Polish and Russian dictionaries' size are respectively 57mo and 64mo.
1 person has voted this message useful
| chokofingrz Pentaglot Senior Member England Joined 5200 days ago 241 posts - 430 votes Speaks: English*, French, Spanish, German, Italian Studies: Russian, Japanese, Catalan, Luxembourgish
| Message 14 of 14 22 March 2014 at 4:10pm | IP Logged |
Doitsujin wrote:
You might find this Github website helpul, which hosts English, French, Italian, German, Spanish, Portuguese, Polish and Russian inflection lists, which make reducing inflected forms to their canonical forms relatively easy.
|
|
|
Great, this is precisely what I was attempting to piece together in a massive spreadsheet last night!
I also have a POS-tagged frequency list of Russian lemmas from here, so ultimately hoping to parse some text, stem the words, and match against lemmas to generate a frequency-sorted vocabulary list.
If I get anywhere I'll share the results here.
1 person has voted this message useful
|
This discussion contains 14 messages over 2 pages: << Prev 1 2 If you wish to post a reply to this topic you must first login. If you are not already registered you must first register
You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum
This page was generated in 0.3281 seconds.
DHTML Menu By Milonic JavaScript
|