Lemmatization, for GL or SRS any comments (Learning Techniques, Methods & Strategies) Language Learning Forum

Lemmatization, for GL or SRS any comments
Tags: Spaced Repetition (SRS)
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : Learning Techniques, Methods & Strategies

Obogrew
Triglot
Newbie
United States
Joined 3275 days ago
6 posts - 7 votes
Speaks: Russian*, English, German
Studies: Modern Hebrew, French, Spanish, Serbian

Message 1 of 6

29 January 2016 at 12:03am | IP Logged

I've recently discovered GLM, Anki, HTLAL etc. started to try all methods to find out
what is the best one.
I came to an idea to create a word list based on text that I am going to read.

For example, I would like to read a book and have it in digital format. Next step
would be to convert every single word into "Base form". I found out that Base Form is
called "Lemma" and the conversion process is called "Lemmatization."

After lemmatization I will get a list of words, and will do initial distillation of
them, filtering out known words. All unknown ones I can learn with SRS, Goldlist or
some other method.

For the second book, after lemmatization it will be faster, I could filter out all
known or "goldlisted" words automatically.
After initial learning with SRS or Goldlist I could start reading the book.

What do you think about that?

The question is, what is the easiest way to do lemmatization. I was able to find it
for Romanic language(google:meaningcloud), but I will have to develop a lemmatization
tool using their APIs that is not completely free of charge.

Do you have an idea how to create list of lemmas in German, Hebrew, Turkish? I was not
able to find them for those languages. Or there is an easier way to do that, withut
developing any software.

Splitting of text by a script language and calling an online/offline dictionary does
not really help, since all languages are more or less isolated. And the meaning of
phrasal verbs will be lost.
2 persons have voted this message useful

chaotic_thought
Diglot
Senior Member
United States
Joined 3541 days ago
129 posts - 274 votes

Speaks: English*, German
Studies: Dutch, French

Message 2 of 6

29 January 2016 at 9:01am | IP Logged

Automatic lemmatization is useful when doing statistical analysis of texts. I.e., it is for texts that you are not actually going to read. For learning, on the other hand, it is actually recommeded that you do actually read the texts, so I wouldn't worry about lemmatization.

As a simple example, consider learning about physics and you come across these novel terms in your reading (which you've noted manually):

extrapolated
particle fields
high density
wave function
higher brain functions
brain wave

The above sort of list is already what you need in order to remember the terms and how they're spelled. Lemmatization would likely give you something else like this:

BRAIN
DENSITY
EXTRAPOLAT
FIELD
FUNCTION
HIGH
PARTICLE
WAVE

Studying the lemmatized list in Anki or using flash cards is unlikely to help you remember how to correctly use a real term like "higher brain functions". In short, automatic lemmatization is useful for dumb computer algorithms that need to analyze billions of words of text.

For the human way to do this, my favorite method is the "highlighter method". You simply read the text and highlight the new terms. When you finish the book, you go back and make a list of the interesting ones (you could also list all of them if you want, but it is more work). In Anki, I normally define an "index" field for this, so for example, I would write the following for my above fictional example:

extrapolate_1,      extrapolat e,      "someone extrapolated something"
particle_field_1,   particle field,   "high density particle fields"
high_density_1,     high density,     "high density particle fields"
wave_function_1,    wave function,    "this is the primary wave function"
brain_wave_1,       brain wave,       "what are brain waves?"
higher_brain_functions_1, higher brain functions, "use of this drug can impair one's higher brain functions"

I use the subscript "_1" to track similar terms which mean the same thing. For example, wave_1 might be the physicist's notion of a wave, while wave_2 might be the motion you make with your hand to get someone's attention. And wave_3 might be the motion you make to a child to tell him to go away while you're working. Etc. Increment the number whenever you've found a new use of that word. The point is, you arrange the words in whatever way makes sense to you to learn to use them properly.

The nice thing about the index field in Anki is that the interface warns you if you've already got one, so if I much later add another "wave" entry, I'll go back and make sure my new example is actually a new instance and not one that I've already got.

Edited by chaotic_thought on 29 January 2016 at 9:07am
2 persons have voted this message useful

Speakeasy
Senior Member
Canada
Joined 4051 days ago
507 posts - 1098 votes

Studies: German

Message 3 of 6

29 January 2016 at 12:22pm | IP Logged

This is a TRULY FASCINATING discussion! I do not wish to appear disloyal to the HTLAL; however, it seems that the "more active discussions" now take place on the new/replacement forum: A Language Learners' Forum. Obogrew, I suggest that you register and post your question there, as well and, assuming that you do so, I wonder if chaotic thought would repost his reply.
1 person has voted this message useful

Obogrew
Triglot
Newbie
United States
Joined 3275 days ago
6 posts - 7 votes
Speaks: Russian*, English, German
Studies: Modern Hebrew, French, Spanish, Serbian

Message 4 of 6

30 January 2016 at 7:20am | IP Logged

chaotic_thought wrote:

As a simple example, consider learning about physics and you come across these novel
terms in your reading (which you've noted manually):

extrapolated
particle fields
high density
wave function
higher brain functions
brain wave

True, but it depends on the goal. It looks like your goal is more to learn physics,
rather than language. In that case focusing on 'wave function' really makes sense.

My target is mostly the vocabulary. What I would like to do is just to read a book.
And be prepared to all vocabulary that I will encounter. I don't want to come back to
the text and read it again, look into dictionary.

I would like to parse 10 books, make first distillation, goldlist new words for 1-2
months and then enjoy reading those 10 books.

chaotic_thought wrote:

The above sort of list is already what you need in order to remember the terms and how
they're spelled. Lemmatization would likely give you something else like this:

BRAIN
DENSITY
EXTRAPOLAT
FIELD
FUNCTION
HIGH
PARTICLE
WAVE

It looks to me like stemming. Actually what I would expect from Lemmatization:

Out of the sentence "I will be looking carefully after my children and take it over
from my wife that she takes advantage"

it creates a wordlist = {I, to look after, careful,child, to take over, wife} + 'to
take advantage' would be real advantage.

and in German "Ich hatte vor in Stille etwas zu sagen und mach' mein Türchen schnell
zu.
wordlist={ich, vorhaben, Stille, sagen, zumachen, Tür, schnell}

I am not sure whether so intelligent lemmatizer really exists. That's why my question.

1 person has voted this message useful

Obogrew
Triglot
Newbie
United States
Joined 3275 days ago
6 posts - 7 votes
Speaks: Russian*, English, German
Studies: Modern Hebrew, French, Spanish, Serbian

Message 5 of 6

30 January 2016 at 7:21am | IP Logged

Speakeasy wrote:

This is a TRULY FASCINATING discussion! I do not wish to appear
disloyal to the HTLAL; however, it seems that the "more active discussions" now take
place on the new/replacement forum: A Language
Learners' Forum. Obogrew, I suggest that you register and post your
question there, as well and, assuming that you do so, I wonder if chaotic thought
would repost his reply.

Will do that. Looks like that forum is more alive.
1 person has voted this message useful

luhmann
Senior Member
Brazil
Joined 5332 days ago
156 posts - 271 votes

Speaks: Portuguese*
Studies: Mandarin, French, English, Italian, Spanish, Persian, Arabic (classical)

Message 6 of 6

31 January 2016 at 10:42pm | IP Logged

My efforts on a similar idea have been baffled by the lack of accuracy of the lemmatizers I was able find for my languages. I did afterwards an SRS system that would run on unique orthographic forms, regardless of lematization. It turned up a very good way to pick a language from scratch, but it becomes increasingly inefficient as you progress. Ever since I've been able to read more advanced texts, I'm only manually plucking words from my reading.

2 persons have voted this message useful

If you wish to post a reply to this topic you must first login. If you are not already registered you must first register

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 0.3906 seconds.