Obogrew Triglot Newbie United States Joined 3277 days ago 6 posts - 7 votes Speaks: Russian*, English, German Studies: Modern Hebrew, French, Spanish, Serbian
| Message 1 of 6 29 January 2016 at 12:03am | IP Logged |
I've recently discovered GLM, Anki, HTLAL etc. started to try all methods to find out
what is the best one.
I came to an idea to create a word list based on text that I am going to read.
For example, I would like to read a book and have it in digital format. Next step
would be to convert every single word into "Base form". I found out that Base Form is
called "Lemma" and the conversion process is called "Lemmatization."
After lemmatization I will get a list of words, and will do initial distillation of
them, filtering out known words. All unknown ones I can learn with SRS, Goldlist or
some other method.
For the second book, after lemmatization it will be faster, I could filter out all
known or "goldlisted" words automatically.
After initial learning with SRS or Goldlist I could start reading the book.
What do you think about that?
The question is, what is the easiest way to do lemmatization. I was able to find it
for Romanic language(google:meaningcloud), but I will have to develop a lemmatization
tool using their APIs that is not completely free of charge.
Do you have an idea how to create list of lemmas in German, Hebrew, Turkish? I was not
able to find them for those languages. Or there is an easier way to do that, withut
developing any software.
Splitting of text by a script language and calling an online/offline dictionary does
not really help, since all languages are more or less isolated. And the meaning of
phrasal verbs will be lost.
2 persons have voted this message useful
|
chaotic_thought Diglot Senior Member United States Joined 3543 days ago 129 posts - 274 votes Speaks: English*, German Studies: Dutch, French
| Message 2 of 6 29 January 2016 at 9:01am | IP Logged |
Automatic lemmatization is useful when doing statistical analysis of texts. I.e., it is for texts that you are not actually going to read. For learning, on the other hand, it is actually recommeded that you do actually read the texts, so I wouldn't worry about lemmatization.
As a simple example, consider learning about physics and you come across these novel terms in your reading (which you've noted manually):
extrapolated
particle fields
high density
wave function
higher brain functions
brain wave
The above sort of list is already what you need in order to remember the terms and how they're spelled. Lemmatization would likely give you something else like this:
BRAIN
DENSITY
EXTRAPOLAT
FIELD
FUNCTION
HIGH
PARTICLE
WAVE
Studying the lemmatized list in Anki or using flash cards is unlikely to help you remember how to correctly use a real term like "higher brain functions". In short, automatic lemmatization is useful for dumb computer algorithms that need to analyze billions of words of text.
For the human way to do this, my favorite method is the "highlighter method". You simply read the text and highlight the new terms. When you finish the book, you go back and make a list of the interesting ones (you could also list all of them if you want, but it is more work). In Anki, I normally define an "index" field for this, so for example, I would write the following for my above fictional example:
extrapolate_1, extrapolat e, "someone extrapolated something"
particle_field_1, particle field, "high density particle fields"
high_density_1, high density, "high density particle fields"
wave_function_1, wave function, "this is the primary wave function"
brain_wave_1, brain wave, "what are brain waves?"
higher_brain_functions_1, higher brain functions, "use of this drug can impair one's higher brain functions"
I use the subscript "_1" to track similar terms which mean the same thing. For example, wave_1 might be the physicist's notion of a wave, while wave_2 might be the motion you make with your hand to get someone's attention. And wave_3 might be the motion you make to a child to tell him to go away while you're working. Etc. Increment the number whenever you've found a new use of that word. The point is, you arrange the words in whatever way makes sense to you to learn to use them properly.
The nice thing about the index field in Anki is that the interface warns you if you've already got one, so if I much later add another "wave" entry, I'll go back and make sure my new example is actually a new instance and not one that I've already got.
Edited by chaotic_thought on 29 January 2016 at 9:07am
2 persons have voted this message useful
|
Speakeasy Senior Member Canada Joined 4053 days ago 507 posts - 1098 votes Studies: German
| Message 3 of 6 29 January 2016 at 12:22pm | IP Logged |
This is a TRULY FASCINATING discussion! I do not wish to appear disloyal to the HTLAL; however, it seems that the "more active discussions" now take place on the new/replacement forum: A Language Learners' Forum. Obogrew, I suggest that you register and post your question there, as well and, assuming that you do so, I wonder if chaotic thought would repost his reply.
1 person has voted this message useful
|
Obogrew Triglot Newbie United States Joined 3277 days ago 6 posts - 7 votes Speaks: Russian*, English, German Studies: Modern Hebrew, French, Spanish, Serbian
| Message 4 of 6 30 January 2016 at 7:20am | IP Logged |
chaotic_thought wrote:
As a simple example, consider learning about physics and you come across these novel
terms in your reading (which you've noted manually):
extrapolated
particle fields
high density
wave function
higher brain functions
brain wave
|
|
|
True, but it depends on the goal. It looks like your goal is more to learn physics,
rather than language. In that case focusing on 'wave function' really makes sense.
My target is mostly the vocabulary. What I would like to do is just to read a book.
And be prepared to all vocabulary that I will encounter. I don't want to come back to
the text and read it again, look into dictionary.
I would like to parse 10 books, make first distillation, goldlist new words for 1-2
months and then enjoy reading those 10 books.
chaotic_thought wrote:
The above sort of list is already what you need in order to remember the terms and how
they're spelled. Lemmatization would likely give you something else like this:
BRAIN
DENSITY
EXTRAPOLAT
FIELD
FUNCTION
HIGH
PARTICLE
WAVE
|
|
|
It looks to me like stemming. Actually what I would expect from Lemmatization:
Out of the sentence "I will be looking carefully after my children and take it over
from my wife that she takes advantage"
it creates a wordlist = {I, to look after, careful,child, to take over, wife} + 'to
take advantage' would be real advantage.
and in German "Ich hatte vor in Stille etwas zu sagen und mach' mein Türchen schnell
zu.
wordlist={ich, vorhaben, Stille, sagen, zumachen, Tür, schnell}
I am not sure whether so intelligent lemmatizer really exists. That's why my question.
1 person has voted this message useful
|
Obogrew Triglot Newbie United States Joined 3277 days ago 6 posts - 7 votes Speaks: Russian*, English, German Studies: Modern Hebrew, French, Spanish, Serbian
| Message 5 of 6 30 January 2016 at 7:21am | IP Logged |
Speakeasy wrote:
This is a TRULY FASCINATING discussion! I do not wish to appear
disloyal to the HTLAL; however, it seems that the "more active discussions" now take
place on the new/replacement forum: A Language
Learners' Forum. Obogrew, I suggest that you register and post your
question there, as well and, assuming that you do so, I wonder if chaotic thought
would repost his reply. |
|
|
Will do that. Looks like that forum is more alive.
1 person has voted this message useful
|
luhmann Senior Member Brazil Joined 5334 days ago 156 posts - 271 votes Speaks: Portuguese* Studies: Mandarin, French, English, Italian, Spanish, Persian, Arabic (classical)
| Message 6 of 6 31 January 2016 at 10:42pm | IP Logged |
My efforts on a similar idea have been baffled by the lack of accuracy of the lemmatizers I was able find for my languages. I did afterwards an SRS system that would run on unique orthographic forms, regardless of lematization. It turned up a very good way to pick a language from scratch, but it becomes increasingly inefficient as you progress. Ever since I've been able to read more advanced texts, I'm only manually plucking words from my reading.
2 persons have voted this message useful
|