Register  Login  Active Topics  Maps  

Albanian with natural language processing

 Language Learning Forum : Language Learning Log Post Reply
10 messages over 2 pages: 1 2  Next >>
Eginhard
Pentaglot
Newbie
France
enno-hermann.blogspo
Joined 4492 days ago

12 posts - 23 votes
Speaks: German*, French, English, Croatian, Lithuanian
Studies: Albanian

 
 Message 1 of 10
27 October 2014 at 5:23pm | IP Logged 
If you know Albanian, please don't correct me or give any hints, thanks!

Motivated by my university courses in natural language processing (NLP), I've
recently started a new project to learn Albanian. I chose Albanian
deliberately mainly because I know (almost) nothing about the language, but it's
still distantly related to the other Indo-European languages, so what I have in
mind should not be completely impossible. I also like the Balkans in general,
having been there several times over the last 1.5 years and also learning Croatian.

I'm planning to learn Albanian by using (almost) only Wikipedia.
Specifically, I'll only use target-language resources and not consult any
dictionaries or grammars. I've compiled a corpus of about 10 million words in about
0.5 million sentences from the Albanian Wikipedia. I will not simply read all
articles, but instead use NLP techniques to find out e.g. the most common words,
collocations of specific words in order to decipher vocabulary and grammar. This is
mainly an experiment; I know very well that this is not a very efficient method to
learn a language, I just want to find out what is possible this way.

I have a blog (http://enno-
hermann.blogspot.com/
) where I'll be posting more general discoveries and
more technical details (I'm implementing the NLP algorithms myself for practice)
and will use this log to share smaller discoveries I make and to allow better
discussion of my experiment. There is also a Google
Spreadsheet
where I compile my findings.

Edited by Eginhard on 19 November 2014 at 12:28pm

9 persons have voted this message useful



clmns01
Diglot
Newbie
Austria
Joined 3493 days ago

22 posts - 23 votes
Speaks: German*, English
Studies: Portuguese, Italian

 
 Message 2 of 10
27 October 2014 at 5:28pm | IP Logged 
Wow, this sounds like a crazy adventure! Viel Glück dabei!
1 person has voted this message useful



Eginhard
Pentaglot
Newbie
France
enno-hermann.blogspo
Joined 4492 days ago

12 posts - 23 votes
Speaks: German*, French, English, Croatian, Lithuanian
Studies: Albanian

 
 Message 3 of 10
28 October 2014 at 2:27pm | IP Logged 
The abundance of geography stubs on Wikipedia already allowed me to identify some
basic sentence structures, several nouns and three cases which I'll call
Nominative, Genitive and Locative. More details are also in the spreadsheet linked
above.

E.g. there are a lot of sentences like these:
X is a city in region Y in country Z.
X is the capital of Y.
X has a surface of Y and Z inhabitants.
...

Cases:
Nominative: qyteti (city)
Genitive: e qytetit (of the city)
Locative: në qytetin (in the city)
(But there are several declensions it seems)

Edited by Eginhard on 28 October 2014 at 2:27pm

1 person has voted this message useful





Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6507 days ago

9078 posts - 16473 votes 
Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

 
 Message 4 of 10
28 October 2014 at 5:18pm | IP Logged 
Wow, a field linguist among us! Good luck with your experiment.
1 person has voted this message useful



Eginhard
Pentaglot
Newbie
France
enno-hermann.blogspo
Joined 4492 days ago

12 posts - 23 votes
Speaks: German*, French, English, Croatian, Lithuanian
Studies: Albanian

 
 Message 5 of 10
29 October 2014 at 10:25am | IP Logged 
Well, I have no training in field linguistics, although I find it very interesting
and this is a very nice intellectual exercise. But it's actually rather the opposite
of traditional field linguistics I guess because I'm not in the field at all and
already have a huge amount of data readily available.

As regards Albanian, I learned the names of 7 months - those which are similar to the
English names, the current one and February because it was easy to identify the one
which never occurs with 30 or 31. I've grouped the remaining ones into 30 and 31-day
months and will try to come up with some way to identify them other than looking up
events of which I know the date. This would be a lot easier with a literary corpus
where it should be possible to at least link months to their seasons.
1 person has voted this message useful





Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6507 days ago

9078 posts - 16473 votes 
Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

 
 Message 6 of 10
29 October 2014 at 3:09pm | IP Logged 
Your project is field linguistics insofar that you try to extract words and grammar directly from utterances in the language. And there is a lot of linguistic research into the best methods to do that - though mainly from the pre-Chomsky period (eg. the socalled 'Bloomfield school').

It is different from traditional field grammar insofar that you can't pose questions to the natives to clear up murky points.
2 persons have voted this message useful



tristano
Tetraglot
Senior Member
Netherlands
Joined 3851 days ago

905 posts - 1262 votes 
Speaks: Italian*, Spanish, French, English
Studies: Dutch

 
 Message 7 of 10
31 October 2014 at 7:55am | IP Logged 
wooow, looking forward to see your progresses!
1 person has voted this message useful



Eginhard
Pentaglot
Newbie
France
enno-hermann.blogspo
Joined 4492 days ago

12 posts - 23 votes
Speaks: German*, French, English, Croatian, Lithuanian
Studies: Albanian

 
 Message 8 of 10
02 November 2014 at 9:31pm | IP Logged 
I wrote a new blog post about n-gram frequencies and how I use them to find
out more about Albanian. For example, they allow me to quickly find all words of a
word family sorted by frequency and then find out which prepositions are used with
which form.

I feel I also made quite some progress in the last days and found out the meanings
of many words, mainly nouns (all added to the spreadsheet). Especially identifying
the Albanian equivalents of USA, UNO, EU and the likes helped a lot to understand
some structures. I can also follow some very simple sentences when I'm familiar with
the context, e.g. "Switzerland is a member of X, Y and Z, but not the EU".

Edited by Eginhard on 02 November 2014 at 9:38pm



3 persons have voted this message useful



This discussion contains 10 messages over 2 pages: 2  Next >>


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 0.7344 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.