Building your own corpus (Learning Techniques, Methods & Strategies) Language Learning Forum

Building your own corpus
Tags: Comprehensive input
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : Learning Techniques, Methods & Strategies

27 messages over 4 pages: 1 2 3 4 Next >>

andras_farkas
Tetraglot
Groupie
Hungary
Joined 4843 days ago
56 posts - 165 votes

Speaks: Hungarian*, Spanish, English, Italian

Message 17 of 27

24 February 2014 at 6:32pm | IP Logged

I don't think I was trying to shut down anything. I contrasted some of your assertions
and opinions with my own.
Lingoleng noted that I'm in the language business, which is true. How much authority that
gives me on these issues is debatable. I studied linguistics and translation at
university and I've been in this profession for 5+ years, so I have some basic
theoretical knowledge and a good amount of practical experience. That does not make me
the ultimate judge of linguistics & translation issues, and I never claimed to be any
such thing.
5 persons have voted this message useful

emk
Diglot
Moderator
United States
Joined 5475 days ago
2615 posts - 8806 votes

Speaks: English*, French^B2
Studies: Spanish, Ancient Egyptian
Personal Language Map

Message 18 of 27

24 February 2014 at 7:28pm | IP Logged

andras_farkas wrote:

If one were to use this tool for scientific research, one would obviously put the various types of texts (English original, French original) into separate databases. For language learning, this is hardly a vital issue. One tries to read texts that were originally written in the target language, but it's not a hard and fast rule.

I'm sure you know about it, but there's also a very handy online database which works on much the same principle: anguish, angoisse. We can see that most instances of "anguish" correspond to "angoisse" in French, but going in the other direction, things aren't so tidy: "angoisse" corresponds to "anguish", and "anxiety", and a bunch of other words.

Of course, to really use a database like this effectively, I need to be able to read the surrounding contexts and think about the nuances. But if I'm prepared to do that, I find that a translation database is far more precise and trustworthy than a bilingual dictionary.

As for Bakunin's original idea proposed in this thread—building a searchable corpus of familiar materials—I think it's absolutely brilliant. What better way to get examples of words in familiar contexts? I'd do it myself, if I weren't so frightfully lazy.
3 persons have voted this message useful

andras_farkas
Tetraglot
Groupie
Hungary
Joined 4843 days ago
56 posts - 165 votes

Speaks: Hungarian*, Spanish, English, Italian

Message 19 of 27

24 February 2014 at 9:24pm | IP Logged

emk wrote:

I'm sure you know about it, but there's also a very handy online database which works
on much the same principle: source=auto&query=anguish">anguish, query=angoisse&source=french">angoisse.

Yes, I know about Linguee. There are numerous other parallel corpora as well, some
downloadable, some can only be queried online. Mymemory.translated.net covers more
languages than Linguee but it's mostly based on EU texts I believe (they crawl the web
and collect everything they find but EU texts dominate in many language combinations).
Hunglish.hu is fairly large and fairly varied but it's English-Hungarian only.
http://opus.lingfil.uu.se/bin/opuscqp.pl contains a wide range of material in many
languages - the OpenSubtitles subcorpus is perhaps the best for language learners.
There's also InterCorp, which I
just found out about today. It's supposed to be a large manually aligned multilingual
corpus with many literary texts, so it should work really well for language learning.

There are numerous others, too, and many that I never heard about I'm sure. It seems
that this field exploded in the last five years or so. Software tools and source texts
became widely available, and people compile massive parallel corpora all over the
place, mostly for training machine translation systems and supporting human
translators. Most of them are autoaligned, so there are alignment errors, but usually
not many. There is no reason why one couldn't make use of these resources for
language learning, especially in language combinations for which no good dictionaries
are available.

Edited by andras_farkas on 24 February 2014 at 9:29pm
4 persons have voted this message useful

iguanamon
Pentaglot
Senior Member
Virgin Islands
Speaks: Ladino
Joined 5205 days ago
2241 posts - 6731 votes

Speaks: English*, Spanish, Portuguese, Haitian Creole, Creole (French)

Message 20 of 27

24 February 2014 at 10:15pm | IP Logged

While searching for more material in Ladino (Judeo-Spanish), I found the Collections de Corpus Oraux Numériques which is French based and has a ton of rare languages included. There's even four regional variations of metropolitan French and almost all the minority languages of France in the database. The corpus includes audio and text. For Ladino, I'll take anything I can get!

I am a huge fan of linguee. I find it highly useful, although for Portuguese it doesn't do as good of a job with the more colloquial/slang usage. For that I use the dicionário informal website. Linguee, to me is much more effective than a bilingual dictionary.

By the way, andras_farkas, thank you very much for sharing your work on bilingual texts and your alignment software with the community. It is greatly appreciated.

Edited by iguanamon on 25 February 2014 at 12:09am
5 persons have voted this message useful

kujichagulia
Senior Member
Japan
Joined 4790 days ago
1031 posts - 1571 votes

Speaks: English*
Studies: Japanese, Portuguese

Message 21 of 27

25 February 2014 at 6:27am | IP Logged

I've always thought that collecting a variety of texts that I've read or studied, or even a collection of my own corrected writings, would make for a fabulous and personally relevant corpus for me to use. But I can't figure out the best way to set it up.

It's easier to have a collection of files such as Word or LibreOffice documents, PDFs, or .txt files in a folder. But how do you search all of those files? I'm not aware of any software that could do that.

It's easier to search for a word in one document, but it seems to be a hassle to copy-and-paste everything into one document, and it is hard to retrieve a particular text if you have hundreds of texts/stories/etc. in one huge document.
2 persons have voted this message useful

andras_farkas
Tetraglot
Groupie
Hungary
Joined 4843 days ago
56 posts - 165 votes

Speaks: Hungarian*, Spanish, English, Italian

Message 22 of 27

25 February 2014 at 9:19am | IP Logged

Dtsearch, copernic and similar "desktop search" tools do just that. (Search in any number
of files of a variety of file types saved in a specified folder, and serve up a single
list of hits.)

Edited by andras_farkas on 25 February 2014 at 9:21am
3 persons have voted this message useful

Lizzern
Diglot
Senior Member
Norway
Joined 5852 days ago
791 posts - 1053 votes

Speaks: Norwegian*, English
Studies: Japanese

Message 23 of 27

25 February 2014 at 5:44pm | IP Logged

kujichagulia wrote:

It's easier to have a collection of files such as Word or LibreOffice documents, PDFs, or .txt files in a folder. But how do you search all of those files? I'm not aware of any software that could do that.

The normal Windows search lets you get a list of files at least (it doesn't always search inside pdfs though). Or if you keep them in the same folder, you can search the folder itself and narrow down the list of files to those containing a certain word, without having to open them. I'm probably going to use Word if I decide I want to build my own corpus, and just use the search in the program, along with the Windows search.

Liz

Edited by Lizzern on 25 February 2014 at 5:44pm
2 persons have voted this message useful

Bakunin
Diglot
Senior Member
Switzerland
outerkhmer.blogspot.
Joined 5073 days ago
531 posts - 1126 votes

Speaks: German*, Thai
Studies: Khmer

Message 24 of 27

25 February 2014 at 6:21pm | IP Logged

kujichagulia wrote:

I can't really add much to the advice you've already got, but I can outline what I did. It involves some programming, but it's really pedestrian.

1. I've got a special folder for all my corpus text files; they are all of the type .txt and have some standardized meta-information at the beginning (date, title, genre, source), then an empty line, and then the text. Note that I've worked on each and every text with FLTR, fixing typos etc. I also have a small routing eliminating empty lines, unnecessary spaces etc. which I run before I put the text into FLTR (and of course a parser, but that's specific to Thai which doesn't use spaces between words).
2. I've got a corpus search file; I specify my search request here. It's basically the search term(s), how many characters the terms (if there are two or more) can be apart, words/phrases which I want to exclude from the results etc., how many characters I want to see around the search terms etc. - This is obviously tailored to my specific needs (e.g., to search for 'X … Y … Z' but exclude 'aX' and 'Zb' [remember the lack of spaces in Thai]).
3. Then I run my python routine 'corpus'. It reads all files in the corpus folder and assembles them into one huge text file. Then the search is performed. The results are trimmed according to the search specifications.
4. Finally I generate an output html file by automatically writing the html commands, using the search results. I've got a template from somebody and just copied the structure. The output also records some of the meta-information coming with each result line which is useful if I want to go back to the original text.

As already pointed out earlier, the value of my corpus lies in the fact that the search results are personally relevant. But if there had been a good, easy to use Thai corpus on the web, I wouldn't have built my own. Building one's own corpus isn't something that I would recommend to anybody, but if you have basic programming skills or can find software that helps you organize the material, it can be a useful study aid for the serious learner.

3 persons have voted this message useful

This discussion contains 27 messages over 4 pages: << Prev 1 2 3 4 Next >>

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 2.7500 seconds.