Register  Login  Active Topics  Maps  

Building your own corpus

 Language Learning Forum : Learning Techniques, Methods & Strategies Post Reply
27 messages over 4 pages: 1 24  Next >>
andras_farkas
Tetraglot
Groupie
Hungary
Joined 4843 days ago

56 posts - 165 votes 
Speaks: Hungarian*, Spanish, English, Italian

 
 Message 17 of 27
24 February 2014 at 6:32pm | IP Logged 
I don't think I was trying to shut down anything. I contrasted some of your assertions
and opinions with my own.
Lingoleng noted that I'm in the language business, which is true. How much authority that
gives me on these issues is debatable. I studied linguistics and translation at
university and I've been in this profession for 5+ years, so I have some basic
theoretical knowledge and a good amount of practical experience. That does not make me
the ultimate judge of linguistics & translation issues, and I never claimed to be any
such thing.
5 persons have voted this message useful





emk
Diglot
Moderator
United States
Joined 5475 days ago

2615 posts - 8806 votes 
Speaks: English*, FrenchB2
Studies: Spanish, Ancient Egyptian
Personal Language Map

 
 Message 18 of 27
24 February 2014 at 7:28pm | IP Logged 
andras_farkas wrote:
If one were to use this tool for scientific research, one would obviously put the various types of texts (English original, French original) into separate databases. For language learning, this is hardly a vital issue. One tries to read texts that were originally written in the target language, but it's not a hard and fast rule.

I'm sure you know about it, but there's also a very handy online database which works on much the same principle: anguish, angoisse. We can see that most instances of "anguish" correspond to "angoisse" in French, but going in the other direction, things aren't so tidy: "angoisse" corresponds to "anguish", and "anxiety", and a bunch of other words.

Of course, to really use a database like this effectively, I need to be able to read the surrounding contexts and think about the nuances. But if I'm prepared to do that, I find that a translation database is far more precise and trustworthy than a bilingual dictionary.

As for Bakunin's original idea proposed in this thread—building a searchable corpus of familiar materials—I think it's absolutely brilliant. What better way to get examples of words in familiar contexts? I'd do it myself, if I weren't so frightfully lazy.
3 persons have voted this message useful



andras_farkas
Tetraglot
Groupie
Hungary
Joined 4843 days ago

56 posts - 165 votes 
Speaks: Hungarian*, Spanish, English, Italian

 
 Message 19 of 27
24 February 2014 at 9:24pm | IP Logged 
emk wrote:

I'm sure you know about it, but there's also a very handy online database which works
on much the same principle: source=auto&query=anguish">anguish, query=angoisse&source=french">angoisse.

Yes, I know about Linguee. There are numerous other parallel corpora as well, some
downloadable, some can only be queried online. Mymemory.translated.net covers more
languages than Linguee but it's mostly based on EU texts I believe (they crawl the web
and collect everything they find but EU texts dominate in many language combinations).
Hunglish.hu is fairly large and fairly varied but it's English-Hungarian only.
http://opus.lingfil.uu.se/bin/opuscqp.pl contains a wide range of material in many
languages - the OpenSubtitles subcorpus is perhaps the best for language learners.
There's also InterCorp, which I
just found out about today. It's supposed to be a large manually aligned multilingual
corpus with many literary texts, so it should work really well for language learning.

There are numerous others, too, and many that I never heard about I'm sure. It seems
that this field exploded in the last five years or so. Software tools and source texts
became widely available, and people compile massive parallel corpora all over the
place, mostly for training machine translation systems and supporting human
translators. Most of them are autoaligned, so there are alignment errors, but usually
not many. There is no reason why one couldn't make use of these resources for
language learning, especially in language combinations for which no good dictionaries
are available.

Edited by andras_farkas on 24 February 2014 at 9:29pm

4 persons have voted this message useful



iguanamon
Pentaglot
Senior Member
Virgin Islands
Speaks: Ladino
Joined 5205 days ago

2241 posts - 6731 votes 
Speaks: English*, Spanish, Portuguese, Haitian Creole, Creole (French)

 
 Message 20 of 27
24 February 2014 at 10:15pm | IP Logged 
While searching for more material in Ladino (Judeo-Spanish), I found the Collections de Corpus Oraux Numériques which is French based and has a ton of rare languages included. There's even four regional variations of metropolitan French and almost all the minority languages of France in the database. The corpus includes audio and text. For Ladino, I'll take anything I can get!

I am a huge fan of linguee. I find it highly useful, although for Portuguese it doesn't do as good of a job with the more colloquial/slang usage. For that I use the dicionário informal website. Linguee, to me is much more effective than a bilingual dictionary.

By the way, andras_farkas, thank you very much for sharing your work on bilingual texts and your alignment software with the community. It is greatly appreciated.

Edited by iguanamon on 25 February 2014 at 12:09am

5 persons have voted this message useful



kujichagulia
Senior Member
Japan
Joined 4790 days ago

1031 posts - 1571 votes 
Speaks: English*
Studies: Japanese, Portuguese

 
 Message 21 of 27
25 February 2014 at 6:27am | IP Logged 
I've always thought that collecting a variety of texts that I've read or studied, or even a collection of my own corrected writings, would make for a fabulous and personally relevant corpus for me to use. But I can't figure out the best way to set it up.

It's easier to have a collection of files such as Word or LibreOffice documents, PDFs, or .txt files in a folder. But how do you search all of those files? I'm not aware of any software that could do that.

It's easier to search for a word in one document, but it seems to be a hassle to copy-and-paste everything into one document, and it is hard to retrieve a particular text if you have hundreds of texts/stories/etc. in one huge document.
2 persons have voted this message useful



andras_farkas
Tetraglot
Groupie
Hungary
Joined 4843 days ago

56 posts - 165 votes 
Speaks: Hungarian*, Spanish, English, Italian

 
 Message 22 of 27
25 February 2014 at 9:19am | IP Logged 
Dtsearch, copernic and similar "desktop search" tools do just that. (Search in any number
of files of a variety of file types saved in a specified folder, and serve up a single
list of hits.)

Edited by andras_farkas on 25 February 2014 at 9:21am

3 persons have voted this message useful



Lizzern
Diglot
Senior Member
Norway
Joined 5852 days ago

791 posts - 1053 votes 
Speaks: Norwegian*, English
Studies: Japanese

 
 Message 23 of 27
25 February 2014 at 5:44pm | IP Logged 
kujichagulia wrote:
It's easier to have a collection of files such as Word or LibreOffice documents, PDFs, or .txt files in a folder. But how do you search all of those files? I'm not aware of any software that could do that.


The normal Windows search lets you get a list of files at least (it doesn't always search inside pdfs though). Or if you keep them in the same folder, you can search the folder itself and narrow down the list of files to those containing a certain word, without having to open them. I'm probably going to use Word if I decide I want to build my own corpus, and just use the search in the program, along with the Windows search.

Liz

Edited by Lizzern on 25 February 2014 at 5:44pm

2 persons have voted this message useful



Bakunin
Diglot
Senior Member
Switzerland
outerkhmer.blogspot.
Joined 5073 days ago

531 posts - 1126 votes 
Speaks: German*, Thai
Studies: Khmer

 
 Message 24 of 27
25 February 2014 at 6:21pm | IP Logged 
kujichagulia wrote:
I've always thought that collecting a variety of texts that I've read or studied, or even a collection of my own corrected writings, would make for a fabulous and personally relevant corpus for me to use. But I can't figure out the best way to set it up.

It's easier to have a collection of files such as Word or LibreOffice documents, PDFs, or .txt files in a folder. But how do you search all of those files? I'm not aware of any software that could do that.

It's easier to search for a word in one document, but it seems to be a hassle to copy-and-paste everything into one document, and it is hard to retrieve a particular text if you have hundreds of texts/stories/etc. in one huge document.


I can't really add much to the advice you've already got, but I can outline what I did. It involves some programming, but it's really pedestrian.

1. I've got a special folder for all my corpus text files; they are all of the type .txt and have some standardized meta-information at the beginning (date, title, genre, source), then an empty line, and then the text. Note that I've worked on each and every text with FLTR, fixing typos etc. I also have a small routing eliminating empty lines, unnecessary spaces etc. which I run before I put the text into FLTR (and of course a parser, but that's specific to Thai which doesn't use spaces between words).
2. I've got a corpus search file; I specify my search request here. It's basically the search term(s), how many characters the terms (if there are two or more) can be apart, words/phrases which I want to exclude from the results etc., how many characters I want to see around the search terms etc. - This is obviously tailored to my specific needs (e.g., to search for 'X … Y … Z' but exclude 'aX' and 'Zb' [remember the lack of spaces in Thai]).
3. Then I run my python routine 'corpus'. It reads all files in the corpus folder and assembles them into one huge text file. Then the search is performed. The results are trimmed according to the search specifications.
4. Finally I generate an output html file by automatically writing the html commands, using the search results. I've got a template from somebody and just copied the structure. The output also records some of the meta-information coming with each result line which is useful if I want to go back to the original text.

As already pointed out earlier, the value of my corpus lies in the fact that the search results are personally relevant. But if there had been a good, easy to use Thai corpus on the web, I wouldn't have built my own. Building one's own corpus isn't something that I would recommend to anybody, but if you have basic programming skills or can find software that helps you organize the material, it can be a useful study aid for the serious learner.


3 persons have voted this message useful



This discussion contains 27 messages over 4 pages: << Prev 1 24  Next >>


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 2.7500 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.