Register  Login  Active Topics  Maps  

Building your own corpus

 Language Learning Forum : Learning Techniques, Methods & Strategies Post Reply
27 messages over 4 pages: 13 4  Next >>
Lizzern
Diglot
Senior Member
Norway
Joined 5852 days ago

791 posts - 1053 votes 
Speaks: Norwegian*, English
Studies: Japanese

 
 Message 9 of 27
21 February 2014 at 11:22am | IP Logged 
That's a great idea :-) I might start doing something similar for Japanese, but I have no programming skills at all, so I might need to just put my texts in a big Word file to make them searchable. Close enough. Thanks Tommus for the link to WebCorp!

Liz
1 person has voted this message useful



andras_farkas
Tetraglot
Groupie
Hungary
Joined 4843 days ago

56 posts - 165 votes 
Speaks: Hungarian*, Spanish, English, Italian

 
 Message 10 of 27
21 February 2014 at 3:42pm | IP Logged 
Interesting. Corpora, especially bilingual and multilingual corpora are kind of my
area, so here are some tips.
If you can find multilingual aligned corpora of interest to you, you can get a really
nice side-by-side lookup feature.
As an example, I grabbed some of the English-French books in my multilingual aligned
book collection. It ended up being 45000 sentences, which is a fraction of all the en-
fr material I have. I dumped the texts in my lookup software and ran a few sample
searches to show you the interface:

65 English sentences contained the word 'anguish':



In 54 of those 65 sentences, the word was translated as 'angoisse', which shows that
the two terms are a good translation for each other.




The books are here,
and the lookup tool (TMLookup) is
here. You can also use
TMLookup for monolingual texts of course, or, alternatively, you can use apsic xbench,
which works largely the same way.


Edited by andras_farkas on 21 February 2014 at 3:44pm

7 persons have voted this message useful



Jeffers
Senior Member
United Kingdom
Joined 4852 days ago

2151 posts - 3960 votes 
Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

 
 Message 11 of 27
23 February 2014 at 1:51pm | IP Logged 
andras_farkas wrote:
65 English sentences contained the word 'anguish'


You do realize that the English side is the translation, right? This might sound picky to you, but it matters because of what you wrote below:

andras_farkas wrote:
In 54 of those 65 sentences, the word was translated as 'angoisse', which shows that the two terms are a good translation for each other.


That is one possible interpretation. The other is that the English translator was lazy and since the words are cognates s/he just used the nearest English equivalent. Cognates are not always the best translation for each other, because the meaning could have shifted in one or both of the languages.

For example, you might read, "L'homme est sensible", and be tempted to translate it as "The man is sensible." There is a chance that would be the best translation, but probably not in this case.

This type of translation is also poor when the cognate in one language is common, but the cognate in the translation language is not used as much. This seems to be the case with this passage. I am not sure if 'angoisse' is so common in French, but 'anguish' is clearly overused in the English translation, and it seems awkward. My guess is that it has a slightly broader semantic range in French than in English, and doesn't sound awkward in the French.

Edited by Jeffers on 23 February 2014 at 1:53pm

2 persons have voted this message useful



andras_farkas
Tetraglot
Groupie
Hungary
Joined 4843 days ago

56 posts - 165 votes 
Speaks: Hungarian*, Spanish, English, Italian

 
 Message 12 of 27
24 February 2014 at 9:35am | IP Logged 
While some of your comments are interesting, you miss the point in two ways.
First of all, this was a random sample search to show the way the software works. The
content hardly matters.
Second, what on earth made you think that "the English side is the translation" in a
text collection you've never seen? In some cases, it is, in some cases, it isn't. There
are a couple of English novels translated into French in the dataset. E.g. 3
occurrences of the word come from Moll Flanders, and one comes from Robinson Crusoe.
In the grand scheme of things, if a word corresponds to X in literary translations in
54 cases out of 65, then it is generally a good idea for language learners to remember
that X is a pretty good translation for the word in question. Everybody with an
interest in language learning knows that there is no 100% correspondence between words
in different languages, so, in almost all cases, any translation is an approximation.

The more interesting point here is that collections of aligned texts and software tools
of this sort make it really easy to study things like this. You can take a million
words of original English text and a million words of original French text and see how
often 'anguish' and 'angoisse' occur. Then you can take the opposite language
translations of the same texts and compare the word frequencies to see if 'anguish'
creeps into English texts as a convenient translation for 'angoisse'. It would be
pretty easy to calculate a % value for how often the similarity of the cognates leads
the translator astray. You could even use brute force to automatically find anomalies.
I.e. run searches on the 10,000 most frequent words in texts translated into English,
then run searches on the same words in original English texts and see where the biggest
discrepancies are and what they may be due to. If there is enough data, it wouldn't be
too hard to build up profiles of individual translators and see how their language
background affects their work. In fact I expect references to farkastranslations.com to
show up in linguistics/translation studies thesis papers a few years from now.

Edited by andras_farkas on 24 February 2014 at 9:47am

5 persons have voted this message useful



Jeffers
Senior Member
United Kingdom
Joined 4852 days ago

2151 posts - 3960 votes 
Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

 
 Message 13 of 27
24 February 2014 at 12:53pm | IP Logged 
andras_farkas wrote:
First of all, this was a random sample search to show the way
the software works. The content hardly matters.


I would like to say the content always matters, but I take your point. Nevertheless,
you drew conclusions based on this sample. Perhaps this means it is a good example of
how the software could be misused?

andras_farkas wrote:
Second, what on earth made you think that "the English side is the
translation" in a text collection you've never seen? In some cases, it is, in some
cases, it isn't.


It seems pretty obvious that the original is the French side, since all but one of the
character names are French (and the one English name is identical in French so it could
be French as well).


Computer tools are wonderful things, I agree. I am a teacher of ICT and Computing, so I
value them greatly. But it is also too easy to make mistakes based on computer tools.
For example, if you unwittingly search a text translated into English, you might draw
conclusions about English word frequency or English syntax which may or may not be
correct. The problem is that a computer tool can't apply common sense, and so the user
needs to be smarter than the system in order to avoid false conclusions.

You mention the possibility of searching original English text separately from original
French text. That sort of distinction would go a long way towards avoiding mistakes
about things such as word frequency.
1 person has voted this message useful



andras_farkas
Tetraglot
Groupie
Hungary
Joined 4843 days ago

56 posts - 165 votes 
Speaks: Hungarian*, Spanish, English, Italian

 
 Message 14 of 27
24 February 2014 at 1:27pm | IP Logged 
Jeffers wrote:
andras_farkas wrote:
First of all, this was a random sample search
to show the way
the software works. The content hardly matters.


I would like to say the content always matters, but I take your point. Nevertheless,
you drew conclusions based on this sample. Perhaps this means it is a good example of
how the software could be misused?

No, I still think it's an example of how the software could be used.

Jeffers wrote:

andras_farkas wrote:
Second, what on earth made you think that "the English side is the
translation" in a text collection you've never seen? In some cases, it is, in some
cases, it isn't.


It seems pretty obvious that the original is the French side, since all but one of the
character names are French (and the one English name is identical in French so it could
be French as well).


Talking about avoiding false conclusions, all but one of the character names are French
in the dozen sentences you can see in the screenshot. As I said, there are 65
hits in a random collection of texts totalling about 45,000 sentences. As it happens,
most of it is made up of French originals translated into English, but not all.
If one were to use this tool for scientific research, one would obviously put the
various types of texts (English original, French original) into separate databases. For
language learning, this is hardly a vital issue. One tries to read texts that were
originally written in the target language, but it's not a hard and fast rule.

Edited by andras_farkas on 24 February 2014 at 1:31pm

4 persons have voted this message useful



lingoleng
Senior Member
Germany
Joined 5241 days ago

605 posts - 1290 votes 

 
 Message 15 of 27
24 February 2014 at 4:32pm | IP Logged 
Jeffers wrote:
andras_farkas wrote:
65 English sentences contained the word 'anguish'


You do realize that the English side is the translation, right? This might sound picky to you, but it matters because of what you wrote below:

andras_farkas wrote:
In 54 of those 65 sentences, the word was translated as 'angoisse', which shows that the two terms are a good translation for each other.


That is one possible interpretation. The other is that the English translator was lazy and since the words are cognates s/he just used the nearest English equivalent. Cognates are not always the best translation for each other, because the meaning could have shifted in one or both of the languages.

For example, you might read, "L'homme est sensible", and be tempted to translate it as "The man is sensible." There is a chance that would be the best translation, but probably not in this case.

This type of translation is also poor when the cognate in one language is common, but the cognate in the translation language is not used as much. This seems to be the case with this passage. I am not sure if 'angoisse' is so common in French, but 'anguish' is clearly overused in the English translation, and it seems awkward. My guess is that it has a slightly broader semantic range in French than in English, and doesn't sound awkward in the French.


If I am not completely mistaken Andras is a professional translator and interpreter. Kind of funny that you feel the urge to lecture him about some elementary basics ...
1 person has voted this message useful



Jeffers
Senior Member
United Kingdom
Joined 4852 days ago

2151 posts - 3960 votes 
Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

 
 Message 16 of 27
24 February 2014 at 5:53pm | IP Logged 
Good answer: back off, we're professionals.

I'm sorry, but if a "professional" can't accept criticism of their work, then I have no respect for them as a professional. A true professional takes on board criticism and improves. Being defensive simply compounds errors. And professionals closing rank against any criticism, well that just makes the profession look weak.

I wasn't going to write anything further, because andras_farkas didn't want to hear it. However, I was disgusted at the attempt to shut down criticism based on credentials. Trust me, I will leave the rest of this discussion to the "professionals". Good day to you sir.


2 persons have voted this message useful



This discussion contains 27 messages over 4 pages: << Prev 13 4  Next >>


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 3.6250 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.