Register  Login  Active Topics  Maps  

The etology of Google Translate

  Tags: Google | Translation
 Language Learning Forum : General discussion Post Reply
11 messages over 2 pages: 1 2  Next >>


Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6485 days ago

9078 posts - 16473 votes 
Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

 
 Message 1 of 11
20 June 2015 at 9:45am | IP Logged 
I have in other threads described my own use of Google translate for the production of bilingual texts. After some experimentation I have settled for a format based on parallel columns with original texts in a weak or mediocre source language to the left and translations in a stronger language to the right. This is often my native Danish, sometimes English and often something else - just for fun. If I easily could get reasonably literal translations of my texts I would use them, but with my preference for non-fiction that is not always possible - and besides literary translations are often too 'free' to be trustworthy. Instead I use Google Translate .

Google Translate may produce clumsy translations with blatant errors, but often these errors are so obvious that you get suspicious - and with my setup the errors will always be in the right column wihere you find the stronger language. Besides I don't use it to make texts for distribution - the right column in my bilingual texts is for a quick check when something in the left column is incomprehensible, and if I don't trust the version given by Google I doublecheck with a dictionary. But even a faulty translation may give me a hint about the correct solution, and the procedure saves me a lot of dictionary lookups.

When speaking about Google T it is normal to focus on its blatant errors, but it learns languages faster than I do, and when you think about it, it is actually amazing that it can make recognizable translations at all. Just twenty years ago any serious linguist would have thought you needed a solid grammatical skeleton plus a dictionary plus something of a miracle to add some common sense nd knowledge about the world. But in 2007 Google switched to a statistical model working on immense amounts of bilingual texts. The alternative method with rules and dictionaries is still used by some competitors, but it can't as easily be automatized - you need a linguist to set up the rules for each language and a lexicographe for defining word meanings

When using statistics based on bilingual texts to built a translation system you would expect that pairs that include languages with few or unsuitable resources would suck big time. But surprise surprise, as far as I can judge the differences between big and small languages aren't nearly as striking as you might think (except in one case: Latin, where GT is so bad that it shouldn't be used at all). I can't judge the quality for most of the 90 or so languages offered by GT, but even within the range I cover there are language specific constructions in some languages which GT doesn't really seem to have grasped - I'll illustrate some of these cases later on.

2 persons have voted this message useful





Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6485 days ago

9078 posts - 16473 votes 
Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

 
 Message 2 of 11
20 June 2015 at 9:47am | IP Logged 
BACK TO THE ORIGINS

English as translation language has a special role because it serves as an intermediary for all other language pairs. So logically translations to or from English ought to be twice as good as translations between other language pairs because English functions as the intermediary language with any other combination, but the difference in quality is not nearly as clearcut as you might expect.

Let's test that. The sentence ...

: Translations to or from English should always be twice as good as translations between other language pairs because English functions as intermediary language, but again this is not nearly as clearcut as you might expect.

is translated back to English as

: Translation to or from English should always be twice as good as translations functions as intermediary between other language pairs because English language, but again this is not nearly as clearcut as you might expect..

(OBS: blue colour for originals, green for hhumanmade translations and red for those made by Google Translate, from now on called GT)

Ouch! A closer test shows that translations to some language and back quite generally are worse than translations to almost any other language. I have given an example of this in my log thread on June 3 2015 for French to French, using the beginning of the poem "Le Bateau Ivre" by monsieur Rimbaud.

: Comme je descendais des Fleuves impassibles,
Je ne me sentis plus guidé par les haleurs :
Des Peaux-Rouges criards les avaient pris pour cibles,
Les ayant cloués nus aux poteaux de couleurs


which becomes ..

: Comme je descendais des Fleuves impassibles,
Je me NE Sentis Plus Guide Par les haleurs:
Des Peaux-Rouges criards les avaient Prix verser CIBLES,
Les Ayant Cloues nus aux poteaux de couleurs


("verser" is a translation of not the French word "pour", but of its English homonym which is a verb - and down the drain goes the grammatical coherence when such things happens).

In Italian the same stanza certainly gets its share of translation errors, but it is not full of those ugly capitalized words (and eloped accents):

: Mentre camminavo lungo i fiumi impassibili,
Non mi sentivo più me guidare da autotrasportatori:
Redskins Gaudy li aveva preso per obiettivi,
L'aver inchiodato li nudo pali colorati


"autotrasportatori" - hehe. They were not even invented in Rimbaud's days. Gaudy redskins, well maybe. But back to the trail: this could be taken as a sign that the intermediary stage isn't just a simple English translation, but some composite with markers of different kinds attached. How can we test that? Well, possibly by translating a text from a language with a grammatical quirk that isn't relevant for English into another language where it is relevant. Like the exclusive 1. person plural pronoun "kami" in Indonesian (meaning 'all of us, but not you'). Does it survive the trip through English to the closely related Malaysian?

Indonesian: Prinsip ini yang akan terus kami pegang di kota ini. (www.hijauku.com)

English: This principle will continue to be held in this city.

Malaysian: Prinsip ini yang akan terus kami pegang di bandar ini.

Well it did, and so does the inclusive "kita" ('we including you') if we put it in the original sentence instead of "kami". So the middle step can't just be the English sentence in the middle (or GT was just lucky this time)

Let's leave this mystery to the professional Googlologists and proceed to something faintly related: word order.


Edited by Iversen on 20 June 2015 at 10:19am

2 persons have voted this message useful





Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6485 days ago

9078 posts - 16473 votes 
Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

 
 Message 3 of 11
20 June 2015 at 9:48am | IP Logged 
WORD ORDER IN SUBSTANTIVAL PHRASES

Let's take a simple example first: word order within a nominal syntagma. GT generally knows how to move the adjectives from the position before to the place after if that's what a given language favors. Well, it isn't infallible, but there are worse problems elsewhere, and it seems that Google has improved on the mechanism. Though not in this German example (my italics):

:Die Sozialisierung von Eigentum von Bundesbürgern zu Gunsten einer EU-Staatskasse ist wirklich ein orgineller Vorschlag (meta.tagesschau.de)

: The socialization of the property of German citizens in favor of an EU treasury is really a proposal orgineller.

Maybe GT got it wrong because of the spelling error (it should of course be "origineller" - and 'Nationalizierung' instead of 'Sozialisierung'). But it got the structure of "Die Sozialisierung von Eigentum von Bundesbürgern" right, and that's what counts.   

About one year ago Google T translated

:...Castelul Huniade – cea mai veche clădire atestată în Timişoara. (www.timisoara-info.ro)

from its native Romanian into danish like this:

: Huyadi Castle - den ældste bygning i Timisoara certificeret

OK, today it gives this English version, which is OK - apart from the untranslated preposition "în":

: Hunyadi Castle - the oldest documented building înTimişoara

But the uniquely Romanian possessive constructions with the a-pronoun later in the same text are still not quite under control:

: Printre obiectele expuse pot fi văzute (...) o machetă a aparatului de zbor al lui Traian Vuia, (..) machete ale primului troleibuz şi al primului tramvai construit la Timişoara şi multe altele.

: Among the exhibits can be seen(...) a model of the aircraft's Traian Vuia, models of the first trolleybus and tram of the first built in Timisoara and more. .

Me (friendly): Among the exhibits can be seen(...) a model of the aircraft of Traian Vuia, models of the first trolleybus and the first tram built in Timişoara and other things. .

Me (hyperliteral): Among the exhibits can be seen(...) a model of the apparatus-of-flight 'of-its' him's Traian Vuia, models 'of-their' first's trolleybus and 'of-their' firstThe's tram built in Timisoara and many more.

Let's dissect the passage "o machetă a aparatului de zbor al lui Traian Vuia". "Traian Vuia" is a personal name, and to avoid inflecting him the author has inserted the preceding "lui", which is a personal pronoun in the genitive - the hyperliteral translation would be something like "of-him Traian Vuia". And the preceding "al"? Well, you can think of it as a combination of "of" and some kind of pronoun, which mostly points back to the 'owned thing' - but even the Romanians sometimes make it concord with the 'owner'. Or just think of it as an inflated/flected 'of'.

Well, let's not judge Google T too harshly, these are complicated matters, but it has a weak point in complicated constructions involving the Romanian 'a' thing. Who wouldn't have?

Complicated genitival constructions with three or more layers of possession or other comlications are generally problematic, here a Portuguese example:

:O caráter típico do ambiente de Burgess deveria excluir qualquer possibilidade de preservação de organismos de corpo mole. . (passeideito)

: The typical character Burgess environment should exclude any possibility of preservation of soft-bodied organisms. .

.. which of course should be

: The typical character of the environment of Burgess (...) .

"Burgess" here is the Burgess shale, a paleontological treasure trove deposited right at the onset of the Cambrian, 520 mio. years ago. OK, I would also have accepted ""The typical character of the Burgess environment" - but not a translation without any "of" at all.

A French exemple with the typical accumulation of "de"s:

:imprégnés d’une certaine idée de la France, de sa culture et de la République (gazetteort.com)

: imbued with a certain idea of France, its culture and the Republic

Without the last "the" the elimination of prepositions would have worked.

Let's take a bit more French (where some very common adjectives or synonyms for them are positioned to the left of the noun, everything else to the right):

The low notes --> Les notes basses
The short notes --> Les courtes notes
The short low notes --> Les courtes notes faibles
The short and low notes --> Les notes à court et bas

Ouch (or aïe)!

Edited by Iversen on 21 June 2015 at 12:23pm

2 persons have voted this message useful



Doitsujin
Diglot
Senior Member
Germany
Joined 5102 days ago

1256 posts - 2363 votes 
Speaks: German*, English

 
 Message 4 of 11
20 June 2015 at 4:36pm | IP Logged 
Iversen wrote:
Let's take a simple example first [...]:
:Die Sozialisierung von Eigentum von Bundesbürgern zu Gunsten einer EU-Staatskasse ist wirklich ein orgineller Vorschlag (meta.tagesschau.de)
:The socialization of the property of German citizens in favor of an EU treasury is really a proposal orgineller.

Maybe GT got it wrong because of the spelling error (it should of course be "origineller"). But it got "Die Sozialisierung von Eigentum von Bundesbürgern" right, and that's what counts.

Actually, GT got it wrong, because Sozialisierung has two definitions in German and socialization is definitely the wrong translation for this sentence. Sozialisierung should have been translated as nationalization = Verstaatlichung. (I'm pretty sure that you correctly inferred this meaning from the context.)

However, since GT is based on statistics and Sozialisierung is predominantly used like its English cognate you can't really blame the algorithm.


1 person has voted this message useful



OlafP
Triglot
Senior Member
Germany
Joined 5217 days ago

261 posts - 667 votes 
Speaks: German*, French, English

 
 Message 5 of 11
21 June 2015 at 12:42am | IP Logged 
Iversen wrote:

English as translation language has a special role because it serves as an intermediary for all other language pairs.


Do you know this for a fact or is this just your hypothesis? There are a few research articles available at research.google.com/pubs/MachineTranslation.html, which are for the most part quite impenetrable for someone who is not an expert on machine translation. My impression is that GT does not use an intermediary language but maps vector spaces of language pairs directly onto each other. There must be some weighting going on when the vectors are mapped, so it should be expected that the path for translating L1 to L2 is in general not same as from L2 to L1 for the same text. And this is supported by the fact that the reverse translation is not the same as the sentence that you started with.

We should expect that there is a lot of experimentation going on behind the scene, so you might get different algorithms for different language pairs at different phases of the moon. This makes the task of guessing what GT really does even harder than it is for GT to understand languages without analysing the grammar.


1 person has voted this message useful



Doitsujin
Diglot
Senior Member
Germany
Joined 5102 days ago

1256 posts - 2363 votes 
Speaks: German*, English

 
 Message 6 of 11
21 June 2015 at 1:08am | IP Logged 
OlafP wrote:
Iversen wrote:

English as translation language has a special role because it serves as an intermediary for all other language pairs.
Do you know this for a fact or is this just your hypothesis?

There's an older video by the head of Google Translate, Franz Och, that explains the process.

Apparently, Google occasionally uses "bridging languages," if the available bilingual corpora don't contain enough words. For example, Ochs mentions Yiddish.
(Since you speak German, you can find many other German GT related videos by Ochs on Youtube.)

2 persons have voted this message useful



vonPeterhof
Tetraglot
Senior Member
Russian FederationRegistered users can see my Skype Name
Joined 4554 days ago

715 posts - 1527 votes 
Speaks: Russian*, EnglishC2, Japanese, German
Studies: Kazakh, Korean, Norwegian, Turkish

 
 Message 7 of 11
21 June 2015 at 10:18am | IP Logged 
From my experience with translating from Ukrainian and Belarusian into Russian, as well as from Korean to Japanese, these pairs of languages don't seem to go through English most of the time. Obvious calques from English and untranslated English words show up extremely rarely, at least compared to translating from most Western European languages into Russian.
1 person has voted this message useful





Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6485 days ago

9078 posts - 16473 votes 
Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

 
 Message 8 of 11
21 June 2015 at 11:52am | IP Logged 
OlafP wrote:
Iversen wrote:

English as translation language has a special role because it serves as an intermediary for all other language pairs.

Do you know this for a fact or is this just your hypothesis?


I would call it a testable hypothesis. If I can insert an English word into a text in another language and get it correctly translated into a third language along with the native words, then it is likely to have passed unchanged into the intermediary stage. And if this only can be done with English then the intermediary stage must be either English or based on English. The word should of course be one that isn't commonly used as a loanword in the original source language, but you just have to test your test words carefully to avoid this:

В Российской Федерации, согласно ст. 235 ГК РФ, национализация должна проводиться в порядке,(from Wikipedia)
В Российской Федерации, согласно ст. 235 ГК РФ, национализация must be проводиться в порядке


Both are translated into Esperanto as follows:

En la Rusa Federacio, laŭ Arto. 235 de la Civila Kodo, ŝtatigo devus esti efektivigitaj en ordo


I have seen one possible exception: when Esperanto was first introduced I discovered that inserted Spanish words were translated better than English ones, but that irregularity seems to have been eradicated:

Mi ne vidis televidon hodiaŭ
Mi ne vidis televidon today
Mi ne vidis televidon hoy


Danish:
Jeg har ikke set tv i dag
Jeg har ikke set tv i dag
Jeg har ikke set tv-hoy


Another clue is the treatment of compound words. English favors word combinations, while languages like Danish and Dutch (and Russian!) prefer long compounds;

In het grafveld van Rhenen werden ook veertien paardengraven aangetroffen.

GT: På kirkegården i Rhenen blev også fundet fjorten hest grave.
me: På kirkegården i Rhenen blev der også fundet fjorten hestegrave.

me: In the cemetery of Rhenen were also found fourteen horse graves.


If a compund is split midway through the process then it is likely that it has passed through a

The suggestion that GT uses mappings between other language combinations is interesting (and maybe correct, if it comes from a trustworthy insider), but it doesn't explain the behaviours I just mentioned.   


As for Die Sozialisierung (...) and The socialization of the property (...):

Doitsujin wrote:
Actually, GT got it wrong, because (...) should have been translated as nationalization = Verstaatlichung. (I'm pretty sure that you correctly inferred this meaning from the context.) However, since GT is based on statistics and Sozialisierung is predominantly used like its English cognate you can't really blame the algorithm.


This is of course correct. I was only thinking of the structure of the phrase, but should also have commented on the semantical error. I have changed the passage in my message to reflect this.


Edited by Iversen on 21 June 2015 at 12:38pm



1 person has voted this message useful



This discussion contains 11 messages over 2 pages: 2  Next >>


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 0.3594 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.