Register  Login  Active Topics  Maps  

The etology of Google Translate

  Tags: Google | Translation
 Language Learning Forum : General discussion Post Reply
11 messages over 2 pages: 1


Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6463 days ago

9078 posts - 16473 votes 
Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

 
 Message 9 of 11
21 June 2015 at 11:53am | IP Logged 
At the sentence level the exclusively statistical methods of GT are put on a hard test, and it is not a surprise that it has problems with word order on that level. Not only are the typical sentence patterns more varied (and often very far from the English ones) than the order of smaller units, but the rules in a grammar are based on the premise that you can identify the different parts of a sentence. And the builders of GT didn't leave rule based translation system aside just to reintroduce them now. But maybe it would be easier than guessing. In the video mentioned by Doitsujin, where Thomas Ochs comments on this, but without being to precise on Google's strategy for conquering it. I'll discuss the problems cause by different word order schemes later, but first an example where I simply don't see any rational explanation:

: Cunoscut drept experimentul de gândire a alegerii întârziate propus de John Wheeler, acesta a fost încercat pentru prima dată în 1978 cu ajutorul unor raze de lumină reflectate de o oglindă.
(www.descopera.ro)

: Known as a thought experiment proposed by John Wheeler delayed election, he was tried for the first time in 1978 using rays of light reflected from a mirror

me: Known as a thought experiment (about) delayed election proposed by John Wheeler, this was tried for the first time in 1978 using rays of light reflected from a mirror

Maybe it is the Romanian 'a-thing' again that rocks the boat. But objectively the easiest would have been to leave "delayed election" at its original position and then leave it to the reader to make sense of it there. But as it is, GT evokes a nasty picture of mister Wheeler being tried using light reflected from a mirror - I think I have heard about that beastly interrogation technique before. Poor man!

In some cases it seems that GT realizes that it has come on thin ice after an unintended displacement and tries to repair a broken sentence - with disastrous results. Here an Indonesian example:

: Planet itu dinamai Kepler-10c dari nama teleskop Kepler yang digunakan Badan Antariksa Amerika Serikat NASA.
(teknologi.news.viva.co.id)

: The planet was named Kepler-10c of the names used Kepler telescope US space agency NASA.
me: This planet was called Kepler-10c from the name (of the) telescope Kepler which was-used (by the) space agency (of) USA, NASA

Maybe "the names used" has become some kind of attractor that distorts the space around it, but the calamity above seems to be caused because GT attached "digunakan" (the perfect participle of 'be used') to the telescope instead of to NASA after the connector "yang".

Once upon a time I was looking for new fresh materials about science in Irish, and I hit upon a now defunct site (zenews) with a lot af tempting articles. Alas, here was one little problem:

: Dr Dalley moladh den chéad uair go poibli di smaoineamh go raib Nineveh, ní mBabylon, am suíomh tocsaineach n gairdíni (...)

: Dr. Dalley first proposed publicly rape her idea that Nineveh, not Babylon, site time their gardens (...)

The theme of this article certainly wasn't rape, but the suggestion that the wondrous Hanging Gardens of Babylon weren't placed in Babylon. Unfortunately I couldn't use the texts at this site because they clearly were produced by Google Translate, and the proof is that the verb in Irish generally stands at the beginning of the sentence, only preceded by conjunctions or a few verbal particles. They didn't in the Irish sentences. And the retranslation of the sentence above into Irish (minus "rape") goes as follows, so the problem hasn't been solved:

: Dr Dalley beartaithe ar dtús go poiblí ar a smaoineamh go Nineveh, ní Babylon, am láithreáin a n-gairdíní

Can this this be a general problem? In that case you wouldn't expect the verbs in German to hit the end in subordinate clauses, and it actually seems there is a problem:

(me): Unfortunately I couldn't use the texts at this site because they clearly were produced by GT

GT: Leider konnte ich nicht verwenden, die Texte auf dieser Seite, weil sie deutlich wurden von GT produziert

me: Leider konnte ich die Texte auf dieser Seite nicht verwenden, weil sie klar von Google Translate hergestellt wurden

In Danish - allegedly a SVO language - the subject normally stands before the verb, but if something else (like an adverbial or a conjunction) usurps that position the subject jumps to the position right after the verb - but not in Google Translate:

(me): today the sentences are incorrect --->   dag sætningerne er forkerte

The word order should be one of these two:

: i dag er sætningerne forkerte or : sætningerne er forkerte i dag

It is obviously more difficult to use statistical methods on the word order of whole sentences than it is within a nominal phrase - except in Latin, where it seems to have been the order of the day to distribute them all over the sentence (and this is one of the reasons that Latin translations suck in both directions). Y might expect that GT then would give up and just keep the original word order in all translations, but then you can't explain cases like

One 'strategy' of GT is apparently to skip the verb (!):

: As you probably noticed, in Dutch the word order of the second sentence changes when we put these two together.
(www.ucl.ac.uk/dutchstudies

German: Wie Sie wahrscheinlich bemerkt, in Holländisch das Wort Ordnung des zweiten Satzes ändert, wenn wir diese beiden zusammen.

Dutch:Zoals u waarschijnlijk al gemerkt, in het Nederlands de woordvolgorde van de tweede zin verandert als we deze twee samen.

However that may be part of the general tendency to drop words almost randomly - a phenomenon that will be comprehensively illustrated later.

To test the word order in sentences thoroughly the best strategy would be to use English (supposedly a SVO language) and a host of languages with OVS or VSO or whatever, but my languages are not exotic enough. Luckily I have some language guides with examples and hyperliteral translations, and I have tried out some examples in Malagassy - with strange results:

: tsy amidy ny trano. ("Le Malgache de poche", Assimil)
French : n'est-pas à-vendre la maison
: house for sale    (well, no)

: tsara ny tranonareo
: belle la maison-de-vous
: The houses    (yes, what about it)

: tia sakafo malagasy aho.
: j'aime la nourriture malgache
: I love Thai food    (well, that too)

I'll comment on lost words (including negations) and of statistically motivated, but faulty substitutions later, but let's just agree here that GT's grasp on Malagassy is somewhat shaky. The amount of bilingual texts in this language and English is probably limited (French might be better served), but the very different word order must add the the difficulties with the identification of corresponding elements).

Inversion is common in many languages, and often it is used simply for emphasis. And GT systematically makes a mess of it:

me: jeg stoler ikke på ham --> I do not trust him
me: ham stoler jeg ikke på --> I trust him not to

What happened here? Well, it seems that GT misinterpreted the second sentence and got trapped in another idiomatic pattern, namely "trust somebody NOT to do something", but the crux of the matter is that "jeg" clearly is a nominative form and therefore it must be the subject - even if it comes after the verb. And GT can't draw that inference because it doesn't think in syntactical terms. Do children? Well, not from the beginning, but children can correct their errors because they can spot nonsense and then try another interpretation. And in the end they have learnt that the person that 'does something to somebody' sometimes stand after the verb and the 'somebody' at the start, and if they spot that pattern then the meaning suddenly becomes clear as crystal. GT can't make that reality check. In that situation the team behind Google have to try either to invent a way to imitate the common sense and worldly savvy or to teach the beast some basic grammar.


Edited by Iversen on 21 June 2015 at 4:07pm

2 persons have voted this message useful





Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6463 days ago

9078 posts - 16473 votes 
Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

 
 Message 10 of 11
21 June 2015 at 4:50pm | IP Logged 
WORDS LOST AND WON

If you as a human reader hits upon an unknown word in a novel you will normally just skip over it and hope the meaning doesn't go drain. But sometimes it does. Google Translate sometimes does the same thing, and sometimes this has repercussions on the rest of the sentence in question. One particular common error is dropping the negation in a sentence, but for the purpose of the present message I have let GT do a new translation of some old text which it got wrong earlier .. but now it works. And I can see that a fair number of my other examples from old bilingual printouts now yield 'nots' galore. This may be irritating for me when I'm trying to illustrate a phenomenon, but it's excellent news for the users of GT. An example:

: Kunst en geschiedenis zijn niet van gisteren en het Rijksmuseum staat in het heden.
( www.rijksmuseum.nl)

before (Danish): Kunst og historie er i går og Rijksmuseum i nutiden
me: Art and history are yesterday and the Rijksmuseum is in the present.

now: Art and history are not of yesterday and the Rijksmuseum is in the present.

Nice to know.

But these Afrikaanse examples are resonably recent:

: n Halwe eier is beter as ’n leë dop — liewer net ’n deel van iets ontvang of besit as om heeltemal niks te kry of te hê nie.
(www.mieliestronk.com)

: A half egg is better than a pie - rather just a part of something received or owned as to completely get nothing or to have.
me: A half egg is better than a pie - better just get a part of something received or owned than to get absolutely nothing or to have nothing

and one more:

: (...) en in hanteerbare dele opgedeel sodat 'n enkele persoon nie alleen verantwoordelik is vir die vertaling van 'n reuse deel nie.
blogs.sun.ac.za)

: broken down into manageable parts so that a single person alone is responsible for the translation of a huge share.
me: >: broken down into manageable parts so that a single person alone ISN'T ever responsible for the translation of a huge share

Missing negations are not confined to Afrikkans, but could it be that GT reasons that double negations equal each other out? Unfortunately GT has not discovered the characteristical double negations of this language:

me: Unfortunately GT hasn't discovered the double negations of Afrikaans:
GT: Ongelukkig GT het nie ontdek die dubbele negations van Afrikaans
me: Ongelukkig GT het nie ontdek die dubbele negations van Afrikaans nie


The inverse process also sometimes happens: sometimes mysteriously negations appear from nothing. Here it is a negation in an Icelandic sentence:

: Svarthvíta innrauða myndin vinstra megin sýnir manneskju í þykkum reykjarmekki.
(www.stjornufraedi.is)

: Black and white infrared image to the left shows the person in thick smoke Jarman not

me: Black and white infrared image to the left shows a person in thick smoke.

Where did "Jarman" and "not" come from? Well, the key is the word "reykjarmekki" ("reykjarmökkur" for "smoke cloud", in the dative singular), which gets translated as "Jarman not smoke" as if the Icelandic word were split into its components. It is understandable that Google hasn't learnt all irregular inflections in Icelandic yet - neither have I.

And speaking of Icelandic: here GT lost the verb and gained another with the opposite meaning:

Stjornufrædi: Rosetta á að endast í um tólf ar
GT: Rosetta to last for twelve years
me: Rosetta has to end [its life> in around twelwe years

And here one from Romania:

: Sonda europeană Rosetta este primul vehicul spaţial care a reuşit să ajungă pe orbita unei comete, aflată la o distanţă de 510 milioane de kilometri de Terra
(www.descopera.ro)

: European probe Rosetta is the first spacecraft that failed to reach the orbit of a comet at a distance of 510 million kilometers from Earth

me: European probe Rosetta is the first spacecraft that succeeded to reach the orbit of a comet, found at a distance of 510 million kilometers from Earth

What else can pop up from the eerie underworld of Google? Maybe a quote from a Serbian article about birds with four wings (or maybe just feathers on their feet):

: Pojedine ptice, kao što je zlatni orao, zadržali su perje na nogama, ali ono im ne služi kao pomagalo pri letenju već kao zaštita od sunca i hladnoće.
( = http://www.blic.rs)

: Some birds, such as the golden eagle, kept the feathers on the legs, but what they do not serve as an aid flight but as protection from the sun and cold.

What's that "what" doing here? Where did it come from?

Finally it should be mentioned that GT becomes stubborn and/or extremely parsimoneous when it is feed too short items in foreign languages:

Afrikaans: die voël --> bird
the bird --> die voël

Where did the in the first example article go? From English it however that articles generally are respected, but not from Afrikaans. Why this difference?


Edited by Iversen on 21 June 2015 at 4:55pm

2 persons have voted this message useful





Iversen
Super Polyglot
Moderator
Denmark
berejst.dk
Joined 6463 days ago

9078 posts - 16473 votes 
Speaks: Danish*, French, English, German, Italian, Spanish, Portuguese, Dutch, Swedish, Esperanto, Romanian, Catalan
Studies: Afrikaans, Greek, Norwegian, Russian, Serbian, Icelandic, Latin, Irish, Lowland Scots, Indonesian, Polish, Croatian
Personal Language Map

 
 Message 11 of 11
21 June 2015 at 6:58pm | IP Logged 
Words can disappear or -occasionally - materialize out of nothing, but they can also be misinterpreted. In earlier days there was one special case, namely the substitution of something dear to to speakers of language A by something else which fills out the same role in language B. Like capital city, currencies or favorite actors (this last example was discussed by GT's own Mr. Ochs). This still occurs, but it seems to me that it has become less common. Here first an old example which apparently has been fixed - but not before I got it printed out:

: Kediaman rasmi Yang di-Pertuan Agong, iaitu Istana Negara, juga terletak di Kuala Lumpur. Bandar raya ini juga merupakan pusat kebudayaan dan ekonomi Malaysia kerana kedudukannya selaku ibu negara dan bandar raya primat.

old (Danish): Den officielle residens kongen The, den National Palace, som også findes i Kuala Lumpur. Byen er også den kulturelle og økonomiske centrum i Malaysia på grund af sin position som hovedstad og primat by.

I didn't do an English translation at the time, but the Danish one above is appalling ("primat by" = 'monkey town') . It has only become slightly better - "kongen The, den National Palace" has however become "Yang di-Pertuan Agong, National Palace,".

new slick English translation: The official residence of the Yang di-Pertuan Agong, the National Palace, also located in Kuala Lumpur. This city is also the cultural and economic center of Malaysia due to its position as the capital and primate city.

But what happens if this is translated into the closely related Indonesian? In the good old days it became

old: Kediaman resmi Yang di-Pertuan Agong, iaitu Istana Negara, juga terletak di Jakarta. Kota ini juga merupakan pusat kebudayaan dan ekonomi Indonesia kerana kedudukannya selaku ibukota dan bandar raya primat.

Today this has become the more correct, but less entertaining

new: Kediaman resmi Yang di-Pertuan Agong, yaitu Istana Negara, juga terletak di Kuala Lumpur. Kota ini juga merupakan pusat kebudayaan dan ekonomi Malaysia karena kedudukannya selaku ibukota dan kota primata.

The Malaysians have got their capital and their country back, but the same phenomenon is still rearing its ugly head here:

: Esperanto estas tiel bona lingvo, ke ĝi meritas esti parolata sed ne balbutata.
(Lernu)

: English is such a good language, that it deserves to be spoken but not balbutata.

Even if you don't exchange your proper names against other they can still cause havoc - for instane if GT tries to interpret them. As in this Icelandic example:

: Sauðburður er hafinn. Nú þegar sólin yljar íbúum höfuðborgarsvæðisins fannst ám Fjölskyldu- og húsdýragarðsins óhætt að boða komu vorsins. Ærin Surtla bar tveimur hrútlömbum og ærin Melkorka bar þremur gimbrum snemma morguns þann 4.maí.
(www.mu.is)

: Lambing is underway. Now when the sun is warming the inhabitants capital's rivers Family Park and Zoo safe to proclaim the arrival of spring. Aerials acid bar two ram lambs, and there are clear Melkorka bar three ewe early on 4.maí.

(me): Lambing is underway. Now when the sun is warming the inhabitants of the capital region the ewes of the Family and Domestic Animal Farm found it safe to proclaim the arrival of spring. The ewe Surtla bore three ram male lambs, and the ewe Melkorka bore three female lambs early on the 4.maí.

OK, there are several issues here. First: "ám" in the first sentences does look like a form of the word "á", but it is also a form of the irregular noun "ær", which means 'ewe'. This is a excusable error, but any sane human person would get suspicious when aerials and acid began to block the way of the ram lambs, and maybe also at the sight of clear Melkorkas blocking ewes on that fateful 4 of May. According to my dictionaries there simply isn't a word like that, but Surtla has been used as a proper name both for persons and for a shortlived volcanic island near the more famous Surtsey. It is probably derived from the mythologic name Surt, ruler of the realm of fire, but GT has clearly fixed its gaze on the word "súr", 'sour'. It should just have let Surtla be Surtla. And 'aerials' has preciously little to do with sheep.

Occasionally you can see the opposite error: leaving a common word untranslated, which then gets reinterpreted and cause havoc:

: Prima replica din lume a unui chip preistoric apartine unei faimoase printese siberiene tatuate
( http://stirileprotv.ro)

The first chip replica of a prehistoric world belongs to a famous Siberian princess tattooed,

me: The first replica in the world of prehistoric face belongs to-a famous siberian tattooed princess

If that really had been a 2500 years old chip then somebody ought to start digging for Silicon valley. But it was 'just' the face of a tattooed mummy princess from around 500 AD. And if you continue to other languages this error will be preserved - and maybe a few more will be added:

La primera réplica de chips de un mundo prehistórico pertenece a una famosa princesa siberiana tatuado,

We have to be strict: in the original it wasn't the world that was prehistoric (although it probably was), but the fair lady with the tats.

As I have written before the thing that saves human learners from some of the worst errors is that we can see when something is patently absurd. Maybe we don't know how to repair it, but it gives us the push to try another interpretation. Or maybe look a word or two up in a dictionary...



2 persons have voted this message useful



This discussion contains 11 messages over 2 pages: << Prev 1

If you wish to post a reply to this topic you must first login. If you are not already registered you must first register


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 5.3750 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.