Bilingual Text Request Thread (Language Programs, Books & Tapes) Language Learning Forum

Bilingual Text Request Thread
Tags: Bilingual texts
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : Language Programs, Books & Tapes

101 messages over 13 pages: << Previous 1 2 3 4 5 6 7 ... 8 ... 12 13 Next >>

DavidW
Hexaglot
Senior Member
United Kingdom
Joined 6532 days ago
318 posts - 458 votes

Speaks: English*, Spanish, French, Italian, Persian, Malay
Studies: Russian, Arabic (Written), Portuguese, German, Urdu

Message 57 of 101

22 September 2011 at 11:00pm | IP Logged

Been a little busy. Will post an update tomorrow. David
1 person has voted this message useful

Crush
Tetraglot
Senior Member
ChinaRegistered users can see my Skype Name
Joined 5871 days ago
1622 posts - 2299 votes

Speaks: English*, Spanish, Mandarin, Esperanto
Studies: Basque

Message 58 of 101

23 September 2011 at 9:08am | IP Logged

I tried to get bleualign running, but it gives me a file-access error (trying to read from an unopened file). It may be my version of Python, though. I'll try with 2.6. It really wouldn't be all that difficult to get google translations of texts, I split the file up into about ten parts and that worked for me.
1 person has voted this message useful

DavidW
Hexaglot
Senior Member
United Kingdom
Joined 6532 days ago
318 posts - 458 votes

Speaks: English*, Spanish, French, Italian, Persian, Malay
Studies: Russian, Arabic (Written), Portuguese, German, Urdu

Message 59 of 101

23 September 2011 at 8:06pm | IP Logged

Starting to put some information together. Would have put this in the first post, but it seems I cannot edit it anymore.

--Aim--

Make an organised library of bilingual texts for the study of many languages, with links to matching audiobooks.

--Current Process--

1. REQUEST: You post a request for a book in this thread. If you can, please find copies of the texts online, and post the links. If the book is not available online, it can be scanned and OCRed.

2. AUTOMATIC ALIGNMENT: Once the texts are available, I will run the texts through a number of scripts I use, that can produce a fairly accurate sentence-by-sentence aligned text. The basic format is a table, with two columns, for each language, saved as an RTF file. Each sentance is placed in it's own cell. Paragraphs are marked by '***' placed in it's own cell. Alignment errors recognised by the scripts are represented by '###.'

3. EDIT: Next the texts need to be edited by hand, to clean up any alignment errors. In some cases, the translation can be edited to follow the meaning of the original text more closely, but this may create differences with existing audiobooks. If no audiobook yet exists for the traslation, or the translation is not intended to be the object of study, this is less of a problem.

4. OUTPUT FILES: The edited files are imported into a desktop publishing program, and the professional-quality PDFs are produced for printing on A4/letter paper. Other formats, such as EPUB or HTML are also possible.

--What kinds of books can I request?--

Texts are hosted in Belarus, which has life+50 copyright laws. Texts for which the author, or if a translated text, the original author and traslator died before 1961, can be hosted without obtaining permissions. When downloading texts, you are responsible to ensure you are not violating the copyright laws in your country.

For works still under copyright, you can still make a request, and I will look into obtaining permissions.

--Immeadiate Plans--

Create guidelines/standards for the editing of texts.
Set up forums to discuss potential texts.
Put a message up in the Project Gutenberg forums to recruit a volunteer to scan books.
Put a message up in the Project Gutenberg forumsto recruit volunteer readers, to produce audio for materials without audiobooks available.

--Future Plans--

Finish work on a specially-made database to hold information about potential books, translations, and audiobooks.

Edited by DavidW on 23 September 2011 at 8:41pm
3 persons have voted this message useful

DavidW
Hexaglot
Senior Member
United Kingdom
Joined 6532 days ago
318 posts - 458 votes

Speaks: English*, Spanish, French, Italian, Persian, Malay
Studies: Russian, Arabic (Written), Portuguese, German, Urdu

Message 60 of 101

23 September 2011 at 8:16pm | IP Logged

--Authors to be investigated--

Czech: Karel Čapek

--Books Currently awaiting scanning/OCR--

Markens grøde (Knut Hamsun)
Alice in Wonderland in Norweigen

--Currently available un-edited files--

TODO

--To Produce output files--

TODO

Edited by DavidW on 23 September 2011 at 8:21pm
1 person has voted this message useful

DavidW
Hexaglot
Senior Member
United Kingdom
Joined 6532 days ago
318 posts - 458 votes

Speaks: English*, Spanish, French, Italian, Persian, Malay
Studies: Russian, Arabic (Written), Portuguese, German, Urdu

Message 61 of 101

23 September 2011 at 8:26pm | IP Logged

Another problem with Bleualign is that it is basically desgined to 'harvest' sentances for training a machine translation system. So it will sometimes delete sections of a text that are missing in the translation. This behaviour could probably be changed fairly easily if you were familiar with the code of the program.

You can split a text into small segments and use the Google Translation API, but there is a limit to the number of requests you can make to the API in 24 hours, and it wouldn't allow you to do more than a book or two a day. Perhaps OK for your own use, but not for offering a service to others.

Asking the user to obtain the translation themselves is possible, but splitting up the text, copy/paste into google translate ten times etc. is quite annoying.

Possibly an offline translation system could be used, like SYSTRAN. The quality of the translation wouldn't be as good as Google Translate, but this trasnlation is only used to help with alignment, so it wouldn't matter too much. I can't remember now why I didn't try this approach.

The current method provides results that are just as good as Bleualign, although it requires about 40 minutes per bilingual text. About half time is spent locating and cleaning up the source texts.

Edited by DavidW on 23 September 2011 at 8:38pm
1 person has voted this message useful

montmorency
Diglot
Senior Member
United Kingdom
Joined 4834 days ago
2371 posts - 3676 votes

Speaks: English*, German
Studies: Danish, Welsh

Message 62 of 101

23 September 2011 at 9:54pm | IP Logged

David,

You are probably aware of this, but I was interested to read this in the Google Translate Wikipedia page:

Quote:

On May 26, 2011, Google announced that the Google Translate API had been deprecated and that it would cease functioning on December 1, 2011 "due to the substantial economic burden caused by extensive abuse."[4][5] The shutting down of the API, which is used by a number of websites, has led to criticism of Google and developers questioning the viability of using Google APIs in their products.[6][7]

On June 3, 2011, Google announced that they were canceling their plan to terminate the Translate API due to public pressure. In the same announcement, Google said that they will release a paid version of the Translate API. [4][8]

1 person has voted this message useful

Doitsujin
Diglot
Senior Member
Germany
Joined 5326 days ago
1256 posts - 2363 votes

Speaks: German*, English

Message 63 of 101

24 September 2011 at 8:35am | IP Logged

I'd like to request a bilingual version of the Count of Monte Christo.

French source: Wikisource
English translation: Project Gutenberg

(The French source is also available at Project Gutenberg, but the French Wikisource version is better formatted.)

BTW, I looked into using Bleualign and Moses SMT myself, but I couldn't find any pre-trained Open Source French-English versions. What kinds of bilingual texts did you train Moses with?
2 persons have voted this message useful

DavidW
Hexaglot
Senior Member
United Kingdom
Joined 6532 days ago
318 posts - 458 votes

Speaks: English*, Spanish, French, Italian, Persian, Malay
Studies: Russian, Arabic (Written), Portuguese, German, Urdu

Message 64 of 101

26 September 2011 at 7:07pm | IP Logged

"On June 3, 2011, Google announced that they were canceling their plan to terminate the
Translate API due to public pressure. In the same announcement, Google said that they
will release a paid version of the Translate API."

--Interesting. Thanks for that.

There are free corpuses available for many languages, such as Europarl:
http://www.statmt.org/europarl/ (proceedings of the European parliament).
Unfortunately, there is very little (freely available) bilingual material based on
literature available, which would be the best kind of material for preparing a system
to translate literature. Setting up and training the system is not straightforward. I
suggest using the scripts 'moses-for-mere-mortals' to get things going faster.

These last couple of days I set up a domain (omilia.org, hosted in Canada, life+50),
and set up 'Google Apps.' This has some useful features, like Google 'Docs' (for on-
line editing of the parallel texts), 'Groups' (for discussing possible texts), 'sites'
(wiki-style pages with file storage etc., for sharing info and techniques) etc.

Will do "the Count of Monte Christo" tomorrow hopefully.

Edited by DavidW on 26 September 2011 at 8:11pm

3 persons have voted this message useful

This discussion contains 101 messages over 13 pages: << Prev 1 2 3 4 5 6 7 8 9 10 11 12 13 Next >>

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 0.3271 seconds.