101 messages over 13 pages: << Previous 1 2 3 4 5 6 7 ... 8 ... 12 13 Next >>
DavidW Hexaglot Senior Member United Kingdom Joined 6532 days ago 318 posts - 458 votes Speaks: English*, Spanish, French, Italian, Persian, Malay Studies: Russian, Arabic (Written), Portuguese, German, Urdu
| Message 57 of 101 22 September 2011 at 11:00pm | IP Logged |
Been a little busy. Will post an update tomorrow. David
1 person has voted this message useful
| Crush Tetraglot Senior Member ChinaRegistered users can see my Skype Name Joined 5871 days ago 1622 posts - 2299 votes Speaks: English*, Spanish, Mandarin, Esperanto Studies: Basque
| Message 58 of 101 23 September 2011 at 9:08am | IP Logged |
I tried to get bleualign running, but it gives me a file-access error (trying to read from an unopened file). It may be my version of Python, though. I'll try with 2.6. It really wouldn't be all that difficult to get google translations of texts, I split the file up into about ten parts and that worked for me.
1 person has voted this message useful
| DavidW Hexaglot Senior Member United Kingdom Joined 6532 days ago 318 posts - 458 votes Speaks: English*, Spanish, French, Italian, Persian, Malay Studies: Russian, Arabic (Written), Portuguese, German, Urdu
| Message 59 of 101 23 September 2011 at 8:06pm | IP Logged |
Starting to put some information together. Would have put this in the first post, but it seems I cannot edit it anymore.
--Aim--
Make an organised library of bilingual texts for the study of many languages, with links to matching audiobooks.
--Current Process--
1. REQUEST: You post a request for a book in this thread. If you can, please find copies of the texts online, and post the links. If the book is not available online, it can be scanned and OCRed.
2. AUTOMATIC ALIGNMENT: Once the texts are available, I will run the texts through a number of scripts I use, that can produce a fairly accurate sentence-by-sentence aligned text. The basic format is a table, with two columns, for each language, saved as an RTF file. Each sentance is placed in it's own cell. Paragraphs are marked by '***' placed in it's own cell. Alignment errors recognised by the scripts are represented by '###.'
3. EDIT: Next the texts need to be edited by hand, to clean up any alignment errors. In some cases, the translation can be edited to follow the meaning of the original text more closely, but this may create differences with existing audiobooks. If no audiobook yet exists for the traslation, or the translation is not intended to be the object of study, this is less of a problem.
4. OUTPUT FILES: The edited files are imported into a desktop publishing program, and the professional-quality PDFs are produced for printing on A4/letter paper. Other formats, such as EPUB or HTML are also possible.
--What kinds of books can I request?--
Texts are hosted in Belarus, which has life+50 copyright laws. Texts for which the author, or if a translated text, the original author and traslator died before 1961, can be hosted without obtaining permissions. When downloading texts, you are responsible to ensure you are not violating the copyright laws in your country.
For works still under copyright, you can still make a request, and I will look into obtaining permissions.
--Immeadiate Plans--
Create guidelines/standards for the editing of texts.
Set up forums to discuss potential texts.
Put a message up in the Project Gutenberg forums to recruit a volunteer to scan books.
Put a message up in the Project Gutenberg forumsto recruit volunteer readers, to produce audio for materials without audiobooks available.
--Future Plans--
Finish work on a specially-made database to hold information about potential books, translations, and audiobooks.
Edited by DavidW on 23 September 2011 at 8:41pm
3 persons have voted this message useful
| DavidW Hexaglot Senior Member United Kingdom Joined 6532 days ago 318 posts - 458 votes Speaks: English*, Spanish, French, Italian, Persian, Malay Studies: Russian, Arabic (Written), Portuguese, German, Urdu
| Message 60 of 101 23 September 2011 at 8:16pm | IP Logged |
--Authors to be investigated--
Czech: Karel Čapek
--Books Currently awaiting scanning/OCR--
Markens grøde (Knut Hamsun)
Alice in Wonderland in Norweigen
--Currently available un-edited files--
TODO
--To Produce output files--
TODO
Edited by DavidW on 23 September 2011 at 8:21pm
1 person has voted this message useful
| DavidW Hexaglot Senior Member United Kingdom Joined 6532 days ago 318 posts - 458 votes Speaks: English*, Spanish, French, Italian, Persian, Malay Studies: Russian, Arabic (Written), Portuguese, German, Urdu
| Message 61 of 101 23 September 2011 at 8:26pm | IP Logged |
Another problem with Bleualign is that it is basically desgined to 'harvest' sentances for training a machine translation system. So it will sometimes delete sections of a text that are missing in the translation. This behaviour could probably be changed fairly easily if you were familiar with the code of the program.
You can split a text into small segments and use the Google Translation API, but there is a limit to the number of requests you can make to the API in 24 hours, and it wouldn't allow you to do more than a book or two a day. Perhaps OK for your own use, but not for offering a service to others.
Asking the user to obtain the translation themselves is possible, but splitting up the text, copy/paste into google translate ten times etc. is quite annoying.
Possibly an offline translation system could be used, like SYSTRAN. The quality of the translation wouldn't be as good as Google Translate, but this trasnlation is only used to help with alignment, so it wouldn't matter too much. I can't remember now why I didn't try this approach.
The current method provides results that are just as good as Bleualign, although it requires about 40 minutes per bilingual text. About half time is spent locating and cleaning up the source texts.
Edited by DavidW on 23 September 2011 at 8:38pm
1 person has voted this message useful
| montmorency Diglot Senior Member United Kingdom Joined 4834 days ago 2371 posts - 3676 votes Speaks: English*, German Studies: Danish, Welsh
| Message 62 of 101 23 September 2011 at 9:54pm | IP Logged |
David,
You are probably aware of this, but I was interested to read this in the Google Translate Wikipedia page:
Quote:
On May 26, 2011, Google announced that the Google Translate API had been deprecated and that it would cease functioning on December 1, 2011 "due to the substantial economic burden caused by extensive abuse."[4][5] The shutting down of the API, which is used by a number of websites, has led to criticism of Google and developers questioning the viability of using Google APIs in their products.[6][7]
On June 3, 2011, Google announced that they were canceling their plan to terminate the Translate API due to public pressure. In the same announcement, Google said that they will release a paid version of the Translate API. [4][8]
|
|
|
1 person has voted this message useful
| Doitsujin Diglot Senior Member Germany Joined 5326 days ago 1256 posts - 2363 votes Speaks: German*, English
| Message 63 of 101 24 September 2011 at 8:35am | IP Logged |
I'd like to request a bilingual version of the Count of Monte Christo.
French source: Wikisource
English translation: Project Gutenberg
(The French source is also available at Project Gutenberg, but the French Wikisource version is better formatted.)
BTW, I looked into using Bleualign and Moses SMT myself, but I couldn't find any pre-trained Open Source French-English versions. What kinds of bilingual texts did you train Moses with?
2 persons have voted this message useful
| DavidW Hexaglot Senior Member United Kingdom Joined 6532 days ago 318 posts - 458 votes Speaks: English*, Spanish, French, Italian, Persian, Malay Studies: Russian, Arabic (Written), Portuguese, German, Urdu
| Message 64 of 101 26 September 2011 at 7:07pm | IP Logged |
"On June 3, 2011, Google announced that they were canceling their plan to terminate the
Translate API due to public pressure. In the same announcement, Google said that they
will release a paid version of the Translate API."
--Interesting. Thanks for that.
There are free corpuses available for many languages, such as Europarl:
http://www.statmt.org/europarl/ (proceedings of the European parliament).
Unfortunately, there is very little (freely available) bilingual material based on
literature available, which would be the best kind of material for preparing a system
to translate literature. Setting up and training the system is not straightforward. I
suggest using the scripts 'moses-for-mere-mortals' to get things going faster.
These last couple of days I set up a domain (omilia.org, hosted in Canada, life+50),
and set up 'Google Apps.' This has some useful features, like Google 'Docs' (for on-
line editing of the parallel texts), 'Groups' (for discussing possible texts), 'sites'
(wiki-style pages with file storage etc., for sharing info and techniques) etc.
Will do "the Count of Monte Christo" tomorrow hopefully.
Edited by DavidW on 26 September 2011 at 8:11pm
3 persons have voted this message useful
|
You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum
This page was generated in 0.3271 seconds.
DHTML Menu By Milonic JavaScript
|