51 messages over 7 pages: 1 2 3 4 5 6 7 Next >>
andras_farkas Tetraglot Groupie Hungary Joined 4904 days ago 56 posts - 165 votes Speaks: Hungarian*, Spanish, English, Italian
| Message 1 of 51 19 March 2013 at 12:30pm | IP Logged |
Hi everyone, a recent thread here inspired me to consider compiling a library of
bilingual books for language learners.
I am the author of a text
aligner and I have a fair bit of experience aligning large texts.
I don't plan to host audiobook mp3s to go with the reading material, but I can post
links to them on my site.
Would there be interest in something like this? Would anyone be willing to help with
collecting texts and reviewing sentence alignments?
The idea is to compile a library of aligned books and make it available online for
language learners to use. I would be willing to host them in HTML, txt and possibly
some ebook format. The key is finding good quality raw material (full texts with no
extraneous material, preferably not OCRed, not pdf).
Now, I don't want to duplicate other people's work, so I'd want to integrate into the
collection all the easily accessible bilingual books that other people have compiled.
I've found threads by DavidW ( http://how-to-learn-any-
language.com/forum/forum_posts.asp?TID=29170 ) and MarcoDiAngelo ( http://how-to-learn-
any-language.com/forum/forum_posts.asp?TID=19917 ) here, and a sizeable collection
here: http://lr.learnlangs.com/lrwiki/Complete_gratis_legal_LR_mat erial
Is there any other resource I should consider? Are the bilingual books prepared by
DavidW and MarcoDiAngelo available anywhere? Would they be willing to share them?
To whet your appetites, here is Arthur Conan Doyle's Rodney Stone in English and
French.
And here's the Sign of Four from the same author in English and Dutch.
Edited by andras_farkas on 19 March 2013 at 12:34pm
21 persons have voted this message useful
| LangOfChildren Tetraglot Groupie Germany Joined 5431 days ago 82 posts - 141 votes Speaks: German*, English, French, Swedish Studies: Mandarin, Japanese, Thai, Russian
| Message 2 of 51 20 March 2013 at 9:21pm | IP Logged |
I don't understand how it's possible that this software turns even the horrible pdf's I have into a relatively good parallel text. The two source files didn't match up well at all, and yet, the output file has everything aligned properly — like magic.
Needless to say, I'm very happy to have found this. I was actually right in the middle of hand-crafting a parallel text of the first Harry Potter book, and it involved a lot of copy-pasting individual paragraphs. Very time-consuming and annoying.
So I'd like to thank you for this program. I'll be sure to recommend it to others who like to do listening-reading.
1 person has voted this message useful
| andras_farkas Tetraglot Groupie Hungary Joined 4904 days ago 56 posts - 165 votes Speaks: Hungarian*, Spanish, English, Italian
| Message 3 of 51 21 March 2013 at 11:35am | IP Logged |
^It is pretty remarkable. It is mostly the merit of Hunalign (the alignment engine that
pairs up sentences). Hunalign is not my work, I only added the rest of the functionality
(the conversion of documents to txt, preprocessing, postprocessing, conversion to xls,
multilingual alignments, GUI, alignment editor etc.)
3 persons have voted this message useful
| andras_farkas Tetraglot Groupie Hungary Joined 4904 days ago 56 posts - 165 votes Speaks: Hungarian*, Spanish, English, Italian
| Message 4 of 51 26 March 2013 at 10:53am | IP Logged |
No interest, then? The site will go ahead, with a couple of dozen books uploaded in the
next week or two, and hundreds more added in the next year.
If anyone is interested in participating or has book requests, speak up.
It'd be nice to contact DavidW, his PM inbox is full.
Here's The Little Prince in English, French,
Italian, Spanish, Hungarian, German, Greek and Russian as a demonstration of what's
possible vis-a-vis multilingual alignments. This took a bit less than a full day's
work.
The Greek and Russian texts need a little more work, as does the German to a lesser
extent. I don't understand these languages well enough to do a decent job in a
reasonable amount of time
Edited by andras_farkas on 26 March 2013 at 10:57am
7 persons have voted this message useful
| Bubblyworld Newbie South Africa http:/ Joined 4290 days ago 7 posts - 11 votes Speaks: English* Studies: Xhosa, French, Japanese
| Message 5 of 51 26 March 2013 at 12:17pm | IP Logged |
I'm afraid I can't be of much help checking sentence alignment, as I am not nearly skilled enough at any of the mentioned languages, but I think this would be of enormous value!
I would be interested in participating, though, if there are other tasks to be done, so let me know.
1 person has voted this message useful
| andras_farkas Tetraglot Groupie Hungary Joined 4904 days ago 56 posts - 165 votes Speaks: Hungarian*, Spanish, English, Italian
| Message 6 of 51 26 March 2013 at 12:36pm | IP Logged |
Well, there is always an infinite amount of work to do in something like this, mostly looking for new texts and manually reviewing/correcting alignments.
Right now, I have about 100 aligned books handed over to me by MarcoDiAngelo, which I'll convert to a uniform format, check the alignments and upload. Meantime, I'll also look for new texts to align and prepare new books. The most
convenient source is Gutenberg.org but there are countless other sources as well for individual languages.
If you want to get stuck into it, download the software from http://sourceforge.net/projects/aligner/ and try it out (I hope you're on Windows). The two readme files should give you all the info you need. If you download txt files from
Gutenberg, use the pdf filetype. Gutenberg txt files are corrupted by superfluous line breaks.
I'll soon upload a new version with improved editing shortcuts so keep an eye on the sourceforge page.
Then think about what books you might want to do. I'm sure a lot of people would be interested in English-French and English-Japanese books, which you might want to do for your own use as well. For copyright reasons I'm only interested
in authors who died before 1943.
Contact me at andras.farkas (yahoo) if you've chosen a book. I'll check if I have it already (unlikely). Let me know if you need help with something.
2 persons have voted this message useful
| andras_farkas Tetraglot Groupie Hungary Joined 4904 days ago 56 posts - 165 votes Speaks: Hungarian*, Spanish, English, Italian
| Message 7 of 51 26 March 2013 at 2:54pm | IP Logged |
Here's an example of the sort of scouting/legwork that would be useful:
I plan to release a large collection of multilingual aligned Jules Verne novels.
A lot of them are available in a couple of languages here:
http://www.gutenberg.org/browse/authors/v
and here:
http://manybooks.net/authors/vernejul.html (their files seem to be better than Gutenberg's)
And in Hungarian, here:
mek.oszk.hu
A couple more are available in various languages on Wikisource and on random sites all over the internet.
So, I'll have to grab the books from these various sources, record the source of each file, rename the files following my uniform file naming conventions that allow me to keep track of what's what, make backup copies of the originals, convert the files to
txt, strip any extra material I don't need, run the aligner on each novel and review the output.
If you're interested, you can get started right now. I will describe the file naming and documentation conventions if/when anyone decides to get their hands dirty.
6 persons have voted this message useful
| andras_farkas Tetraglot Groupie Hungary Joined 4904 days ago 56 posts - 165 votes Speaks: Hungarian*, Spanish, English, Italian
| Message 8 of 51 27 March 2013 at 3:11pm | IP Logged |
In response to a question by PM, here are the file naming conventions I use. If anyone wants to participate, please follow them so I can seamlessly integrate your work. Of course if you don't follow them, I can fix your files myself but it makes it
that much harder to keep on top of things.
I only use ASCII letters in file names. No í, no ö, no cyrillic letters etc. No spaces (replaced by _), commas, apostrophes (left out) or other similar characters. File names start with the author's name with family name first (for alphabetic
ordering). The name is followed by a hyphen and the title, then another hyphen and the two-letter ISO language code (ISO 639-1, see here: http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes German is de, not ger, Spanish is es, Swedish is sv, Slovak
is sk, Slovene is sl, Greek is el etc.) The title is listed in the original language, transliterated to ASCII. In case of languages with non-latin scripts (Russian, Chinese) the English title is used. Definite articles are left out from the beginning
of titles to help alphabetic ordering.
Original (pdf, doc) files are kept with the naming convention above, and a UTF-8 txt is generated with the same name. ALL surplus material is stripped from the txt, just the text itself is left (Title in the first row, author in the second, then the
text itself, with forewords, prefaces, contents lists and similar stuff removed unless available in all languages). Metadata (full title with accented letters, translator name, comments etc.) is saved into a separate file named [filename]_info.txt.
Metadata files should have one row that says Title: [title], one that says Author: [author name] and one that says Translator: [name of translator(s)]. Aligned files should have a metadata row that says Aligned by: [name]
Aligned files have the same naming convention with the code of extra languages added on to the end.
Examples:
Bronte_Charlotte-Jane_Eyre-en.txt
Bronte_Charlotte-Jane_Eyre-en_info.txt
Bronte_Charlotte-Jane_Eyre-hu.txt
Bronte_Charlotte-Jane_Eyre-en-hu.txt
Bronte_Charlotte-Jane_Eyre-en-hu_info.txt
Hugo_Victor-Notre_Dame_de_Paris-en-hu.txt
SaintExupery_Antoine-Petit_Prince-en-hu.txt   ;<-- note that the French title is used despite the fact that it's an English-Hungarian alignment
SaintExupery_Antoine-Petit_Prince-en-hu_info.txt
SaintExupery_Antoine-Petit_Prince-en-en-fr-es-it-hu-de-el-ru .txt
Aligned files are tab separated UTF-8 txts as generated by the aligner, from wich I can auto-generate HTML files or whatever other file format is needed.
Try to procure good quality source texts if possible, and try to include as many languages in the alignment as possible. It's easier to do an English-French-German-Italian four-way alignment and then generate individual language pairs from that than
to do the alignment separately for all the languages.
For copyright reasons, I'm only interested in the works of authors who died before 1943.
------------------------------------------------------------ --------
This project is definitely going to go ahead now, so here's some extra information. The files will be hosted on my own website and made available free of charge, in HTML for online reading and in downloadable formats. There's nothing to stop anyone
from downloading and sharing them, but I don't plan to offer bulk downloads and I wouldn't be too happy if someone grabbed the texts and created a copycat site with them without attribution. Automated downloads will be prevented to limit traffic. The
files will contain my own copyright notice, which will say "Copy and distribute as you wish, but indicate where you got the file from". All files will list the names of the persons who aligned them/worked on them.
The site will be a site where I offer translation, interpreting and text alignment services, so the bilingual/multilingual books will be there in part to showcase the available text alignment services.
If I ever decide to take the site down or take the texts off, I will create a megapack of all the texts and dump it on Rapidshare/PirateBay so someone else can take it over.
4 persons have voted this message useful
|
This discussion contains 51 messages over 7 pages: 1 2 3 4 5 6 7 Next >>
You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum
This page was generated in 0.3906 seconds.
DHTML Menu By Milonic JavaScript
|