24 messages over 3 pages: 1 2 3
Ericounet Senior Member France yojik.euRegistered users can see my Skype Name Joined 5435 days ago 157 posts - 414 votes Studies: English, German, Russian
| Message 17 of 24 31 July 2011 at 2:43pm | IP Logged |
Hi,
I just put on my website The Yojik a part of DLI Russian: "Individual weapons training".
Some lessons from the Basic Course are allready done, but I have to check them before publishing: I did the formating some time ago, and I learned a lot (about docbook) after formating FSI, FSI 1973 and Verbs of Motion, so I have to make some changes.
You'll find it in PDF and Epub format.
I can generate the documents in .doc (Microsoft) format, but there will be a footer with "xmlmind" on every page, as I use their (free) tool to make the job. (They ask NOT to remove this footer)
If you want it in an other format, tell me. The Russian part is not stressed for the moment (will do later).
If someone wants to work with me, I can put the .xml docbook sources on my SVN server and give access to them.
Note: we could use theses courses to build a perfect Russian course: for example, build Pimsleur type lessons, anki lists, grammar exercises and so on ... a lot of work, but interesting.
2 persons have voted this message useful
| DavidW Hexaglot Senior Member United Kingdom Joined 6526 days ago 318 posts - 458 votes Speaks: English*, Spanish, French, Italian, Persian, Malay Studies: Russian, Arabic (Written), Portuguese, German, Urdu
| Message 18 of 24 01 August 2011 at 2:50am | IP Logged |
It's nice work your doing. It's a really great course and deserves some love.
I can do something with the audio. The original is poor, and only 32kps, but it can
improved somewhat. Obviously if someone could get a better copy, better to start
with that. Here's an example of what can be done:
Original:
www.omilia.org/hosted_files/orig.mp3
Improved:
www.omilia.org/hosted_files/proc.mp3
This has noise reduction with Izotope RX2, some EQ (a slight boost over 4.5Khz and cut
below 80Hz) and compression.
For the text, OCR gives the best appearance, but is also quite time consuming. There's
a free program called 'Scan Tailor' that can do wonders to clean up the existing scans.
The images must first be extracted from the PDFs. Two tools which can do this are Adobe
Acrobat (not Reader), and 'somePDF' (buggy, but sometimes manages when acrobat fails,
or gives strange results).
Extract the pages from the PDF as TIFF image files. These, unlike JPEG, are lossless.
Compare them to the original PDF to make sure no quality has been lost. Then import
them into Scan Tailor. Make sure the DPI setting is set correctly in Scan Tailor. We
should decide on a standard output format, perhaps A4 size, 600dpi black/white (not
grayscale).
It's less work, but you won't get the advantages of XML, or be able to reflow the text
for smaller screens.
Would be good if someone could work out the structure of the course, and we could work
out whats missing, and prioritise the documents. We could also arrange readings of the
reading texts.
Edited by DavidW on 01 August 2011 at 5:08am
1 person has voted this message useful
| Ericounet Senior Member France yojik.euRegistered users can see my Skype Name Joined 5435 days ago 157 posts - 414 votes Studies: English, German, Russian
| Message 19 of 24 01 August 2011 at 8:53am | IP Logged |
Hi,
I think OCR (and xml) is better because it allows different formats (pdf, epub, doc ) and allows to generate interesting things: like extracting the Russian vocabulary and getting Anki lists (or whatever program people want); here is an example of what could be done (it was only a test, and I have to delete the duplicate): 1.pdf.
I use FineReader and it's quite fast to OCR the pdfs: The most part of the work is the formating in docbook, and the checking ... but I'm now used to it and can do it fast. The only thing I miss is the time :)
For the audio part, xml allows to extract the phrases, and theses phrases can be re-recorded with tools like shooka-recorder or yazik-recorder. The audio parts are taggued with linguistic informations and can be suitable for other uses. It's possible to record 400 words in 20 minutes with theses tools. Faster than dealing with the audio allready done by DLI.
My wife will record the Russian parts (you can hear her voice in shtooka.net, wikidictionary and many other places). I'll record the French parts (translations of the English parts): but what will be lacking is the recording of the English parts.
Once theses recordings done, we could imagine some courses like Pimsleur, or use them with Anki, (learning) games ...). Scripts (in an IT way) can be used to generate theses.
Yesterday, I worked on "Better Russian" (DLI): OCR finished, and first chapters allready formated.
It will be finished end of week.
ps: I have to create the missing pages in the 1st lesson of the DLI Basic Course. I'll be on hollidays next week for 2 weeks, and will manage to make them, so the course will be complete. Then, I'll publish the first lessons.
Edited by Ericounet on 01 August 2011 at 11:36am
1 person has voted this message useful
| DavidW Hexaglot Senior Member United Kingdom Joined 6526 days ago 318 posts - 458 votes Speaks: English*, Spanish, French, Italian, Persian, Malay Studies: Russian, Arabic (Written), Portuguese, German, Urdu
| Message 20 of 24 01 August 2011 at 3:07pm | IP Logged |
Ah, you're a programmer, I thought so :-). Could I take a look at the XML files, to see
the structure? I'll make some time to do some thinking, maybe I can offer a couple of
ideas.
1 person has voted this message useful
| Ericounet Senior Member France yojik.euRegistered users can see my Skype Name Joined 5435 days ago 157 posts - 414 votes Studies: English, German, Russian
| Message 21 of 24 01 August 2011 at 4:26pm | IP Logged |
Hi,
I just put the xml file Individual-weapons-training.xml; There were no images for this book.
If you want to see it clearly, use the free software from xmlmind: xmleditor.
It's written in Java so run on every platform (Linux, Mac and Windows).
I use it with the help of "oXygen Author" (this one is not free).
Next week, I'll put on my site all the sources of the documents (tiff files, images, xml files, working files), everything that is needed to work with.
ps: I'm not a programmer, but wrote my first program (assembly language) in 1976 ... and never stopped since that time. I like languages ;)
1 person has voted this message useful
| DavidW Hexaglot Senior Member United Kingdom Joined 6526 days ago 318 posts - 458 votes Speaks: English*, Spanish, French, Italian, Persian, Malay Studies: Russian, Arabic (Written), Portuguese, German, Urdu
| Message 22 of 24 01 August 2011 at 7:43pm | IP Logged |
I took a look. I saw in the Docbook reference documentation, there are elements to
point to an external audio file. But, can this be referenced to a specific piece of
text in the document? Are there any viewers that support this feature? The DAISY format
supports syncronised audio and text, as does the upcoming version of EPUB.
I thought you were storing the document in a special XML that includes special labels
like 'vocab,' 'grammar explanation,' 'text for reading,' 'drill' etc., so that software
could do stuff with the files automatically, like generate flashcards. I suppose this
would need an XML language to be extended with these special tags, which would take
some thought.
At the moment, the main benefits of Docbook are:
-copy and paste
-ability to be automatically refomatted for viewing small-screen (less than 10")
devices
-improvement of appearance
To be honest, I wonder if it's worth it. The quality of the original PDFs isn't bad,
and they'll look a lot better when cleaned up. If I was doing it, I'd clean up the PDFs
and organise them on their own website. I'd try to get better audio recordings, or
otherwise clean up the existing ones. I'd then look into producing recordings for the
Russian texts in more advanced parts of the course, which are quite interesting. All
this I think would take much less time than OCRing the entire course, which can also
introduce errors if you're not careful. That's just my opinion.
Edit:
Here's what a cleaned up PDF looks like. It took about 15 mins work + 15 mins
processing time (once you know how...):
Original:
www.omilia.org/hosted_files/orig.pdf
Processed:
www.omilia.org/hosted_files/proc.pdf
Scan Tailor can correct skewed text, centre pages, despeckle etc.
Edited by DavidW on 01 August 2011 at 9:26pm
1 person has voted this message useful
| Ericounet Senior Member France yojik.euRegistered users can see my Skype Name Joined 5435 days ago 157 posts - 414 votes Studies: English, German, Russian
| Message 23 of 24 01 August 2011 at 9:49pm | IP Logged |
Hi,
docbook is only a xml shema for formating documents. Of course you can link to internal or external parts.
There is no need to have a "viewer" for docbook documents: the docbook file is meant to be processed and give either other xml files (ex: the vocabulary file I gave to you), or pdf or epub or doc or html (or what you want: just write the xslt file to produce what you want (many allready exist)) : one source, multiple targets.
For example, all the Russian parts are taggued as xml:lang="ru".
You don't have to mix special labels: everything is allready in docbook: you have attributs (for elements) and you can invent the one you want: after that, the processing works with theses attributs. So you can tag a part as vocabulary, or drill or ...
With images, as nice they can be, you cannot do anything: no search, no sorting, no extracting, nothing : you can only keep the book as it is. It could be a way, but I choose another :)
I know, it's more work, but I like to do it. Yes, it's possible to add some errors: but theses are easylly corrected: xml files are only text files. And the production chain is totally free.(xml->pdf or html or xml).
The other thing important to me is the possibility to translate the English text; with tools like OmegaT (free and multiplatforms), it's an easy task (but long as there are so many pages)
Your corrected pages are nice, but I prefer my way (I do it for free, in my free time).
In the next weeks, I'll publish the vocabulary lists for FSI, in many formats, so you'll see what is possible once the hard work done(formating in docbook). This vocabulary will also be recorded. With little stones, we can build castels.
I only hope it can be usefull.
1 person has voted this message useful
| DavidW Hexaglot Senior Member United Kingdom Joined 6526 days ago 318 posts - 458 votes Speaks: English*, Spanish, French, Italian, Persian, Malay Studies: Russian, Arabic (Written), Portuguese, German, Urdu
| Message 24 of 24 01 August 2011 at 10:33pm | IP Logged |
I think if you have a vision about how the XML could allow the materials to be used in
new ways, it could be worthwhile. It would be necessary in this case go through the
materials and think about the possibilities for each section. To do a translation would
be a good reason, but there are thousands of pages: it would be many months work,
working full time. If the only benefit is to provide the document in different formats,
I personally wouldn't bother. Sure, you can't search, sort, extract, but the materials
have already been carefully prepared and laid out for the student.
My intonation in English is a little odd, so I don't think I'd make a good reader. But
you should be able to recruit one here:
https://forum.librivox.org/viewtopic.php?t=21482
Best of luck.
Edited by DavidW on 01 August 2011 at 10:45pm
1 person has voted this message useful
|
This discussion contains 24 messages over 3 pages: << Prev 1 2 3 If you wish to post a reply to this topic you must first login. If you are not already registered you must first register
You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum
This page was generated in 0.3120 seconds.
DHTML Menu By Milonic JavaScript
|