Register  Login  Active Topics  Maps  

Anybody interested in OCR of FSI PDF’s?

  Tags: FSI
 Language Learning Forum : Language Programs, Books & Tapes Post Reply
17 messages over 3 pages: 1 2 3  Next >>
Rob Tickner
Senior Member
New Zealand
Joined 4490 days ago

126 posts - 158 votes 
Speaks: English*
Studies: GermanB1, French, Swedish

 
 Message 1 of 17
12 August 2012 at 12:45am | IP Logged 
Hi,

(previous username tomsawyer, account got messed up with the recent hacking)

Having finished the German and French FSI courses in the past, and knowing their value,
I started the Swedish FSI course yesterday, only to find that the poor legibility of
the typewritten PDF hinders reading of the Swedish text (it's sometimes difficult to
tell apart å and ä and a, for example).

Having a search around, I found this fellow:
http://www.ielanguages.com/fsi/fsiproject.html who seems to be trying to convert
several of the FSI course PDFs into HTML. I emailed him, but haven't heard back yet.
While I appreciate his efforts, I can't help to think that a .TXT file formal would be
much better than HTML - e.g. if you wanted to make an eBook out of it.

I ran the FSI Swedish course through Tesseract (an OCR engine), and it will probably
require about 20 - 30 hours of reformatting, fixing spelling mistakes, etc. to get to a
useful state.

My question: have others encountered legibility issues on these courses, to the point
that a .TXT OCR'd file would indeed be useful?

If enough people are interested, I'll do the work required, and post the .TXT file in a
public place for download (on my web server). If other people are interested in OCR'ing
some of the other manuals, I would be happy to put them up on my server too.

Thanks,
Rob.

Edited by Rob Tickner on 12 August 2012 at 12:46am

2 persons have voted this message useful



Hampie
Diglot
Senior Member
Sweden
Joined 6661 days ago

625 posts - 1009 votes 
Speaks: Swedish*, English
Studies: Latin, German, Mandarin

 
 Message 2 of 17
12 August 2012 at 2:53am | IP Logged 
An OCR'd PDF-file is nice too: it's searchable and you can copy small chunks out of it. Making it into pure text will
take long and you'll have to reformat it.. a lot...
1 person has voted this message useful



tarvos
Super Polyglot
Winner TAC 2012
Senior Member
China
likeapolyglot.wordpr
Joined 4709 days ago

5310 posts - 9399 votes 
Speaks: Dutch*, English, Swedish, French, Russian, German, Italian, Norwegian, Mandarin, Romanian, Afrikaans
Studies: Greek, Modern Hebrew, Spanish, Portuguese, Czech, Korean, Esperanto, Finnish

 
 Message 3 of 17
12 August 2012 at 7:53am | IP Logged 
I don't find the Swedish FSI course that illegible. But if you can make it better, by all
means do.
1 person has voted this message useful



maydayayday
Pentaglot
Senior Member
United Kingdom
Joined 5221 days ago

564 posts - 839 votes 
Speaks: English*, German, Italian, SpanishB2, FrenchB2
Studies: Arabic (Egyptian), Russian, Swedish, Turkish, Polish, Persian, Vietnamese
Studies: Urdu

 
 Message 4 of 17
12 August 2012 at 9:47am | IP Logged 
Thank you for volunteering. Go for it! I am sure there will be a lot of people interested. The sound quality of the FSI materials diappointed me, do you plan to do anything with the sound ?


1 person has voted this message useful



Rob Tickner
Senior Member
New Zealand
Joined 4490 days ago

126 posts - 158 votes 
Speaks: English*
Studies: GermanB1, French, Swedish

 
 Message 5 of 17
12 August 2012 at 10:34am | IP Logged 
I don't mind the audio quality, as long as I scoop the bass out of it with my speakers,
else it sounds a little muffled. I know my way around Audacity, but don't really have the
skills to improve the audio at this time. The best I could probably do is coax a few
Swedish backpackers into re-recording it. Swedish backpackers, if you're out there, free
accommodation in outback Australia for a few chapters of FSI!
1 person has voted this message useful



maydayayday
Pentaglot
Senior Member
United Kingdom
Joined 5221 days ago

564 posts - 839 votes 
Speaks: English*, German, Italian, SpanishB2, FrenchB2
Studies: Arabic (Egyptian), Russian, Swedish, Turkish, Polish, Persian, Vietnamese
Studies: Urdu

 
 Message 6 of 17
12 August 2012 at 10:51am | IP Logged 
I did some work on the FSI Spanish materials where I transcribed the text and cleaned up the sound, linking the two together. Was fun but work got busy so that fell off the to-do list.

1 person has voted this message useful



Majka
Triglot
Senior Member
Czech Republic
kofoholici.wordpress
Joined 4659 days ago

307 posts - 755 votes 
Speaks: Czech*, German, English
Studies: French
Studies: Russian

 
 Message 7 of 17
12 August 2012 at 12:39pm | IP Logged 
I did think about converting the French course to epub for my reader (the new Pocketbook touch). But I decided against it - the pdf is not that bad and the reader can crop the white borders. And the French course is searchable (I hope I downloaded it as such).

One tip for you - the free pdf-exchange reader can do this work for you - additional languages are here to download.

It works well, and if the source text has clear layout, there is no need for tesseract. One simply press "ocr the text" and lets the program run.
The other piece of free software is STDU Viewer which has some nifty features and allows to export a text file. But with embedded text, one needs lot of work with reformating. I find it easier to copy and paste parts of the text from the pdf directly, opening both files next to each other.

Again, in case of FSI French I decided against it. But I did convert some textbooks (parts of them, without the exercises) to epub, mainly because I wanted to use text-to-speech.
1 person has voted this message useful



iguanamon
Pentaglot
Senior Member
Virgin Islands
Speaks: Ladino
Joined 5264 days ago

2241 posts - 6731 votes 
Speaks: English*, Spanish, Portuguese, Haitian Creole, Creole (French)

 
 Message 8 of 17
12 August 2012 at 1:11pm | IP Logged 
FSI isn't the only public domain US government language learning resource out there. Many, if not most, of the DLI Courses are in desperate need of a rehabilitation project, I ocr'ed and de-skewed the Portuguese Basic Course with my adobe 9 pdf software and improved it a lot but the cleanup only goes so far. The old DLI courses were typewritten. They need to be re-transcribed. Perhaps if a project could be crowd-sourced with each person doing a few pages... whole volumes could be greatly improved.

Edited by iguanamon on 12 August 2012 at 2:43pm



2 persons have voted this message useful



This discussion contains 17 messages over 3 pages: 2 3  Next >>


Post ReplyPost New Topic Printable version Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum


This page was generated in 0.7969 seconds.


DHTML Menu By Milonic JavaScript
Copyright 2024 FX Micheloud - All rights reserved
No part of this website may be copied by any means without my written authorization.