pentatonic Senior Member United States Joined 7247 days ago 221 posts - 245 votes
| Message 1 of 237 08 March 2005 at 8:03am | IP Logged |
The FSI recorded materials are on tape and so would need to be recorded to digital, the hiss removed/lessened, compressed to even levels, and encoded to mono mp3. It would also be good to even the sound (decibel) levels to a standard, at least across each package. The printed materials would need to be scanned and converted to some format. PDF would be good but HTML or possibly both would be another option. I personally would like the text processed with OCR software for import into flashcard/learning software.
I have the hardware and software to do all of this and would be willing to do this for the German and Spanish FSI courses (not the Barron's versions as they have be retypeset which created lots of errors). However I would have to be provided with both since I have neither.
Anyone who volunteers must realize that this represents hours and hours of work and cannot be done overnight.
Some sort of ditribution system for the finished product would be necessary. Bittorrent is an option but this site would be a good place
1 person has voted this message useful
|
administrator Hexaglot Forum Admin Switzerland FXcuisine.com Joined 7376 days ago 3094 posts - 2987 votes 12 sounds Speaks: French*, EnglishC2, German, Italian, Spanish, Russian Personal Language Map
| Message 2 of 237 08 March 2005 at 10:34am | IP Logged |
Pentatonic : you seem to know a great deal about sound, is this your profession or a hobby? I agree that we would need to OCRize the texts for use in other software and faster download.
For filesize, how many Megabytes per hour of sound if we use MP3? I'd be willing to supply the space and hardware to distribute it, but I need to calculate the bandwith since this can escalate fast.
How long per tape would you need to transfer from tapes to digital?
Would we do one file per lesson?
PDF is nice but I think it's especially useful if you make a nice layout. If we just OCRize and make the text clear but no-thrills, probably that HTML would be faster and more portable.
1 person has voted this message useful
|
pentatonic Senior Member United States Joined 7247 days ago 221 posts - 245 votes
| Message 3 of 237 08 March 2005 at 12:03pm | IP Logged |
administrator wrote:
Pentatonic : you seem to know a great deal about sound, is this your profession or a hobby? |
|
|
I'm an amateur musician and do some home recording so I've spent a lot of time tweaking digital audio and have acquired some good software.
administrator wrote:
For filesize, how many Megabytes per hour of sound if we use MP3? I'd be willing to supply the space and hardware to distribute it, but I need to calculate the bandwith since this can escalate fast. |
|
|
Voice audio can be mono and at a variable encoding rate of 48-56 KBPS (good quality) would be just under 400k per minute. I've heard acceptable quality mp3s of lower bit rates that were just under 250k per minute but I'd have to round up a codec for that as mine all sound bad at that bitrate.
administrator wrote:
How long per tape would you need to transfer from tapes to digital? |
|
|
It's hard to say but obviously it would take an hour just to record an hour-long lesson. Then I'd have to process it and listen to it to make sure it's OK and find the points at which it should be split. I'd need to split the lesson up into several files and then encode it to MP3. It would probably be a little faster to encode the whole file and then split it up. Anyway, the review part will be time consuming. Most of the rest could be done while I do other things. Maybe I should do a lesson or two and see.
Of course the OCR part would be time consuming as well.
administrator wrote:
Would we do one file per lesson? |
|
|
No, I think it would be better to split them up by drill, etc. That way it's easy to go to a drill you need specific practice on.
administrator wrote:
PDF is nice but I think it's especially useful if you make a nice layout. If we just OCRize and make the text clear but no-thrills, probably that HTML would be faster and more portable. |
|
|
I was thinking PDFs from OCRed material. I agree that PDFs from images are unnecessarily big. PDFs are easy to download and print, but HTML is good.
1 person has voted this message useful
|
mahyar Newbie Canada Joined 7201 days ago 34 posts - 31 votes Speaks: English* Studies: French
| Message 4 of 237 08 March 2005 at 2:38pm | IP Logged |
To record from tape, you need a portable tape player and a wire that looks like two male stereo jacks on both ends (pretty cheap). You then plug the tape player into the microphone jack on your computer. You set up a sound recording from the microphone port on the computer and then press play on the tape player. Then you wait until the tape is finished and you have your digitized tape recording.
If your willing to debind and rebind your books (or even better, they came in a ringed binder!), you can put the FSI sheets into a sheet feed scanner. It would then take about 10 minutes of just waiting and maybe putting another stack of papers once in a while.
OCR is another passive process once you've scanned the sheets in. My 6 year old computer can do a 400 page book over night.
Book digitizing teams usually go by this process:
*You get a request from someone
*The scanner person gets the book and scans it in.
*The scanner then posts it somewhere for the general public to proof-read and digitize.
*You get a proof reader person(s) who have the original non-OCRed version and the OCRed version. If they encounter a non obvious typo, they can look at the non-OCRed version to see what it is. Dividing the job or putting an OCRed version on a wiki or some other groupware or revision control system is what is done in the case of multiple proofreaders.
*After some good amount of proofreading has been done, the book is released in multiple formats, such as a CHM, PDF, HTML, and plain text to allow for maximum flexiblity. (PDAs for example work best with plain text and html, while PDF is great for printing, html is great for website publishing and and the CHM format is great when your sitting at your computer.
I've seen the entire pimsleur japanese course (all 90 of the 30 minute long units of it including their "readings" recordings) encoded clearly in 450MB in the ogg vorbis format. If people want to convert it to MP3s or ACCs for their portable players, we can also make a guide on how to do so too.
You can also use BitTorrent to solve the bandwidth problem. So if the files do become really popular, it wouldn't be a problem.
We can also look at the gutenburg project for help with copyright and scanning. They don't cover the specific government case that we are talking about with the FSI in their FAQ so sending them an email at help_AT_pglaf.org (replace the _AT_ with an @) could help clear up the USA copyright issue.
Edited by mahyar on 08 March 2005 at 2:49pm
1 person has voted this message useful
|
pentatonic Senior Member United States Joined 7247 days ago 221 posts - 245 votes
| Message 5 of 237 08 March 2005 at 3:33pm | IP Logged |
mahyar wrote:
To record from tape, you need a portable tape player and a wire that looks like two male stereo jacks on both ends (pretty cheap). You then plug the tape player into the microphone jack on your computer. You set up a sound recording from the microphone port on the computer and then press play on the tape player. Then you wait until the tape is finished and you have your digitized tape recording. |
|
|
It depends on what level of quality you are willing to accept. These tapes where recorded in the 60's They are uneven and have lots of hiss. You could certainly speed things up by dumping an unedited, 30-minute-long MP3 out there and let user deal with the details, but I think it would be a better to spare them that. These are not tapes you listen to once and put away.
mahyar wrote:
If your willing to debind and rebind your books (or even better, they came in a ringed binder!), you can put the FSI sheets into a sheet feed scanner. It would then take about 10 minutes of just waiting and maybe putting another stack of papers once in a while. |
|
|
That's a good idea and would making scanning easier. Unfortunately, I don't have a sheet feeder for my scanner.
mahyar wrote:
OCR is another passive process once you've scanned the sheets in. My 6 year old computer can do a 400 page book over night. |
|
|
Sorry, but I think this is not a realistic view of the current state of OCR software. It has come a long way but there are still lots of errors and sometimes you can just type things by hand and be as productive. You still have to spell check and correct errors. We're talking about language courses so the text needs to be as error-free as possible. That's my main complaint with the Barron's series. They were retypeset and that introduced a lot of errors. How is someone who doesn't know the language supposed to catch such errors?
I like your suggestions/comments on book digitizing teams. That would be a good thing if we could team up on conversions.
As far as audio formats, I think MP3s are the way to go. Ogg Vorbis and AAC are better formats but the truth is that most mainstream people don't even know what they are, even though a lot are unknowingly using Apple's version of AAC when they download from iTunes. MP3s are playable from practically all portable players and computer media players, and converting from one compressed format to another results in quality degradation.
Edited by pentatonic on 08 March 2005 at 7:01pm
1 person has voted this message useful
|
administrator Hexaglot Forum Admin Switzerland FXcuisine.com Joined 7376 days ago 3094 posts - 2987 votes 12 sounds Speaks: French*, EnglishC2, German, Italian, Spanish, Russian Personal Language Map
| Message 6 of 237 08 March 2005 at 4:57pm | IP Logged |
You do seem to know the ropes!
Format - I think MP3 is the way to go, most common, everybody can read it now.
Distribution - I checked out BitTorrent, it seems like a smart, non-profit, collaborative way of distributing files while minimizing server bandwidth drain.
OCR - I've done a 19th century book (on this site) and confirm it's by far not automatic. I like the idea of the feeder, but most of the time is spent comparing each scanned letters where the OCR software has a doubt with the scan itself. It's not an impossible task but you need to know both English and the target language.
Audio - I'd say Pentatonic knows his stuff and a better source recording will be a nicer shared ressource. After all, if we do it once and for all language enthusiasts to share, let's try to do it well if possible. The idea of breaking down each lesson is good, we could then have a directory structure such as FSI_German/I/lesson01/04drill.torrent etc..., so that somebody who wishes to download the entire lesson 01 could do so, then play each track in the right order using a MP3 player.
1 person has voted this message useful
|
heartburn Senior Member United States Joined 7207 days ago 355 posts - 350 votes Speaks: English* Studies: Spanish
| Message 7 of 237 08 March 2005 at 5:52pm | IP Logged |
I'm no expert, but I've recorded, edited and encoded lots of lessons and other audio material. I've also OCRed tons of Spanish articles and stories. I'm very comfortable with this stuff and I'd be willing to help.
1 person has voted this message useful
|
ElComadreja Senior Member Philippines bibletranslatio Joined 7238 days ago 683 posts - 757 votes 2 sounds Speaks: English* Studies: Spanish, Portuguese, Latin, Ancient Greek, Biblical Hebrew, Cebuano, French, Tagalog
| Message 8 of 237 08 March 2005 at 9:46pm | IP Logged |
ooh, ooooh, get someone to read those articles already on the computer! In a mumbly, non distinct sort of way.
1 person has voted this message useful
|