JTAS CD-ROM Issues, Take #1


Ok, I've done some searching through numerous threads, and while I see mentions of bad pages and scans on the JTAS CD, I canna find any specifics! This is the next Marc project, so let's hear about your problems...

The list of JTAS CD-ROM issues (updated 8/16/10):

JTAS Best of Volume 1.pdf: no identified issues.
JTAS Best of Volume 2.pdf: 2 page spread of pages 24/25 prevents booklet printing.
JTAS Best of Volume 3.pdf: no identified issues.
JTAS Best of Volume 4.pdf: no identified issues.
JTAS01.pdf: highlighter markings on pages 3, 6 and 15; double page 16s (first page bad, should be removed).
JTAS02.pdf: missing page 26; page 3 needs rescan.
JTAS03.pdf: no identified issues.
JTAS04.pdf: page 11 needs rescan.
JTAS05.pdf: no identified issues.
JTAS06.pdf: no identified issues.
JTAS07.pdf: inside cover is cropped, but missing elements are ads.
JTAS08.pdf: no identified issues.
JTAS09.pdf: no identified issues.
JTAS10.pdf: needs complete rescan.
JTAS11.pdf: needs complete rescan.
JTAS12.pdf: needs complete rescan.
JTAS13.pdf: page curling throughout; recommend complete rescan from clean original.
JTAS14.pdf: no identified issues.
JTAS15.pdf: significant page curling throughout; recommend complete rescan from clean original.
JTAS16.pdf: no identified issues.
JTAS17.pdf: no identified issues.
JTAS18.pdf: no identified issues.
JTAS19.pdf: no identified issues.
JTAS20.pdf: no identified issues.
JTAS21.pdf: cover of missile supplement has scaling issue.
JTAS22.pdf: no identified issues.
JTAS23.pdf: even number pages need to be rescanned.
JTAS24.pdf: missing pages 19 and 41.
Challenge25.pdf: blank page appended to end.
Challenge26.pdf: no identified issues.
Challenge27.pdf: no identified issues.
Challenge28.pdf: Amber Zone ("Behind the Scenes") and Contact ("The Sabmiqys") need to be rescanned.
Challenge29.pdf: Challenge pages 18-21 need to be rescanned (the "A Decade of Traveller" article). Pages 46 and 47 may also need to be redone.
Challenge30.pdf: Tables for Police ("There When You Need Them") aren't OCRed. Otherwise OK.
Challenge31.pdf: page 24 is blank. All else appears OK.
Challenge32.pdf: no identified issues.
Challenge33.pdf: no identified issues.
Challenge34.pdf: no identified issues.
Challenge35.pdf: no identified issues.
Challenge36.pdf: artwork in "The Green Hills of Earth" should be rescanned; blank page appended.

Many issues have pages sized differently throughout, and the formatting of pages is often inconsistent. Two-page spread of several pdfs prevents them from being printed in booklet format.
I think you're a little bit crazy on this one. Not you personally, but anyone who was going to attempt this.

If we cataloged it all, it would be a monumental task. There's literally thousands of areas it can go wrong.

I think it's safe to say: whoever did the last batch, don't use them, because they didn't do such a great job -- or if you use them, make sure they learned from their first batch of mistakes (like leaving out pages, letting badly blurred entire books go thru) that sort of thing. Make them signoff that the product has been checked visually page by page.

Don't expect because one book goes easy they'll all be that way.

And use a newer version of whatever software you used last time, because hopefully it'll do a better job.

Basically the entire process is a complex, multi-faceted project that needs to be checked and verified.

Then have a 3rd party inspect it before it goes out.

Or, if that's too much trouble, just do what you did last time.

Here's an example:

Yesterday I scanned the T4 Central Supply Catalog (most of the relevant pages). For 150 DPI and black & white, my scanner flies and I can probably do a bit better than 1 page per minute, with all the mouseclicks and centering and filename changes.

So I copied those to my laptop, about 90 pages and used a script I have to open an existing MDI file and insert the pages 1 after the other, in the proper order (takes about 1 minute sending keystrokes).

That gives me a MDI file, 90 pages, about ~6 megs. I then do an OCR on that file, which takes about ~30 seconds to ~1 minute.

Then I noticed some of my scans were a little ragged. The scanner got empty space (big black borders) on some since the book doesn't cover the entire scan deck.

So I dropped them into Paintshop Pro in batches of 10, cropped each page (a few keystrokes) and re-saved them.

I created a 2nd MDI file and inserted the new pages and they look much nicer. However I got an error on the OCR process. It finished but said it had problems.

I did some searches for text and found that it was missing the word "Densitometer" in big heading print but found it in the paragraph just below that. The original found them all. The TIFF files were the same, just cropped. There are undoubtedly others as well.

So the cropping caused some changes.

I'm sure there's some guy with the latest & greatest equipment/software/know how who's probably shaking his head, but that's the breaks.

Okay here's the results of the OCR from the two MDI files.

The first original found the following heading. It's supposed to be -13 for TL13


The second one, using the same original TIFF but cropped must have done something, or I just need to remake a third MDI file.

This one spelled/OCR'd Densitometer all wrong but got the 13 correct.


So that's the kind of process you're dealing with. If I get a chance later I'll try making a 3rd MDI and see if it works better or I get the same results.

Probably the only way to see how well your OCR turns out, is to dump the entire file to text and do a spell-check on it; and hope you've got one of those spell-checkers that can "learn" words and ignore others.

However doing that for each book, is going to take quite a bit of time. The other thing is that now that we've found the above errors -- how do you correct them behind the scenes in Nitro or Adobe ? My old Acrobat 5 Pro let you type in corrections but the fonts weren't the same (I'm sure that can be re-done too).

I'm sure there's someone out there that can give us some pointers, someone who does this type of work a lot.

Okay, I was able to grab those cropped files and build another MDI (kept eye on clock this time)

Build MDI of 90 pages = 5 minutes
OCR 90 pages = 3 minutes

Still had errors, so I'm guessing the cropping process OR Paint Shop Pro is somehow skewing the scans. I still have "uncropped" copies of the files on the other computer (always make backups).

I'm going to try either the FASA Fate of the Skyraiders or a High Passage later today.

I'm going to see how much time doing 200 DPI adds to the process (scans
take longer I believe) and how that works with OCR.

I'll probably re-do a few of the Central Supply Catalog's pages as well (just the densitometer section) and see if that solves it and how much time it adds to the equation. Then I'll make copies and see if Paint Shop Pro affects the files with cropping and saving too.

Okay, took the originals for Central Supply Catalog (just the one page) and
re-did it...

Using as-is worked fine. Found "Densitometer" in the heading.
Using Paint Shop Pro to crop -- same issue
Using GIMP to crop -- no problem

So the issue is Paint Shop Pro.

Just finished scanning Fate of the Sky Raiders.

Initial problems: my scanner (HP PSC 500) doesn't scan the entire deck in BW when it senses the white page ends. So when I scan a single page by placing it in the middle (and let the spine overlap) it bizarrely skews everything the wrong way, despite being well within the boundaries.

So I ended up doing a double-page "centerfold" type scan. Using 200 DPI slowed things down a bit, but still fairly fast. Ended up with 30 pages (60 inthe book).

Luckily the Microsoft MDI program can also rotate pages, so I placed them into a MDI file, rotated them and then OCR'd them.

Another bizarre side effect was that I did the Front/Back covers as one. No matter how I'd put it in, it would rotate this page after OCRing the back cover's blurb. After three attempts I put it in upside down and re-OCR'd it and that got it close. It still rotated it and messed up the cover, placing white streaks during it's auto-rotate process.

The nice thing is on the regular print pages, it will auto-straighten during OCR.

Overall: It's an okay file. The cover is worn and didn't scan real nice, but that's all right.

I'm just getting into GIMP so that may take some time as well. I have a scripting language, but it doesn't work with TIFFs, so I was hoping to split the pages into singles (may not be possible).

Yeah, because no one has sat down and waded through it in the detail you're looking for.

I've only run into issues when I'm looking for something specific and found garbage.

It's much more complex than you realize.

I am interested

But time is always a factor during the spring/summer time (farmer).

I missing only a few of the JTAS orginals and would like to get the rest either in hard or CDROM but money is a factor also. If I was running/playing RPG's or even had someone locally to BS with over games, I would probably go ahead and buy.

But with the ecomony and prices, I am hesitant to do such with out having a good reason besides just wanting them.

Here's what I've turned up. When I use page numbers, I'm using the internal page numbers (those on the pages) not the logical page numbers from the PDF reader.

Challenge 25 has a blank page appended to the end. No big deal.

JTAS 1 has highlighter markings on pages 3, 6, and 15. They don't look like they're part of the original printing, but they may be. I don't have an original to check against. Also, page 16 is in there twice. The first scan of page 16 looks like there's something laying on top of it, so the second scan was probably intended to replace it.

JTAS 2 has a fuzzy page 3. Page 26 appears to be missing.

JTAS 4 has a comic on page 11 that's difficult to read in the scan. Also, should pages 10 and 11 be formatted 2-up?

That's what I've found from a look at JTAS1-4 and Challenge 25-27 so far. I haven't done any extensive checking of the OCR, but I haven't turned up any problems with what I've searched on in these, either.

I'll post anything else I turn up as I get the opportunity.
Here's what I've found in the JTAS CD for issues 5-14:

JTAS 10 is scanned at a low resolution. It's hard to read throughout. I'd definitely say it needs a rescan.

JTAS 11 has some page curling on even numbered pages, which puts the right sides of the page out of focus and distortion of images/text on many pages. The halftones also appear very faded throughout issue 11 (are these colors poorly converted to grayscale, perhaps?)

JTAS 13 has the page curling problem on even numbered pages, too, though some of the pages look OK. Page 10 and some others are pretty bad, though. Page 32 is cropped at the top. Again, not a big deal--no text is lost.

JTAS 7 has the inside cover cropped top and bottom. No biggie, since it's ads.

Much of the text in JTAS 12 appears faded. It looks like it may have been a faded original, but it could be the scan settings, too.

Several issues have pages sized differently throughout (individual pages, I'm not concerned about pages done 2-up here) for no apparent reason. Issue 12 is most notable here.
Several issues have pages sized differently throughout (individual pages, I'm not concerned about pages done 2-up here) for no apparent reason. Issue 12 is most notable here.

Interesting. I was looking on the Image Magick site and it looks like you can take a single image, like a 2-page side by side LBB scan and break it up into left and right pages via their product.

So Don, that might save you time when scanning LBBs, if that's what you're up to.

I've gone through JTAS 15-18 now, and they look fine except for some of the slight page curling on even numbered pages (resulting in some loss of focus and light print in places.) Issue 15 near the beginning and end of the issue was the worst.

The issue is pretty minor overall, however. They're all plenty readable.
Interesting. I was looking on the Image Magick site and it looks like you can take a single image, like a 2-page side by side LBB scan and break it up into left and right pages via their product.

So Don, that might save you time when scanning LBBs, if that's what you're up to.


I think in addressing the scanning problems you missed Don's point entirely! His initial post said:

OK, I've done some searching through numerous threads, and while I see mentions of bad pages and scans on the JTAS CD, I canna find any specifics! This is the next Marc project, so let's here about your problems...

To me, this sounds less like an OCR issue, and more of a readability and organization issue. In other words, missing pages, bad pages, blurry unreadable pages, and the like. In other other words, problems easily corrected rather than the more intractable ones.

THAT SAID, your tests pitting scan quality and cropping software against OCR suites is valuable information that could yet be put to use. However, I just suspect that's not the top priority.

Just to add to the Geek Quotient: I thought 150dpi is a lousy scan quality for OCR. Back in the day, we needed 300dpi. But that was 10+ years ago.
Quick note: Thanks to everyone for your notes so far on details for the JTAS CD.