OCR not working, help me!!

Discussion in 'Software' started by Smiley, Feb 21, 2010.

Thread Status:
Not open for further replies.
  1. Smiley

    Smiley Pen Pal - Newbie

    Messages:
    18
    Likes Received:
    0
    Trophy Points:
    5
    Hello!

    I'm using Readiris pro 10 corporate for the first time to do this. Also using PDF Revu and Plustek OpticBook 4600.

    I appreciate any help and feel free to forward this to others.

    I am attempting to OCR a chapter of a textbook I have. At the moment, I am trying to OCR a pdf I created from a single page (as a test).

    I have no problem creating pdf's but am having issues with OCR recognition. I am given options to export the image as a pdf text, pdf image, pdf text-image (text on top of image), and pdf image-text (image over text).

    I tried exporting as text-image but the resulting pdf's text came out with weird letters and characters that replaced text from the page. I also tried exporting as image-text. It came out well with no weird letters but I was only able to find few words after doing several searches. In the software, it shows that text, charts, pictures, diagrams have been recognized but when I try to search for words that are within them, I come up with nothing.

    I know this is a lot. Let me know if this is unclear. Maybe some screenshots would help. Maybe if I upload the pdf's. Let me know.

    Smiley :)

    Edit:
    I would open up Readiris pro 10 corporate edition. Since this would be my first time attempting OCR, I have left settings at default.

    The program would begin with a wizard that walks me through the process.
    This includes: 1. choosing between text pages, business cards, etc.
    2. Selecting an image from my scanner or hard drive.
    3. What languages I want for the document.
    4. What format do I want text in. (There are plenty to choose from but I want pdf)

    I've been playing around with this for a while. I did get better recognition by importing a .tiff file instead of a pdf for OCR. I am still trying some things though.

    If you have done this before, how do you go about doing this? What file formats have you found to work best? As I get better at this, eventually I would want to improve my pdf files with better compression.
     
  2. ScubaX

    ScubaX Level 90 Mage Senior Member

    Messages:
    1,011
    Likes Received:
    0
    Trophy Points:
    56
    I use a Fujitsu ScanSnap S510 and scan it directly to a PDF in Acrobat Pro. Then I use Acrobat to OCR it. I have not tested it to see if it catches everything my heart desires, but I have not seen any problems either. The one problem I have noticed and corrected was if the scan was not high enough quality. I also often print them to OneNote and it's search engine also finds them - though the list of found words does not link well from the list. Seems to be a ON bug as it does this on all my machines.

    I would suggest trying Acrobat Pro if you have a copy. If not, increase the quality of the system your using now.

    If you want, link to an PDF page and I will see if I can OCR it.
     
  3. Frank

    Frank Scribbler - Standard Member Senior Member

    Messages:
    3,847
    Likes Received:
    3
    Trophy Points:
    116
    I do it similar to ScubaX, except that I scan the pages to image files, so I can process them with Photoshop and then combine them as a PDF with Acrobat and use its OCR feature. It works perfectly, in my opinion.

    However, just as your wrote, it seems as if the same should be possible with your program. I haven't used it. I used OmniPage in the past sometimes, but not Readiris.


    As ScubaX said, check the resolution you use. It should be at least 300DPI, more isn't really necessary, less is useless.

    I think the option image-text keeps the image and places an invisible text layer above it, or as you said, behind it. Later in Acrobat use the text tool to select some text to copy it, then you should see how well it recognized the text.
     
  4. adam.mt

    adam.mt Super Moderator Senior Member

    Messages:
    960
    Likes Received:
    0
    Trophy Points:
    31
    If you're having problems I would suggest either using the scan option in Readiris (option 2 of the wizard in your description) or the scanner software and creating image files, not pdfs, and passing these on to Readiris.

    Doing so will eliminate the possibility of the image to pdf conversation process removing detail required for accurate OCR (the scanner will natively create an image; if this is converted to a pdf before the OCR process then the program has to turn it back to an image to in order to perform the OCR - a wasteful and quality loosing step).

    Obviously, as mentioned by Scuba and Frank, if the pdf creation process is perfect and the OCR engine good enough then accurate results can still be obtained; but since you're having problems it's best to reduce the process to a minimum to find where it's failing.

    I also recommend experimenting with your scanner settings; ie. DPI and tone (black&white/grey scale etc).
     
  5. Smiley

    Smiley Pen Pal - Newbie

    Messages:
    18
    Likes Received:
    0
    Trophy Points:
    5
    About increasing the image quality:

    So should I try-- 1. Scanning a page directly to pdf, then OCR?
    2. Scan a page, process, OCR, then convert to pdf?

    Also, does it matter what format the original image is in? Would the format (jpg,jpeg,tiff,png,etc.) affect how well the image is recognized by OCR?

    When processing an image (using photoshop, other software, etc.), what are some things you all change? Or what are some things I should play around with? This is for a textbook that has color, graphics, text, charts, graphs, etc.

    I will try doing what you all suggest.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page