Next / Previous / Contents / TCC Help System / NM Tech homepage

Abstract

Instructions for scanning text documents to produce PDF files using optical character recognition.

This publication is available in Web form and also as a PDF document. Please forward any comments to tcc-doc@nmt.edu.

Table of Contents

1. Overview
2. Creating the PDF file
3. Character recognition

1. Overview

This document describes techniques for converting textual content from a paper document into machine-readable form. The computer actually attempts to “read” or recognize the characters on your page, through a technique called OCR, for Optical Character Recognition.

If all you want to do is capture an exact image of a flat original, see Using the flatbed scanner. Use OCR if you want to extract the textual content. OCR is definitely indicated if you want to modify the text.

Warning

Please do not expect miracles from this process. For best results, you will need an original document that is very crisply printed in a common font. Results will be poor or useless for originals with complex layouts, strange fonts, and stray marks. If your original has multiple columns, the result may mix text from the columns together; single-column originals work best.

For this process, you will need to use one of the PC workstations that has a flatbed scanner attached. Most of these workstations are in Speare 5; ask the User Consultant there to help you find an appropriate system.

The necessary software package, Adobe Acrobat Professional 6.0, is available on all scanner-equipped systems. In general, there are two overall parts of the process. First you will scan the document and convert it into an Adobe PDF (Page Description Format) file. In the second step, this program attempts to recognize the text and attach it to the document. If this part succeeds, you can then save the textual content in other formats such as Microsoft Word and ordinary text.