Services Forms Processing Services Steps in forms processing

Steps in forms processing

E-mail Print PDF
Just as documents must be prepared in order to be fed into a scanner by removing staples, smoothing wrinkles, positioning them for optimal registration, etc., so the image of a form document must be prepared by following these steps before it can be intelligently recognized:

Document scanning - Pages of forms are scanned and converted into bit-mapped (usually TIFF) images of forms which are either compressed and stored for later batch processing, or are passed immediately in an uncompressed format to an ICR engine for recognition.

Image analysis - The document image is cleaned up. Character image quality is improved, using image enhancement techniques. Background "noise" is removed from the form.

Form alignment - The image is registered and deskewed by the ICR software, which automatically aligns the form by locating special symbols on the document called registration marks as guides.

Form identification - The document is identified by certain predefined characteristics that the ICR software is trained to look for, so that the zones containing the fields designated for recognition can be located by a customized, predefined ICR template. Form ID attributes can include form numbers, corporate logos, or the name of the form itself imprinted somewhere on the form.

Form background removal - This stage is not necessary if the document is a form that was originally printed in a colored ("drop out") ink that is invisible to the scanner being used. If colored ink is not used, the form image may contain lines, boxes, fine print, and other form attributes-passive data-that tend to confuse the ICR engine. These form attributes must be extracted from the image of the form, so that only the character images-the active data-are left behind. Broken and fragmented characters are automatically repaired and restored to their original shapes.

Character field location - The predefined ICR template automatically locates the fields that contain character data. The template identifies which individual fields on the form image require character recognition, and what the nature of those fields are-hand print, machine print, numeric, alphabetic, alphanumeric, etc. The template also identifies which areas are barcodes or check box recognition zones.

Character segmentation - Sophisticated software routines analyze, separate, and break down the character fields into isolated characters. If the form is "ICR -friendly," characters are segmented with the aid of graphic devices such as boxes, tick-marks, and connected boxes called "combs" that serve to force the form user to legibly separate the characters from one another.

Character classification - Individual characters are classified by ICR algorithms according to their ASCII category and assigned a confidence value, which is an index of how "certain" the ICR engine "feels" about the selection it has made. Alternate character choices are ranked according to those values, so that they can be incorporated into editing procedures that improve ICR accuracy. For example, the alternate choice "1" might be used instead of the first-ranked choice "I" when contextual analysis reports that the field is all-numeric.

Post-processing - The initial or "raw" recognition results are validated using edit procedures such as grammatical rules, spell-checkers, dictionaries, check-sum routines, and look-up tables. Ambiguous and erroneous data fields-the "rejects"- are identified and sent to data entry operators at workstations for manual correction.

Manual correction of rejected character fields - The manner in which the data entry operator is presented the rejected data for correction can dramatically impact both the speed and the accuracy of the reject repair process. In particular, the data entry GUI is important because the ergonomics of data entry are what enable a given data entry operator to reach his or her maximum correction speed.

What is interesting in forms processing is that only one of the steps-character classification-is specifically concerned with identifying character data. The rest of the steps have to do with either preparing the imaged characters for classification or interpreting the results of character classification. With so much opportunity for error increasing at each successive step of the way, it is remarkable that ICR accuracy rates can attain (and sometimes exceed) human performance levels.
 

Technologies From...

Featured Sponsors

www.eradoc.com

Search

Sponsored Links

Featured Product..

Canon Document Scanner DR-7080C

Speed up your workflow with this compact, universal A3 document scanner. The DR-7080C features superior speed and unsurpassed quality for scanning both colour and black and white documents perfectly.

Featured Accessories..

Canon Barcode Module for DR-5010C, DR-7580, DR-9080C, and DR-2580C Scanners
This barcode module from Canon is compatible with the DR-5010C, DR-7580, DR-9080C and DR-2580C Sheetfed Scanners. This optional Barcode Software Module automates many workflow processes such as indexing, batch separation, and forms recognition.