|
|
|
KT-TECH'S MORPHOLOGICAL (SHAPE-BASED) PREPROCESSING TECHNOLOGY FOR ROBUST OPTICAL CHARACTER RECOGNITIONA scanner is a remarkable device that gives the user the power to port any kind of printed material into the computer. The user can take scanned data and reformat it the way he wants it, catalog information in a way that makes sense to him, and create enormous databases that he can store on-line. However, the user needs Optical Character Recognition (OCR) technology that would allow him to efficiently utilize the scanner in converting the data from hardcopy format to digitized format. A scanner creates an image of a document page. OCR software then converts the image into text for word processor or spreadsheet. Text on a printed page appears in the word processor or spreadsheet, ready to edit and reformat without the need to retype. Thus, a scanner is not quite complete until it can "read" text into the word processing or spreadsheet software. There are, however, limits to what an OCR software can accomplish. When reading degraded documents, such as documents with small type and fuzzy characters, or documents with strong, colored or textured background information and/or noise (dirt, smears, dropouts), the accuracy of most OCR software diminishes rapidly. In such cases, one needs to utilize an intermediate preprocessor module that enhances the scanned data prior to the OCR for better recognition and decreased rejection rates. At the present, commercially available "character recognition systems" that are utilized to convert scanned documents to ascii text files can de-skew and de-speckle the scanned document. However, these systems cannot remove the background information, or other information, such as logos, icons, dirt, smears, etc., which is sometimes present on the document. Such information, when not properly removed, can confuse the recognition engine resulting in time delays and/or erroneous character recognition. The basic premise of preprocessing the scanned data is to improve the signal-to-noise ratio of the data, i.e. to remove any features (noise) from the data that, if left alone, may complicate the recognition process while retaining those features (signal) that are necessary for recognition. Prior to the recognition stage, the concept of what is signal and what is noise is relative to what are the minimal data attributes which facilitate correct symbol detection and attributes complicate this identification. An image, icon, or the rendered form of the character has many degradations which constitute noise as well as many redundant features which overstate the signal. The optimal preprocessing scheme should then be one which minimizes the mapping of the degradations as well as the redundant signal information into the enhanced signal. To address this problem, KT-Tech is developing a morphological (shape-based) preprocessing module that will improve the performance of the OCR in scanned documents. Figure 1 illustrates the relative role of this morphological preprocessing module within the overall scanning, recognition, and storage system. The morphological preprocessing module is in software form and resides on the hard drive of the computer. We place the morphological preprocessing module between the scanner and the OCR software, where its function is to enhance the scanned data prior to the OCR for better recognition and decreased rejection rates. OCR module is followed by the word processor/ spreadsheet software where the data can be edited and/or reformatted. The resulting document is then saved on the hard drive of the computer and/or on the external storage medium. For the case of scanned documents, noise (dirt, smears, dropouts, as well as background) and signal (rendered alpha-numeric characters by dot matrix, laser printer, line printer, as well as images, icons, and other characters) have different spatial and spectral signatures. By taking advantage of these differences, KT-Tech's morphological preprocessing module removes the noise and the background while retaining the necessary information for successful recognition. Figure 2 illustrates an original scanned-in postal label with strong background noise. The wavy lines in the background constitute noise and have to be removed prior to compression. In Figure 3, the resulting data after the application of a standard noise reduction algorithm, the OTSU threshold, is presented. Note the amplification of the background noise corrupting the data. Figure 4 illustrates the result of KT-Tech's unique morphological preprocessing algorithm and adaptive thresholding on the original data. Note the improvement in the quality after the application of KT-Tech's method. KT-Tech's technique removes the background noise completely. |