Saturday, October 1, 2011

Research Paper: A Survey of Telugu OCR System


Development of an Efficient OCR System for Telugu Script
 

INTRODUCTION

Optical character recognition is usually abbreviated as OCR.  The object of OCR is automatic reading of optically sensed document text materials to translate human-readable characters into machine-readable codes.  Today, reasonably efficient and inexpensive OCR packages are commercially available to recognize printed texts in widely used languages such as English, Chinese, and Japanese.  These systems can process documents that are typewritten, or printed.  They can recognize characters with different fonts and sizes as well as different formats including intermixed text and graphics.  While a large amount of literature is available for the recognition of Roman, Chinese and Japanese language characters, relatively less work is reported for the recognition of Indian language scripts.  The research on OCRs for Indian scripts is still a challenging task.  Recently, work has been done for development of such packages for Indian languages.  This is reported in Proceedings of International Conference on Document Analysis and Recognition, 1999.

A script is a system of characters used for writing or printing a natural language.  Latin script represents Western European languages and Devanagari for Hindi; Telugu, Kannada and Tamil scripts for Dravidian languages.  The Devanagari script is the most widely used Indian script.  It is used as the writing system for over 28 languages including Sanskrit, Hindi, Kashmiri, Marathi and Nepali.  The Telugu script is the second most major script in India.  Indian scripts do not show upper-case and lower-case distinction.  The concept of capitalization does not exist.  Many names are also common nouns.  Indian names are also more diverse.  Machine processing of document categorization demands for establishing a relation between coded sequence of characters and human perception of the language.
 
Telugu is the oldest and principal language of the eastern part of the Indian Peninsula from Madras to Bengal, i.e., Southern Indian state of Andhra Pradesh.  Telugu is spoken by more than 100 million people especially in South India.  It ranks between 13-17 largest spoken language in the world alongside of Korean, Vietnamese, Tamil and Marathi.  The distribution of spoken language in India is geographic, and each of the different states of the country usually speaks a different language (apart from a large number of Hindi-speaking states).  Andhra Pradesh state, where Telugu is spoken, shares borders with 5 different states which speak Tamil, Kannada, Marathi, Hindi and Oriya.  Thus, in regions along the borders with these states, the dialect of Telugu is different, although the script and formal (written) language are the same.  The language has a rich literature since 1120 AD and has been studied by native and foreign linguists significantly.  However, with pressures from the use of Hindi as a national language and English as a global language, chinks are beginning to appear on the Telugu fabric and concerns are being expressed about the functional survival of Telugu.  The language has not benefited significantly from the recent advances in computational approaches for linguistic or statistical processing of natural language texts. 

The language is called Telugu or Tenugu.  Formally the Europeans often called in Gentoo.  Gentoo is a corruption of the Portuguese gentio.  Another name is Andhra which is used in Aitareya Brahmana to denote Indian people.  The word Telugu supported to be a corruption of Sanskrit Trilinga.  It is explained as meaning, the country of the three linga’s and a tradition is quoted according to which siva, in the form of linga, descended upon the three mountains Kaleswara, Srisaila and Bhimeswara and then those mountains marked the boundary are of Telugu country.  According to linguists, Telugu is a Dravidian language.  It does not belong to Indo-Aryan family to which Hindi, Sanskrit, Latin and Greek belongs.  Telugu has a recorded history from 6th century A.D. and a fine literature, up to the end of the 19th century.  This is a syllabic language.  Each symbol in Telugu script represents a complete syllable, i.e., very little scope for confusion and spelling problems.  This language is usually written using Telugu alphabets, i.e., Bramhic script and it is phonetic (written from left to right) in nature.

In Telugu script, individual characters are written separately.  It represents one major class of script in India – script without shirorekha (a head bar that joins all the characters in a word).  For example, Malayalam, Kannada, Tamil, etc.  Character separation prior to recognition is therefore not needed in Telugu.  These two features are common to all the four south Indian languages.  In fact, the absence of shirorekha is a feature of other scripts such as Oriya and Gujarati too.

Telugu has a complex orthography with a large number of distinct character shapes (estimated to be of the order of 10,000) composed of simple and compound characters formed from 52 letters; with 16 vowels (called achchus) that represent basic vowel sounds and 36 consonants (called hallus).  In addition, several semi-vowel symbols, called maatras, are used in conjunction with hallus; and, half-consonants, called voththus, are used in consonant clusters.  Achchus, hallus, maatras and voththus together provide roughly 100 basic orthographic units that are combined together in different ways to represent all the frequently used syllables (estimated between 5000 and 10000) in the language.  Here, we refer to these basic orthographic units as glyphs (see Figure 1 for the complete listing). 

A glyph refers to a single connected component within a character.  A simple character is one containing only a single glyph (the first two characters in Figure 2).  A compound character is one containing more than one glyph (the last three characters in Figure 2).

Differences from English text are strikingly apparent in the composition of various characters.  While written words and sentences are linear left-to-right, the maatras and vaththus are placed above and below the hallus.  In Figure 2, the first two characters are examples of an achchu and a hallu.  The third character shows a maatra placed above the hallu (second character) and the fourth shows a vaththu in addition to the maatra.  The last is a more complex example with a maatra and two voththus placed above and below a consonant.  The number of glyphs is the sum of achchus (16) and hallus (36), the maatras (16) and vaththus (36), and is 104.  In actual practice, not all vaththus are used and the total number is less than 100.  Adding the number of glyphs for punctuation, special symbols and Telugu numerals, the total is approximately 120.

As a general rule when the vowel appears at the beginning of the word, the full vowel character is displayed.  When a vowel appears inside a word, a vowel modifier is added to the preceding consonant creating a composite character.  The number of combinations possible here to form a composite character is very large.  Also a character can have more than one consonant modifier (as in stree).  Word-processing with such a complex script using the QWERTY keyboard can be a harrowing experience, considering the cumbersome keyboard mappings the user has to learn.  

One of the most difficult tasks in Telugu OCR systems is to put the disjoint glyphs back together into characters.  Character spacing or distances between glyph locations is extremely error-prone because of wide variations.  Sometimes, certain vowel or consonant modifiers would not be picked up because they are too far from their base character; sometimes, they would be associated with the wrong base character; or in some cases, even remain unconnected.

Optical character recognition is the recognition of printed or written text by a computer.  This involves photo scanning of the text, which converts the paper document into an image, and then translation of the text image into character codes such as ASCII.  Any OCR implementation consists of a number of preprocessing steps followed by the actual recognition.  Most document analysis systems can be visualized as consisting of three steps:  the preprocessor, the feature extractor and the recognizer.  In preprocessing, the raw image obtained by scanning a page of text is converted to a form acceptable to the recognizer by extracting individually recognizable symbols.  This step is also called symbol segmentation.  The number and types of preprocessing algorithms employed on the scanned image depend on many factors such as age of the document, paper quality, resolution of the scanned image, the amount of skew in the image, the format and layout of the images and text, the kind of script used and also on the type of characters - printed or handwritten (Anbumani & Subramanian 2000).  Typical preprocessing includes the following stages:
1.        Binarization                                   
2.       Noise removing                  
3.       Thinning
4.       Skew detection and correction      
5.       Line segmentation
6.       Word segmentation 
7.       Character segmentation



The preprocessed image of the symbol is further processed to obtain meaningful elements called features.  Recognition is completed by searching for a feature vector in a database of stored feature vectors of all possible symbols that matches with the feature vector of the symbol to be recognized.  Recognition consists of
1.        Feature extraction
2.       Feature selection
3.       Classification



Binarization:  Binarization is a technique by which the gray scale images are converted to binary images.  In any image analysis or enhancement problem, it is very essential to identify the objects of interest from the rest.  Binarization separates the foreground (text) and background information.  The most common method for binarization is to select a proper threshold for the intensity of the image and then convert all the intensity values above the threshold to one intensity value (for example, ”white”), and all intensity values below the threshold to the other chosen intensity (“black”).  Binarization is usually reported to be performed either globally or locally.  Global methods apply one intensity value to the entire image.  Local or adaptive thresholding methods apply different intensity values to different regions of the image.  These threshold values are determined by the neighborhood of the pixel to which the thresholding is being applied.  Several binarization techniques are discussed in (Anuradha & Koteswarrao 2006).

Noise Removal:  Scanned documents often contain noise that arises due to printer, scanner, print quality, age of the document, etc.  Therefore, it is necessary to filter this noise before we process the image. The commonly used approach is to low-pass filter the image and to use it for later processing.  The objective in the design of a filter to reduce noise is that it should remove as much of the noise as possible while retaining the entire signal (Rangachar et al 2002).

Thinning:  Thinning or skeletonization is a process by which a one-pixel-width representation (or the skeleton) of an object is obtained, by preserving the connectedness of the object and its end points (Gonzalez &Woods 2002).  The purpose of thinning is to reduce the image components to their essential information so that further analysis and recognition are facilitated.  This enables easier subsequent detection of pertinent features.  A number of thinning algorithms have been proposed and are being used.  The most common algorithm used is the classical Hilditch algorithm (Rangachar et al 2002) and its variants.

Skew detection and correction:  When a document is fed to the scanner either mechanically or by a human operator, a few degrees of skew (tilt) are unavoidable.  Skew angle is the angle that the lines of text in the digital image make with the horizontal direction.

There exist many techniques for skew estimation.  One skew estimation technique is based on the projection profile of the document; another class of approach is based on nearest neighbor clustering of connected components.  Techniques based on the Hough transform and Fourier transform are also employed for skew estimation.  A survey on different skew correction techniques can be found in Chaudhuri & Pal (1997).  A popular method for skew detection employs the projection profile.  A horizontal projection profile is a one-dimensional array where each element denotes the number of black pixels along a row in the image.  For a document whose text lines span horizontally, the horizontal projection profile has peaks whose widths are equal to the character height and valleys whose widths are equal to the spacing between lines.  At the correct skew angle, since scan lines are aligned to text lines, the projection profile has maximum height peaks for text and valleys for line spacing.  

Line, word, and character segmentation:  After the tilt is corrected, the text has to be segmented first into lines; each line then into words and finally each word have to be segmented into its constituted characters.  An error in segmentation may lead to wrong recognition of text and the system may be rendered useless.  Recognition of text heavily depends on proper segmentation of text into lines, words and then individual characters or sub-characters for feature extraction and classification of these characters.  

Horizontal projection of a document image is most commonly employed to extract the lines from the document.  If the lines are well separated, and are not tilted, the horizontal projection will have separated peaks and valleys, which serve as the separators of the text lines.  These valleys are easily detected and used to determine the location of boundaries between lines.

Similarly a vertical projection profile gives the column sums.  One can separate lines by looking for minima in horizontal projection profile of the page and then separate words by looking at minima in vertical projection profile of a single line.  Valleys in the vertical projection of a line image can be used in the extraction of words in a line, as well as extracting individual characters from the word.  For example, if a line consisting of 4 words, along with vertical projection profiles; and it shows the 4 words, after segmentation.  If a word is shown segmented into its constituting 3 characters, overlapping, and adjacent characters in a word (called kerned characters) cannot be segmented using zero-valued valleys in the vertical projection profile.  Special techniques have to be employed to solve this problem.

            The problem that rises while doing character segmentation in Telugu script is more complex than in Roman scripts.  This is due to (i) the large number of complexes produced by neighboring characters touching each other; (ii) those produced by touching with voththus, and maatras; (iii) voththus may be placed with respect to their base characters in a large number of relative positions; (iv) voththus of a upper line might touch maatras of a lower line.  All these lead to a number of situations where touching characters occur in a Telugu document.


Feature extraction and selection:  Feature extraction can be considered as finding a set of parameters (features) that define the shape of the underlying character as precisely and uniquely as possible.  The features have to be selected in such a way that they help in discriminating between characters.  Thinned data is analyzed to detect features such as straight lines, curves, and significant points along the curves.

Feature selection approaches try to find a subset of the original features.  The strategy used for OCR can be broadly classified into three categories:
1.        Statistical Approach
2.       Syntactic/ structural Approach
3.       Hybrid Approach

In statistical approach, a pattern is represented as a vector: an ordered, fixed length list of numeric features.  Many samples of a pattern are used for collecting statistics.  This phase is known as the training phase.  The objective is to expose the system to natural variants of a character.  Recognition process uses this statistics for identifying an unknown character. Features derived from the statistical distribution of points include geometrical moments, and black-to-white crossings. 

Structural classification methods utilize structural features and decision rules to classify characters.  Structural features may be defined in terms of character strokes, character holes, end points, loops or other character attributes such as concavities.  The classifier is expected to recognize the natural variants of a character but discriminate between similar looking characters such as O and Q, c and e, etc.

The statistical approach and structural approach both have their advantages and disadvantages.  The statistical features are more tolerant to noise provided the sample space over which training has been performed is representative and realistic than structural descriptions.  The variation due to font or writing style can be more easily abstracted in structural descriptions.

In hybrid approach, these two approaches are combined at appropriate stages for representation of characters and utilizing them for classification of unknown characters.

Classification:  The classification stage in an OCR process assigns labels to character images based on the features extracted and the relationships among the features.  In simple terms, it is this part of the OCR which finally recognizes individual characters and outputs them in machine editable form. 

Template matching is one of the most common and oldest classification methods.  In template matching, individual image pixels are used as features.  Classification is performed by comparing an input character image with a set of templates (or prototypes) from each character class.  The template, which matches most closely with the unknown, provides recognition.

Classification strategies following feature extraction are mostly based on identification of a neighbor pixel with the nearest distance.  A distance measure between the vectors is used as the similarity between the images.  Binary- tree classifiers and the nearest-neighbor classifiers are the two most commonly used classifiers.  The frequently used distance calculation is the Euclidean distance measure.


The Challenges
Language-specific Issues:  Telugu, like most Indian languages, has a complex script, where the consonant could be modified by a vowel, consonant and/or a diacritic.  Due to this inherent complexity of the language’s script and writing style, accurate segmentation and matching of words (and characters) is a very difficult task.

Issues in Scanning:  Scanned document images contain a large number of artifacts, which are cleaned on a large scale using a semi-automatic process, by using various image processing operations.  Owing to the variation in quality across the images, a single setup of image processing parameters would not be suitable for all.  Consequently, the overall quality of the processed images is poor, thereby matching and recognizing such words are very difficult.

Scalability:  The massiveness of the digital library collections is a serious challenge for automation of the processes.  Due to this magnitude, even the quick image processing routines require large amounts of time.  Despite considerable optimizations, the computation required is enormous, and the processing has to be distributed over a cluster of computers.  Managing such a cluster and transferring of large amounts of data across the network were some of the major bottlenecks in the system development.


Applications and Extensions

Research in OCR is popular for its application potential in several places.  Some practical application potentials of OCRs are: reading aid for the blind, preserving old/historical documents in electronic format, desktop publication, library cataloging, ledgering, automatic reading for sorting of postal mail, bank cheques and other documents, etc.  Research in the area of printed South Indian Telugu script recognition was to address many specific and generic applications which need an OCR engine.  Our strength is in extending the recognizer to many real-life applications.

Post Processor:  Component level recognition results can be combined to get the word level output.  At this stage, one could use a post processor to improve the recognition accuracy.  We can use two distinct types of post processing schemes.  In an application for a learning aid for the illiterate people, we have a fixed and limited word set to be recognized.  We use post-processing based on geometric information of components for some confusing pairs.  At this stage, some of the frequent misclassifications are also corrected using approximate string matching techniques.    

Speech Recognition System:  The speech recognizer has been developed by collecting speech samples of Telugu alphabets which include vowels, consonants, consonant vowel combination, consonant-vowel-consonants, two letter and three letter words and simple sentences from 16 speaker’s of both male and female. 
 
Document Database System:  An important application of OCR is for converting printed documents into electronic form on which search and retrieval can be carried out.  Indian language document databases are non-existent today primarily because of the lack of good OCR systems.  We can integrate the OCR engine with a page segmentation procedure for the archival of printed pages.  Wherever OCR fails to recognize the text, the relevant area is stored as an image.  The rest of the text is converted into electronic form (ISCII) and the content and structure of the text block is stored.
 
Document Speaking System:  Another application is a reading aid for the visually impaired people.  With the scarcity of printed material in Braille, it is becoming more and more difficult for such people to access information in the printed medium.  The document reading system integrates Text-To-Speech (TTS) system with the OCR to achieve this.  A data driven approach using example-based learning is the basis for the TTS.

Human-Computer Speech Interactive system:  The human-computer speech interactive system consists of three modules, namely, Speech Recognition System (SRS), Search Engine (SE), and Text-to-Speech System (TTS).  Each of these modules plays a very important role.  The HCSI module gives the speaker’s data to SRS module, and then SRS module recognizes the letter, word or sentence depends on input and replies back as a text to HCSI module.  The text is now passed to Speech Database (SD), search engine which searches for appropriate word or sentence in turn pass this text to TTS, which generates waveform.  It will be passed to the HCSI, and then the HCSI gives speech output to listener by playing the generated wave file.

User Interface Design:  The User Interface (UI) is designed for user-friendly purpose.  Here HTML pages are designed to display Telugu alphabets.


Survey on Telugu OCR

Rajasekaran & Deekshatulu (1977) have proposed the ‘Recognition of printed Telugu characters’ [31] using Computer Graphics Image Processing.  This was the first reported work on OCR of Telugu Character.  It identifies 50 primitive features and proposes a two-stage syntax-aided character recognition system.  In the first stage a knowledge-based search is used to recognize and remove the primitive shapes.  In the second stage, the pattern obtained after the removal of primitives is coded by tracing along points on it.  Classification is done by a decision tree.  Primitives are joined and superimposed appropriately to define individual characters.

Rao & Ajitha (1995) have proposed the ‘Telugu Script Recognition – a Feature Based Approach’ [33].  It includes the concept of Telugu characters as composed of circular segments of different radii is made use of in the work.  Recognition consists in segmenting the characters into the constituent components and identifying them.  Feature set is chosen as the circular segments, which preserve the canonical shapes of Telugu characters.  The recognition scores are reported as ranging from 78 to 90% across different subjects, and from 91 to 95% when the reference and test sets were from the same subject.

Sukhaswami et al (1995) have proposed the ‘Recognition of Telugu characters using Neural Networks’ [38].  Hopfield model of neural network working as an associative memory is chosen for recognition purposes initially.  Due to the limitation in the storage capacity of the Hopfield neural network, they later proposed a multiple neural network associative memory (MNNAM).  These networks work on mutually disjoint sets of training patterns.  They demonstrated that storage shortage could be overcome by this scheme.

Pujari et al (2002) have proposed the ‘An Adaptive Character Recognizer for Telugu Scripts using Multi-resolution Analysis and Associative Memory’ [30].  Gray level input text images are line segmented using horizontal projections; and vertical projections are used for the word segmentation.  Images are uniformly scaled to 32x32 using zero-padding techniques.  Wavelet representation with three levels of down sampling reduces a 32x32 image into a set of four 8x8 images, of which only an average image is considered for further processing.  Character images of size 8x8 are converted to binary images using the mean value of the grey level as the threshold.  The resulting bit string of 64 bits is used as the signature of the input symbol.  A Hopfield-based Dynamic Neural Network is designed for the recognition purpose.  The performance across fonts and sizes is reported as varying from 93% to 95%.  The authors reported that the same system, when applied to recognize English characters, resulted in very low recognition rate since the directional features that are prevalent in Latin scripts are not preserved during signature computation with wavelet transformation.

Negi et al (2001) have proposed the ‘An OCR System for Telugu’ [24].  Instead of segmenting the words into characters as usually done, words are split into connected components (glyphs).  Run Length Smearing Algorithm (RLSA) (Wong et al 1982) and Recursive XY Cuts (Nagy et al 1992) methods are used to segment the input document image into words.  About 370 connected components (depending on the font) are identified as sufficient to compose all the characters including punctuation marks and numerals.  Template matching based on the fringe distance (Brown 1994) is used to measure the similarity or distance between the input and each template.  The template with the minimum fringe distance is marked as the recognized character.  The template code of the recognized character is converted into ISCII, the Indian Standard Code for Information Interchange.  Raw OCR accuracy with no post processing is reported as 92%. Performance across fonts varied from 97.3% for Hemalatha font to 70.1% for the newspaper font. 

Negi et al (2002) have proposed the ‘Non-linear Normalization to Improve Telugu OCR’ [25].  It was used by selectively scaling regions of low curvature in the glyphs.  This is based on a dot density feature normalization method.  The authors observed distortions in the shapes, but reported improvement in the OCR recognition accuracy.  Performance across different fonts is not investigated. 

Lakshmi & Patvardhan (2002) have proposed the ‘A multi-font OCR system for printed Telugu text’ [18].  Preprocessing stages such as binarization, noise removal, skew correction using Hough transform method, Lines and words segmentation using horizontal and vertical projections are included in this work.  Basic symbols from each word are obtained using connected components approach.  After preliminary classification as in the previous work, pixel gradient directions are chosen as the features.  Recognition is done again using the k-nearest neighbor algorithm on these feature vectors.  The training vectors are created with three different fonts and three different sizes: 25, 30 and 35.  Testing is done on characters with different sizes, and also with some different fonts.  Recognition accuracy of more than 92% for most of the images is claimed.

Negi and Nikhil (2003) have proposed the ‘Localization and Extraction of Text in Telugu Document Images’ [26].  The gradient magnitude of the image is computed to obtain contrasting regions in the image.  After binarization and noise removing, Hough Transform for circles is applied on the gradient magnitude of the image to obtain the circular gradient which is a prominent feature of Telugu text.  Each detected circle is filled to obtain the regions of interest.  Recursive XY cuts and projection profiles are used to segment the document image into paragraphs, lines, and words.

Bhagvati et al (2003) have proposed the ‘On Developing High Accuracy OCR Systems for Telugu and other Indian Scripts’ [9].  Some of them are: identification of glyph position information, and recognizing punctuation marks from the width, and height information, etc.  The authors observed that handling of confusion pairs of glyphs, and touching characters would improve the OCR recognition accuracy further.  Overall, 25% improvement is expected from considering the above factors.

Lakshmi & Patvardhan (2003) have proposed the ‘Optical Character Recognition of Basic Symbols in Printed Telugu Text’ [17].  After obtaining the minimum bounding rectangle, each character (basic symbol) is resized to 36 columns, while maintaining the original aspect ratio.  A preliminary classification is done by grouping all the symbols with approximately same height (rows).  Feature vector is computed out of a set of seven invariant moments from the second and third order moments.  Recognition is done using k-nearest neighbor algorithm on these feature vectors.  A single font type is used for both training and test data.  Testing is done on noisy character images with Gaussian noise, salt and pepper noise and speckle noise added. Preprocessing such as line, word, and character segmentation is not addressed in this work.  

Lakshmi & Patvardhan (2003) have proposed the ‘A High Accuracy OCR for Printed Telugu Text’ [19].  Neural network classifiers and some additional logic are introduced in this research.  The feature vectors obtained from pixel gradient directions are used to train separate neural networks for each of the sets identified by the preliminary classification scheme.  Testing is done on the same 3 fonts used for training, but, with different sizes.  A high recognition accuracy of 99% in most cases for laser and desk jet quality prints is reported.

Jawahar et al (2003) have proposed the ‘A Bilingual OCR for Hindi-Telugu Documents and its Applications’ [16].  It is based on Principal Component Analysis followed by support vector classification.  An overall accuracy of approximately 96.7% is reported. 

DRISHTI is a complete Optical Character Recognition system for Telugu language developed by the Resource Center for Indian Language Technology Solutions (RCILTS), at the University of Hyderabad (JLT, July 2003 pg110-113).  The techniques used in Drishti are as follows: For binarization three options are provided: global (the default), percentile based and iterative method.  Skew Detection and Correction are done by maximizing the variance in horizontal projection profile.  Text and Graphics Separation is done by horizontal projection profile.  Multi-column Text Detection is done using Recursive X-Y Cuts technique.  It is based on recursively splitting a document into rectangular regions using vertical and horizontal projection profiles alternately.  Word segmentation is done using a combination of Run-Length Smearing Algorithm (RLSA) and connected-component labeling.  Words are decomposed into glyphs by running the connected component labeling algorithm again.  Recognition is based on template matching using fringe distance maps.  The template with the best matching score is output as the recognized glyph.

Anuradha Srinivas et al (2007) have proposed the ‘Telugu Optical Character Recognition’ [6] system for a single font.  Sauvola’s algorithm is used for binarization; skew detection and correction are done by maximizing the variance in horizontal projection profile.  For decomposing the text document into lines, words and characters, horizontal and vertical projection profiles are used.  Zero-crossing features are computed, and Telugu characters are grouped into 11 groups based on these crossing features.  A 2-stage classifier with first stage identifies the group number of the test character, and a minimum-distance classifier at the second stage identifies the character.  Recognition accuracy of 93.2% is reported.


During my literature survey, I have identified the following areas which are still incomplete in recognition process.  The main objectives of my work are (a) font-independent OCR, (b) easy adaptation across languages, and (c) scope for extension to handwritten documents.

·       The first is that some components are consistently mis-recognized.  One such component along with its incorrectly identified template is shown in the Figure 4.  It may be seen that the two shapes resemble each other with the difference in the region at the bottom-left corner.  

By performing matching and/or spell-check, we can improve the accuracy of recognition.  While performing these operations, thinning of input character prior to recognition will be much more helpful than the increasing thickness of the input character.  Similarly, several errors result from vaththus which are smaller in size compared to other components.  Certain vaththus are fairly complex (see Figure 1) and non-linear scaling techniques may handle this problem. 

·       Many of the OCR systems result shows that the recognition fails only in case of symbols that are very similar to each other.  Therefore, recognition rate is lower in the images in which there is an abundance of these symbols. 

·       To conclude, the design, approach and implementation will be driven by the need for a practical OCR system for Telugu.  A study of the OCR performance on various Telugu literary and cultural works is in progress.






References
[1] Aparna K G, Ramakrishnan A G, 2002 A complete Tamil Optical Character Recognition System. 5th International Workshop on Document Analysis Systems DAS 2002, Princeton, NJ, USA, pp. 53-57.
[2] Aparna K G ,Ramakrishnan A G, 2001 Tamil Gnani – an OCR on Windows, Proc. Tamil Internet 2001, Kuala Lumper, pp. 60-63.
[3] Anbumani, Subramanian 2000 Optical Character Recognition of Printed Tamil Characters. Department of Electrical and computer Engineering, Virginia Tech, Blacksburg
[4] Antani Sameer , Lalitha Agnihotri, 1999 Gujarati Character Recognition. Fifth International Conference on Document Analysis and Recognition (ICDAR'99), p. 418.
[5] Anuradha B, Koteswarrao B 2006 An efficient Binarization technique for old documents. Proc. Of International conference on Systemics,Cybernetics, and Inforrmatics(ICSCI2006), Hyderabad, pp771-775.
[6] Anuradha Srinivas, Arun Agarwal, C.R.Rao 2007 Telugu Character Recognition. Proc. Of International conference on systemics, cybernetics, and informatics, Hyderabad, pgs. 654-659.
[7] Ashwin T V And P S Sastry 2002 A font and size-independent OCR system for printed Kannada documents using support vector machines. Saadhanaa, Vol. 27, Part 1, February 2002, pp. 35–58.
[8]Brown R.L. The fringe distance measure: an easily calculated image distance measure with recognition results comparable to Gaussian blurring. IEEE Trans.System Man and Cybernetics, 24(1):111-116, 1994.
[9] Chakravarthy Bhagvati, T.Ravi, S.M.Kumar, and Atul Negi. 2003 On Developing High Accuracy OCR Systems for Telugu and other Indian Scripts. Proc. of Language
Engineering Conference, Pp 18-23, Hyderabad, IEEE computer society Press.
[10] Chaudhuri B.B.and U. Pal 1997 Skew Angle Detection of Digitized Indian Script Documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 19, NO. 2, February 1997.
[11] Chaudhuri, B.B. and Pal, U. 1998 A complete printed Bangla OCR system. Pattern Recognition, Vol. 31, 1998, pp 531-549.
[12] Chinnuswamy P., S.G. Krishnamoorty, 1980 Recognition of hand-printed Tamil characters. Pattern Recognition, Vol 12, issue 3, 141–152.
[13] Garain U. and B. B. Chaudhuri, 2002 Segmentation of Touching Characters in Printed Devnagari and Bangla Scripts using Fuzzy Multifactorial Analysis. IEEE Transactions on Systems, Man and Cybernetics, Part C, Vol. 32, No. 4, pp. 449-459.
[14] Gonzalez R.C., and R.E. Woods 2002 Digital Image Processing. (New Jersey: Prentice-Hall)
[15] Jalal Uddin Mahmud, Mohammed Feroz Raihan and Chowdhury Mofizur Rahman 2003 A Complete OCR System for Continuous Bengali Characters. Conference on Convergent Technologies for Asia-Pacific Region (TENCON) Volume 4, Issue , 15-17 Page(s): 1372 – 1376
[16] Jawahar C. V., M. N. S. S. K. Pavan Kumar, S. S. Ravi Kiran 2003. A Bilingual OCR for Hindi-Telugu Documents and its Applications. International Conference on Document Analysis and Recognition,
Journal of Language Technology, Vishwabharat@tdil, July2003
Journal of Language Technology, Vishwabharat@tdil, October, 2003.
Journal of Language Technology, Vishwabharat@tdil, July,2004.
Journal of Language Technology, Vishwabharat@tdil. July,2004,pgs53-54
[17] Lakshmi C V , C Patvardhan, 2003 Optical Character Recognition of Basic Symbols in Printed Telugu Text. IE(I)Journal-CP Vol 84, pgs.66-71.
[18] Lakshmi C V, C Patvardhan, 2002 A multi-font OCR system for printed Telugu text. Proc. Of Language engineering conference LEC, Hyderabad.pgs.7-17.
[19] Lakshmi C V, C Patvardhan 2003 A high accuracy OCR for printed Telugu text. Conference on Convergent Technologies for Asia-Pacific Region (TENCON )Volume 2, Issue, 15-17 Page(s): 725 - 729
[20] Lehal G S and Chandan Singh 2000 A Gurmukhi Script Recognition System 15th International Conference on Pattern Recognition (ICPR'00) - Volume 2 p. 2557.
[21] Lehal G S and Chandan singh, 2002 A post-processor for Gurmukhi OCR Saadhana Vol. 27, Part 1, February, pp. 99–111.
[22] Mohanty S., H.K.Behera, 2004 A Complete OCR Development System for Oriya Script. Proceeding of symposium on Indian Morphology, phonology and Language Engineering, IIT Kharagpur.
[23] Nagy G. , S. Seth, and M. Vishwanathan 1992 A prototype document image analysis system for technical journals. Computer, 25(7)
[24] Negi Atul, Chakravarthy Bhagvati and.Krishna B 2001 An OCR system for Telugu. Proc. Of 6th Int. Conf. on Document Analysis and Recognition IEEE Comp. Soc. Press, USA,. Pgs. 1110-1114.
[25] Negi Atul, Chakravarthy Bhagvati, and V.V.Suresh Kumar. 2002 Non-linear Normalization to Improve Telugu OCR Proc. of Indo-European Conf. on Multilingual Communication Technologies, pgs 45-57,Tata McGraw Hill Book Co., New Delhi,
[26] Negi Atul, Nikhil Kasinadhuni 2003 Localization and Extraction of Text in Telugu Document Images Conference on Convergent Technologies for Asia-Pacific Region (TENCON ) pgs. 749-752
[27] Pal U., B.B. Chaudhuri 2004 Indian script character recognition: a survey. Pattern Recognition 37 pgs.1887 – 1899
[28] Pal U. , B. B. Chaudhuri 1997 Printed Devnagari Script OCR System. Vivek, vol.10, pgs.12-24
[29] Parvati Iyer, Abhipsita Singh, S.Sanyal 2005 Optical Character Recognition System for Noisy Images in Devnagari Script. UDL Workshop on Optical Character Recognition with Workflow and Document Summarization
[30] Pujari Arun K , C Dhanunjaya Naidu & B C Jinaga 2002
An Adaptive Character Recognizer for Telugu Scripts using Multiresolution Analysis and Associative Memory. ICVGIP, Ahmedabad.
[31] Rajasekaran S.N.S. Deekshatulu B.L. 1977 Recognition of printed Telugu characters. Comput. Graphics Image Processing,6 pgs.335–360.
[32] Rangachar Kasturi, Lawrence O’Gorman and Venu Govindaraju 2002 Document image analysis: A primer. Saadhanaa Vol. 27, Part 1, pp. 3–22.
[33] Rao P. V. S. & T. M. Ajitha 1995 Telugu Script Recognition - a Feature Based Approach. Proce.of ICDAR, IEEE pgs.323-326,.
[34] Ray K, and Chatterjee B 1984 Design of a nearest neighbor classifier system for Bengali character recognition. Journal of . Inst. Electronics. Telecom. Eng. 30 pgs.226–229.
[35] Seethalakshmi R., Sreeranjani T.R., Balachandar T., Abnikant Singh, Markandey Singh, Ritwaj Ratan, Sarvesh Kumar 2005 Optical Character Recognition for printed
Tamil text using Unicode. Journal of Zhejiang University SCI 6A(11) pgs.1297-1305
[36] Sinha R.K, Mahabala 1979 Machine recognition of Devnagari script. IEEE Trans. Systems Man Cybern. Pgs. 435–441.
[37] Siromony, G. Chandrasekaran R., Chandrasekaran M. 1978 Computer recognition of printed Tamil characters. Pattern Recognition vol.10 pgs.243–247.
[38] Sukhaswami R, Seetharamulu P., Pujari A.K.1995 Recognition of Telugu characters using Neural networks, Int. J. Neural Syst. Vol.6 pgs.317–357.
[39] Veena Bansal 1999 Integrating knowledge sources in Devnagari text recognition. Ph.D. Thesis, IIT Kanpur,
[40] Wong K., Casey R., and.Wahl F. 1982 Document analysis system. IBM Journal of Research  and Development, Vol 26 (6).
[41] C. Vasantha Lakshmi, C. Patvardhan, Mohit Prasad 2004 A novel approach for improving recognition accuracies in OCR of printed Telugu text
[42] Atul Negi, VSR Sowri, K Mohan Rao 2004 Document processing methods for Telugu and other South East Asian scripts.
[43] Anitha Jayaraman, C Chandra Sekhar, V Srinivasa Chakravarthy 2007 Modular Approach to Recognition of Strokes in Telugu Script.