Development of an
Efficient OCR System for Telugu Script

INTRODUCTION
Optical
character recognition is usually abbreviated as OCR. The object of OCR is automatic reading of
optically sensed document text materials to translate human-readable characters
into machine-readable codes. Today,
reasonably efficient and inexpensive OCR packages are commercially available to
recognize printed texts in widely used languages such as English, Chinese, and
Japanese. These systems can process
documents that are typewritten, or printed. They can recognize characters with different
fonts and sizes as well as different formats including intermixed text and
graphics. While a large amount of
literature is available for the recognition of Roman, Chinese and Japanese
language characters, relatively less work is reported for the recognition of
Indian language scripts. The research on
OCRs for Indian scripts is still a challenging task. Recently, work has been done for development
of such packages for Indian languages.
This is reported in Proceedings of International Conference on Document
Analysis and Recognition, 1999.
A script
is a system of characters used for writing or printing a natural language. Latin script represents Western European
languages and Devanagari for Hindi; Telugu, Kannada and Tamil scripts for
Dravidian languages. The Devanagari
script is the most widely used Indian script. It is used as the writing system for over 28
languages including Sanskrit, Hindi, Kashmiri, Marathi and Nepali. The Telugu script is the second most major
script in India . Indian scripts do not show upper-case and
lower-case distinction. The concept of
capitalization does not exist. Many
names are also common nouns. Indian
names are also more diverse. Machine
processing of document categorization demands for establishing a relation
between coded sequence of characters and human perception of the language.
Telugu is
the oldest and principal language of the eastern part of the Indian Peninsula
from Madras to Bengal ,
i.e., Southern Indian state of Andhra Pradesh.
Telugu is spoken by more than 100 million people especially in South India . It
ranks between 13-17 largest spoken language in the world alongside of Korean,
Vietnamese, Tamil and Marathi. The
distribution of spoken language in India is geographic, and each of
the different states of the country usually speaks a different language (apart from
a large number of Hindi-speaking states). Andhra Pradesh state, where Telugu is spoken,
shares borders with 5 different states which speak Tamil, Kannada, Marathi,
Hindi and Oriya. Thus, in regions along
the borders with these states, the dialect of Telugu is different, although the
script and formal (written) language are the same. The language has a rich literature since 1120
AD and has been studied by native and foreign linguists significantly. However, with pressures from the use of Hindi
as a national language and English as a global language, chinks are beginning
to appear on the Telugu fabric and concerns are being expressed about the
functional survival of Telugu. The language
has not benefited significantly from the recent advances in computational
approaches for linguistic or statistical processing of natural language texts.
The
language is called Telugu or Tenugu.
Formally the Europeans often called in Gentoo. Gentoo is a
corruption of the Portuguese gentio.
Another name is Andhra which
is used in Aitareya Brahmana to
denote Indian people. The word Telugu
supported to be a corruption of Sanskrit Trilinga. It is explained as meaning, the country of
the three linga’s and a tradition is quoted according to which siva, in the
form of linga, descended upon the three mountains Kaleswara, Srisaila and
Bhimeswara and then those mountains marked the boundary are of Telugu
country. According to linguists, Telugu is
a Dravidian language. It does not belong
to Indo-Aryan family to which Hindi, Sanskrit, Latin and Greek belongs. Telugu has a recorded history from 6th
century A.D. and a fine literature, up to the end of the 19th century. This is a syllabic language. Each symbol in Telugu script represents a
complete syllable, i.e., very little scope for confusion and spelling
problems. This language is usually
written using Telugu alphabets, i.e., Bramhic script and it is phonetic (written from left to right) in
nature.
In Telugu script,
individual characters are written separately.
It represents one major class of script in India – script without shirorekha (a head bar that joins all
the characters in a word). For example, Malayalam,
Kannada, Tamil, etc. Character separation
prior to recognition is therefore not needed in Telugu. These two features are common to all the four
south Indian languages. In fact, the
absence of shirorekha is a feature of other scripts such as Oriya and Gujarati
too.
Telugu
has a complex orthography with a large number of distinct character shapes
(estimated to be of the order of 10,000) composed of simple and compound characters
formed from 52 letters; with 16 vowels (called achchus) that represent basic vowel sounds and 36 consonants (called
hallus). In addition, several semi-vowel symbols,
called maatras, are used in
conjunction with hallus; and, half-consonants, called voththus, are used in consonant clusters. Achchus, hallus, maatras and voththus
together provide roughly 100 basic orthographic units that are combined together
in different ways to represent all the frequently used syllables (estimated
between 5000 and 10000) in the language. Here, we refer to these basic orthographic units
as glyphs (see Figure 1 for the complete listing).
A glyph refers to a single connected
component within a character. A simple
character is one containing only a single glyph (the first two characters in
Figure 2). A compound character is one
containing more than one glyph (the last three characters in Figure 2).
Differences from English
text are strikingly apparent in the composition of various characters. While written words and sentences are linear
left-to-right, the maatras and vaththus are placed above and below the hallus. In Figure 2, the first two characters are
examples of an achchu and a hallu. The
third character shows a maatra placed above the hallu (second character) and
the fourth shows a vaththu in addition to the maatra. The last is a more complex example with a
maatra and two voththus placed above and below a consonant. The number of glyphs is the sum of achchus
(16) and hallus (36), the maatras (16) and vaththus (36), and is 104. In actual practice, not all vaththus are used
and the total number is less than 100. Adding
the number of glyphs for punctuation, special symbols and Telugu numerals, the
total is approximately 120.
As a
general rule when the vowel appears at the beginning of the word, the full
vowel character is displayed. When a
vowel appears inside a word, a vowel modifier is added to the preceding
consonant creating a composite character. The number of combinations possible here to
form a composite character is very large. Also a character can have more than one
consonant modifier (as in stree). Word-processing with such a complex script
using the QWERTY keyboard can be a harrowing experience, considering the
cumbersome keyboard mappings the user has to learn.
One of
the most difficult tasks in Telugu OCR systems is to put the disjoint glyphs
back together into characters. Character
spacing or distances between glyph locations is extremely error-prone because of
wide variations. Sometimes, certain
vowel or consonant modifiers would not be picked up because they are too far from
their base character; sometimes, they would be associated with the wrong base
character; or in some cases, even remain unconnected.
Optical
character recognition is the recognition of printed or written text by a
computer. This involves photo scanning
of the text, which converts the paper document into an image, and then
translation of the text image into character codes such as ASCII. Any OCR implementation consists of a number
of preprocessing steps followed by the actual recognition. Most document analysis systems can be visualized
as consisting of three steps: the
preprocessor, the feature extractor and the recognizer. In preprocessing, the raw image obtained by
scanning a page of text is converted to a form acceptable to the recognizer by
extracting individually recognizable symbols. This step is also called symbol segmentation. The number and types of preprocessing
algorithms employed on the scanned image depend on many factors such as age of
the document, paper quality, resolution of the scanned image, the amount of
skew in the image, the format and layout of the images and text, the kind of
script used and also on the type of characters - printed or handwritten
(Anbumani & Subramanian 2000). Typical
preprocessing includes the following stages:
1.
Binarization
2.
Noise removing
3.
Thinning
4.
Skew detection and correction
5.
Line segmentation
6.
Word segmentation
7.
Character segmentation
The
preprocessed image of the symbol is further processed to obtain meaningful
elements called features. Recognition is
completed by searching for a feature vector in a database of stored feature
vectors of all possible symbols that matches with the feature vector of the symbol
to be recognized. Recognition consists
of
1.
Feature extraction
2.
Feature selection
3.
Classification
Binarization: Binarization is a technique by
which the gray scale images are converted to binary images. In any image analysis or enhancement problem,
it is very essential to identify the objects of interest from the rest. Binarization separates the foreground (text)
and background information. The most
common method for binarization is to select a proper threshold for the
intensity of the image and then convert all the intensity values above the
threshold to one intensity value (for example, ”white”), and all intensity values
below the threshold to the other chosen intensity (“black”). Binarization is usually reported to be
performed either globally or locally. Global
methods apply one intensity value to the entire image. Local or adaptive thresholding methods apply
different intensity values to different regions of the image. These threshold values are determined by the
neighborhood of the pixel to which the thresholding is being applied. Several binarization techniques are discussed
in (Anuradha & Koteswarrao 2006).
Noise Removal: Scanned documents often contain
noise that arises due to printer, scanner, print quality, age of the document,
etc. Therefore, it is necessary to
filter this noise before we process the image. The commonly used approach is to
low-pass filter the image and to use it for later processing. The objective in the design of a filter to
reduce noise is that it should remove as much of the noise as possible while
retaining the entire signal (Rangachar et al 2002).
Thinning: Thinning or skeletonization is a
process by which a one-pixel-width representation (or the skeleton) of an
object is obtained, by preserving the connectedness of the object and its end
points (Gonzalez &Woods 2002). The
purpose of thinning is to reduce the image components to their essential
information so that further analysis and recognition are facilitated. This enables easier subsequent detection of
pertinent features. A number of thinning
algorithms have been proposed and are being used. The most common algorithm used is the
classical Hilditch algorithm (Rangachar et al 2002) and its variants.
Skew detection and correction: When a document is fed to the
scanner either mechanically or by a human operator, a few degrees of skew (tilt) are unavoidable. Skew angle is the angle that the lines of text
in the digital image make with the horizontal direction.
There
exist many techniques for skew estimation. One skew estimation technique is based on the
projection profile of the document; another class of approach is based on
nearest neighbor clustering of connected components. Techniques based on the Hough transform and
Fourier transform are also employed for skew estimation. A survey on different skew correction
techniques can be found in Chaudhuri & Pal (1997). A popular method for skew detection employs
the projection profile. A horizontal
projection profile is a one-dimensional array where each element denotes the
number of black pixels along a row in the image. For a document whose text lines span
horizontally, the horizontal projection profile has peaks whose widths are
equal to the character height and valleys whose widths are equal to the spacing
between lines. At the correct skew
angle, since scan lines are aligned to text lines, the projection profile has
maximum height peaks for text and valleys for line spacing.
Line, word, and character
segmentation: After the
tilt is corrected, the text has to be segmented first into lines; each line
then into words and finally each word have to be segmented into its constituted
characters. An error in segmentation may
lead to wrong recognition of text and the system may be rendered useless. Recognition of text heavily depends on proper
segmentation of text into lines, words and then individual characters or
sub-characters for feature extraction and classification of these characters.
Horizontal projection of a
document image is most commonly employed to extract the lines from the
document. If the lines are well
separated, and are not tilted, the horizontal projection will have separated
peaks and valleys, which serve as the separators of the text lines. These valleys are easily detected and used to
determine the location of boundaries between lines.
Similarly
a vertical projection profile gives
the column sums. One can separate lines
by looking for minima in horizontal projection profile of the page and then
separate words by looking at minima in vertical projection profile of a single
line. Valleys in the vertical projection
of a line image can be used in the extraction of words in a line, as well as
extracting individual characters from the word. For example, if a line consisting of 4 words,
along with vertical projection profiles; and it shows the 4 words, after
segmentation. If a word is shown
segmented into its constituting 3 characters, overlapping, and adjacent
characters in a word (called kerned characters) cannot be segmented using
zero-valued valleys in the vertical projection profile. Special techniques have to be employed to
solve this problem.
The problem that rises while doing character
segmentation in Telugu script is more complex than in Roman scripts. This is due to (i) the large number of
complexes produced by neighboring characters touching each other; (ii) those
produced by touching with voththus,
and maatras; (iii) voththus may be placed with respect to
their base characters in a large number of relative positions; (iv) voththus of a upper line might touch maatras of a lower line. All these lead to a number of situations
where touching characters occur in a Telugu document.
Feature extraction and selection: Feature extraction can be
considered as finding a set of parameters (features) that define the shape of
the underlying character as precisely and uniquely as possible. The features have to be selected in such a
way that they help in discriminating between characters. Thinned data is analyzed to detect features
such as straight lines, curves, and significant points along the curves.
Feature
selection approaches try to find a subset of the original features. The strategy used for OCR can be broadly
classified into three categories:
1.
Statistical Approach
2.
Syntactic/ structural Approach
3.
Hybrid Approach
In statistical approach, a pattern is
represented as a vector: an ordered, fixed length list of numeric features. Many samples of a pattern are used for
collecting statistics. This phase is
known as the training phase. The
objective is to expose the system to natural variants of a character. Recognition process uses this statistics for
identifying an unknown character. Features derived from the statistical
distribution of points include geometrical moments, and black-to-white
crossings.
Structural classification methods
utilize structural features and decision rules to classify characters. Structural features may be defined in terms of
character strokes, character holes, end points, loops or other character
attributes such as concavities. The
classifier is expected to recognize the natural variants of a character but
discriminate between similar looking characters such as O and Q, c and e, etc.
The
statistical approach and structural approach both have their advantages and
disadvantages. The statistical features
are more tolerant to noise provided the sample space over which training has
been performed is representative and realistic than structural descriptions. The variation due to font or writing style can
be more easily abstracted in structural descriptions.
In hybrid approach, these two approaches
are combined at appropriate stages for representation of characters and
utilizing them for classification of unknown characters.
Classification: The classification stage in an
OCR process assigns labels to character images based on the features extracted
and the relationships among the features. In simple terms, it is this part of the OCR
which finally recognizes individual characters and outputs them in machine
editable form.
Template
matching is one of the most common and oldest classification methods. In template matching, individual image pixels
are used as features. Classification is
performed by comparing an input character image with a set of templates (or
prototypes) from each character class.
The template, which matches most closely with the unknown, provides recognition.
Classification
strategies following feature extraction are mostly based on identification of a
neighbor pixel with the nearest distance. A distance measure between the vectors is used
as the similarity between the images. Binary-
tree classifiers and the nearest-neighbor classifiers are the two most commonly
used classifiers. The frequently used
distance calculation is the Euclidean distance measure.
The
Challenges
Language-specific Issues: Telugu, like most Indian languages, has a
complex script, where the consonant could be modified by a vowel, consonant
and/or a diacritic. Due to this inherent
complexity of the language’s script and writing style, accurate segmentation
and matching of words (and characters) is a very difficult task.
Issues in Scanning: Scanned document images contain a large number
of artifacts, which are cleaned on a large scale using a semi-automatic
process, by using various image processing operations. Owing to the variation in quality across the
images, a single setup of image processing parameters would not be suitable for
all. Consequently, the overall quality
of the processed images is poor, thereby matching and recognizing such words
are very difficult.
Scalability: The massiveness of the digital library collections
is a serious challenge for automation of the processes. Due to this magnitude, even the quick image
processing routines require large amounts of time. Despite considerable optimizations, the
computation required is enormous, and the processing has to be distributed over
a cluster of computers. Managing such a
cluster and transferring of large amounts of data across the network were some
of the major bottlenecks in the system development.
Applications
and Extensions
Research
in OCR is popular for its application potential in several places. Some practical application potentials of OCRs
are: reading aid for the blind, preserving old/historical documents in
electronic format, desktop publication, library cataloging, ledgering,
automatic reading for sorting of postal mail, bank cheques and other documents,
etc. Research in the area of printed South
Indian Telugu script recognition was to address many specific and generic
applications which need an OCR engine. Our
strength is in extending the recognizer to many real-life applications.
Post Processor: Component level recognition results can be
combined to get the word level output. At
this stage, one could use a post processor to improve the recognition accuracy.
We can use two distinct types of post
processing schemes. In an application for
a learning aid for the illiterate people, we have a fixed and limited word set
to be recognized. We use post-processing
based on geometric information of components for some confusing pairs. At this stage, some of the frequent misclassifications
are also corrected using approximate string matching techniques.
Speech Recognition System: The speech recognizer has been developed by
collecting speech samples of Telugu alphabets which include vowels, consonants,
consonant vowel combination, consonant-vowel-consonants, two letter and three
letter words and simple sentences from 16 speaker’s of both male and
female.
Document Database System: An important application of OCR
is for converting printed documents into electronic form on which search and
retrieval can be carried out. Indian language
document databases are non-existent today primarily because of the lack of good
OCR systems. We can integrate the OCR
engine with a page segmentation procedure for the archival of printed pages. Wherever OCR fails to recognize the text, the
relevant area is stored as an image. The
rest of the text is converted into electronic form (ISCII) and the content and
structure of the text block is stored.
Document Speaking System: Another application is a reading
aid for the visually impaired people. With
the scarcity of printed material in Braille,
it is becoming more and more difficult for such people to access information in
the printed medium. The document reading
system integrates Text-To-Speech (TTS) system with the OCR to achieve this. A data driven approach using example-based
learning is the basis for the TTS.
Human-Computer Speech Interactive system: The human-computer speech interactive system
consists of three modules, namely, Speech Recognition System (SRS), Search
Engine (SE), and Text-to-Speech System (TTS).
Each of these modules plays a very important role. The HCSI module gives the speaker’s data to
SRS module, and then SRS module recognizes the letter, word or sentence depends
on input and replies back as a text to HCSI module. The text is now passed to Speech Database
(SD), search engine which searches for appropriate word or sentence in turn
pass this text to TTS, which generates waveform. It will be passed to the HCSI, and then the
HCSI gives speech output to listener by playing the generated wave file.
User Interface Design: The User Interface (UI) is designed for
user-friendly purpose. Here HTML pages
are designed to display Telugu alphabets.
Survey
on Telugu OCR
Rajasekaran & Deekshatulu (1977) have
proposed the ‘Recognition of printed
Telugu characters’ [31] using Computer Graphics Image Processing. This was the first reported work on OCR of
Telugu Character. It identifies 50 primitive
features and proposes a two-stage syntax-aided character recognition system. In the first stage a knowledge-based search is
used to recognize and remove the primitive shapes. In the second stage, the pattern obtained
after the removal of primitives is coded by tracing along points on it. Classification is done by a decision tree. Primitives are joined and superimposed
appropriately to define individual characters.
Rao & Ajitha (1995) have
proposed the ‘Telugu Script Recognition –
a Feature Based Approach’ [33]. It
includes the concept of Telugu characters as composed of circular segments of
different radii is made use of in the work. Recognition consists in segmenting the
characters into the constituent components and identifying them. Feature set is chosen as the circular
segments, which preserve the canonical shapes of Telugu characters. The recognition scores are reported as ranging
from 78 to 90% across different subjects, and from 91 to 95% when the reference
and test sets were from the same subject.
Sukhaswami et al (1995) have proposed
the ‘Recognition of Telugu characters
using Neural Networks’ [38]. Hopfield
model of neural network working as an associative memory is chosen for
recognition purposes initially. Due to
the limitation in the storage capacity of the Hopfield neural network, they
later proposed a multiple neural network associative memory (MNNAM). These networks work on mutually disjoint sets
of training patterns. They demonstrated
that storage shortage could be overcome by this scheme.
Pujari et al (2002) have proposed
the ‘An Adaptive Character Recognizer for
Telugu Scripts using Multi-resolution Analysis and Associative Memory’
[30]. Gray level input text images are
line segmented using horizontal projections; and vertical projections are used
for the word segmentation. Images are
uniformly scaled to 32x32 using zero-padding techniques. Wavelet representation with three levels of
down sampling reduces a 32x32 image into a set of four 8x8 images, of which
only an average image is considered for further processing. Character images of size 8x8 are converted to
binary images using the mean value of the grey level as the threshold. The resulting bit string of 64 bits is used as
the signature of the input symbol. A Hopfield-based
Dynamic Neural Network is designed for the recognition purpose. The performance across fonts and sizes is
reported as varying from 93% to 95%. The
authors reported that the same system, when applied to recognize English
characters, resulted in very low recognition rate since the directional
features that are prevalent in Latin scripts are not preserved during signature
computation with wavelet transformation.
Negi et al (2001) have
proposed the ‘An OCR System for Telugu’
[24]. Instead of segmenting the words
into characters as usually done, words are split into connected components
(glyphs). Run Length Smearing Algorithm
(RLSA) (Wong et al 1982) and Recursive XY Cuts (Nagy et al 1992) methods are used
to segment the input document image into words.
About 370 connected components (depending on the font) are identified as
sufficient to compose all the characters including punctuation marks and
numerals. Template matching based on the
fringe distance (Brown 1994) is used to measure the similarity or distance
between the input and each template. The
template with the minimum fringe distance is marked as the recognized
character. The template code of the
recognized character is converted into ISCII, the Indian Standard Code for Info rmation Interchange. Raw OCR accuracy with no post processing is
reported as 92%. Performance across fonts varied from 97.3% for Hemalatha font
to 70.1% for the newspaper font.
Negi et al (2002) have
proposed the ‘Non-linear Normalization to
Improve Telugu OCR’ [25]. It was
used by selectively scaling regions of low curvature in the glyphs. This is based on a dot density feature
normalization method. The authors
observed distortions in the shapes, but reported improvement in the OCR
recognition accuracy. Performance across
different fonts is not investigated.
Lakshmi & Patvardhan (2002) have
proposed the ‘A multi-font OCR system for
printed Telugu text’ [18]. Preprocessing
stages such as binarization, noise removal, skew correction using Hough
transform method, Lines and words segmentation using horizontal and vertical
projections are included in this work. Basic
symbols from each word are obtained using connected components approach. After preliminary classification as in the
previous work, pixel gradient directions are chosen as the features. Recognition is done again using the k-nearest
neighbor algorithm on these feature vectors. The training vectors are created with three
different fonts and three different sizes: 25, 30 and 35. Testing is done on characters with different
sizes, and also with some different fonts. Recognition accuracy of more than 92% for most
of the images is claimed.
Negi and Nikhil (2003) have
proposed the ‘Localization and Extraction
of Text in Telugu Document Images’ [26]. The gradient magnitude of the image is
computed to obtain contrasting regions in the image. After binarization and noise removing, Hough
Transform for circles is applied on the gradient magnitude of the image to
obtain the circular gradient which is a prominent feature of Telugu text. Each detected circle is filled to obtain the
regions of interest. Recursive XY cuts
and projection profiles are used to segment the document image into paragraphs,
lines, and words.
Bhagvati et al (2003) have
proposed the ‘On Developing High Accuracy
OCR Systems for Telugu and other Indian Scripts’ [9]. Some of them are: identification of glyph
position information, and recognizing punctuation marks from the width, and
height information, etc. The authors
observed that handling of confusion pairs of glyphs, and touching characters
would improve the OCR recognition accuracy further. Overall, 25% improvement is expected from
considering the above factors.
Lakshmi & Patvardhan (2003) have
proposed the ‘Optical Character Recognition
of Basic Symbols in Printed Telugu Text’ [17]. After obtaining the minimum bounding
rectangle, each character (basic symbol) is resized to 36 columns, while
maintaining the original aspect ratio. A
preliminary classification is done by grouping all the symbols with
approximately same height (rows). Feature
vector is computed out of a set of seven invariant moments from the second and
third order moments. Recognition is done
using k-nearest neighbor algorithm on these feature vectors. A single font type is used for both training
and test data. Testing is done on noisy
character images with Gaussian noise, salt and pepper noise and speckle noise
added. Preprocessing such as line, word, and character segmentation is not
addressed in this work.
Lakshmi & Patvardhan (2003) have
proposed the ‘A High Accuracy OCR for
Printed Telugu Text’ [19]. Neural
network classifiers and some additional logic are introduced in this research. The feature vectors obtained from pixel
gradient directions are used to train separate neural networks for each of the
sets identified by the preliminary classification scheme. Testing is done on the same 3 fonts used for
training, but, with different sizes. A
high recognition accuracy of 99% in most cases for laser and desk jet quality
prints is reported.
Jawahar et al (2003) have proposed
the ‘A Bilingual OCR for Hindi-Telugu
Documents and its Applications’ [16]. It is based on Principal Component Analysis
followed by support vector classification. An overall accuracy of approximately 96.7% is
reported.
DRISHTI is a complete Optical
Character Recognition system for Telugu language developed by the Resource Center
for Indian Language Technology Solutions (RCILTS), at the University of Hyderabad
(JLT, July 2003 pg110-113). The techniques used in Drishti are as
follows: For binarization three options are provided: global (the default),
percentile based and iterative method. Skew
Detection and Correction are done by maximizing the variance in horizontal
projection profile. Text and Graphics
Separation is done by horizontal projection profile. Multi-column Text Detection is done using
Recursive X-Y Cuts technique. It is
based on recursively splitting a document into rectangular regions using
vertical and horizontal projection profiles alternately. Word segmentation is done using a combination
of Run-Length Smearing Algorithm (RLSA) and connected-component labeling. Words are decomposed into glyphs by running
the connected component labeling algorithm again. Recognition is based on template matching
using fringe distance maps. The template
with the best matching score is output as the recognized glyph.
Anuradha Srinivas et al (2007) have
proposed the ‘Telugu Optical Character Recognition’
[6] system for a single font. Sauvola’s
algorithm is used for binarization; skew detection and correction are done by
maximizing the variance in horizontal projection profile. For decomposing the text document into lines,
words and characters, horizontal and vertical projection profiles are used. Zero-crossing features are computed, and
Telugu characters are grouped into 11 groups based on these crossing features. A 2-stage classifier with first stage
identifies the group number of the test character, and a minimum-distance classifier
at the second stage identifies the character.
Recognition accuracy of 93.2% is reported.
During my
literature survey, I have identified the following areas which are still
incomplete in recognition process. The
main objectives of my work are (a) font-independent OCR, (b) easy adaptation
across languages, and (c) scope for extension to handwritten documents.
· The first is that some
components are consistently mis-recognized. One such component along with its incorrectly
identified template is shown in the Figure 4. It may be seen that the two shapes resemble
each other with the difference in the region at the bottom-left corner.
By performing matching and/or spell-check, we can
improve the accuracy of recognition. While
performing these operations, thinning of input character prior to recognition will
be much more helpful than the increasing thickness of the input character. Similarly, several errors result from vaththus which are smaller in size
compared to other components. Certain vaththus are fairly complex (see Figure
1) and non-linear scaling techniques may handle this problem.
· Many of
the OCR systems result shows that the recognition fails only in case of symbols
that are very similar to each other.
Therefore, recognition rate is lower in the images in which there is an
abundance of these symbols.
· To
conclude, the design, approach and implementation will be driven by the need
for a practical OCR system for Telugu. A
study of the OCR performance on various Telugu literary and cultural works is
in progress.
References
[1]
Aparna K G, Ramakrishnan A G, 2002 A complete Tamil Optical Character
Recognition System. 5th International Workshop on Document Analysis Systems DAS
2002, Princeton , NJ , USA ,
pp. 53-57.
[2]
Aparna K G ,Ramakrishnan A G, 2001 Tamil Gnani – an OCR on Windows, Proc. Tamil
Internet 2001, Kuala Lumper, pp. 60-63.
[3]
Anbumani, Subramanian 2000 Optical Character Recognition of Printed Tamil
Characters. Department of Electrical and computer Engineering, Virginia Tech, Blacksburg
[4]
Antani Sameer , Lalitha Agnihotri, 1999 Gujarati Character Recognition. Fifth
International Conference on Document Analysis and Recognition (ICDAR'99), p.
418.
[5]
Anuradha B, Koteswarrao B 2006 An efficient Binarization technique for old
documents. Proc. Of International conference on Systemics,Cybernetics, and Info rrmatics(ICSCI2006), Hyderabad , pp771-775.
[6]
Anuradha Srinivas, Arun Agarwal, C.R.Rao 2007 Telugu Character Recognition.
Proc. Of International conference on systemics, cybernetics, and informatics, Hyderabad , pgs. 654-659.
[7]
Ashwin T V And P S Sastry 2002 A font and size-independent OCR system for
printed Kannada documents using support vector machines. Saadhanaa, Vol. 27,
Part 1, February 2002, pp. 35–58.
[8]Brown
R.L. The fringe distance measure: an easily calculated image distance measure
with recognition results comparable to Gaussian blurring. IEEE Trans.System Man
and Cybernetics, 24(1):111-116, 1994.
[9]
Chakravarthy Bhagvati, T.Ravi, S.M.Kumar, and Atul Negi. 2003 On Developing
High Accuracy OCR Systems for Telugu and other Indian Scripts. Proc. of
Language
Engineering
Conference, Pp 18-23, Hyderabad ,
IEEE computer society Press.
[10]
Chaudhuri B.B.and U. Pal 1997 Skew Angle Detection of Digitized Indian Script
Documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL.
19, NO. 2, February 1997.
[11]
Chaudhuri, B.B. and Pal, U. 1998 A complete printed Bangla OCR system. Pattern
Recognition, Vol. 31, 1998, pp 531-549.
[12]
Chinnuswamy P., S.G. Krishnamoorty, 1980 Recognition of hand-printed Tamil
characters. Pattern Recognition, Vol 12, issue 3, 141–152.
[13] Garain U.
and B. B. Chaudhuri, 2002 Segmentation of Touching Characters in Printed
Devnagari and Bangla Scripts using Fuzzy Multifactorial Analysis. IEEE
Transactions on Systems, Man and Cybernetics, Part C, Vol. 32, No. 4, pp.
449-459.
[14]
Gonzalez R.C., and R.E. Woods 2002 Digital Image Processing. (New Jersey : Prentice-Hall)
[15]
Jalal Uddin Mahmud, Mohammed Feroz Raihan and Chowdhury Mofizur Rahman 2003 A
Complete OCR System for Continuous Bengali Characters. Conference on Convergent
Technologies for Asia-Pacific Region (TENCON) Volume 4, Issue , 15-17 Page(s):
1372 – 1376
[16]
Jawahar C. V., M. N. S. S. K. Pavan Kumar, S. S. Ravi Kiran 2003. A Bilingual
OCR for Hindi-Telugu Documents and its Applications. International Conference
on Document Analysis and Recognition,
Journal
of Language Technology, Vishwabharat@tdil, July2003
Journal
of Language Technology, Vishwabharat@tdil, October, 2003.
Journal
of Language Technology, Vishwabharat@tdil, July,2004.
Journal
of Language Technology, Vishwabharat@tdil. July,2004,pgs53-54
[17]
Lakshmi C V , C Patvardhan, 2003 Optical Character Recognition of Basic Symbols
in Printed Telugu Text. IE(I)Journal-CP Vol 84, pgs.66-71.
[18]
Lakshmi C V, C Patvardhan, 2002 A multi-font OCR system for printed Telugu
text. Proc. Of Language engineering conference LEC, Hyderabad.pgs.7-17.
[19]
Lakshmi C V, C Patvardhan 2003 A high accuracy OCR for printed Telugu text.
Conference on Convergent Technologies for Asia-Pacific Region (TENCON )Volume
2, Issue, 15-17 Page(s): 725 - 729
[20]
Lehal G S and Chandan Singh 2000 A Gurmukhi Script Recognition System 15th
International Conference on Pattern Recognition (ICPR'00) - Volume 2 p. 2557.
[21]
Lehal G S and Chandan singh, 2002 A post-processor for Gurmukhi OCR Saadhana
Vol. 27, Part 1, February, pp. 99–111.
[22]
Mohanty S., H.K.Behera, 2004 A Complete OCR Development System for Oriya
Script. Proceeding of symposium on Indian Morphology, phonology and Language
Engineering, IIT Kharagpur.
[23] Nagy
G. , S. Seth, and M. Vishwanathan 1992 A prototype document image analysis
system for technical journals. Computer, 25(7)
[24] Negi
Atul, Chakravarthy Bhagvati and.Krishna B 2001 An OCR system for Telugu. Proc.
Of 6th Int. Conf. on Document Analysis and Recognition IEEE Comp. Soc. Press, USA ,.
Pgs. 1110-1114.
[25] Negi
Atul, Chakravarthy Bhagvati, and V.V.Suresh Kumar. 2002 Non-linear
Normalization to Improve Telugu OCR Proc. of Indo-European Conf. on
Multilingual Communication Technologies, pgs 45-57,Tata McGraw Hill Book Co., New Delhi ,
[26] Negi
Atul, Nikhil Kasinadhuni 2003 Localization and Extraction of Text in Telugu
Document Images Conference on Convergent Technologies for Asia-Pacific Region
(TENCON ) pgs. 749-752
[27] Pal
U., B.B. Chaudhuri 2004 Indian script character recognition: a survey. Pattern
Recognition 37 pgs.1887 – 1899
[28] Pal
U. , B. B. Chaudhuri 1997 Printed Devnagari Script OCR System. Vivek, vol.10,
pgs.12-24
[29]
Parvati Iyer, Abhipsita Singh, S.Sanyal 2005 Optical Character Recognition
System for Noisy Images in Devnagari Script. UDL Workshop on Optical Character
Recognition with Workflow and Document Summarization
[30]
Pujari Arun K , C Dhanunjaya Naidu & B C Jinaga 2002
An
Adaptive Character Recognizer for Telugu Scripts using Multiresolution Analysis
and Associative Memory. ICVGIP, Ahmedabad.
[31]
Rajasekaran S.N.S. Deekshatulu B.L. 1977 Recognition of printed Telugu
characters. Comput. Graphics Image Processing,6 pgs.335–360.
[32]
Rangachar Kasturi, Lawrence O’Gorman and Venu Govindaraju 2002 Document image
analysis: A primer. Saadhanaa Vol. 27, Part 1, pp. 3–22.
[33] Rao
P. V. S. & T. M. Ajitha 1995 Telugu Script Recognition - a Feature Based
Approach. Proce.of ICDAR, IEEE pgs.323-326,.
[34] Ray
K, and Chatterjee B 1984 Design of a nearest neighbor classifier system for
Bengali character recognition. Journal of . Inst. Electronics. Telecom. Eng. 30
pgs.226–229.
[35]
Seethalakshmi R., Sreeranjani T.R., Balachandar T., Abnikant Singh, Markandey
Singh, Ritwaj Ratan, Sarvesh Kumar 2005 Optical Character Recognition for
printed
Tamil
text using Unicode. Journal of Zhejiang University SCI 6A(11) pgs.1297-1305
[36]
Sinha R.K, Mahabala 1979 Machine recognition of Devnagari script. IEEE Trans.
Systems Man Cybern. Pgs. 435–441.
[37]
Siromony, G. Chandrasekaran R., Chandrasekaran M. 1978 Computer recognition of
printed Tamil characters. Pattern Recognition vol.10 pgs.243–247.
[38] Sukhaswami
R, Seetharamulu P., Pujari A.K.1995 Recognition of Telugu characters using
Neural networks, Int. J. Neural Syst. Vol.6 pgs.317–357.
[39]
Veena Bansal 1999 Integrating knowledge sources in Devnagari text recognition.
Ph.D. Thesis, IIT Kanpur ,
[40] Wong
K., Casey R., and.Wahl F. 1982 Document analysis system. IBM Journal of
Research and Development, Vol 26 (6).
[41] C.
Vasantha Lakshmi, C. Patvardhan, Mohit Prasad 2004 A novel approach for
improving recognition accuracies in OCR of printed Telugu text
[42] Atul
Negi, VSR Sowri, K Mohan Rao 2004 Document processing methods for Telugu and
other South East Asian scripts.
[43]
Anitha Jayaraman, C Chandra Sekhar, V Srinivasa Chakravarthy 2007 Modular
Approach to Recognition of Strokes in Telugu Script.