Volume 14, Issue 2 (2023)                   LRR 2023, 14(2): 363-400 | Back to browse issues page


XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Zareian A, Mosavi Miangah T, Rovshan B, Fakhr Ahmad S M. Correction and Improvement of the Common Processes in Optical Character Recognition (OCR) of Persian Texts: Using the Features of the Persian Script and a Dimension Transference Algorithm. LRR 2023; 14 (2) :363-400
URL: http://lrr.modares.ac.ir/article-14-53265-en.html
1- PhD. Candidate in Linguistics, Payame Noor University, P.O. Box 19395-3697, Tehran, Iran
2- Associate Professor in Linguistics, Payame Noor University, P.O. Box 19395-3697, Tehran, Iran , mosavit@pnu.ac.ir
3- Professor in Linguistics, Payam Noor University, P.O. Box 19395-3697, Tehran, Iran
4- Associate Professor in Computer, Shiraz University, Shiraz, Iran
Abstract:   (4000 Views)
Since the technology of optical recognition of characters is essentially based on Latin script, almost all the algorithms and processes involved in Persian OCR systems are constructed upon the structure and scriptological features of Latin alphabet. This utilization of the means and features of Latin script to design Persian-based OCR systems however, not only has not resulted in the appropriate optical recognition of Persian characters but it also has simultaneously ended in confusion on the part of both the Persian-speaking users and the systems. This paper, therefore, begins with a short review of the significance of language and linguistics in the field of information technology in connection with OCR systems. Then, it will continue with a short history of Persian/Arabic script, while focusing on the scribal features of Persian writing system and its differences with other scripts. In the next part, for effective utilization of the formal elements of the Persian script, these elements have been categorized according to their application and significance in the process of the user’s interaction with Persian OCR systems. Furthermore, through a step by step discussion and analysis of the processes involved in optical recognition of characters based on the scriptological features of the Persian script, not only the deficiencies and faults of the current Latin-based OCR systems will be pinpointed but also a different aspect of the Persian writing system, in connection with its use in computer software, especially OCR systems, will be used so that the reader will practically notice the potentials and capabilities of this complex script in contrast to the simpler Latin writing system. In the end, in order to upgrade and improve the current algorithms employed in Persian OCR systems, the geometrical process of transferring bi-dimensional specifications into mono-dimensional ones has been utilized. The proposed algorithm, which is based on the scriptological features of Persian script, will simultaneously result in the convenient manipulation of patterns, reduction of the bulk of the database, and acceleration of the data processing rate.

1. Introduction
Since the technology of optical recognition of characters is essentially based on Latin script, almost all the algorithms and processes involved in Persian OCR systems are constructed upon the structure and scriptological features of Latin alphabet. This utilization of the means and features of Latin script to design Persian-based OCR systems however, not only has not resulted in the appropriate optical recognition of Persian characters but it also has simultaneously ended in confusion on the part of both the Persian-speaking users and the systems. Therefore, in order to present a different portrait of Persian writing system when working with computers, especially in OCR systems, this research, attempts to describe and analyze the processes involved in optical character recognition based on the scriptological features of the Persian alphabet and elaborate on its differences with the existing Latin-based systems. In line with this objective, after reviewing the history and evolution of the Persian script through different periods, this research gives a classified illustration of the scriptological features of the Persian writing system and its formal elements with a special focus on the OCR processes. Consequently, in this study, the formal elements of the Persian script are categorized according to their application and significance in the interaction of the user with the Persian OCR softwares. Furthermore, the effective utilization of these scriptological elements is expressed in the framework of the existing algorithms, as well as, in the form of a proposed algorithm. The proposed algorithm, on the one hand, results in the practical elimination of the high affectation of the existing algorithms when facing the cursiveness and elongation, of the Persian letters, which previously increased the error rate of the OCR processes; and on the other hand, it highly prevents an increase in the bulk of the database and computations, related to the stored patterns, which previously caused a decrease in the software performance.  


2. Literature Review
The study of Persian/Arabic characters, representation have been studied since 1970s (Bonyani & Jahangard, 2020) and the very beginning algorithms for representing Arabic scripts have been released in 1990s. (Margner & El-Abed, 2008). Many researchers including Shafii (2014) gave up holistic segmentation of Persian characters because of difficulties resulted from some special features of Persian alphabet and only worked on sub words, representation instead. The proposed algorithm of Kiaei (2019), regardless of working on printed limited Omni-font texts did not lead to an accepted results and was inefficient facing to words sequence. Rhmati, et al. (2020) as the latest research in the field of character segmentation like many other studies considered baseline connector as a part of a character and their algorithm suggested a procedure to shorten over length baseline connectors in order to facilitate character recognition through the existing systems. The newly done studies on optical character recognition avoid the structural features in the process of recognition and primarily utilize holistic algorithm based on neural networks inorder to extract distinctive features of characters (Bonyani & Jahangard, 2020).

3. Discussion
Using the concept of baseline connector (BC) in the design of the proposed algorithm, the connected characters will all have an identical BC component. This means that each instance of the BC, regardless of its length, will be identified as one identical component. This way, the BC component of each character and its variable extra stretches are removed by means of algorithms and mathematical processes and replaced by an identical special code. This is different from the common known methods of character segmentation in which the whole character including the BC component goes through an image processing stage. Here, in the pattern comparison stage, the system at first recognizes the BC component and removes its extra stretches and then compares the remaining letter image with the stored patterns. By removing the BC component from the text image and replacing it with a simple code, contrary to what is customary: 1) the letter segmentation process occurs naturally and successfully; 2) instead of comparing a letter image with all existing patterns, due to the presence of a BC component code, the comparison and recognition process occurs only between the letter image (raw letter) and the patterns belonging to the same set since based on the position of the BC component, the letters can be divided into four sets: a) letter + BC (= initial letters); b) BC + letter (= final letters); c) BC+letter+BC (= medial letters); d) isolated form without a BC on either side (= isolated letters). As a result, instead of comparing the raw letter and trying to match it with all the existing patterns, this comparison is made only between the raw letter (letter image) and the patterns in one of the above four main groups; 3) by removing the BC component, which occurs in various lengths and in practice has no effect on the reading of the word, this component is removed from the letter image and thereby a great number of patterns whose difference lies in their length of the BC component will be eliminated and thus the process of pattern recognition is sped up; 4) on the other hand, the BC component and its extension, i.e. the baseline, divides the letter components into two groups: a) above the baseline; b) below the baseline. The classification of components into upper and lower sets based on the BC component results in further simplification of the pattern comparison process since: a) this way, upper elements are compared only with upper patterns and lower elements are compared only with lower patterns; b) instead of the overall comparison of the raw image frame with the patterns, first the baseline of the raw element is matched with the baseline of the pattern and then the comparison is made in the two upper and lower sections.

4. Conclusion
The functioning status of present Persian OCR soft wares indicates that there are two main challenges in doing research in this field, one related to solving Persian script problems and another concerned with algorithm design and programming. In this study it was determined that the origin of the current challenges is that the programmers ignored the original function and existing philosophy of baseline connector and consider it as a part of the Persian words. This study tried to improve Persian OCR sub word segmentation throughout utilizing an outstanding feature of baseline connector and at the same time its formal elimination. Base line connector forms a large part of Persian texts and its formal deletion from the raw patterns has caused an impressive reduction in the volume of errors in the processing level. Furthermore, considering baseline connector as a criterion can lead to the possibility of Persian scripts and patterns classification. Consequently, instead of comparing one raw element with all patterns, the comparing procedure has limited to homogenous groups and the processing speed has increased. Finally, in order to upgrade and improve the current algorithms employed in Persian OCR systems, the geometrical process of transferring   bi-dimensional specifications into mono-dimensional ones has been utilized. The proposed algorithm, which is based on the script logical features of Persian script, will simultaneously result in the convenient manipulation of patterns, reduction of the bulk of the database, and acceleration of the data processing rate.
Full-Text [PDF 5074 kb]   (1091 Downloads)    
Article Type: مقالات علمی پژوهشی | Subject: Linguistics
Published: 2023/05/31

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.