Offline printed Arabic character recognition

AbdelRaouf, Ashraf M. (2012) Offline printed Arabic character recognition. PhD thesis, University of Nottingham.

[img]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (8MB) | Preview

Abstract

Optical Character Recognition (OCR) shows great potential for rapid data entry, but has limited success when applied to the Arabic language. Normal OCR problems are compounded by the right-to-left nature of Arabic and because the script is largely connected. This research investigates current approaches to the Arabic character recognition problem and innovates a new approach.

The main work involves a Haar-Cascade Classifier (HCC) approach modified for the first time for Arabic character recognition. This technique eliminates the problematic steps in the pre-processing and recognition phases in additional to the character segmentation stage. A classifier was produced for each of the 61 Arabic glyphs that exist after the removal of diacritical marks. These 61 classifiers were trained and tested on an average of about 2,000 images each.

A Multi-Modal Arabic Corpus (MMAC) has also been developed to support this work. MMAC makes innovative use of the new concept of connected segments of Arabic words (PAWs) with and without diacritics marks. These new tokens have significance for linguistic as well as OCR research and applications and have been applied here in the post-processing phase.

A complete Arabic OCR application has been developed to manipulate the scanned images and extract a list of detected words. It consists of the HCC to extract glyphs, systems for parsing and correcting these glyphs and the MMAC to apply linguistic constrains. The HCC produces a recognition rate for Arabic glyphs of 87%. MMAC is based on 6 million words, is published on the web and has been applied and validated both in research and commercial use.

Item Type: Thesis (University of Nottingham only) (PhD)
Supervisors: Higgins, C.
Pridmore, T.
Keywords: arabic language, character recognition, printed arabic, mmac, multi-modal aarabic corpus, haar-cascade classifer
Subjects: T Technology > TA Engineering (General). Civil engineering (General)
Faculties/Schools: UK Campuses > Faculty of Science > School of Computer Science
Item ID: 12601
Depositing User: EP, Services
Date Deposited: 04 Oct 2012 10:39
Last Modified: 16 Dec 2017 09:25
URI: https://eprints.nottingham.ac.uk/id/eprint/12601

Actions (Archive Staff Only)

Edit View Edit View