Offline printed Arabic character recognition

Optical Character Recognition (OCR) shows great potential for rapid data entry, but has limited success when applied to the Arabic language. Normal OCR problems are compounded by the right-to-left nature of Arabic and because the script is largely connected. This research investigates current approa...

Full description

Bibliographic Details
Main Author: AbdelRaouf, Ashraf M.
Format: Thesis (University of Nottingham only)
Language:English
Published: 2012
Subjects:
Online Access:https://eprints.nottingham.ac.uk/12601/
_version_ 1848791537683529728
author AbdelRaouf, Ashraf M.
author_facet AbdelRaouf, Ashraf M.
author_sort AbdelRaouf, Ashraf M.
building Nottingham Research Data Repository
collection Online Access
description Optical Character Recognition (OCR) shows great potential for rapid data entry, but has limited success when applied to the Arabic language. Normal OCR problems are compounded by the right-to-left nature of Arabic and because the script is largely connected. This research investigates current approaches to the Arabic character recognition problem and innovates a new approach. The main work involves a Haar-Cascade Classifier (HCC) approach modified for the first time for Arabic character recognition. This technique eliminates the problematic steps in the pre-processing and recognition phases in additional to the character segmentation stage. A classifier was produced for each of the 61 Arabic glyphs that exist after the removal of diacritical marks. These 61 classifiers were trained and tested on an average of about 2,000 images each. A Multi-Modal Arabic Corpus (MMAC) has also been developed to support this work. MMAC makes innovative use of the new concept of connected segments of Arabic words (PAWs) with and without diacritics marks. These new tokens have significance for linguistic as well as OCR research and applications and have been applied here in the post-processing phase. A complete Arabic OCR application has been developed to manipulate the scanned images and extract a list of detected words. It consists of the HCC to extract glyphs, systems for parsing and correcting these glyphs and the MMAC to apply linguistic constrains. The HCC produces a recognition rate for Arabic glyphs of 87%. MMAC is based on 6 million words, is published on the web and has been applied and validated both in research and commercial use.
first_indexed 2025-11-14T18:30:05Z
format Thesis (University of Nottingham only)
id nottingham-12601
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T18:30:05Z
publishDate 2012
recordtype eprints
repository_type Digital Repository
spelling nottingham-126012025-02-28T11:20:14Z https://eprints.nottingham.ac.uk/12601/ Offline printed Arabic character recognition AbdelRaouf, Ashraf M. Optical Character Recognition (OCR) shows great potential for rapid data entry, but has limited success when applied to the Arabic language. Normal OCR problems are compounded by the right-to-left nature of Arabic and because the script is largely connected. This research investigates current approaches to the Arabic character recognition problem and innovates a new approach. The main work involves a Haar-Cascade Classifier (HCC) approach modified for the first time for Arabic character recognition. This technique eliminates the problematic steps in the pre-processing and recognition phases in additional to the character segmentation stage. A classifier was produced for each of the 61 Arabic glyphs that exist after the removal of diacritical marks. These 61 classifiers were trained and tested on an average of about 2,000 images each. A Multi-Modal Arabic Corpus (MMAC) has also been developed to support this work. MMAC makes innovative use of the new concept of connected segments of Arabic words (PAWs) with and without diacritics marks. These new tokens have significance for linguistic as well as OCR research and applications and have been applied here in the post-processing phase. A complete Arabic OCR application has been developed to manipulate the scanned images and extract a list of detected words. It consists of the HCC to extract glyphs, systems for parsing and correcting these glyphs and the MMAC to apply linguistic constrains. The HCC produces a recognition rate for Arabic glyphs of 87%. MMAC is based on 6 million words, is published on the web and has been applied and validated both in research and commercial use. 2012-07-19 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/12601/1/FinalSubmissionThesis.pdf AbdelRaouf, Ashraf M. (2012) Offline printed Arabic character recognition. PhD thesis, University of Nottingham. arabic language character recognition printed arabic mmac multi-modal aarabic corpus haar-cascade classifer
spellingShingle arabic language
character recognition
printed arabic
mmac
multi-modal aarabic corpus
haar-cascade classifer
AbdelRaouf, Ashraf M.
Offline printed Arabic character recognition
title Offline printed Arabic character recognition
title_full Offline printed Arabic character recognition
title_fullStr Offline printed Arabic character recognition
title_full_unstemmed Offline printed Arabic character recognition
title_short Offline printed Arabic character recognition
title_sort offline printed arabic character recognition
topic arabic language
character recognition
printed arabic
mmac
multi-modal aarabic corpus
haar-cascade classifer
url https://eprints.nottingham.ac.uk/12601/