Arabic language script and encoding identification with support vector machines and rough set theory

Arabic is ranking sixth among the world’s spoken languages with more than 230 million speakers around the Arabic world. There are different flavors and dialects of Arabic; the most common one is the Egyptian Arabic which has the largest number of users (more than 50 millions). Although, only a sma...

Full description

Bibliographic Details
Main Author: Mohamed Sidya, Mohamed Ould
Format: Thesis
Language:English
Published: 2007
Subjects:
Online Access:http://eprints.utm.my/6795/
http://eprints.utm.my/6795/1/MohamedOuldMohamedSidyaMFSKSM2007.pdf
_version_ 1848891346883969024
author Mohamed Sidya, Mohamed Ould
author_facet Mohamed Sidya, Mohamed Ould
author_sort Mohamed Sidya, Mohamed Ould
building UTeM Institutional Repository
collection Online Access
description Arabic is ranking sixth among the world’s spoken languages with more than 230 million speakers around the Arabic world. There are different flavors and dialects of Arabic; the most common one is the Egyptian Arabic which has the largest number of users (more than 50 millions). Although, only a small number Arabic speakers use the internet, still it constitutes a considerable share to the internet community. Unfortunately, so far, there has been no research to automatically distinguish between the Arabic language and the other languages that use the same script. This project deals with identifying the Arabic language from the Persian language; both languages are written in the Arabic script. The data for this project has been collected from the internet, the BBC website in particular. Many operations have been applied to this data, including stop word removal and stemming. This project is established to compare the performance of Support Vector Machines with Rough Set Theory in Identifying the Arabic language. The results show that both methods perform well but the Support Vector Machines outperform the Rough Set Theory.
first_indexed 2025-11-15T20:56:31Z
format Thesis
id utm-6795
institution Universiti Teknologi Malaysia
institution_category Local University
language English
last_indexed 2025-11-15T20:56:31Z
publishDate 2007
recordtype eprints
repository_type Digital Repository
spelling utm-67952018-08-03T08:49:15Z http://eprints.utm.my/6795/ Arabic language script and encoding identification with support vector machines and rough set theory Mohamed Sidya, Mohamed Ould QA75 Electronic computers. Computer science Arabic is ranking sixth among the world’s spoken languages with more than 230 million speakers around the Arabic world. There are different flavors and dialects of Arabic; the most common one is the Egyptian Arabic which has the largest number of users (more than 50 millions). Although, only a small number Arabic speakers use the internet, still it constitutes a considerable share to the internet community. Unfortunately, so far, there has been no research to automatically distinguish between the Arabic language and the other languages that use the same script. This project deals with identifying the Arabic language from the Persian language; both languages are written in the Arabic script. The data for this project has been collected from the internet, the BBC website in particular. Many operations have been applied to this data, including stop word removal and stemming. This project is established to compare the performance of Support Vector Machines with Rough Set Theory in Identifying the Arabic language. The results show that both methods perform well but the Support Vector Machines outperform the Rough Set Theory. 2007-11 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/6795/1/MohamedOuldMohamedSidyaMFSKSM2007.pdf Mohamed Sidya, Mohamed Ould (2007) Arabic language script and encoding identification with support vector machines and rough set theory. Masters thesis, Universiti Teknologi Malaysia, Faculty of Computer Science and Information System. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:62506
spellingShingle QA75 Electronic computers. Computer science
Mohamed Sidya, Mohamed Ould
Arabic language script and encoding identification with support vector machines and rough set theory
title Arabic language script and encoding identification with support vector machines and rough set theory
title_full Arabic language script and encoding identification with support vector machines and rough set theory
title_fullStr Arabic language script and encoding identification with support vector machines and rough set theory
title_full_unstemmed Arabic language script and encoding identification with support vector machines and rough set theory
title_short Arabic language script and encoding identification with support vector machines and rough set theory
title_sort arabic language script and encoding identification with support vector machines and rough set theory
topic QA75 Electronic computers. Computer science
url http://eprints.utm.my/6795/
http://eprints.utm.my/6795/
http://eprints.utm.my/6795/1/MohamedOuldMohamedSidyaMFSKSM2007.pdf