Malay manuscripts transliteration using statistical machine translation (SMT)

Natural Language Processing (NLP) is a vital field of artificial intelligence that automates the study of human language. However for Malay manuscripts (MM) written in old jawi, its exposure on such field is limited. Besides, most of the studies related to MM studies and NLP were focused on rul...

Full description

Bibliographic Details
Main Authors: Abdul Razak, Sitti Munirah, Abu Seman, Muhamad Sadry, Wan Mamat, Wan Ali @ Wan Yusoff, Mohammad Noor, Noor Hasrul Nizan
Format: Proceeding Paper
Language:English
English
Published: IEEE 2020
Subjects:
Online Access:http://irep.iium.edu.my/79736/
http://irep.iium.edu.my/79736/1/79736%20Malay%20manuscripts%20transliteration.pdf
http://irep.iium.edu.my/79736/2/79736%20Malay%20manuscripts%20transliteration%20SCOPUS.pdf
_version_ 1848788828953772032
author Abdul Razak, Sitti Munirah
Abu Seman, Muhamad Sadry
Wan Mamat, Wan Ali @ Wan Yusoff
Mohammad Noor, Noor Hasrul Nizan
author_facet Abdul Razak, Sitti Munirah
Abu Seman, Muhamad Sadry
Wan Mamat, Wan Ali @ Wan Yusoff
Mohammad Noor, Noor Hasrul Nizan
author_sort Abdul Razak, Sitti Munirah
building IIUM Repository
collection Online Access
description Natural Language Processing (NLP) is a vital field of artificial intelligence that automates the study of human language. However for Malay manuscripts (MM) written in old jawi, its exposure on such field is limited. Besides, most of the studies related to MM studies and NLP were focused on rule based or rule based machine transliteration (RBMT). Hence the objective of this study is to propose a statistical approach for old jawi to modern jawi transliteration of Malay manuscript contents using Phrase Based Statistical Machine Translation (PBSMT) as its model. In order to achieve such purpose, quality score of Word Error Rate (WER) was computed on the transliteration output. Besides, the issues formerly encountered by rule based approach such as vocals limitation and homograph, reduplication, letters error and combination of multiple words were observed in the implementation. Moreover, this paper utilized exploratory approach as its research strategy and mixed method as its research method. The data for the analysis were extracted from a MM titled Bidyat al-Mubtad biFalill�h al-Muhd. Quality score of WER was computed for the evaluation of SMT output. Afterwards, related issues were identified and assessed. The research found that quality score of PBSMT for old jawi to modern jawi transliteration was high in terms of WER, however the issues of rule based were generally addressed by PBSMT except homograph. The research is however limited to the approach of SMT that solely focused on PBSMT as its model. Moreover, the corpus size was limited to one manuscript while SMT relies on corpus size. Nevertheless the research contributes to the wider coverage on Malay language as one of the under resource languages in NLP, in form of old and modern jawi. Besides, to the best of the researcher’s knowledge, it is also the first to apply SMT (PBSMT) approach on old jawi transliteration. Most importantly, the study is to contribute on MM’s studies.
first_indexed 2025-11-14T17:47:02Z
format Proceeding Paper
id iium-79736
institution International Islamic University Malaysia
institution_category Local University
language English
English
last_indexed 2025-11-14T17:47:02Z
publishDate 2020
publisher IEEE
recordtype eprints
repository_type Digital Repository
spelling iium-797362020-07-14T08:42:36Z http://irep.iium.edu.my/79736/ Malay manuscripts transliteration using statistical machine translation (SMT) Abdul Razak, Sitti Munirah Abu Seman, Muhamad Sadry Wan Mamat, Wan Ali @ Wan Yusoff Mohammad Noor, Noor Hasrul Nizan TK7800 Electronics. Computer engineering. Computer hardware. Photoelectronic devices Natural Language Processing (NLP) is a vital field of artificial intelligence that automates the study of human language. However for Malay manuscripts (MM) written in old jawi, its exposure on such field is limited. Besides, most of the studies related to MM studies and NLP were focused on rule based or rule based machine transliteration (RBMT). Hence the objective of this study is to propose a statistical approach for old jawi to modern jawi transliteration of Malay manuscript contents using Phrase Based Statistical Machine Translation (PBSMT) as its model. In order to achieve such purpose, quality score of Word Error Rate (WER) was computed on the transliteration output. Besides, the issues formerly encountered by rule based approach such as vocals limitation and homograph, reduplication, letters error and combination of multiple words were observed in the implementation. Moreover, this paper utilized exploratory approach as its research strategy and mixed method as its research method. The data for the analysis were extracted from a MM titled Bidyat al-Mubtad biFalill�h al-Muhd. Quality score of WER was computed for the evaluation of SMT output. Afterwards, related issues were identified and assessed. The research found that quality score of PBSMT for old jawi to modern jawi transliteration was high in terms of WER, however the issues of rule based were generally addressed by PBSMT except homograph. The research is however limited to the approach of SMT that solely focused on PBSMT as its model. Moreover, the corpus size was limited to one manuscript while SMT relies on corpus size. Nevertheless the research contributes to the wider coverage on Malay language as one of the under resource languages in NLP, in form of old and modern jawi. Besides, to the best of the researcher’s knowledge, it is also the first to apply SMT (PBSMT) approach on old jawi transliteration. Most importantly, the study is to contribute on MM’s studies. IEEE 2020-01-30 Proceeding Paper PeerReviewed application/pdf en http://irep.iium.edu.my/79736/1/79736%20Malay%20manuscripts%20transliteration.pdf application/pdf en http://irep.iium.edu.my/79736/2/79736%20Malay%20manuscripts%20transliteration%20SCOPUS.pdf Abdul Razak, Sitti Munirah and Abu Seman, Muhamad Sadry and Wan Mamat, Wan Ali @ Wan Yusoff and Mohammad Noor, Noor Hasrul Nizan (2020) Malay manuscripts transliteration using statistical machine translation (SMT). In: 2019 1st International Conference on Artificial Intelligence and Data Sciences (AiDAS), 19 Sept 2019, Ipoh, Perak. https://ieeexplore.ieee.org/document/8970867 10.1109/AiDAS47888.2019.8970867
spellingShingle TK7800 Electronics. Computer engineering. Computer hardware. Photoelectronic devices
Abdul Razak, Sitti Munirah
Abu Seman, Muhamad Sadry
Wan Mamat, Wan Ali @ Wan Yusoff
Mohammad Noor, Noor Hasrul Nizan
Malay manuscripts transliteration using statistical machine translation (SMT)
title Malay manuscripts transliteration using statistical machine translation (SMT)
title_full Malay manuscripts transliteration using statistical machine translation (SMT)
title_fullStr Malay manuscripts transliteration using statistical machine translation (SMT)
title_full_unstemmed Malay manuscripts transliteration using statistical machine translation (SMT)
title_short Malay manuscripts transliteration using statistical machine translation (SMT)
title_sort malay manuscripts transliteration using statistical machine translation (smt)
topic TK7800 Electronics. Computer engineering. Computer hardware. Photoelectronic devices
url http://irep.iium.edu.my/79736/
http://irep.iium.edu.my/79736/
http://irep.iium.edu.my/79736/
http://irep.iium.edu.my/79736/1/79736%20Malay%20manuscripts%20transliteration.pdf
http://irep.iium.edu.my/79736/2/79736%20Malay%20manuscripts%20transliteration%20SCOPUS.pdf