Malay manuscripts transliteration using statistical machine translation (SMT)
Natural Language Processing (NLP) is a vital field of artificial intelligence that automates the study of human language. However for Malay manuscripts (MM) written in old jawi, its exposure on such field is limited. Besides, most of the studies related to MM studies and NLP were focused on rul...
| Main Authors: | , , , |
|---|---|
| Format: | Proceeding Paper |
| Language: | English English |
| Published: |
IEEE
2020
|
| Subjects: | |
| Online Access: | http://irep.iium.edu.my/79736/ http://irep.iium.edu.my/79736/1/79736%20Malay%20manuscripts%20transliteration.pdf http://irep.iium.edu.my/79736/2/79736%20Malay%20manuscripts%20transliteration%20SCOPUS.pdf |
| Summary: | Natural Language Processing (NLP) is a vital
field of artificial intelligence that automates the study of
human language. However for Malay manuscripts (MM)
written in old jawi, its exposure on such field is limited.
Besides, most of the studies related to MM studies and NLP
were focused on rule based or rule based machine
transliteration (RBMT). Hence the objective of this study is to
propose a statistical approach for old jawi to modern jawi
transliteration of Malay manuscript contents using Phrase
Based Statistical Machine Translation (PBSMT) as its model.
In order to achieve such purpose, quality score of Word Error
Rate (WER) was computed on the transliteration output.
Besides, the issues formerly encountered by rule based
approach such as vocals limitation and homograph,
reduplication, letters error and combination of multiple words
were observed in the implementation. Moreover, this paper
utilized exploratory approach as its research strategy and
mixed method as its research method. The data for the analysis
were extracted from a MM titled Bidyat al-Mubtad biFalill�h al-Muhd. Quality score of WER was computed for
the evaluation of SMT output. Afterwards, related issues were
identified and assessed. The research found that quality score
of PBSMT for old jawi to modern jawi transliteration was high
in terms of WER, however the issues of rule based were
generally addressed by PBSMT except homograph. The
research is however limited to the approach of SMT that solely
focused on PBSMT as its model. Moreover, the corpus size was
limited to one manuscript while SMT relies on corpus size.
Nevertheless the research contributes to the wider coverage on
Malay language as one of the under resource languages in
NLP, in form of old and modern jawi. Besides, to the best of
the researcher’s knowledge, it is also the first to apply SMT
(PBSMT) approach on old jawi transliteration. Most
importantly, the study is to contribute on MM’s studies. |
|---|