Generic named-entity recognition for indigenous languages of Sarawak (Nersil)

The aim of this research is to create the first Named Entity Recognition (NER) system for the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal of NERSIL is to achieve a good accuracy with regard to the identification and classification of named entities (NEs). The NEs...

Full description

Bibliographic Details
Main Author: Yong, Soo Fong
Format: Thesis
Language:English
Published: Universiti Malaysia Sarawak, (UNIMAS) 2013
Subjects:
Online Access:http://ir.unimas.my/id/eprint/8340/
http://ir.unimas.my/id/eprint/8340/3/Generic%20Named-Entity%20Recognition%20For%20Indigenous%20Languages%20of%20Sarawak%20%28NERSIL%29%20%28full%29.pdf
_version_ 1848836361343205376
author Yong, Soo Fong
author_facet Yong, Soo Fong
author_sort Yong, Soo Fong
building UNIMAS Institutional Repository
collection Online Access
description The aim of this research is to create the first Named Entity Recognition (NER) system for the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal of NERSIL is to achieve a good accuracy with regard to the identification and classification of named entities (NEs). The NEs considered in this research are Person, Location, Organisation, Date, Time, Monetary and Percentage. Generally, all these NEs carry important information about the text itself. Thus, there are targets for extraction. NER approaches can be categorised broadly as rule-based approach, machine learningbased approach, and hybrid approach. Rule-based approach relies on hand-crafted linguistic grammars. Machine learning-based approach needs a large amount of annotated training data, which is unavailable for SILs. Hybrid approach is the combination of rulebased and machine learning-based approach. NERSIL requires special attention as it is impossible to apply directly from the existing NER approaches. In this thesis, an NER system that is built by extending and modifying the existing NER approaches is presented. There are three main processes: the non-modified ANNIE (A Nearly-New IE system) NER, the adapted ANNIE to SILs, and finally the context investigation. Firstly, the input texts are submitted to an English NER, in this case ANNIE with the assumption that some NEs that appear in English texts will also occur in SIL‟s texts. At that stage, the rules for unrecognised NEs from the rules of recognised NEs are distinguished. Next, the new rules for unrecognised NEs are written and the new gazetteers for SILs are built in order to identify more NEs. However, the first two v processes are not enough to provide a good accuracy in recognising all NEs. Thus, context investigation is needed. Context investigation includes frequency analysis, triggered words filtering, and concordance analysis. The context of a NE (the left or right side of NE) will be investigated.Finally, a NER system designed for SILs will be an advancement of world knowledge. Besides, the design can be improved by incorporating the machine translation, WordNet, and adding more noise filtering (e.g. context filtering, and morphological filtering). With more research and future studies, this NER system will reach a high level of performance like the English NER work on.
first_indexed 2025-11-15T06:22:33Z
format Thesis
id unimas-8340
institution Universiti Malaysia Sarawak
institution_category Local University
language English
last_indexed 2025-11-15T06:22:33Z
publishDate 2013
publisher Universiti Malaysia Sarawak, (UNIMAS)
recordtype eprints
repository_type Digital Repository
spelling unimas-83402023-05-25T09:43:08Z http://ir.unimas.my/id/eprint/8340/ Generic named-entity recognition for indigenous languages of Sarawak (Nersil) Yong, Soo Fong T Technology (General) The aim of this research is to create the first Named Entity Recognition (NER) system for the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal of NERSIL is to achieve a good accuracy with regard to the identification and classification of named entities (NEs). The NEs considered in this research are Person, Location, Organisation, Date, Time, Monetary and Percentage. Generally, all these NEs carry important information about the text itself. Thus, there are targets for extraction. NER approaches can be categorised broadly as rule-based approach, machine learningbased approach, and hybrid approach. Rule-based approach relies on hand-crafted linguistic grammars. Machine learning-based approach needs a large amount of annotated training data, which is unavailable for SILs. Hybrid approach is the combination of rulebased and machine learning-based approach. NERSIL requires special attention as it is impossible to apply directly from the existing NER approaches. In this thesis, an NER system that is built by extending and modifying the existing NER approaches is presented. There are three main processes: the non-modified ANNIE (A Nearly-New IE system) NER, the adapted ANNIE to SILs, and finally the context investigation. Firstly, the input texts are submitted to an English NER, in this case ANNIE with the assumption that some NEs that appear in English texts will also occur in SIL‟s texts. At that stage, the rules for unrecognised NEs from the rules of recognised NEs are distinguished. Next, the new rules for unrecognised NEs are written and the new gazetteers for SILs are built in order to identify more NEs. However, the first two v processes are not enough to provide a good accuracy in recognising all NEs. Thus, context investigation is needed. Context investigation includes frequency analysis, triggered words filtering, and concordance analysis. The context of a NE (the left or right side of NE) will be investigated.Finally, a NER system designed for SILs will be an advancement of world knowledge. Besides, the design can be improved by incorporating the machine translation, WordNet, and adding more noise filtering (e.g. context filtering, and morphological filtering). With more research and future studies, this NER system will reach a high level of performance like the English NER work on. Universiti Malaysia Sarawak, (UNIMAS) 2013 Thesis NonPeerReviewed text en http://ir.unimas.my/id/eprint/8340/3/Generic%20Named-Entity%20Recognition%20For%20Indigenous%20Languages%20of%20Sarawak%20%28NERSIL%29%20%28full%29.pdf Yong, Soo Fong (2013) Generic named-entity recognition for indigenous languages of Sarawak (Nersil). Masters thesis, Universiti Malaysia Sarawak, (UNIMAS).
spellingShingle T Technology (General)
Yong, Soo Fong
Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_full Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_fullStr Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_full_unstemmed Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_short Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_sort generic named-entity recognition for indigenous languages of sarawak (nersil)
topic T Technology (General)
url http://ir.unimas.my/id/eprint/8340/
http://ir.unimas.my/id/eprint/8340/3/Generic%20Named-Entity%20Recognition%20For%20Indigenous%20Languages%20of%20Sarawak%20%28NERSIL%29%20%28full%29.pdf