A hybrid model for low-resource language text classification and comparative analysis
Context: The growing digital content in many languages helps users share diverse information. However, classifying user reviews is time-consuming and biased. Transformers like BERT excel in NLP, but low-resource languages still face challenges due to limited data, computational resources, and lingui...
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025
|
| Online Access: | http://psasir.upm.edu.my/id/eprint/119285/ http://psasir.upm.edu.my/id/eprint/119285/1/119285.pdf |
| _version_ | 1848867925386067968 |
|---|---|
| author | Salleh, Amran Osman, Mohd Hafeez Hassan, Sa'adah Said, Mar Yah Sharif, Khaironi Yatim Wei, Koh Tieng |
| author_facet | Salleh, Amran Osman, Mohd Hafeez Hassan, Sa'adah Said, Mar Yah Sharif, Khaironi Yatim Wei, Koh Tieng |
| author_sort | Salleh, Amran |
| building | UPM Institutional Repository |
| collection | Online Access |
| description | Context: The growing digital content in many languages helps users share diverse information. However, classifying user reviews is time-consuming and biased. Transformers like BERT excel in NLP, but low-resource languages still face challenges due to limited data, computational resources, and linguistic tools. Objective: The objective of this paper is twofold: (1) to evaluate and compare existing text classification methods using a newly annotated dataset for Malay, a low-resource language; and (2) to propose a new hybrid model for classifying low-resource languages that combines rule-based linguistic features with transfer learning approaches. Methods: For this analysis, five tools—LangDetect, spaCy, FastText, XLM-RoBERTa and LLaMA were applied. The study compares these tools against a low-resource dataset (Malay) to identify gaps and limitations in performance. The research focuses on several main areas: (i) Challenges in Low-Resource Languages, (ii) Comparative Analysis, (iii) Proposed Model, and (iv) Empirical Evaluation. The dataset includes 74,931 user reviews from Google Play Store apps (MyBayar, PDRM, MyJPJ, and MySejahtera). A subset of 2621 reviews was selected and annotated by two independent coders, and Fleiss’ Kappa was used to ensure reliable agreement for a ground-truth dataset. Results: The proposed hybrid model demonstrated statistically significant improvements in classification performance, achieving an accuracy of 84 %. Paired t-tests further confirm these improvements, showing significant differences in F1-score compared to baseline methods (p < 0.05). Conclusion: Findings emphasize the need for tailored NLP approaches for underrepresented languages, showing the importance of custom models to handle language diversity and further development in low-resource language. |
| first_indexed | 2025-11-15T14:44:14Z |
| format | Article |
| id | upm-119285 |
| institution | Universiti Putra Malaysia |
| institution_category | Local University |
| language | English |
| last_indexed | 2025-11-15T14:44:14Z |
| publishDate | 2025 |
| publisher | Elsevier |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | upm-1192852025-08-14T07:29:30Z http://psasir.upm.edu.my/id/eprint/119285/ A hybrid model for low-resource language text classification and comparative analysis Salleh, Amran Osman, Mohd Hafeez Hassan, Sa'adah Said, Mar Yah Sharif, Khaironi Yatim Wei, Koh Tieng Context: The growing digital content in many languages helps users share diverse information. However, classifying user reviews is time-consuming and biased. Transformers like BERT excel in NLP, but low-resource languages still face challenges due to limited data, computational resources, and linguistic tools. Objective: The objective of this paper is twofold: (1) to evaluate and compare existing text classification methods using a newly annotated dataset for Malay, a low-resource language; and (2) to propose a new hybrid model for classifying low-resource languages that combines rule-based linguistic features with transfer learning approaches. Methods: For this analysis, five tools—LangDetect, spaCy, FastText, XLM-RoBERTa and LLaMA were applied. The study compares these tools against a low-resource dataset (Malay) to identify gaps and limitations in performance. The research focuses on several main areas: (i) Challenges in Low-Resource Languages, (ii) Comparative Analysis, (iii) Proposed Model, and (iv) Empirical Evaluation. The dataset includes 74,931 user reviews from Google Play Store apps (MyBayar, PDRM, MyJPJ, and MySejahtera). A subset of 2621 reviews was selected and annotated by two independent coders, and Fleiss’ Kappa was used to ensure reliable agreement for a ground-truth dataset. Results: The proposed hybrid model demonstrated statistically significant improvements in classification performance, achieving an accuracy of 84 %. Paired t-tests further confirm these improvements, showing significant differences in F1-score compared to baseline methods (p < 0.05). Conclusion: Findings emphasize the need for tailored NLP approaches for underrepresented languages, showing the importance of custom models to handle language diversity and further development in low-resource language. Elsevier 2025 Article PeerReviewed text en http://psasir.upm.edu.my/id/eprint/119285/1/119285.pdf Salleh, Amran and Osman, Mohd Hafeez and Hassan, Sa'adah and Said, Mar Yah and Sharif, Khaironi Yatim and Wei, Koh Tieng (2025) A hybrid model for low-resource language text classification and comparative analysis. Knowledge-Based Systems, 326. art. no. 114068. pp. 1-9. ISSN 0950-7051 https://linkinghub.elsevier.com/retrieve/pii/S095070512501113X 10.1016/j.knosys.2025.114068 |
| spellingShingle | Salleh, Amran Osman, Mohd Hafeez Hassan, Sa'adah Said, Mar Yah Sharif, Khaironi Yatim Wei, Koh Tieng A hybrid model for low-resource language text classification and comparative analysis |
| title | A hybrid model for low-resource language text classification and comparative analysis |
| title_full | A hybrid model for low-resource language text classification and comparative analysis |
| title_fullStr | A hybrid model for low-resource language text classification and comparative analysis |
| title_full_unstemmed | A hybrid model for low-resource language text classification and comparative analysis |
| title_short | A hybrid model for low-resource language text classification and comparative analysis |
| title_sort | hybrid model for low-resource language text classification and comparative analysis |
| url | http://psasir.upm.edu.my/id/eprint/119285/ http://psasir.upm.edu.my/id/eprint/119285/ http://psasir.upm.edu.my/id/eprint/119285/ http://psasir.upm.edu.my/id/eprint/119285/1/119285.pdf |