A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis

Social media has evolved into a platform for the dissemination of information, including fake news. There is a lot of false information about the current situation of the Coronavirus Disease 2019 (COVID-19) pandemic, such as false information regarding vaccination. In this paper, we focus on sentime...

Full description

Bibliographic Details
Main Authors: Kong, Jeffery TH, Juwono, Filbert, Ngu, Ik Ying, Nugraha, I. Gde Dharma, Maraden, Yan, Wong, Wei Kitt
Format: Journal Article
Published: MDPI 2023
Online Access:http://hdl.handle.net/20.500.11937/95202
_version_ 1848765984017481728
author Kong, Jeffery TH
Juwono, Filbert
Ngu, Ik Ying
Nugraha, I. Gde Dharma
Maraden, Yan
Wong, Wei Kitt
author_facet Kong, Jeffery TH
Juwono, Filbert
Ngu, Ik Ying
Nugraha, I. Gde Dharma
Maraden, Yan
Wong, Wei Kitt
author_sort Kong, Jeffery TH
building Curtin Institutional Repository
collection Online Access
description Social media has evolved into a platform for the dissemination of information, including fake news. There is a lot of false information about the current situation of the Coronavirus Disease 2019 (COVID-19) pandemic, such as false information regarding vaccination. In this paper, we focus on sentiment analysis for Malaysian COVID-19-related news on social media such as Twitter. Tweets in Malaysia are often a combination of Malay, English, and Chinese with plenty of short forms, symbols, emojis, and emoticons within the maximum length of a tweet. The contributions of this paper are twofold. Firstly, we built a multilingual COVID-19 Twitter dataset, comprising tweets written from 1 September 2021 to 12 December 2021. In particular, we collected 108,246 tweets, with over  (Formula presented.)  in Malay language,  (Formula presented.)  in English,  (Formula presented.)  in Chinese, and  (Formula presented.)  in other languages. We then manually annotated and assigned the sentiment of 11,568 tweets into three-class sentiments (positive, negative, and neutral) to develop a Malay-language sentiment analysis tool. For this purpose, we applied a data compression method using Byte-Pair Encoding (BPE) on the texts and used two deep learning approaches, i.e., the Multilingual Bidirectional Encoder Representation for Transformer (M-BERT) and convolutional neural network (CNN). BPE tokenization is used to encode rare and unknown words into smaller meaningful subwords. With the CNN, we converted the labeled tweets into image files. Our experiments explored different BPE vocabulary sizes with our BPE-Text-to-Image-CNN and BPE-M-BERT models. The results show that the optimal vocabulary size for BPE is 12,000; any values beyond that would not contribute much to the F1-score. Overall, our results show that BPE-M-BERT slightly outperforms the CNN model, thereby showing that the pre-trained M-BERT network has the advantage for our multilingual dataset.
first_indexed 2025-11-14T11:43:55Z
format Journal Article
id curtin-20.500.11937-95202
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T11:43:55Z
publishDate 2023
publisher MDPI
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-952022024-07-03T00:52:34Z A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis Kong, Jeffery TH Juwono, Filbert Ngu, Ik Ying Nugraha, I. Gde Dharma Maraden, Yan Wong, Wei Kitt Social media has evolved into a platform for the dissemination of information, including fake news. There is a lot of false information about the current situation of the Coronavirus Disease 2019 (COVID-19) pandemic, such as false information regarding vaccination. In this paper, we focus on sentiment analysis for Malaysian COVID-19-related news on social media such as Twitter. Tweets in Malaysia are often a combination of Malay, English, and Chinese with plenty of short forms, symbols, emojis, and emoticons within the maximum length of a tweet. The contributions of this paper are twofold. Firstly, we built a multilingual COVID-19 Twitter dataset, comprising tweets written from 1 September 2021 to 12 December 2021. In particular, we collected 108,246 tweets, with over  (Formula presented.)  in Malay language,  (Formula presented.)  in English,  (Formula presented.)  in Chinese, and  (Formula presented.)  in other languages. We then manually annotated and assigned the sentiment of 11,568 tweets into three-class sentiments (positive, negative, and neutral) to develop a Malay-language sentiment analysis tool. For this purpose, we applied a data compression method using Byte-Pair Encoding (BPE) on the texts and used two deep learning approaches, i.e., the Multilingual Bidirectional Encoder Representation for Transformer (M-BERT) and convolutional neural network (CNN). BPE tokenization is used to encode rare and unknown words into smaller meaningful subwords. With the CNN, we converted the labeled tweets into image files. Our experiments explored different BPE vocabulary sizes with our BPE-Text-to-Image-CNN and BPE-M-BERT models. The results show that the optimal vocabulary size for BPE is 12,000; any values beyond that would not contribute much to the F1-score. Overall, our results show that BPE-M-BERT slightly outperforms the CNN model, thereby showing that the pre-trained M-BERT network has the advantage for our multilingual dataset. 2023 Journal Article http://hdl.handle.net/20.500.11937/95202 10.3390/bdcc7020061 http://creativecommons.org/licenses/by/4.0/ MDPI fulltext
spellingShingle Kong, Jeffery TH
Juwono, Filbert
Ngu, Ik Ying
Nugraha, I. Gde Dharma
Maraden, Yan
Wong, Wei Kitt
A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis
title A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis
title_full A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis
title_fullStr A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis
title_full_unstemmed A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis
title_short A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis
title_sort mixed malay–english language covid-19 twitter dataset: a sentiment analysis
url http://hdl.handle.net/20.500.11937/95202