Using wordnet to enhance feature selection in automated text categorization

the field of automated text categorization, the large dimensionality of the feature space is a major problem as it involves extensive computations. Feature selection is one of the approaches to reduce the dimensionality of the feature space. This research explores the use of WordNet (Miller et al...

Full description

Bibliographic Details
Main Author: Chua, Stephanie Hui Li
Format: Thesis
Language:English
Published: Universiti Malaysia Sarawak, (UNIMAS) 2004
Subjects:
Online Access:http://ir.unimas.my/id/eprint/12604/
http://ir.unimas.my/id/eprint/12604/2/Stephanie%20Chua%20Hui%20Li%20ft.pdf
_version_ 1848837234038407168
author Chua, Stephanie Hui Li
author_facet Chua, Stephanie Hui Li
author_sort Chua, Stephanie Hui Li
building UNIMAS Institutional Repository
collection Online Access
description the field of automated text categorization, the large dimensionality of the feature space is a major problem as it involves extensive computations. Feature selection is one of the approaches to reduce the dimensionality of the feature space. This research explores the use of WordNet (Miller et al., 1990), a lexical database, for performing feature selection for an automated text categorization system. The WordNet-based approach employs lexical and semantics information for feature selection. WordNet allows the selection of terms that are lexically and semantically representative of a category of documents, as opposed to statistical approaches traditionally used for feature selection. f' We proposed three WordNet based approaches for feature selection. The first one is to use the WordNet nouns approach that selects all nouns in WordNet that occur in each category as features. The second approach is based on lexical semantics that selects synonymous terms that co-occur in a category while the third approach is a combination of the lexical semantics approach with statistical feature selection methods. The lexical semantics approach performed better than the WordNet nouns approach with more than 40% of reduction in feature space in the experiments using the Reuters-21578 dataset. The lexical semantics approach also outperformed popular statistical feature selection methods, namely, Chi-Square (Chi2) and Information Gain (IG). The combined approach has improved the performance of the statistical methods. WordNet has successfully been used to enhance feature selection, highlighting the possibility of determining semantic features automatically. The limitations of the lexical semantics approach are also highlighted, proposing an improved framework and an extension to overcome them.
first_indexed 2025-11-15T06:36:25Z
format Thesis
id unimas-12604
institution Universiti Malaysia Sarawak
institution_category Local University
language English
last_indexed 2025-11-15T06:36:25Z
publishDate 2004
publisher Universiti Malaysia Sarawak, (UNIMAS)
recordtype eprints
repository_type Digital Repository
spelling unimas-126042025-06-17T04:34:50Z http://ir.unimas.my/id/eprint/12604/ Using wordnet to enhance feature selection in automated text categorization Chua, Stephanie Hui Li T Technology (General) the field of automated text categorization, the large dimensionality of the feature space is a major problem as it involves extensive computations. Feature selection is one of the approaches to reduce the dimensionality of the feature space. This research explores the use of WordNet (Miller et al., 1990), a lexical database, for performing feature selection for an automated text categorization system. The WordNet-based approach employs lexical and semantics information for feature selection. WordNet allows the selection of terms that are lexically and semantically representative of a category of documents, as opposed to statistical approaches traditionally used for feature selection. f' We proposed three WordNet based approaches for feature selection. The first one is to use the WordNet nouns approach that selects all nouns in WordNet that occur in each category as features. The second approach is based on lexical semantics that selects synonymous terms that co-occur in a category while the third approach is a combination of the lexical semantics approach with statistical feature selection methods. The lexical semantics approach performed better than the WordNet nouns approach with more than 40% of reduction in feature space in the experiments using the Reuters-21578 dataset. The lexical semantics approach also outperformed popular statistical feature selection methods, namely, Chi-Square (Chi2) and Information Gain (IG). The combined approach has improved the performance of the statistical methods. WordNet has successfully been used to enhance feature selection, highlighting the possibility of determining semantic features automatically. The limitations of the lexical semantics approach are also highlighted, proposing an improved framework and an extension to overcome them. Universiti Malaysia Sarawak, (UNIMAS) 2004 Thesis NonPeerReviewed text en http://ir.unimas.my/id/eprint/12604/2/Stephanie%20Chua%20Hui%20Li%20ft.pdf Chua, Stephanie Hui Li (2004) Using wordnet to enhance feature selection in automated text categorization. Masters thesis, Universiti Malaysia Sarawak, (UNIMAS).
spellingShingle T Technology (General)
Chua, Stephanie Hui Li
Using wordnet to enhance feature selection in automated text categorization
title Using wordnet to enhance feature selection in automated text categorization
title_full Using wordnet to enhance feature selection in automated text categorization
title_fullStr Using wordnet to enhance feature selection in automated text categorization
title_full_unstemmed Using wordnet to enhance feature selection in automated text categorization
title_short Using wordnet to enhance feature selection in automated text categorization
title_sort using wordnet to enhance feature selection in automated text categorization
topic T Technology (General)
url http://ir.unimas.my/id/eprint/12604/
http://ir.unimas.my/id/eprint/12604/2/Stephanie%20Chua%20Hui%20Li%20ft.pdf