Classifying good and bad websites

Websites classification has become a vital subject matter as most websites are increasingly being used as a platform for various applications. These web pages often contain semi-structured content which make the classification process challenging. This paper addresses the use of machine learning tec...

Full description

Bibliographic Details
Main Author: Koo, Ee Woon
Format: Final Year Project Report / IMRAD
Language:English
English
Published: Universiti Malaysia Sarawak, (UNIMAS) 2015
Subjects:
Online Access:http://ir.unimas.my/id/eprint/12117/
http://ir.unimas.my/id/eprint/12117/1/Koo.pdf
http://ir.unimas.my/id/eprint/12117/4/Koo%20full.pdf
Description
Summary:Websites classification has become a vital subject matter as most websites are increasingly being used as a platform for various applications. These web pages often contain semi-structured content which make the classification process challenging. This paper addresses the use of machine learning techniques to classify good and bad websites. The classification process is made easy by using set of features generated from HTML codes. The performance ofthe 21 features were evaluated by using three machine learning techniques: support vector machine (SVM), naIve bayes, and nearest neighbor classifiers. The good and bad websites were distinguished by the set of features obtained through counting ofthe HTML tags. A total of200 websites were collected from machine learning task. The results obtained indicate that the features are useful for classification tasks with average accuracy of 80.50% for SVM classifier, 77.00% for naIve bayes classifier, and 72.50% nearest neighbor classifier. Hence, SVM classifier achieved the highest accuracy among all. This project illustrates that it is possible to classify websites as good or bad by using the underlying tags along with the machine learning algorithms.