Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation

Problem statement: Stylometric authorship attribution is an approach concerned about analyzing texts in text mining, e.g., novels and plays that famous authors wrote, trying to measure the authors style, by choosing some attributes that shows the author style of writing, assuming that these writers...

Full description

Bibliographic Details
Main Authors: Mustafa, Tareef Kamil, Mustapha, Norwati, Azmi Murad, Masrah Azrifah, Sulaiman, Md. Nasir
Format: Article
Language:English
English
Published: Science Publications 2010
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/14139/
http://psasir.upm.edu.my/id/eprint/14139/1/Dropping%20down%20the%20maximum%20item%20set.pdf
_version_ 1848842310718062592
author Mustafa, Tareef Kamil
Mustapha, Norwati
Azmi Murad, Masrah Azrifah
Sulaiman, Md. Nasir
author_facet Mustafa, Tareef Kamil
Mustapha, Norwati
Azmi Murad, Masrah Azrifah
Sulaiman, Md. Nasir
author_sort Mustafa, Tareef Kamil
building UPM Institutional Repository
collection Online Access
description Problem statement: Stylometric authorship attribution is an approach concerned about analyzing texts in text mining, e.g., novels and plays that famous authors wrote, trying to measure the authors style, by choosing some attributes that shows the author style of writing, assuming that these writers have a special way of writing that no other writer has; thus, authorship attribution is the task of identifying the author of a given text. In this study, we propose an authorship attribution algorithm, improving the accuracy of Stylometric features of different professionals so it can be discriminated nearly as well as fingerprints of different persons using authorship attributes. Approach: The main target in this study is to build an algorithm supports a decision making systems enables users to predict and choose the right author for a specific anonymous author’s novel under consideration, by using a learning procedure to teach the system the Stylometric map of the author and behave as an expert opinion. The Stylometric Authorship Attribution (AA) usually depends on the frequent word as the best attribute that could be used, many studies strived for other beneficiary attributes, still the frequent word is ahead of other attributes that gives better results in the researches and experiments and still the best parameter and technique that’s been used till now is the counting of the bag-of-word with the maximum item set. Results: To improve the techniques of the AA, we need to use new pack of attributes with a new measurement tool, the first pack of attributes we are using in this study is the (frequent pair) which means a pair of words that always appear together, this attribute clearly is not a new one, but it wasn’t a successive attribute compared with the frequent word, using the maximum item set counters. the words pair made some mistakes as we see in the experiment results, improving the winnow algorithm by combining it with the computational approach, achieved by using the CV statistical tool as a conditional threshold for attribute selecting; by doing so, the frequent pair result improved from 50% error to 0% in the improved frequent pair with a clear higher score result compared with the frequent word attribute. Conclusion/Recommendations: The new CV algorithm results improvement may lead to several new attributes usage that gave unsatisfying results before that might improve the direction for solving some hard cases couldn’t be solved till now.
first_indexed 2025-11-15T07:57:06Z
format Article
id upm-14139
institution Universiti Putra Malaysia
institution_category Local University
language English
English
last_indexed 2025-11-15T07:57:06Z
publishDate 2010
publisher Science Publications
recordtype eprints
repository_type Digital Repository
spelling upm-141392015-10-21T00:46:48Z http://psasir.upm.edu.my/id/eprint/14139/ Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation Mustafa, Tareef Kamil Mustapha, Norwati Azmi Murad, Masrah Azrifah Sulaiman, Md. Nasir Problem statement: Stylometric authorship attribution is an approach concerned about analyzing texts in text mining, e.g., novels and plays that famous authors wrote, trying to measure the authors style, by choosing some attributes that shows the author style of writing, assuming that these writers have a special way of writing that no other writer has; thus, authorship attribution is the task of identifying the author of a given text. In this study, we propose an authorship attribution algorithm, improving the accuracy of Stylometric features of different professionals so it can be discriminated nearly as well as fingerprints of different persons using authorship attributes. Approach: The main target in this study is to build an algorithm supports a decision making systems enables users to predict and choose the right author for a specific anonymous author’s novel under consideration, by using a learning procedure to teach the system the Stylometric map of the author and behave as an expert opinion. The Stylometric Authorship Attribution (AA) usually depends on the frequent word as the best attribute that could be used, many studies strived for other beneficiary attributes, still the frequent word is ahead of other attributes that gives better results in the researches and experiments and still the best parameter and technique that’s been used till now is the counting of the bag-of-word with the maximum item set. Results: To improve the techniques of the AA, we need to use new pack of attributes with a new measurement tool, the first pack of attributes we are using in this study is the (frequent pair) which means a pair of words that always appear together, this attribute clearly is not a new one, but it wasn’t a successive attribute compared with the frequent word, using the maximum item set counters. the words pair made some mistakes as we see in the experiment results, improving the winnow algorithm by combining it with the computational approach, achieved by using the CV statistical tool as a conditional threshold for attribute selecting; by doing so, the frequent pair result improved from 50% error to 0% in the improved frequent pair with a clear higher score result compared with the frequent word attribute. Conclusion/Recommendations: The new CV algorithm results improvement may lead to several new attributes usage that gave unsatisfying results before that might improve the direction for solving some hard cases couldn’t be solved till now. Science Publications 2010 Article PeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/14139/1/Dropping%20down%20the%20maximum%20item%20set.pdf Mustafa, Tareef Kamil and Mustapha, Norwati and Azmi Murad, Masrah Azrifah and Sulaiman, Md. Nasir (2010) Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation. Journal of Computer Science, 6 (3). pp. 235-243. ISSN 1549-3636 Data mining Text processing (Computer science) English
spellingShingle Data mining
Text processing (Computer science)
Mustafa, Tareef Kamil
Mustapha, Norwati
Azmi Murad, Masrah Azrifah
Sulaiman, Md. Nasir
Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation
title Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation
title_full Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation
title_fullStr Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation
title_full_unstemmed Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation
title_short Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation
title_sort dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation
topic Data mining
Text processing (Computer science)
url http://psasir.upm.edu.my/id/eprint/14139/
http://psasir.upm.edu.my/id/eprint/14139/1/Dropping%20down%20the%20maximum%20item%20set.pdf