Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology

Forums (or discussion boards) represent a huge information collection structured under different boards, threads and posts. The actual information entity of a forum is a post, which has the information about authors, date and time of post, actual content etc. This information is significant for a nu...

Full description

Bibliographic Details
Main Authors: Sarencheh, S., Potdar, Vidyasagar, Yeganeh, E., Firoozeh, N.
Other Authors: David Taniar
Format: Book Chapter
Published: Springer 2010
Subjects:
Online Access:http://hdl.handle.net/20.500.11937/11240
_version_ 1848747752440201216
author Sarencheh, S.
Potdar, Vidyasagar
Yeganeh, E.
Firoozeh, N.
author2 David Taniar
author_facet David Taniar
Sarencheh, S.
Potdar, Vidyasagar
Yeganeh, E.
Firoozeh, N.
author_sort Sarencheh, S.
building Curtin Institutional Repository
collection Online Access
description Forums (or discussion boards) represent a huge information collection structured under different boards, threads and posts. The actual information entity of a forum is a post, which has the information about authors, date and time of post, actual content etc. This information is significant for a number of applications like gathering market intelligence, analyzing customer perceptions etc. However automatically extracting this information from a forum is an extremely challenging task. There are several customized parsers designed for extracting information from a particular forum platform with a specific template (e.g. SMF or phpBB), however the problem with this approach is that these parsers are dependent upon the forum platform and the template used, which makes it unrealistic to use in practical situations. Hence, in this paper we propose a semi-automatic rule based solution for extracting forum post information and inserting the extracted information to a database for the purpose of analysis. The key challenge with this solution is identifying extraction rules, which are normally forum platform and forum template specific. As a result we analyzed 100 forums to derive these rules and test the performance of the algorithm. The results indicate that we were able to extract all the required information from SMF and phpBB forum platforms, which represent the majority of forums on the web.
first_indexed 2025-11-14T06:54:08Z
format Book Chapter
id curtin-20.500.11937-11240
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T06:54:08Z
publishDate 2010
publisher Springer
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-112402022-12-09T07:12:37Z Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology Sarencheh, S. Potdar, Vidyasagar Yeganeh, E. Firoozeh, N. David Taniar Osvaldo Gervasi Beniamino Murgante Eric Pardede Bernady O Apduhan Information extraction Forum Forums (or discussion boards) represent a huge information collection structured under different boards, threads and posts. The actual information entity of a forum is a post, which has the information about authors, date and time of post, actual content etc. This information is significant for a number of applications like gathering market intelligence, analyzing customer perceptions etc. However automatically extracting this information from a forum is an extremely challenging task. There are several customized parsers designed for extracting information from a particular forum platform with a specific template (e.g. SMF or phpBB), however the problem with this approach is that these parsers are dependent upon the forum platform and the template used, which makes it unrealistic to use in practical situations. Hence, in this paper we propose a semi-automatic rule based solution for extracting forum post information and inserting the extracted information to a database for the purpose of analysis. The key challenge with this solution is identifying extraction rules, which are normally forum platform and forum template specific. As a result we analyzed 100 forums to derive these rules and test the performance of the algorithm. The results indicate that we were able to extract all the required information from SMF and phpBB forum platforms, which represent the majority of forums on the web. 2010 Book Chapter http://hdl.handle.net/20.500.11937/11240 Springer restricted
spellingShingle Information extraction
Forum
Sarencheh, S.
Potdar, Vidyasagar
Yeganeh, E.
Firoozeh, N.
Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology
title Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology
title_full Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology
title_fullStr Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology
title_full_unstemmed Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology
title_short Semi-Automatic Information Extraction from Discussion Boards with Applications for Anti-Spam Technology
title_sort semi-automatic information extraction from discussion boards with applications for anti-spam technology
topic Information extraction
Forum
url http://hdl.handle.net/20.500.11937/11240