Information extraction from hypertext mark-up language Web pages.

Abstract: Problems statement: Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various HTML information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a sear...

Full description

Bibliographic Details
Main Authors:	Shaker, Mahmoud, Ibrahim, Hamidah, Mustapha, Aida, Abdullah, Lili Nurliyana
Format:	Article
Language:	English
Published:	Science Publications 2009
Subjects:	Hypertext systems. HTML (Document markup language). Information Storage and Retrieval.
Online Access:	http://psasir.upm.edu.my/id/eprint/15226/

_version_	1848842617070026752
author	Shaker, Mahmoud Ibrahim, Hamidah Mustapha, Aida Abdullah, Lili Nurliyana
author_facet	Shaker, Mahmoud Ibrahim, Hamidah Mustapha, Aida Abdullah, Lili Nurliyana
author_sort	Shaker, Mahmoud
building	UPM Institutional Repository
collection	Online Access
description	Abstract: Problems statement: Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various HTML information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. The number of selected pages is enormous. Therefore, the performance capabilities, the overlap among results for the same queries and limitations of web search engines are an important and large area of research. Extracting information from the web pages also becomes very important because the massive and increasing amount of diverse HTML information sources in the internet that are available to users and the variety of web pages making the process of information extraction from web a challenging problem. Approach: This study proposed an approach for extracting information from HTML web pages which was able to extract relevant information from different web pages based on standard classifications. Results: Proposed approach was evaluated by conducting experiments on a number of web pages from different domains and achieved increment in precision and F measure as well as decrement in recall. Conclusion: Experiments demonstrated that our approach extracted the attributes besides the sub attributes that described the extracted attributes and values of the sub attributes from various web pages. Proposed approach was able to extract the attributes that appear in different names in some of the web pages.
first_indexed	2025-11-15T08:01:58Z
format	Article
id	upm-15226
institution	Universiti Putra Malaysia
institution_category	Local University
language	English
last_indexed	2025-11-15T08:01:58Z
publishDate	2009
publisher	Science Publications
recordtype	eprints
repository_type	Digital Repository
spelling	upm-152262013-06-25T04:13:41Z http://psasir.upm.edu.my/id/eprint/15226/ Information extraction from hypertext mark-up language Web pages. Shaker, Mahmoud Ibrahim, Hamidah Mustapha, Aida Abdullah, Lili Nurliyana Abstract: Problems statement: Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various HTML information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. The number of selected pages is enormous. Therefore, the performance capabilities, the overlap among results for the same queries and limitations of web search engines are an important and large area of research. Extracting information from the web pages also becomes very important because the massive and increasing amount of diverse HTML information sources in the internet that are available to users and the variety of web pages making the process of information extraction from web a challenging problem. Approach: This study proposed an approach for extracting information from HTML web pages which was able to extract relevant information from different web pages based on standard classifications. Results: Proposed approach was evaluated by conducting experiments on a number of web pages from different domains and achieved increment in precision and F measure as well as decrement in recall. Conclusion: Experiments demonstrated that our approach extracted the attributes besides the sub attributes that described the extracted attributes and values of the sub attributes from various web pages. Proposed approach was able to extract the attributes that appear in different names in some of the web pages. Science Publications 2009 Article PeerReviewed Shaker, Mahmoud and Ibrahim, Hamidah and Mustapha, Aida and Abdullah, Lili Nurliyana (2009) Information extraction from hypertext mark-up language Web pages. Journal of Computer Science, 5 (8). pp. 596-607. ISSN 1549-3636 Hypertext systems. HTML (Document markup language). Information Storage and Retrieval. English
spellingShingle	Hypertext systems. HTML (Document markup language). Information Storage and Retrieval. Shaker, Mahmoud Ibrahim, Hamidah Mustapha, Aida Abdullah, Lili Nurliyana Information extraction from hypertext mark-up language Web pages.
title	Information extraction from hypertext mark-up language Web pages.
title_full	Information extraction from hypertext mark-up language Web pages.
title_fullStr	Information extraction from hypertext mark-up language Web pages.
title_full_unstemmed	Information extraction from hypertext mark-up language Web pages.
title_short	Information extraction from hypertext mark-up language Web pages.
title_sort	information extraction from hypertext mark-up language web pages.
topic	Hypertext systems. HTML (Document markup language). Information Storage and Retrieval.
url	http://psasir.upm.edu.my/id/eprint/15226/

Information extraction from hypertext mark-up language Web pages.

Similar Items