Term frequency with average term occurrences for textual information retrieval

In the context of Information Retrieval (IR) from text documents, the term-weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model (VSM). In this paper we propose a new TWS that is based on computing the average term occurrences of terms in documents and...

Full description

Bibliographic Details
Main Authors: Ibrahim, O., Landa-Silva, Dario
Format: Article
Published: Springer 2016
Subjects:
Online Access:https://eprints.nottingham.ac.uk/31296/
_version_ 1848794170575028224
author Ibrahim, O.
Landa-Silva, Dario
author_facet Ibrahim, O.
Landa-Silva, Dario
author_sort Ibrahim, O.
building Nottingham Research Data Repository
collection Online Access
description In the context of Information Retrieval (IR) from text documents, the term-weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model (VSM). In this paper we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and may be infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance, and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TFIDF and TF-ATO. The results show that both, stopwords removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information in the relevance judgement for the collection.
first_indexed 2025-11-14T19:11:56Z
format Article
id nottingham-31296
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T19:11:56Z
publishDate 2016
publisher Springer
recordtype eprints
repository_type Digital Repository
spelling nottingham-312962020-05-04T20:01:43Z https://eprints.nottingham.ac.uk/31296/ Term frequency with average term occurrences for textual information retrieval Ibrahim, O. Landa-Silva, Dario In the context of Information Retrieval (IR) from text documents, the term-weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model (VSM). In this paper we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and may be infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance, and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TFIDF and TF-ATO. The results show that both, stopwords removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information in the relevance judgement for the collection. Springer 2016-08 Article PeerReviewed Ibrahim, O. and Landa-Silva, Dario (2016) Term frequency with average term occurrences for textual information retrieval. Soft Computing, 20 (8). pp. 3045-3061. ISSN 1433-7479 Heuristic term-weighting scheme Random term weights Textual information retrieval Discriminative approach Stop-words removal http://link.springer.com/article/10.1007/s00500-015-1935-7 doi:10.1007/s00500-015-1935-7 doi:10.1007/s00500-015-1935-7
spellingShingle Heuristic term-weighting scheme
Random term weights
Textual information retrieval
Discriminative approach
Stop-words removal
Ibrahim, O.
Landa-Silva, Dario
Term frequency with average term occurrences for textual information retrieval
title Term frequency with average term occurrences for textual information retrieval
title_full Term frequency with average term occurrences for textual information retrieval
title_fullStr Term frequency with average term occurrences for textual information retrieval
title_full_unstemmed Term frequency with average term occurrences for textual information retrieval
title_short Term frequency with average term occurrences for textual information retrieval
title_sort term frequency with average term occurrences for textual information retrieval
topic Heuristic term-weighting scheme
Random term weights
Textual information retrieval
Discriminative approach
Stop-words removal
url https://eprints.nottingham.ac.uk/31296/
https://eprints.nottingham.ac.uk/31296/
https://eprints.nottingham.ac.uk/31296/