Increasing the effectiveness of system-based evaluation for information retrieval systems / Prabha Rajagopal
The information retrieval system evaluation is necessary to measure and quantify the effectiveness, assess user satisfaction and acceptance of the retrieval systems, and compare the performance of the retrieval systems. The relevance judgments, system rankings, and statistical significance testin...
| Main Author: | |
|---|---|
| Format: | Thesis |
| Published: |
2018
|
| Subjects: | |
| Online Access: | http://studentsrepo.um.edu.my/8916/ http://studentsrepo.um.edu.my/8916/1/Prabha.pdf http://studentsrepo.um.edu.my/8916/9/prabha.pdf |
| Summary: | The information retrieval system evaluation is necessary to measure and quantify the
effectiveness, assess user satisfaction and acceptance of the retrieval systems, and
compare the performance of the retrieval systems. The relevance judgments, system
rankings, and statistical significance testing are some essentials in the evaluation. This
thesis makes several contributions to the information retrieval system evaluation using
test collections in three different experiments. The first experiment explored issues in
relation to effort needed by users to retrieve relevant contents from documents. Real users
give up easily and do not put as much effort as expert judges while retrieving relevant
contents. It is unknown if deeper evaluation and wider groups of systems show variation
in system rankings due to effort. The experimentation aims to generate low effort
relevance judgments systematically, determine the variation of system rankings evaluated
at different depth and groups of systems, and explore the effectiveness in evaluating
retrieval systems using low effort relevance judgment with reduced topic sizes. Low effort
relevance judgments are generated using boxplot approach and standardized readability
grades. The findings reveal variation on system rankings at various evaluation depths and
groups of systems while reduced topic sizes evaluation shows differing outcome. The
second experiment explored issues on reliability of system rankings. Evaluation of system
rankings set indicates the overall reliability but not for individual systems. Evaluation by
combination of metrics signifies its versatility in fulfilling different user models. The
experimentation aims to propose an approach to evaluate the reliability of individual
system rankings, determine suitable combination of metrics, understand generalization of
system ranking reliability to other similar metrics, identify the original systems with
reliable system rankings, and validate the proposed approach. The proposed intraclass correlation coefficient approach measures the reliability of individual system rankings
using relative topic ranks. The average precision and rank-biased precision metrics are
recommended for measuring reliability of individual system rankings. Most experimented
metrics combinations generalize well. Highly reliable systems comprise of top and mid
performing systems from the original systems ranking. Also, a strong correlation
coefficient between system rankings of original and proposed approach validates the
proposed reliability measurement of individual retrieval system rankings. The third
experiment explored issues on the usage of averaged or cut-off topic scores for statistical
significance testing. Precision at k metric causes varying user experience while the need
for total relevant documents in average precision is infeasible on the ever-changing Web.
The experimentation aims to propose an approach to overcome the inaccuracy of
averaged or cut-off topic scores in statistical significance test, identify a suitable sample
size, and validate the effectiveness of the proposed approach. The approach uses
indivisible document-level scores for statistical significance testing. The document-level
scores usage produced higher numbers of statistically significant system pairs compared
to the existing method. Suitable sample size selection is necessary for achieving reliable
results while a high percentage of agreement between the proposed and existing reveals
the effectiveness of the proposed document-level approach. |
|---|