Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani
Summarization is a process to select important information from a source text. Summarizing strategies are the core of the cognitive processes involved in the summarization activity. Summarizing strategies include a set of conscious tasks that are used to determine important information and extrac...
| Main Author: | |
|---|---|
| Format: | Thesis |
| Published: |
2016
|
| Subjects: | |
| Online Access: | http://studentsrepo.um.edu.my/6400/ http://studentsrepo.um.edu.my/6400/4/seyed.pdf |
| Summary: | Summarization is a process to select important information from a source text.
Summarizing strategies are the core of the cognitive processes involved in the
summarization activity. Summarizing strategies include a set of conscious tasks that are
used to determine important information and extract the main idea of a source text.
In this research project, we conducted a study on students’ summaries. The findings of
the study show that, there is a strong relationship between the summary writing
proficiency of students and the summarizing strategies that they used. We then develop
a new algorithm to address the summarizing strategies identification problem. The
algorithm simulates two important tasks that are frequently used by the human experts
to identify summarizing strategies used to produce the summary sentences: 1) sentences
relevance identification; and 2) summarizing strategies identification.
The sentences relevance identification module uses a statistical based approach such as
vector space model (VSM) to represent sentences and compute similarity between the
source sentences and the summary sentences using the cosine similarity measure. It then
integrates both the semantic and syntactic similarity measures using a linear equation to
capture the meaning in comparison between two sentences. It aims to distinguish the
meaning of two sentences, when two sentences have same surface or share the similar
bag-of-words (BOW), while their meaning is different. The module also employed a
word semantic similarity measuring method to overcome vocabulary mismatch problem
in sentence comparison. The method bridges the lexical gaps for semantically similar
contexts that are expressed in a different wording. In addition, the sentences relevance
identification module requires some degree of linguistic pre-processing, including part
of speech tagging (POS), word stemming and stop-words removal.
iii
The summarizing strategies identification module relies on a set of heuristic rules,
statistical and linguistic methods such as position-based method, title-based method,
cue-phrase method and word-frequency method to identify the summarizing strategies
employed by students.
To evaluate the algorithm, we conducted two experiments. In the first experiment, we
examine the functionality of the system, whether the system is able to identify the
summarizing strategies used by students in summary writing. The result for the first
experiment shows that the system is able to identify some of summarizing strategies
which are deletion, sentence combination, paraphrase and topic sentence selection. The
system is also able to detect copy- verbatim strategy, the most commonly strategy used
by students. Besides than these strategies, there are four methods used in topic sentence
selection strategy which can also be identified by the system. They are 1) cue method;
2) title method; 3) keyword method; and 4) location method. In the second experiment,
we want to measure the performance of the algorithm against human judgment to
identify the summarizing strategies using the precision, recall, F-measure score and
accuracy rate. The experimental results show that the proposed algorithm achieved
acceptable results in comparison to human judgment. The algorithm achieved an
average of 87% precision, 83% of recall, 85% of F-score and 82% of accuracy rate. |
|---|