Generating summary documents for a variable-quality PDF document collection

The Cochrane Schizophrenia Group’s Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document collections of this sort – on a given theme but gather...

Full description

Bibliographic Details
Main Authors: Hughes, Jacob, Brailsford, David F., Bagley, Steven R., Adams, Clive E.
Format: Conference or Workshop Item
Published: 2014
Subjects:
Online Access:https://eprints.nottingham.ac.uk/28168/
_version_ 1848793519592833024
author Hughes, Jacob
Brailsford, David F.
Bagley, Steven R.
Adams, Clive E.
author_facet Hughes, Jacob
Brailsford, David F.
Bagley, Steven R.
Adams, Clive E.
author_sort Hughes, Jacob
building Nottingham Research Data Repository
collection Online Access
description The Cochrane Schizophrenia Group’s Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document collections of this sort – on a given theme but gathered from a wide range of sources – will generally have huge variability in the quality of the PDF, particularly with respect to the key property of text searchability. Summarising the results from the best of these papers, to allow evidence-based health care decision making, has so far been done by manually creating a summary document, starting from a visual inspection of the relevant PDF file. This labour-intensive process has resulted, to date, in only 4,000 of the papers being summarised – with enormous duplication of effort and with many issues around the validity and reliability of the data extraction. This paper describes a pilot project to provide a computer-assisted framework in which any of the PDF documents could be searched for the occurrence of some 8,000 keywords and key phrases.Once keyword tagging has been completed the framework assists in the generation of a standard summary document, thereby greatly speeding up the production of these summaries. Early examples of the framework are described and its capabilities illustrated.
first_indexed 2025-11-14T19:01:35Z
format Conference or Workshop Item
id nottingham-28168
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T19:01:35Z
publishDate 2014
recordtype eprints
repository_type Digital Repository
spelling nottingham-281682020-05-04T20:13:26Z https://eprints.nottingham.ac.uk/28168/ Generating summary documents for a variable-quality PDF document collection Hughes, Jacob Brailsford, David F. Bagley, Steven R. Adams, Clive E. The Cochrane Schizophrenia Group’s Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document collections of this sort – on a given theme but gathered from a wide range of sources – will generally have huge variability in the quality of the PDF, particularly with respect to the key property of text searchability. Summarising the results from the best of these papers, to allow evidence-based health care decision making, has so far been done by manually creating a summary document, starting from a visual inspection of the relevant PDF file. This labour-intensive process has resulted, to date, in only 4,000 of the papers being summarised – with enormous duplication of effort and with many issues around the validity and reliability of the data extraction. This paper describes a pilot project to provide a computer-assisted framework in which any of the PDF documents could be searched for the occurrence of some 8,000 keywords and key phrases.Once keyword tagging has been completed the framework assists in the generation of a standard summary document, thereby greatly speeding up the production of these summaries. Early examples of the framework are described and its capabilities illustrated. 2014-09 Conference or Workshop Item PeerReviewed Hughes, Jacob, Brailsford, David F., Bagley, Steven R. and Adams, Clive E. (2014) Generating summary documents for a variable-quality PDF document collection. In: ACM Symposium on Document Engineering (DocEng '14), 16-19 Sept 2014, Fort Collins, Colorado, USA. Schizophrenia; PDF; OCR; document collections http://dx.doi.org/10.1145/2644866.2644892
spellingShingle Schizophrenia; PDF; OCR; document collections
Hughes, Jacob
Brailsford, David F.
Bagley, Steven R.
Adams, Clive E.
Generating summary documents for a variable-quality PDF document collection
title Generating summary documents for a variable-quality PDF document collection
title_full Generating summary documents for a variable-quality PDF document collection
title_fullStr Generating summary documents for a variable-quality PDF document collection
title_full_unstemmed Generating summary documents for a variable-quality PDF document collection
title_short Generating summary documents for a variable-quality PDF document collection
title_sort generating summary documents for a variable-quality pdf document collection
topic Schizophrenia; PDF; OCR; document collections
url https://eprints.nottingham.ac.uk/28168/
https://eprints.nottingham.ac.uk/28168/