Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text

The search accuracy achieved in a PDF image-plus-hidden- text (PDF-IT) document depends upon the accuracy of the optical character recognition (OCR) process that produced the searchable hidden text layer. In many cases recognising words in a blurred area of a PDF page image may exceed the capabiliti...

Full description

Bibliographic Details
Main Authors:	Knight, Ian A., Brailsford, David F.
Format:	Conference or Workshop Item
Language:	English
Published:	2016
Subjects:	PDF OCR Tesseract Searchability truth text
Online Access:	https://eprints.nottingham.ac.uk/45753/

_version_	1848797187665821696
author	Knight, Ian A. Brailsford, David F.
author_facet	Knight, Ian A. Brailsford, David F.
author_sort	Knight, Ian A.
building	Nottingham Research Data Repository
collection	Online Access
description	The search accuracy achieved in a PDF image-plus-hidden- text (PDF-IT) document depends upon the accuracy of the optical character recognition (OCR) process that produced the searchable hidden text layer. In many cases recognising words in a blurred area of a PDF page image may exceed the capabilities of an OCR engine. This paper describes a project to replace an inadequate hidden textual layer of a PDF-IT file with a more accurate hidden layer produced from a `truth text'. The alignment of the truth text with the image is guided by using OCR- provided page-image co-ordinates, for those glyphs that are correctly recognised, as a set of fixed location points between which other truth-text words can be inserted and aligned with blurred glyphs in the image. Results are presented to show the much enhanced searchability of this new file when compared to that of the original file, which had an OCR-produced hidden layer with no truth-text enhancement.
first_indexed	2025-11-14T19:59:54Z
format	Conference or Workshop Item
id	nottingham-45753
institution	University of Nottingham Malaysia Campus
institution_category	Local University
language	English
last_indexed	2025-11-14T19:59:54Z
publishDate	2016
recordtype	eprints
repository_type	Digital Repository
spelling	nottingham-457532017-10-13T04:16:08Z https://eprints.nottingham.ac.uk/45753/ Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text Knight, Ian A. Brailsford, David F. The search accuracy achieved in a PDF image-plus-hidden- text (PDF-IT) document depends upon the accuracy of the optical character recognition (OCR) process that produced the searchable hidden text layer. In many cases recognising words in a blurred area of a PDF page image may exceed the capabilities of an OCR engine. This paper describes a project to replace an inadequate hidden textual layer of a PDF-IT file with a more accurate hidden layer produced from a `truth text'. The alignment of the truth text with the image is guided by using OCR- provided page-image co-ordinates, for those glyphs that are correctly recognised, as a set of fixed location points between which other truth-text words can be inserted and aligned with blurred glyphs in the image. Results are presented to show the much enhanced searchability of this new file when compared to that of the original file, which had an OCR-produced hidden layer with no truth-text enhancement. 2016-09-13 Conference or Workshop Item PeerReviewed application/pdf en https://eprints.nottingham.ac.uk/45753/1/final-shortpap.pdf Knight, Ian A. and Brailsford, David F. (2016) Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text. In: DocEng '16 Proceedings of the 2016 ACM Symposium on Document Engineering, 13-16 September 2016, Vienna, Austria. PDF OCR Tesseract Searchability truth text http://dl.acm.org/citation.cfm?id=2967157&CFID=983735475&CFTOKEN=75191762
spellingShingle	PDF OCR Tesseract Searchability truth text Knight, Ian A. Brailsford, David F. Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text
title	Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text
title_full	Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text
title_fullStr	Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text
title_full_unstemmed	Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text
title_short	Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text
title_sort	enhancing the searchability of page-image pdf documents using an aligned hidden layer from a truth text
topic	PDF OCR Tesseract Searchability truth text
url	https://eprints.nottingham.ac.uk/45753/ https://eprints.nottingham.ac.uk/45753/

Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text

Similar Items