Document analysis of PDF files: methods, results and implications

A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackbo...

Full description

Bibliographic Details
Main Authors: Lovegrove, William S., Brailsford, David F.
Format: Article
Published: John Wiley Ltd 1995
Subjects:
Online Access:https://eprints.nottingham.ac.uk/300/
_version_ 1848790389952086016
author Lovegrove, William S.
Brailsford, David F.
author2 Brailsford, David F.
author_facet Brailsford, David F.
Lovegrove, William S.
Brailsford, David F.
author_sort Lovegrove, William S.
building Nottingham Research Data Repository
collection Online Access
description A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the semantic gap between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding.
first_indexed 2025-11-14T18:11:51Z
format Article
id nottingham-300
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T18:11:51Z
publishDate 1995
publisher John Wiley Ltd
recordtype eprints
repository_type Digital Repository
spelling nottingham-3002020-05-04T20:33:36Z https://eprints.nottingham.ac.uk/300/ Document analysis of PDF files: methods, results and implications Lovegrove, William S. Brailsford, David F. A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the semantic gap between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding. John Wiley Ltd Brailsford, David F. Furuta, Richard K. 1995 Article PeerReviewed Lovegrove, William S. and Brailsford, David F. (1995) Document analysis of PDF files: methods, results and implications. Electronic Publishing -- Origination, Dissemination and Design, 8 (3). pp. 207-220. Document analysis Document understanding Blackboard methods Geometric structure Logical structure PDF PostScript
spellingShingle Document analysis
Document understanding
Blackboard methods
Geometric structure
Logical structure
PDF
PostScript
Lovegrove, William S.
Brailsford, David F.
Document analysis of PDF files: methods, results and implications
title Document analysis of PDF files: methods, results and implications
title_full Document analysis of PDF files: methods, results and implications
title_fullStr Document analysis of PDF files: methods, results and implications
title_full_unstemmed Document analysis of PDF files: methods, results and implications
title_short Document analysis of PDF files: methods, results and implications
title_sort document analysis of pdf files: methods, results and implications
topic Document analysis
Document understanding
Blackboard methods
Geometric structure
Logical structure
PDF
PostScript
url https://eprints.nottingham.ac.uk/300/