Document analysis of PDF files: methods, results and implications
A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackbo...
| Main Authors: | , |
|---|---|
| Format: | Article |
| Published: |
John Wiley Ltd
1995
|
| Subjects: | |
| Online Access: | https://eprints.nottingham.ac.uk/300/ |
| _version_ | 1848790389952086016 |
|---|---|
| author | Lovegrove, William S. Brailsford, David F. |
| author2 | Brailsford, David F. |
| author_facet | Brailsford, David F. Lovegrove, William S. Brailsford, David F. |
| author_sort | Lovegrove, William S. |
| building | Nottingham Research Data Repository |
| collection | Online Access |
| description | A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the semantic gap between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding. |
| first_indexed | 2025-11-14T18:11:51Z |
| format | Article |
| id | nottingham-300 |
| institution | University of Nottingham Malaysia Campus |
| institution_category | Local University |
| last_indexed | 2025-11-14T18:11:51Z |
| publishDate | 1995 |
| publisher | John Wiley Ltd |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | nottingham-3002020-05-04T20:33:36Z https://eprints.nottingham.ac.uk/300/ Document analysis of PDF files: methods, results and implications Lovegrove, William S. Brailsford, David F. A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the semantic gap between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding. John Wiley Ltd Brailsford, David F. Furuta, Richard K. 1995 Article PeerReviewed Lovegrove, William S. and Brailsford, David F. (1995) Document analysis of PDF files: methods, results and implications. Electronic Publishing -- Origination, Dissemination and Design, 8 (3). pp. 207-220. Document analysis Document understanding Blackboard methods Geometric structure Logical structure PDF PostScript |
| spellingShingle | Document analysis Document understanding Blackboard methods Geometric structure Logical structure PostScript Lovegrove, William S. Brailsford, David F. Document analysis of PDF files: methods, results and implications |
| title | Document analysis of PDF files: methods, results and implications |
| title_full | Document analysis of PDF files: methods, results and implications |
| title_fullStr | Document analysis of PDF files: methods, results and implications |
| title_full_unstemmed | Document analysis of PDF files: methods, results and implications |
| title_short | Document analysis of PDF files: methods, results and implications |
| title_sort | document analysis of pdf files: methods, results and implications |
| topic | Document analysis Document understanding Blackboard methods Geometric structure Logical structure PostScript |
| url | https://eprints.nottingham.ac.uk/300/ |