Advanced document analysis and automatic classification of PDF documents

This thesis explores the domain of document analysis and document classification within the PDF document environment The main focus is the creation of a document classification technique which can identify the logical class of a PDF document and so provide necessary information to document class spe...

Full description

Bibliographic Details
Main Author: Lovegrove, Will.
Format: Thesis (University of Nottingham only)
Language:English
Published: 1996
Online Access:https://eprints.nottingham.ac.uk/13967/
_version_ 1848791848394424320
author Lovegrove, Will.
author_facet Lovegrove, Will.
author_sort Lovegrove, Will.
building Nottingham Research Data Repository
collection Online Access
description This thesis explores the domain of document analysis and document classification within the PDF document environment The main focus is the creation of a document classification technique which can identify the logical class of a PDF document and so provide necessary information to document class specific algorithms (such as document understanding techniques). The thesis describes a page decomposition technique which is tailored to render the information contained in an unstructured PDF file into a set of blocks. The new technique is based on published research but contains many modifications which enable it to competently analyse the internal document model of PDF documents. A new level of document processing is presented: advanced document analysis. The aim of advanced document analysis is to extract information from the PDF file which can be used to help identify the logical class of that PDF file. A blackboard framework is used in a process of block labelling in which the blocks created from earlier segmentation techniques are classified into one of eight basic categories. The blackboard's knowledge sources are programmed to find recurring patterns amongst the document's blocks and formulate document-specific heuristics which can be used to tag those blocks. Meaningful document features are found from three information sources: a statistical evaluation of the document's esthetic components; a logical based evaluation of the labelled document blocks and an appearance based evaluation of the labelled document blocks. The features are used to train and test a neural net classification system which identifies the recurring patterns amongst these features for four basic document classes: newspapers; brochures; forms and academic documents. In summary this thesis shows that it is possible to classify a PDF document (which is logically unstructured) into a basic logical document class. This has important ramifications for document processing systems which have traditionally relied upon a priori knowledge of the logical class of the document they are processing.
first_indexed 2025-11-14T18:35:02Z
format Thesis (University of Nottingham only)
id nottingham-13967
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T18:35:02Z
publishDate 1996
recordtype eprints
repository_type Digital Repository
spelling nottingham-139672025-02-28T11:28:07Z https://eprints.nottingham.ac.uk/13967/ Advanced document analysis and automatic classification of PDF documents Lovegrove, Will. This thesis explores the domain of document analysis and document classification within the PDF document environment The main focus is the creation of a document classification technique which can identify the logical class of a PDF document and so provide necessary information to document class specific algorithms (such as document understanding techniques). The thesis describes a page decomposition technique which is tailored to render the information contained in an unstructured PDF file into a set of blocks. The new technique is based on published research but contains many modifications which enable it to competently analyse the internal document model of PDF documents. A new level of document processing is presented: advanced document analysis. The aim of advanced document analysis is to extract information from the PDF file which can be used to help identify the logical class of that PDF file. A blackboard framework is used in a process of block labelling in which the blocks created from earlier segmentation techniques are classified into one of eight basic categories. The blackboard's knowledge sources are programmed to find recurring patterns amongst the document's blocks and formulate document-specific heuristics which can be used to tag those blocks. Meaningful document features are found from three information sources: a statistical evaluation of the document's esthetic components; a logical based evaluation of the labelled document blocks and an appearance based evaluation of the labelled document blocks. The features are used to train and test a neural net classification system which identifies the recurring patterns amongst these features for four basic document classes: newspapers; brochures; forms and academic documents. In summary this thesis shows that it is possible to classify a PDF document (which is logically unstructured) into a basic logical document class. This has important ramifications for document processing systems which have traditionally relied upon a priori knowledge of the logical class of the document they are processing. 1996 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/13967/1/336930.pdf Lovegrove, Will. (1996) Advanced document analysis and automatic classification of PDF documents. PhD thesis, University of Nottingham.
spellingShingle Lovegrove, Will.
Advanced document analysis and automatic classification of PDF documents
title Advanced document analysis and automatic classification of PDF documents
title_full Advanced document analysis and automatic classification of PDF documents
title_fullStr Advanced document analysis and automatic classification of PDF documents
title_full_unstemmed Advanced document analysis and automatic classification of PDF documents
title_short Advanced document analysis and automatic classification of PDF documents
title_sort advanced document analysis and automatic classification of pdf documents
url https://eprints.nottingham.ac.uk/13967/