Advanced document analysis and automatic classification of PDF documents

Tools

Lovegrove, Will. (1996) Advanced document analysis and automatic classification of PDF documents. PhD thesis, University of Nottingham.

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Available under Licence All Rights Reserved.
Download (24MB) | Preview

Abstract

This thesis explores the domain of document analysis and document classification within the PDF document environment The main focus is the creation of a document classification technique which can identify the logical class of a PDF document and so provide necessary information to document class specific algorithms (such as document understanding techniques).

The thesis describes a page decomposition technique which is tailored to render the information contained in an unstructured PDF file into a set of blocks. The new technique is based on published research but contains many modifications which enable it to competently analyse the internal document model of PDF documents.

A new level of document processing is presented: advanced document analysis. The aim of advanced document analysis is to extract information from the PDF file which can be used to help identify the logical class of that PDF file. A blackboard framework is used in a process of block labelling in which the blocks created from earlier segmentation techniques are classified into one of eight basic categories. The blackboard's knowledge sources are programmed to find recurring patterns amongst the document's blocks and formulate document-specific heuristics which can be used to tag those blocks.

Meaningful document features are found from three information sources: a statistical evaluation of the document's esthetic components; a logical based evaluation of the labelled document blocks and an appearance based evaluation of the labelled document blocks. The features are used to train and test a neural net classification system which identifies the recurring patterns amongst these features for four basic document classes: newspapers; brochures; forms and academic documents.

In summary this thesis shows that it is possible to classify a PDF document (which is logically unstructured) into a basic logical document class. This has important ramifications for document processing systems which have traditionally relied upon a priori knowledge of the logical class of the document they are processing.

Item Type:	Thesis (University of Nottingham only) (PhD)
Supervisors:	Harrison, Leon Elliman, David
Subjects:	Q Science > QA Mathematics > QA 75 Electronic computers. Computer science
Faculties/Schools:	UK Campuses > Faculty of Science > School of Mathematical Sciences
Item ID:	13967
Depositing User:	EP, Services
Date Deposited:	07 Feb 2014 10:47
Last Modified:	28 Feb 2025 11:28
URI:	https://eprints.nottingham.ac.uk/id/eprint/13967

Actions (Archive Staff Only)

Edit View

LoginAdmin