Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text

Knight, Ian A. and Brailsford, David F. (2016) Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text. In: DocEng '16 Proceedings of the 2016 ACM Symposium on Document Engineering, 13-16 September 2016, Vienna, Austria.

[img]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (215kB) | Preview

Abstract

The search accuracy achieved in a PDF image-plus-hidden- text (PDF-IT) document depends upon the accuracy of the optical character recognition (OCR) process that produced the searchable hidden text layer. In many cases recognising words in a blurred area of a PDF page image may exceed the capabilities of an OCR engine. This paper describes a project to replace an inadequate hidden textual layer of a PDF-IT file with a more accurate hidden layer produced from a `truth text'. The alignment of the truth text with the image is guided by using OCR- provided page-image co-ordinates, for those glyphs that are correctly recognised, as a set of fixed location points between which other truth-text words can be inserted and aligned with blurred glyphs in the image. Results are presented to show the much enhanced searchability of this new file when compared to that of the original file, which had an OCR-produced hidden layer with no truth-text enhancement.

Item Type: Conference or Workshop Item (Paper)
Additional Information: Paper published in DocEng '16 Proceedings of the 2016 ACM Symposium on Document Engineering 9781450344388. Doi 10.1145/2960811.2967157
Keywords: PDF, OCR, Tesseract, Searchability, truth text
Schools/Departments: University of Nottingham, UK > Faculty of Science > School of Computer Science
Depositing User: Brailsford, Prof David
Date Deposited: 12 Sep 2017 13:06
Last Modified: 13 Oct 2017 04:16
URI: https://eprints.nottingham.ac.uk/id/eprint/45753

Actions (Archive Staff Only)

Edit View Edit View