Automated re-typesetting, indexing and content enhancement for scanned marriage registers

Brailsford, David F. (2009) Automated re-typesetting, indexing and content enhancement for scanned marriage registers. In: ACM Symposium on Document Engineering (DocEng '09), 15-18 Sept 2009, Munich, Germany.

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (415kB) | Preview


For much of England and Wales marriage registers began to be kept in 1537. The marriage details were recorded locally, and in longhand, until 1st July 1837, when central records began. All registers were kept in the local parish church.

In the period from 1896 to 1922 an attempt was made, by the Phillimore company of London, using volunteer help, to transcribe marriage registers for as many English parishes as possible and to have them printed.

This paper describes an experiment in the automated retypesetting of Volume 2 of the 15-volume Phillimore series relating to the county of Derbyshire. The source material was plain text derived from running Optical Character Recognition (OCR) on a set of page scans taken from the original printed volume.

The aim of the experiment was to avoid any idea of labour-intensive page-by-page rebuilding with tools such as Acrobat Capture. Instead, it proved possible to capitalise on the regular, tabular, structure of the Register pages as a means of automating the re-typesetting process, using UNIX troff software and its tbl preprocessor. A series of simple software tools helped to bring about the OCR-to-troff transformation.

However, the re-typesetting of the text was not just an end in itself but, additionally, a step on the way to content enhancement and content repurposing. This included the indexing of the marriage entries and their potential transformation into XML and GEDCOM notations. The experiment has shown, for highly regular material, that the efforts of one programmer, with suitable low-level tools, can be far more effective than attempting to recreate the printed material using WYSIWYG software.

Item Type: Conference or Workshop Item (Paper)
Additional Information: Published in: DocEng '09: proceedings of the 9th ACM Symposium on Document Engineering. New York : ACM, 2009, ISBN: 978-1-60558-575-8. pp. 29-38, doi: 10.1145/1600193.1600202
Keywords: Re-typesetting, GEDCOM, OCR, troff, genealogy, hyperlinking, indexing.
Schools/Departments: University of Nottingham UK Campus > Faculty of Science > School of Computer Science
Depositing User: Brailsford, Prof David
Date Deposited: 24 Feb 2015 12:10
Last Modified: 14 Sep 2016 15:00

Actions (Archive Staff Only)

Edit View Edit View