Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research

Cochrane, Nicholas J.K. (2015) Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research. PhD thesis, University of Nottingham.

[img] PDF (Thesis - as examined) - Repository staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Download (9MB)



For epidemiological research purposes structured data provide identifiable and immediate access to the information that has been recorded, however, many quantitative recordings in electronic medical records are unstructured. This means researchers have to manually identify and extract information of interest. This is costly in terms of time and money and with access to larger amounts of electronically stored data this approach is becoming increasingly impractical.


Two programmatic methods were developed to extract and classify numeric quantities and identify attributes from unstructured dosage instructions and clinical comments from The Health Improvement Network (THIN) database. Both methods are based on frequently occurring patterns of recording from which models were formed. Dosage instructions: Automated coding was achieved through the interpretation of a representative set of language phrases with identifiable traits. The dosage data table was automatically recoded and assessed for accuracy and coverage of a daily dosage value, then assessed in the context of 146 commonly prescribed medications. Clinical comments: Automated coding was achieved through the identification of a representative set of text and/or Read code qualifications. The model was initially trained on THIN data for a wide range of numeric health indicators, then tested for generalizability using comments from an alternative source and assessed for accuracy, sensitivity, and specificity using a subset of 12 commonly recorded health indicators.


Dosage instructions: The coverage of a daily dosage value within the dosage data table was increased from 42.1% to 84.8% coverage with an accuracy of 84.6%. For the 146 medications assessed, on a per-unique-instruction basis, the coverage was 79.7% on average with an accuracy of 95.4%. On an all-recorded-instructions basis the weighted coverage was 65.9% on average with an accuracy of 99.3%. Clinical comments: For all 12 of the health indicators assessed the automated extraction achieved a specificity of >98% and an accuracy of >99%. The sensitivity was >96% for 8 of the indicators and between 52-88% for the other indicators.


Dosage instructions: The automated coding has improved the quantitative and qualitative summary for dosage instructions within THIN resulting in a substantial increase in the quantity of data available for pharmaco-epidemiological research. Clinical comments: The sensitivity of the extraction method is dependent on the consistency of recording patterns, which in turn was dependent on the ability to identify the differing patterns of qualification during training.

Item Type: Thesis (University of Nottingham only) (PhD)
Supervisors: Gibson, Jack
Hubbard, R.
Keywords: Epidemiological research, Automated coding, Structured data, The Health Improvement Network database, Electronic medical records, Dosage instructions
Subjects: W Medicine and related subjects (NLM Classification) > W Health professions
Faculties/Schools: UK Campuses > Faculty of Medicine and Health Sciences > School of Medicine
Item ID: 30582
Depositing User: Cochrane, Nicholas
Date Deposited: 18 Jan 2016 08:57
Last Modified: 02 Nov 2017 18:12

Actions (Archive Staff Only)

Edit View Edit View