Automatic detection of protected health information from clinic narratives

Yang, Hui and Garibaldi, Jonathan M. (2015) Automatic detection of protected health information from clinic narratives. Journal of Biomedical Informatics, 58 (Suppl.). S30-S38. ISSN 1532-0480

Full text not available from this repository.


This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub categories. A hybrid model was proposed which combines machine learning techniques with keyword-based and rule based approaches to deal with the complexity inherent in PHI categories. Our proposed approaches exploit a rich set of linguistic features, both syntactic and word surface-oriented, which are further enriched by task specific features and regular expression template patterns to characterize the semantics of various PHI categories. Our system achieved promising accuracy on the challenge test data with an overall micro-averaged F measure of 93.6%, which was the winner of this de-identification challenge.

Item Type: Article
Additional Information: Proceedings of the 2014 i2b2/UTHealth Shared-Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data
Keywords: Protected Health Information (PHI); De-identification; Hybrid model; Natural language processing; Clinical text mining
Schools/Departments: University of Nottingham, UK > Faculty of Science > School of Computer Science
Identification Number:
Depositing User: Garibadi, Prof Jon
Date Deposited: 14 Oct 2016 09:17
Last Modified: 04 May 2020 17:12

Actions (Archive Staff Only)

Edit View Edit View