Automatic detection of protected health information from clinic narratives

This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub categories. A hybrid model was...

Full description

Bibliographic Details
Main Authors: Yang, Hui, Garibaldi, Jonathan M.
Format: Article
Published: Elsevier 2015
Subjects:
Online Access:https://eprints.nottingham.ac.uk/37551/
_version_ 1848795482515570688
author Yang, Hui
Garibaldi, Jonathan M.
author_facet Yang, Hui
Garibaldi, Jonathan M.
author_sort Yang, Hui
building Nottingham Research Data Repository
collection Online Access
description This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub categories. A hybrid model was proposed which combines machine learning techniques with keyword-based and rule based approaches to deal with the complexity inherent in PHI categories. Our proposed approaches exploit a rich set of linguistic features, both syntactic and word surface-oriented, which are further enriched by task specific features and regular expression template patterns to characterize the semantics of various PHI categories. Our system achieved promising accuracy on the challenge test data with an overall micro-averaged F measure of 93.6%, which was the winner of this de-identification challenge.
first_indexed 2025-11-14T19:32:47Z
format Article
id nottingham-37551
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T19:32:47Z
publishDate 2015
publisher Elsevier
recordtype eprints
repository_type Digital Repository
spelling nottingham-375512020-05-04T17:12:19Z https://eprints.nottingham.ac.uk/37551/ Automatic detection of protected health information from clinic narratives Yang, Hui Garibaldi, Jonathan M. This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub categories. A hybrid model was proposed which combines machine learning techniques with keyword-based and rule based approaches to deal with the complexity inherent in PHI categories. Our proposed approaches exploit a rich set of linguistic features, both syntactic and word surface-oriented, which are further enriched by task specific features and regular expression template patterns to characterize the semantics of various PHI categories. Our system achieved promising accuracy on the challenge test data with an overall micro-averaged F measure of 93.6%, which was the winner of this de-identification challenge. Elsevier 2015-07-29 Article PeerReviewed Yang, Hui and Garibaldi, Jonathan M. (2015) Automatic detection of protected health information from clinic narratives. Journal of Biomedical Informatics, 58 (Suppl.). S30-S38. ISSN 1532-0480 Protected Health Information (PHI); De-identification; Hybrid model; Natural language processing; Clinical text mining http://www.sciencedirect.com/science/article/pii/S1532046415001252 doi:10.1016/j.jbi.2015.06.015 doi:10.1016/j.jbi.2015.06.015
spellingShingle Protected Health Information (PHI); De-identification; Hybrid model; Natural language processing; Clinical text mining
Yang, Hui
Garibaldi, Jonathan M.
Automatic detection of protected health information from clinic narratives
title Automatic detection of protected health information from clinic narratives
title_full Automatic detection of protected health information from clinic narratives
title_fullStr Automatic detection of protected health information from clinic narratives
title_full_unstemmed Automatic detection of protected health information from clinic narratives
title_short Automatic detection of protected health information from clinic narratives
title_sort automatic detection of protected health information from clinic narratives
topic Protected Health Information (PHI); De-identification; Hybrid model; Natural language processing; Clinical text mining
url https://eprints.nottingham.ac.uk/37551/
https://eprints.nottingham.ac.uk/37551/
https://eprints.nottingham.ac.uk/37551/