Repository logo
 

Evaluation of Machine Learning Models for Patient Data De-identification in Clinical Records

Date

2018-08-29T16:24:47Z

Authors

Kakarla, Yamani

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In research that involves medical records, it is important that patient-identifiable details are removed before the records are made available for research, a requirement enforced by the HIPAA Privacy Rule and Public Law 104-191. De-identification is the redaction or masking of individually identifiable pieces of patient health information (PHI) from the clinical notes to protect the patient's identity from being exposed. With an increasing adoption of electronic health records (EHRs) in healthcare industries, there is an increasingly large amount of medical information available in digital format. Performing de-identification on such large collections of records is a challenging task to complete manually. Automated de-identification systems address this issue by automatically tagging the free-text medical records. The primary objective of this research is to explore automated techniques in natural language processing for de-identifying unstructured health records. To facilitate studies in automatic de-identification using statistical models, my work provides an overview of the evaluation results of a core NLP based de-identification model. My thesis describes the complexities in learning the variants of the model in the parameter space, explains performance metrics (precision, recall, and F1 measure) of the models, compare results with a rule-based de-identification system and finally provides directions for future research. The data used for evaluation consisted of three different types of medical notes: discharge summaries, longitudinal medical records, and nursing notes. Through model-specific feature engineering and introduction of hidden neural gates (model parameter) to the core model, a highest tag-level F1-measure of 0.967 on discharge summaries was achieved. For this task, in cases where more importance should be given to precision, the F1 measure can over-weight recall. The performance results from all models are encouraging and provide scope for future work. Overall this thesis intends to increase practitioners' understanding of the nature of de-identification models and how they are trained, to help preserve medical information while not compromising the privacy of individuals.

Description

Keywords

Privacy, De-identification, Protected Health Information, HIPAA, Natural Language Processing, Sequence Labelling, Conditional Random Fields, Semi-CRF, Neural-CRF

Citation