Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research

Background For epidemiological research purposes structured data provide identifiable and immediate access to the information that has been recorded, however, many quantitative recordings in electronic medical records are unstructured. This means researchers have to manually identify and extract...

Full description

Bibliographic Details
Main Author: Cochrane, Nicholas J.K.
Format: Thesis (University of Nottingham only)
Language:English
Published: 2015
Subjects:
Online Access:https://eprints.nottingham.ac.uk/30582/
_version_ 1848794016416530432
author Cochrane, Nicholas J.K.
author_facet Cochrane, Nicholas J.K.
author_sort Cochrane, Nicholas J.K.
building Nottingham Research Data Repository
collection Online Access
description Background For epidemiological research purposes structured data provide identifiable and immediate access to the information that has been recorded, however, many quantitative recordings in electronic medical records are unstructured. This means researchers have to manually identify and extract information of interest. This is costly in terms of time and money and with access to larger amounts of electronically stored data this approach is becoming increasingly impractical. Method Two programmatic methods were developed to extract and classify numeric quantities and identify attributes from unstructured dosage instructions and clinical comments from The Health Improvement Network (THIN) database. Both methods are based on frequently occurring patterns of recording from which models were formed. Dosage instructions: Automated coding was achieved through the interpretation of a representative set of language phrases with identifiable traits. The dosage data table was automatically recoded and assessed for accuracy and coverage of a daily dosage value, then assessed in the context of 146 commonly prescribed medications. Clinical comments: Automated coding was achieved through the identification of a representative set of text and/or Read code qualifications. The model was initially trained on THIN data for a wide range of numeric health indicators, then tested for generalizability using comments from an alternative source and assessed for accuracy, sensitivity, and specificity using a subset of 12 commonly recorded health indicators. Results Dosage instructions: The coverage of a daily dosage value within the dosage data table was increased from 42.1% to 84.8% coverage with an accuracy of 84.6%. For the 146 medications assessed, on a per-unique-instruction basis, the coverage was 79.7% on average with an accuracy of 95.4%. On an all-recorded-instructions basis the weighted coverage was 65.9% on average with an accuracy of 99.3%. Clinical comments: For all 12 of the health indicators assessed the automated extraction achieved a specificity of >98% and an accuracy of >99%. The sensitivity was >96% for 8 of the indicators and between 52-88% for the other indicators. Conclusion Dosage instructions: The automated coding has improved the quantitative and qualitative summary for dosage instructions within THIN resulting in a substantial increase in the quantity of data available for pharmaco-epidemiological research. Clinical comments: The sensitivity of the extraction method is dependent on the consistency of recording patterns, which in turn was dependent on the ability to identify the differing patterns of qualification during training.
first_indexed 2025-11-14T19:09:29Z
format Thesis (University of Nottingham only)
id nottingham-30582
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T19:09:29Z
publishDate 2015
recordtype eprints
repository_type Digital Repository
spelling nottingham-305822025-02-28T11:36:59Z https://eprints.nottingham.ac.uk/30582/ Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research Cochrane, Nicholas J.K. Background For epidemiological research purposes structured data provide identifiable and immediate access to the information that has been recorded, however, many quantitative recordings in electronic medical records are unstructured. This means researchers have to manually identify and extract information of interest. This is costly in terms of time and money and with access to larger amounts of electronically stored data this approach is becoming increasingly impractical. Method Two programmatic methods were developed to extract and classify numeric quantities and identify attributes from unstructured dosage instructions and clinical comments from The Health Improvement Network (THIN) database. Both methods are based on frequently occurring patterns of recording from which models were formed. Dosage instructions: Automated coding was achieved through the interpretation of a representative set of language phrases with identifiable traits. The dosage data table was automatically recoded and assessed for accuracy and coverage of a daily dosage value, then assessed in the context of 146 commonly prescribed medications. Clinical comments: Automated coding was achieved through the identification of a representative set of text and/or Read code qualifications. The model was initially trained on THIN data for a wide range of numeric health indicators, then tested for generalizability using comments from an alternative source and assessed for accuracy, sensitivity, and specificity using a subset of 12 commonly recorded health indicators. Results Dosage instructions: The coverage of a daily dosage value within the dosage data table was increased from 42.1% to 84.8% coverage with an accuracy of 84.6%. For the 146 medications assessed, on a per-unique-instruction basis, the coverage was 79.7% on average with an accuracy of 95.4%. On an all-recorded-instructions basis the weighted coverage was 65.9% on average with an accuracy of 99.3%. Clinical comments: For all 12 of the health indicators assessed the automated extraction achieved a specificity of >98% and an accuracy of >99%. The sensitivity was >96% for 8 of the indicators and between 52-88% for the other indicators. Conclusion Dosage instructions: The automated coding has improved the quantitative and qualitative summary for dosage instructions within THIN resulting in a substantial increase in the quantity of data available for pharmaco-epidemiological research. Clinical comments: The sensitivity of the extraction method is dependent on the consistency of recording patterns, which in turn was dependent on the ability to identify the differing patterns of qualification during training. 2015-12-09 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/30582/1/NicholasJKCochranePhDTheisFinal.pdf Cochrane, Nicholas J.K. (2015) Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research. PhD thesis, University of Nottingham. Epidemiological research Automated coding Structured data The Health Improvement Network database Electronic medical records Dosage instructions
spellingShingle Epidemiological research
Automated coding
Structured data
The Health Improvement Network database
Electronic medical records
Dosage instructions
Cochrane, Nicholas J.K.
Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research
title Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research
title_full Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research
title_fullStr Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research
title_full_unstemmed Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research
title_short Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research
title_sort programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research
topic Epidemiological research
Automated coding
Structured data
The Health Improvement Network database
Electronic medical records
Dosage instructions
url https://eprints.nottingham.ac.uk/30582/