Joseph W. Sirrianni1*, Jin Peng1, Yungui Huang1 and Homa Amini2
Background: Cohort identification is a crucial task for performing retrospective clinical analysis. The utilization of natural language processing, especially the modern and advanced approaches using deep learning modeling, may improve this task by allowing for improved classification of patients by cohort status. However, this utilization has not been applied in the dentaldomain.
Objective: We aim to identify patients that suffer trampoline-associated traumatic dental injuries among all trampoline-associatedinjuries.
Methods: We develop and apply a natural language processing cohort identification pipeline, consisting of text filtering rules and a machine learning model trained using historic data. The pipeline processes a patient’s clinical notes for a series of temporally related encounters and produces a binary prediction of whether the patient has suffered a trampoline-injury or not. We experimented with six different machine learning models: logistic regression, random forest, decision tress, linear-SVM, naïve bayes, and a fine-tuned ClinicalBERT model.
Results: The fine-tuned ClinicalBERT model had the best performance of the models on our evaluation data with a PPV of 0.836 and a sensitivity of 0.898. The application of the pipeline on our data increased the cohort size for all trampoline injuries from an initial 7454 patients to 15,010 patients and the trampoline-associated traumatic dental injuries cohort from an initial 102 patients to 140 patients.
Conclusion: We present a novel natural language processing powered pipeline for identifying a trampoline-associated injury cohort for dental research. Our results demonstrate the superiority of deep learning over traditional machine learning models on our specific task. Our process for identifying patient encounters by activity type is generalizable to several different types of injuries and applicable to other research cohorts.
分享此文章