Language
English
Publication Date
5-1-2025
Journal
Research and Practice in Thrombosis and Haemostasis
DOI
10.1016/j.rpth.2025.102896
PMID
40606764
PMCID
PMC12213262
PubMedCentral® Posted Date
5-21-2025
PubMedCentral® Full Text Version
Post-print
Abstract
Background: Pulmonary embolism (PE) is a leading cause of preventable in-hospital mortality. Advances in diagnosis, risk stratification, and prevention can improve outcomes. Large, publicly available datasets are needed to move research forward, but are lacking in the field of hemostasis and thrombosis.
Objectives: In this study, we experiment using a machine learning language model to automatically add PE labels to a large dataset.
Methods: We extracted all computed tomography pulmonary angiography radiology reports (N = 19,942) from the Medical Information Mart for Intensive Care IV, a database of adult patients who presented to the emergency room or were admitted to the intensive care unit at one tertiary care center between 2008 and 2019. Two physicians manually labeled each report result as PE positive (acute PE) or PE negative. Using this as our gold standard, we compared the performance of a fine-tuned Bio_ClinicalBERT (bidirectional encoder representations from transformers) transformer language model, known as venous thromboembolism-BERT, with diagnosis codes in the ability to classify reports as PE positive or negative.
Results: Venous thromboembolism-BERT had a sensitivity of 92.4% and a positive predictive value of 87.8% in all 19,942 computed tomography pulmonary angiography reports. Diagnosis codes had a sensitivity of 95.4% and a positive predictive value of 83.8% in the subset of 11,990 reports with an associated discharge diagnosis code.
Conclusion: We successfully added nearly 20,000 PE labels to the publicly available Medical Information Mart for Intensive Care IV database and demonstrated how a transformer language model can automate and accelerate hematologic research.
Keywords
machine learning, natural language processing, pulmonary embolism, venous thromboembolism, databases
Published Open-Access
yes
Recommended Citation
Lam, Barbara D; Ma, Shengling; Kovalenko, Iuliia; et al., "Using a Transformer Language Model To Curate a Pulmonary Embolism Dataset From the Medical Information Mart for Intensive Care IV: MIMIC-IV-Ext-PE" (2025). Faculty and Staff Publications. 4625.
https://digitalcommons.library.tmc.edu/baylor_docs/4625