Language

English

Publication Date

8-5-2025

Journal

Journal of Thrombosis and Haemostasis

DOI

10.1016/j.jtha.2025.07.021

PMID

40754035

PMCID

PMC12360494

PubMedCentral® Full Text Version

Post-print

Abstract

Background: Accurate and rapid phenotyping of venous thromboembolism (VTE) in longitudinal studies is important. A natural language processing (NLP) tool externally validated in representative patients is lacking.

Objectives: To train and validate an efficient NLP model to detect incident VTE event.

Methods: We designed a novel NLP platform, NLPMed, to assist thrombosis researchers with data preprocessing, phenotype annotation, language model finetuning, and NLP application. Using clinical notes, discharge summaries, and radiology reports from patients with cancer at 2 healthcare institutions, we finetuned Bio_Clinical Bidirectional Encoder Representations from Transformers (BERT) to develop VTE-BERT. The new model was trained to detect acute VTE events and their anatomical locations longitudinally. We internally and externally validated the model's performance in 2 randomly sampled cohorts of patients with advanced cancer.

Results: The training cohort consisted of 715 patients and 14 013 annotated notes with ≥1 VTE keyword from the Harris Health System. The internal validation cohort included 400 additional patients with 7190 VTE keyword-containing notes from Harris Health System. The external validation cohort included 400 patients with 7371 VTE keyword-containing notes from the national Veterans Affairs healthcare system. VTE-BERT was trained until it reached a precision of 95% and recall of 98% on the patient level. Using independent datasets, the model achieved precision and recall of 95% and 91% in internal validation and of 85% and 92% in external validation.

Conclusion: We trained and externally validated an efficient NLP model to detect incident VTE events longitudinally. We believe its adoption will accelerate thrombosis research by improving VTE detection at scale and decreasing the time and expense involved with manual chart review in big data epidemiological studies.

Keywords

thromboembolism, neoplasms, large language model, natural language processing, artificial intelligence

Published Open-Access

yes

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.