
Center for Medical Ethics and Health Policy Staff Publications
Publication Date
1-1-2025
Journal
Health Informatics Journal
PMID
40528690
PubMedCentral® Posted Date
4-1-2025
PubMedCentral® Full Text Version
Post-print
Published Open-Access
yes
Keywords
Cloud Computing, Humans, Genomics, Information Storage and Retrieval, S3 genomic data lake; clinical-omics data; hybrid cloud architecture; oncology clinical decision making
Abstract
Objective: Cancer centers must quickly integrate clinical genomics data from different vendors for oncology operations and research. Clinical data warehouse architectures are costly to construct and brittle, and they are not readily amenable to the rapid changes in oncology research. We introduce a cost-effective hybrid cloud Data Lake architecture for storing clinical genomic data from different vendors, aiding both clinical and research workflows.
Methods: We created a Data Lake architecture based on the zone architecture, with four layers: ingestion, storage, transformation, and interaction. The layers are implemented with a hybrid cloud architecture. Rich metadata created from patient and genomic data enables patient-based queries, with access to data controlled through a data governance workflow.
Results: Genomic data are stored in the cloud, synchronized with vendors' storage, and managed by a governance committee. The architecture implementation includes genomic test results from two vendors and supports independent clinical sites. The implementation serves 149 clinicians across 31 disease groups and stores 240 TB of data on 5800 patients at a monthly cost of approximately $350.
Conclusion: The Data Lake architecture offers flexibility and scalability, making it suitable for organizations of all sizes to integrate clinical and genomic data efficiently for clinical and research purposes.