Faculty, Staff and Student Publications
Publication Date
1-1-2026
Journal
SN Computer Science
DOI
10.1007/s42979-025-04540-x
PMID
41523798
PMCID
PMC12779700
PubMedCentral® Posted Date
1-7-2026
PubMedCentral® Full Text Version
Post-print
Abstract
Datasets used in machine learning often contain sensitive information, including personally identifiable health and financial details. A common challenge faced by organizations and researchers is the risk of privacy breaches when using real-world data. Synthetic data can be used as an alternative to the real-world data. In existing synthetic data generation techniques, an encoder processes the real-world data to map it into a lower-dimensional latent space. Random sampling is then performed in this latent space. Subsequently, a decoder network is utilized to generate synthetic data from these sampled points in the latent space. Such approaches typically require generating a large number of synthetic samples to approximate the performance of real-world data, subsequently slowing down downstream machine learning tasks. Addressing this, we introduce a combinatorial approach to sampling the latent space, motivated by our empirical findings within this study that most model predictions are largely influenced by interactions between a few features. In some cases, just using a small number of features produces accuracy better than using entire features. Through this approach, we generate samples that utilize t-way interactions among the t latent dimensions out of n. Our experimental results indicate that our approach requires fewer samples than traditional random sampling to achieve comparable model performance for real-world data sets. We also show that when integrated with a differentially private mechanism, our approach incurs a smaller decline in model performance than existing random sampling approach.
Keywords
Synthetic data generation, Combinatorial testing, Variational autoencoder, Differential privacy
Published Open-Access
yes
Recommended Citation
Khadka, Krishna; Chandrasekaran, Jaganmohan; Lei, Yu; et al., "A Combinatorial Approach to Synthetic Data Generation for Machine Learning" (2026). Faculty, Staff and Student Publications. 6087.
https://digitalcommons.library.tmc.edu/uthgsbs_docs/6087
Included in
Bioinformatics Commons, Biomedical Informatics Commons, Genetic Phenomena Commons, Medical Genetics Commons, Oncology Commons