Faculty, Staff and Student Publications

Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Estimating Sample Size and Reducing Overfitting

Hamzeh Ghasemzadeh
Robert E Hillman
Daryush D Mehta

Publication Date

3-11-2024

Journal

Journal of Speech Language and Hearing Research

Abstract

PURPOSE: Many studies using machine learning (ML) in speech, language, and hearing sciences rely upon cross-validations with single data splitting. This study's first purpose is to provide quantitative evidence that would incentivize researchers to instead use the more robust data splitting method of nested

METHOD: First, the significant impact of different cross-validations on ML outcomes was demonstrated using real-world clinical data. Then, Monte Carlo simulations were used to quantify the interactions among the employed cross-validation method, the discriminative power of features, the dimensionality of the feature space, the dimensionality of the model, and the sample size. Four different cross-validation methods (single holdout, 10-fold, train-validation-test, and nested 10-fold) were compared based on the statistical power and confidence of the resulting ML models. Distributions of the null and alternative hypotheses were used to determine the minimum required sample size for obtaining a statistically significant outcome (5% significance) with 80% power. Statistical confidence of the model was defined as the probability of correct features being selected for inclusion in the final model.

RESULTS: ML models generated based on the single holdout method had very low statistical power and confidence, leading to overestimation of classification accuracy. Conversely, the nested 10-fold cross-validation method resulted in the highest statistical confidence and power while also providing an unbiased estimate of accuracy. The required sample size using the single holdout method could be 50% higher than what would be needed if nested

CONCLUSION: The adoption of nested

Keywords

Humans, Sample Size, Speech, Machine Learning, Language, Hearing

DOI

10.1044/2023_JSLHR-23-00273

PMID

38386017

PMCID

PMC11005022

PubMedCentral® Posted Date

February 2024

PubMedCentral® Full Text Version

Post-print

Published Open-Access

yes

Download

Included in

Bioinformatics Commons, Biomedical Informatics Commons, Medical Sciences Commons, Medical Specialties Commons, Speech and Hearing Science Commons, Speech Pathology and Audiology Commons

COinS

Faculty, Staff and Student Publications

Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Estimating Sample Size and Reducing Overfitting

Publication Date

Journal

Abstract

Keywords

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Published Open-Access

Included in

Search

Browse

Author Corner

More Info

Library

Faculty, Staff and Student Publications

Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Estimating Sample Size and Reducing Overfitting

Authors

Publication Date

Journal

Abstract

Keywords

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Published Open-Access

Included in

Share

Search

Browse

Author Corner

More Info

Library