Interview with Dr Gerard Dumancas

Posted Thu, Mar, 03,2016

This author interview is by Dr. Gerard Dumancas, of Oklahoma Baptist University. Dr. Dumancas' full paper, Development and Application of a Genetic Algorithm for Variable Optimization and Predictive Modeling of 5-year Mortality Using Questionnaire Data, is available for download in Bioinformatics and Biology Insights.

Please summarize for readers the content of your article.
This manuscript offers a novel method of utilizing a genetic algorithm (GA) approach for variable selection using questionnaire data. The selected variables are then used to construct predictive models of five-year mortality using various machine learning techniques. Parametric and non-parametric machine learning algorithms are emerging computational methods that have increasing applications in the area of bioinformatics and computational biology. We examined 123 questions (variables) answered by 5444 individuals in the National Health and Nutrition Examination Survey (NHANES). The GA iterations selected the top 24 variables, including questions related to stroke, emphysema, and general health problems requiring the use of special equipment, for use in predictive modeling by various parametric and non-parametric machine learning techniques. Using these top 24 variables, gradient boosting yielded the best performance (AUC=0.7654), although there were other techniques with lower but not significantly different AUC. Results obtained from this study will provide novel insights for computational biologists and bioinformaticians to utilize GA in conjunction with machine learning techniques to efficiently select important variables and determine their predictive accuracy.

How did you come to be involved in your area of study?
I obtained my PhD in Analytical Chemistry with a specialization in Chemometrics (statistics and computer applications in analytical chemistry). I found this area very interesting because it has a wide array of applications and overlaps with several fields including computer science and biology. I then pursued a Postdoctoral Fellowship in genetic epidemiology, which further honed my interests in the field of machine learning techniques, chemometrics, and molecular genetics.

What was previously known about the topic of your article?
Surveys and questionnaires are widely used in various areas of research, especially in health-related fields, as they provide a relatively efficient method of sampling many individuals in an inexpensive and less obtrusive manner. Questionnaires have been implemented as effective means to study or aid in the diagnosis of muscuoloskeletal, psychological, cardiovascular, and other disorders, and as the popularity of questionnaires has grown, so has the potential for new understanding via advanced statistical techniques like GA. Although GA has been successfully applied for optimizing selection of questionnaire data in the context of family medicine, stressful life events, and sleep apnea diagnosis, its application to the selection of questionnaire data for predictive modeling of disease outcome is relatively novel. Outside of biomedical research, Madden and colleagues utilized GAs in the analysis of questionnaire data to ascertain students' attitudes toward their schoolwork, showing that GAs may be used to generate logical rules, which predict one variable in relation to others. Additionally, Yukselturk and colleagues applied GA in predicting student dropout utilizing only 10 variables.

How has your work in this area advanced understanding of the topic?
The optimization of questionnaire variable selection and the ability to construct predictive models using selected variables represent a promising enterprise for researchers and clinicians alike. With techniques like those employed in our study, interesting questions may be posed regarding the importance of variables for understanding a certain outcome, enabling rational questionnaire design and improved diagnostic or prognostic capabilities. However, independent validation is needed before such methods are integrated into everyday clinical practice. Additionally, to optimize predictive reliability, machine learning techniques must be chosen according to the characteristics of the population and variables in question, as illustrated in our study.

What do you regard as being the most important aspect of the results reported in the article?
This study provided a novel examination of GA as a useful tool for variable selection in the context of questionnaire data. From an initial set of 123 variables, GA selected 24 variables from the NHANES for use in predictive modeling of 5-year mortality with machine learning techniques. This study was uniquely comprehensive in its consideration of such techniques, and gradient boosting performed most optimally (AUC=0.7654), significantly outperforming random forest, LASSO, partial least squares-discriminant analysis (PLS-DA), and recursive partitioning and regression trees (RPART) techniques (P<0.05). Its performance, however, was not significantly different (P>0.05) than that of artificial neural network (ANN), elastic net, support vector machine (SVM), ridge regression (RR), or logistic regression. Insights obtained from this study can be used to design automated methods for variable selection and outcome prediction in a clinical setting.


Posted in: Supplements

  • Efficient Processing: 4 Weeks Average to First Editorial Decision
  • Fair & Independent Expert Peer Review
  • High Visibility & Extensive Database Coverage
Services for Authors

Quick Links

New article and journal news notification services