To identify genetic and dietary factors, and their interactions that contribute to type 2 diabetes (T2D) and predict an individual’s risk to design more precise prevention and treatment strategies.
A genome-wide scan for up to three-way interactions between 717,275 single nucleotide polymorphisms (SNPs), and 139 dietary and lifestyle factors was conducted on 1380 participants of the Boston Puerto Rican Health Study using the Generalized Multifactor Dimensionality Reduction (GMDR) method. Based on identified genetic and dietary factors, we then used machine learning (ML) to predict T2D risk, and the accuracy of prediction was assessed using area under the Receiver Operating Characteristic curve (ROC-AUC).
Results : A genome-wide scan for main effects and up to three-way interactions between SNPs and dietary factors using GMDR identified a set of 818 SNPs and 12 dietary factors that were selected for the prediction of T2D incidence. Comparing several ML algorithms, we found that stochastic gradient boosting provided the best prediction accuracy of T2D incidence with ROC-AUC of 0.93 in the training set, and overall accuracy of 85 % based on test set validation. This approach identified that 52 SNPs in 37 genes, three food groups of high sugar content, and age were key predictors of the best-fit model.
This study illustrates a powerful methodology that can predict incidence of T2D based on gene-gene and gene-environment interactions in combination with machine learning. This genome-wide approach allows identification of those diet and lifestyle factors that interact with genotype and can inform personalized nutrition strategies for the prevention and treatment of T2D.
Funding Sources :
This work was funded by the US Department of Agriculture, under agreement no. 8050-51000-098-00D, and NIH grants P01 AG023394, P50 HL105185, and R01 AG027087.