Search CORE

9,838 research outputs found

Recommended from our members

Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method.

Author: Kim Byung-Ju
Kim Sung-Hou
Publication venue: eScholarship, University of California
Publication date: 01/02/2018
Field of study

Prevention and early intervention are the most effective ways of avoiding or minimizing psychological, physical, and financial suffering from cancer. However, such proactive action requires the ability to predict the individual's susceptibility to cancer with a measure of probability. Of the triad of cancer-causing factors (inherited genomic susceptibility, environmental factors, and lifestyle factors), the inherited genomic component may be derivable from the recent public availability of a large body of whole-genome variation data. However, genome-wide association studies have so far showed limited success in predicting the inherited susceptibility to common cancers. We present here a multiple classification approach for predicting individuals' inherited genomic susceptibility to acquire the most likely phenotype among a panel of 20 major common cancer types plus 1 "healthy" type by application of a supervised machine-learning method under competing conditions among the cohorts of the 21 types. This approach suggests that, depending on the phenotypes of 5,919 individuals of "white" ethnic population in this study, (i) the portion of the cohort of a cancer type who acquired the observed type due to mostly inherited genomic susceptibility factors ranges from about 33 to 88% (or its corollary: the portion due to mostly environmental and lifestyle factors ranges from 12 to 67%), and (ii) on an individual level, the method also predicts individuals' inherited genomic susceptibility to acquire the other types ranked with associated probabilities. These probabilities may provide practical information for individuals, heath professionals, and health policymakers related to prevention and/or early intervention of cancer

eScholarship - University of California

Stratification bias in low signal microarray studies

Author: Bedo Justin
Guenter Simon
Parker Brian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/12/2015
Field of study

BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets

The Australian National University

Combining genome-wide association mapping and transcriptional networks to identify novel genes controlling glucosinolates in Arabidopsis thaliana.

Author: Chan Eva KF
Corwin Jason A
Joseph Bindu
Kliebenstein Daniel J
Rowe Heather C
Publication venue: eScholarship, University of California
Publication date: 01/08/2011
Field of study

BackgroundGenome-wide association (GWA) is gaining popularity as a means to study the architecture of complex quantitative traits, partially due to the improvement of high-throughput low-cost genotyping and phenotyping technologies. Glucosinolate (GSL) secondary metabolites within Arabidopsis spp. can serve as a model system to understand the genomic architecture of adaptive quantitative traits. GSL are key anti-herbivory defenses that impart adaptive advantages within field trials. While little is known about how variation in the external or internal environment of an organism may influence the efficiency of GWA, GSL variation is known to be highly dependent upon the external stresses and developmental processes of the plant lending it to be an excellent model for studying conditional GWA.Methodology/principal findingsTo understand how development and environment can influence GWA, we conducted a study using 96 Arabidopsis thaliana accessions, >40 GSL phenotypes across three conditions (one developmental comparison and one environmental comparison) and ∼230,000 SNPs. Developmental stage had dramatic effects on the outcome of GWA, with each stage identifying different loci associated with GSL traits. Further, while the molecular bases of numerous quantitative trait loci (QTL) controlling GSL traits have been identified, there is currently no estimate of how many additional genes may control natural variation in these traits. We developed a novel co-expression network approach to prioritize the thousands of GWA candidates and successfully validated a large number of these genes as influencing GSL accumulation within A. thaliana using single gene isogenic lines.Conclusions/significanceTogether, these results suggest that complex traits imparting environmentally contingent adaptive advantages are likely influenced by up to thousands of loci that are sensitive to fluctuations in the environment or developmental state of the organism. Additionally, while GWA is highly conditional upon genetics, the use of additional genomic information can rapidly identify causal loci en masse

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

LightGWAS: A Novel Genome-Wide Association Study Procedure

Author: Ambrozio Bruno
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2020
Field of study

This dissertation proposes LightGWAS, a novel machine learning procedure for genome-wide association study (GWAS) based on LightGBM and k-fold cross-validation. The conducted literature review identified that the currently available GWAS implementations rely on massive manual quality control steps to address statistical issues, such as controlling for false-positive inflation and power reduction. It also showed they demand a specific GWAS method for each type of genomic dataset morphology, which consequently increases the human dependency and open margins for misleadings. LightGWAS is a potential single, resilient, autonomous and scalable solution to address such concerns. Through this research, LightGWAS was contrasted against the current state-of-the-art for GWAS throughout secondary research method. It has been compared with a GWAS implementation based on general linear model (GLM) with support to Firth regularisation. Quantitative empirical tests and deductive reasoning have been employed to reach and evaluate the results. The models were submitted to balanced (case:control=1:1), imbalanced (case:control=1:10), and high-imbalanced (case:control=1:100) genomic datasets of binary phenotypes. The results from statistical tests denoted that LightGWAS performs equivalently to the compared GLM method for balanced dataset scenarios, and outperforms for imbalanced and high-imbalanced datasets. The assessed metrics were weighted average of the precision and recall (F1), recall, average precision score (APS), receiver operating characteristic (ROC)/area under the curve (AUC), accuracy, and precision

Arrow@TUDublin