Search CORE

11 research outputs found

Scalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases

Author: Okser Sebastian
Publication venue: Turku Centre for Computer Science
Publication date: 19/08/2015
Field of study

Personalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have aﬀorded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants. Machine learning-based feature selection algorithms have been shown to have the ability to eﬀectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including ﬁlter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be eﬀective at predicting the disease phenotypes, but also doing so eﬃciently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets. Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.Siirretty Doriast

UTUPub

Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives

Author: Aittokallio Tero
Okser Sebastian
Pahikkala Tapio
Publication venue
Publication date: 01/03/2013
Field of study

Peer reviewe

Crossref

PubMed Central

Helsingin yliopiston digitaalinen arkisto

Regularized Machine Learning in the Genetic Prediction of Complex Traits

Author: Antti Airola
Samuli Ripatti
Sebastian Okser
Tapio Pahikkala
Tapio Salakoski
Tero Aittokallio
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 27/10/2022
Field of study

Compared to univariate analysis of genome-wide association (GWA) studies, machine learning–based models have been shown to provide improved means of learning such multilocus panels of genetic variants and their interactions that are most predictive of complex phenotypic traits. Many applications of predictive modeling rely on effective variable selection, often implemented through model regularization, which penalizes the model complexity and enables predictions in individuals outside of the training dataset. However, the different regularization approaches may also lead to considerable differences, especially in the number of genetic variants needed for maximal predictive accuracy, as illustrated here in examples from both disease classification and quantitative trait prediction. We also highlight the potential pitfalls of the regularized machine learning models, related to issues such as model overfitting to the training data, which may lead to over-optimistic prediction results, as well as identifiability of the predictive variants, which is important in many medical applications. While genetic risk prediction for human diseases is used as a motivating use case, we argue that these models are also widely applicable in nonhuman applications, such as animal and plant breeding, where accurate genotype-to-phenotype modeling is needed. Finally, we discuss some key future advances, open questions and challenges in this developing field, when moving toward low-frequency variants and cross-phenotype interactions.</p

UTUPub

Regularized Machine Learning in the Genetic Prediction of Complex Traits

Author: Airola Antti
Aittokallio Tero
Okser Sebastian
Pahikkala Tapio
Ripatti Samuli
Salakoski Tapio
Publication venue
Publication date: 01/11/2014
Field of study

Peer reviewe

Crossref

Directory of Open Access Journals

PubMed Central

Helsingin yliopiston digitaalinen arkisto

FigShare

Genetic Variants and Their Interactions in the Prediction of Increased Pre-Clinical Carotid Atherosclerosis: The Cardiovascular Risk in Young Finns Study

Author: Aittokallio Tero
Eklund Carita
Elo Laura L.
Fan Yue-Mei
Hernesniemi Jussi A.
Hurme Mikko
Hutri-Kahonen Nina
Juonala Markus
Kahonen Mika
Laitinen Tomi
Lehtimaki Terho
Lyytikainen Leo-Pekka
Mononen Nina
Okser Sebastian
Peltonen Nina
Raitakari Olli T.
Rontu Riikka
Taittonen Leena
Viikari Jorma S. A.
Publication venue
Publication date: 01/01/2010
Field of study

Peer reviewe

Directory of Open Access Journals

PubMed Central

TamPub Julkaisuarkisto - TamPub Institutional Repository

Helsingin yliopiston digitaalinen arkisto

Trepo - Institutional Repository of Tampere University

Genetic variants and their interactions in disease risk prediction – machine learning and network perspectives

Author: 1000 Genomes Project
A Ashworth
A Burga
A Califano
A Galvan
A Gyenesei
A Statnikov
A Torkamani
A Torkamani
AL Barabási
AL Hopkins
B Lehner
B Lehner
B Maher
B Rakitsch
BA McKinney
BA McKinney
BS Srinivasan
C Ambroise
C Kooperberg
C Tian
C Winter
CG Lambert
CS Greene
D Merico
D Urbach
DJ Balding
DM Evans
DW Aha
DW Huang
DW Huang
E Lee
EA Ashley
EE Eichler
EE Schadt
ES Lander
F Barrenäs
G Bebek
G Gibson
G Hannum
G Peng
GK Chen
GM Clarke
H Eleftherohorinou
H Holm
H Zhong
HJ Cordell
HY Chuang
I Feldman
I Guyon
I König
I Surakka
J Corander
J Jakobsdottir
J Kruppa
J Tuikkala
J Yang
JD Iglehart
JH Moore
JH Moore
K Askland
K Wang
KA Pattin
KS Reynolds
L Luo
M Ladouceur
M Michaut
M Mooney
M Smoot
M Vidal
MA Heiskanen
MD Ritchie
MJ Sillanpää
NA Lavender
NF Marko
O Lavi
O Zuk
P Beltrao
P Donnelly
P Kraft
P Sebastiani
P Smialowski
PC Phillips
PJ Castaldi
Q He
R Braun
R Jelier
R Makowsky
R Simon
RO Lindén
S Lee
S Okser
S Ripatti
S Varma
SE Baranzini
Sebastian Okser
SJ Dixon
SW Hartley
T Hu
T Ideker
T Pahikkala
T Peltola
T Schupbach
TA Manolio
Tapio Pahikkala
Tero Aittokallio
TS Deisboeck
TT Wu
U Ober
U Ober
V Bansal
VK Ramanan
W Huang
Wellcome Trust Case Control Consortium
WG Kaelin Jr
Y Saeys
Z Wang
Z Wei
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Penalty terms and loss functions.

Author: Antti Airola (45970)
Samuli Ripatti (144251)
Sebastian Okser (240648)
Tapio Pahikkala (659819)
Tapio Salakoski (23206)
Tero Aittokallio (61010)
Publication venue
Publication date
Field of study

(A) Penalty terms: L0-norm imposes the most explicit constraint on the model complexity as it effectively counts the number of nonzero entries in the model parameter vector. While it is possible to train prediction models with L0-penalty using, e.g., greedy or other types of discrete optimization methods, the problem becomes mathematically challenging due to the nonconvexity of the constraint, especially when other than the squared loss function is used. The convexity of the L1 and L2 norms makes them easier for the optimization. While the L2 norm has good regularization properties, it must be used together with either L0 or L1 norms to perform feature selection. (B) Loss functions: The plain classification error is difficult to minimize due to its nonconvex and discontinuous nature, and therefore one often resorts to its better behaving surrogates, including the hinge loss used with SVMs, the cross-entropy used with logistic regression, or the squared error used with regularized least-squares classification and regression. These surrogates in turn differ both in their quality of approximating the classification error and in terms of the optimization machinery they can be minimized with (<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754.s001" target="_blank">Text S1</a>).</p

FigShare

Performance of regularized machine learning models.

Author: Antti Airola (45970)
Samuli Ripatti (144251)
Sebastian Okser (240648)
Tapio Pahikkala (659819)
Tapio Salakoski (23206)
Tero Aittokallio (61010)
Publication venue
Publication date
Field of study

Upper panel: Behavior of the learning approaches in terms of their predictive accuracy (y-axis) as a function of the number of selected variants (x-axis). Differences can be attributed to the genotypic and phenotypic heterogeneity as well as genotyping density and quality. (A) The area under the receiver operating characteristic curve (AUC) for the prediction of Type 1 diabetes (T1D) cases in SNP data from WTCCC <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Wellcome1" target="_blank">[118]</a>, representing ca. one million genetic features and ca. 5,000 individuals in a case-control setup. (B) Coefficient of determination (R2) for the prediction of a continuous trait (Tunicamycin) in SNP data from a cross between two yeast strains (Y2C) <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Bloom1" target="_blank">[44]</a>, representing ca. 12,000 variants and ca. 1,000 segregants in a controlled laboratory setup. The peak prediction accuracy/number of most predictive variants are listed in the legend. The model validation was implemented using nested 3-fold cross-validation (CV) <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Okser2" target="_blank">[5]</a>. Prior to any analysis being done, the data was split into three folds. On each outer round of CV, two of the folds were combined forming a training set, and the remaining one was used as an independent test set. On each round, all feature and parameter selection was done using a further internal 3-fold CV on the training set, and the predictive performance of the learned models was evaluated on the independent test set. The final performance estimates were calculated as the average over these three iterations of the experiment. In learning approaches where internal CV was not needed to select model parameters (e.g., log odds), this is equivalent to a standard 3-fold CV. T1D data: the L2-regularized (ridge) regression was based on selecting the top 500 variants according to the χ2 filter. For wrappers, we used our greedy L2-regularized least squares (RLS) implementation <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Pahikkala1" target="_blank">[30]</a>, while the embedded methods, Lasso, Elastic Net and L1-logistic regression, were implemented through the Scikit-Learn <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Pedregosa1" target="_blank">[119]</a>, interpolated across various regularization parameters up to the maximal number of variants (500 or 1,000). As a baseline model, we implemented a log odds-ratio weighted sum of the minor allele dosage in the 500 selected variants within each individual <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Evans1" target="_blank">[25]</a>. Y2C: the filter method was based on the top 1,000 variants selected according to R2, followed by L2-regularization within greedy RLS using nested CV. As a baseline model, we implemented a greedy version of least squares (LS), which is similar to the stepwise forward regression used in the original work <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Bloom1" target="_blank">[44]</a>; the greedy LS differs from the greedy RLS in terms that it implements regularization through optimization of L0 norm instead of L2. It was noted that the greedy LS method drops around the point where the number of selected variants exceeds the number training examples (here, 400). Lower panel: Overlap in the genetic features selected by the different approaches. (C) The numbers of selected variants within the major histocompatibility complex (MHC) are shown in parentheses for the T1D data. (D) The overlap among then maximally predictive variants in the Y2C data. Note: these results should be considered merely as illustrative examples. Differing results may be obtained when other prediction models are implemented in other genetic datasets or other prediction applications.</p

FigShare