Search CORE

142 research outputs found

Performance of regularized machine learning models.

Author: Antti Airola (45970)
Samuli Ripatti (144251)
Sebastian Okser (240648)
Tapio Pahikkala (659819)
Tapio Salakoski (23206)
Tero Aittokallio (61010)
Publication venue
Publication date
Field of study

Upper panel: Behavior of the learning approaches in terms of their predictive accuracy (y-axis) as a function of the number of selected variants (x-axis). Differences can be attributed to the genotypic and phenotypic heterogeneity as well as genotyping density and quality. (A) The area under the receiver operating characteristic curve (AUC) for the prediction of Type 1 diabetes (T1D) cases in SNP data from WTCCC <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Wellcome1" target="_blank">[118]</a>, representing ca. one million genetic features and ca. 5,000 individuals in a case-control setup. (B) Coefficient of determination (R2) for the prediction of a continuous trait (Tunicamycin) in SNP data from a cross between two yeast strains (Y2C) <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Bloom1" target="_blank">[44]</a>, representing ca. 12,000 variants and ca. 1,000 segregants in a controlled laboratory setup. The peak prediction accuracy/number of most predictive variants are listed in the legend. The model validation was implemented using nested 3-fold cross-validation (CV) <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Okser2" target="_blank">[5]</a>. Prior to any analysis being done, the data was split into three folds. On each outer round of CV, two of the folds were combined forming a training set, and the remaining one was used as an independent test set. On each round, all feature and parameter selection was done using a further internal 3-fold CV on the training set, and the predictive performance of the learned models was evaluated on the independent test set. The final performance estimates were calculated as the average over these three iterations of the experiment. In learning approaches where internal CV was not needed to select model parameters (e.g., log odds), this is equivalent to a standard 3-fold CV. T1D data: the L2-regularized (ridge) regression was based on selecting the top 500 variants according to the χ2 filter. For wrappers, we used our greedy L2-regularized least squares (RLS) implementation <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Pahikkala1" target="_blank">[30]</a>, while the embedded methods, Lasso, Elastic Net and L1-logistic regression, were implemented through the Scikit-Learn <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Pedregosa1" target="_blank">[119]</a>, interpolated across various regularization parameters up to the maximal number of variants (500 or 1,000). As a baseline model, we implemented a log odds-ratio weighted sum of the minor allele dosage in the 500 selected variants within each individual <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Evans1" target="_blank">[25]</a>. Y2C: the filter method was based on the top 1,000 variants selected according to R2, followed by L2-regularization within greedy RLS using nested CV. As a baseline model, we implemented a greedy version of least squares (LS), which is similar to the stepwise forward regression used in the original work <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754-Bloom1" target="_blank">[44]</a>; the greedy LS differs from the greedy RLS in terms that it implements regularization through optimization of L0 norm instead of L2. It was noted that the greedy LS method drops around the point where the number of selected variants exceeds the number training examples (here, 400). Lower panel: Overlap in the genetic features selected by the different approaches. (C) The numbers of selected variants within the major histocompatibility complex (MHC) are shown in parentheses for the T1D data. (D) The overlap among then maximally predictive variants in the Y2C data. Note: these results should be considered merely as illustrative examples. Differing results may be obtained when other prediction models are implemented in other genetic datasets or other prediction applications.</p

FigShare

Penalty terms and loss functions.

Author: Antti Airola (45970)
Samuli Ripatti (144251)
Sebastian Okser (240648)
Tapio Pahikkala (659819)
Tapio Salakoski (23206)
Tero Aittokallio (61010)
Publication venue
Publication date
Field of study

(A) Penalty terms: L0-norm imposes the most explicit constraint on the model complexity as it effectively counts the number of nonzero entries in the model parameter vector. While it is possible to train prediction models with L0-penalty using, e.g., greedy or other types of discrete optimization methods, the problem becomes mathematically challenging due to the nonconvexity of the constraint, especially when other than the squared loss function is used. The convexity of the L1 and L2 norms makes them easier for the optimization. While the L2 norm has good regularization properties, it must be used together with either L0 or L1 norms to perform feature selection. (B) Loss functions: The plain classification error is difficult to minimize due to its nonconvex and discontinuous nature, and therefore one often resorts to its better behaving surrogates, including the hinge loss used with SVMs, the cross-entropy used with logistic regression, or the squared error used with regularized least-squares classification and regression. These surrogates in turn differ both in their quality of approximating the classification error and in terms of the optimization machinery they can be minimized with (<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004754#pgen.1004754.s001" target="_blank">Text S1</a>).</p

FigShare

Association between level of work-related exhaustion and relative leukocyte telomere length.

Author: Arpo Aromaa (250547)
Iiris Hovatta (45487)
Ilari Sirén (316187)
Jouko Lönnqvist (250763)
Kirsi Ahola (316186)
Mika Kivimäki (95277)
Samuli Ripatti (144251)
Publication venue
Publication date
Field of study

aModel 1 is adjusted for sex and age.bModel 2 is adjusted for sex, age, marital status, occupational grade, daily smoking, body mass index, physical illness, and common mental disorders.</p

FigShare

Association between age and relative leukocyte telomere length by sex.

Author: Arpo Aromaa (250547)
Iiris Hovatta (45487)
Ilari Sirén (316187)
Jouko Lönnqvist (250763)
Kirsi Ahola (316186)
Mika Kivimäki (95277)
Samuli Ripatti (144251)
Publication venue
Publication date
Field of study

Association between age and relative leukocyte telomere length by sex.</p

FigShare

Level of work-related exhaustion by characteristics of the study population.

Author: Arpo Aromaa (250547)
Iiris Hovatta (45487)
Ilari Sirén (316187)
Jouko Lönnqvist (250763)
Kirsi Ahola (316186)
Mika Kivimäki (95277)
Samuli Ripatti (144251)
Publication venue
Publication date
Field of study

aWeighted percentage.</p

FigShare

Characteristics of the study population (N = 2911).

Author: Arpo Aromaa (250547)
Iiris Hovatta (45487)
Ilari Sirén (316187)
Jouko Lönnqvist (250763)
Kirsi Ahola (316186)
Mika Kivimäki (95277)
Samuli Ripatti (144251)
Publication venue
Publication date
Field of study

Characteristics of the study population (N = 2911).</p

FigShare

Association between relative leukocyte telomere length and the covariates.

Author: Arpo Aromaa (250547)
Iiris Hovatta (45487)
Ilari Sirén (316187)
Jouko Lönnqvist (250763)
Kirsi Ahola (316186)
Mika Kivimäki (95277)
Samuli Ripatti (144251)
Publication venue
Publication date
Field of study

Association between relative leukocyte telomere length and the covariates.</p

FigShare

sj-docx-1-sjp-10.1177_14034948221119634 – Supplemental material for Marital status and genetic liability independently predict coronary heart disease incidence

Author: George Davey Smith (7180727)
Hannu Lahtinen (9655468)
Kaarina Korhonen (5434787)
Karri Silventoinen (75581)
Pekka Martikainen (178042)
Samuli Ripatti (144251)
Tim Morris (2950479)
Publication venue
Publication date: 07/09/2022
Field of study

Supplemental material, sj-docx-1-sjp-10.1177_14034948221119634 for Marital status and genetic liability independently predict coronary heart disease incidence by Karri Silventoinen, Hannu Lahtinen, Kaarina Korhonen, George Davey Smith, Samuli Ripatti, Tim Morris and Pekka Martikainen in Scandinavian Journal of Public Health</p

FigShare

Relative leukocyte telomere length by work-related exhaustion.

Author: Arpo Aromaa (250547)
Iiris Hovatta (45487)
Ilari Sirén (316187)
Jouko Lönnqvist (250763)
Kirsi Ahola (316186)
Mika Kivimäki (95277)
Samuli Ripatti (144251)
Publication venue
Publication date
Field of study

Error bars represent standard error of mean.</p

FigShare

The number of childhood adverse life events affects relative telomere length at adult age.

Author: Ida Surakka (120325)
Iiris Hovatta (45487)
Jaana Suvisaari (364516)
Jouko Lönnqvist (250763)
Laura Kananen (327242)
Leena Peltonen (41349)
Sami Pirkola (253097)
Samuli Ripatti (144251)
Publication venue
Publication date
Field of study

Relative telomere length is adjusted for age and sex. Each bar presents group mean (± standard error of the mean).</p

FigShare