123 research outputs found

    Statistical Integration of Heterogeneous Data with PO2PLS

    Full text link
    The availability of multi-omics data has revolutionized the life sciences by creating avenues for integrated system-level approaches. Data integration links the information across datasets to better understand the underlying biological processes. However, high-dimensionality, correlations and heterogeneity pose statistical and computational challenges. We propose a general framework, probabilistic two-way partial least squares (PO2PLS), which addresses these challenges. PO2PLS models the relationship between two datasets using joint and data-specific latent variables. For maximum likelihood estimation of the parameters, we implement a fast EM algorithm and show that the estimator is asymptotically normally distributed. A global test for testing the relationship between two datasets is proposed, and its asymptotic distribution is derived. Notably, several existing omics integration methods are special cases of PO2PLS. Via extensive simulations, we show that PO2PLS performs better than alternatives in feature selection and prediction performance. In addition, the asymptotic distribution appears to hold when the sample size is sufficiently large. We illustrate PO2PLS with two examples from commonly used study designs: a large population cohort and a small case-control study. Besides recovering known relationships, PO2PLS also identified novel findings. The methods are implemented in our R-package PO2PLS. Supplementary materials for this article are available online.Comment: 36 pages, 4 figures, Submitted to Journal of the American Statistical Associatio

    Discussion on the paper ‘Statistical contributions to bioinformatics: Design, modelling, structure learning and integration’ by Jeffrey S. Morris and Veerabhadran Baladandayuthapani

    Get PDF
    Bioinformatics is an important research area for statisticians. This discussion provides some additional topics to the paper, namely on statistical contributions to detect differential expressed genes, for protein structure prediction, and for the analysis of highly correlated features in Glycomics datasets

    Gene analysis for longitudinal family data using random-effects models

    Get PDF
    We have extended our recently developed 2-step approach for gene-based analysis to the family design and to the analysis of rare variants. The goal of this approach is to study the joint effect of multiple single-nucleotide polymorphisms that belong to a gene. First, the information in a gene is summarized by 2 variables, namely the empirical Bayes estimate capturing common variation and the number of rare variants. By using random effects for the common variants, our approach acknowledges the within-gene correlations. In the second step, the 2 summaries were included as covariates in linear mixed models. To test the null hypothesis of no association, a multivariate Wald test was applied. We analyzed the simulated data sets to assess the performance of the method. Then we applied the method to the real data set and identified a significant association between FRMD4B and diastolic blood pressure (p-value = 8.3 × 10(-12))

    Pathway analysis for family data using nested random-effects models

    Get PDF
    Recently we proposed a novel two-step approach to test for pathway effects in disease progression. The goal of this approach is to study the joint effect of multiple single-nucleotide polymorphisms that belong to certain genes. By using random effects, our approach acknowledges the correlations within and between genes when testing for pathway effects. Gene-gene and gene-environment interactions can be included in the model. The method can be implemented with standard software, and the distribution of the test statistics under the null hypothesis can be approximated by using standard chi-square distributions. Hence no extensive permutations are needed for computations of the p-value. In this paper we adapt and apply the method to family data, and we study its performance for sequence data from Genetic Analysis Workshop 17. For the set of unrelated subjects, the performance of the new test was disappointing. We found a power of 6% for the binary outcome and of 18% for the quantitative trait Q1. For family data the new approach appears to perform well, especially for the quantitative outcome. We found a power of 39% for the binary outcome and a power of 89% for the quantitative trait Q1

    Locally weighted transmission/disequilibrium test for genetic association analysis

    Get PDF
    The transmission/disequilibrium test statistic has been used for assessing genetic association in affected-parent trios. In the presence of multiple tightly linked marker loci where local dependency may exist, haplotypes are reconstructed statistically to estimate the joint effects of these markers. In this manuscript, we propose an alternative to the haplotype approach by taking a weighted average of multiple loci, where the weight is proportional to the product of (1-2X recombination fraction) and the linkage disequilibrium between markers. As an illustration, we applied the method to the simulated Aipotu data

    Haplotype Estimation from Fuzzy Genotypes Using Penalized Likelihood

    Get PDF
    The Composite Link Model is a generalization of the generalized linear model in which expected values of observed counts are constructed as a sum of generalized linear components. When combined with penalized likelihood, it provides a powerful and elegant way to estimate haplotype probabilities from observed genotypes. Uncertain (“fuzzy”) genotypes, like those resulting from AFLP scores, can be handled by adding an extra layer to the model. We describe the model and the estimation algorithm. We apply it to a data set of accurate human single nucleotide polymorphism (SNP) and to a data set of fuzzy tomato AFLP scores

    Decreased Levels of Bisecting GlcNAc Glycoforms of IgG Are Associated with Human Longevity

    Get PDF
    BACKGROUND: Markers for longevity that reflect the health condition and predict healthy aging are extremely scarce. Such markers are, however, valuable in aging research. It has been shown previously that the N-glycosylation pattern of human immunoglobulin G (IgG) is age-dependent. Here we investigate whether N-linked glycans reflect early features of human longevity. METHODOLOGY/PRINCIPAL FINDINGS: The Leiden Longevity Study (LLS) consists of nonagenarian sibling pairs, their offspring, and partners of the offspring serving as control. IgG subclass specific glycosylation patterns were obtained from 1967 participants in the LLS by MALDI-TOF-MS analysis of tryptic IgG Fc glycopeptides. Several regression strategies were applied to evaluate the association of IgG glycosylation with age, sex, and longevity. The degree of galactosylation of IgG decreased with increasing age. For the galactosylated glycoforms the incidence of bisecting GlcNAc increased as a function of age. Sex-related differences were observed at ages below 60 years. Compared to males, younger females had higher galactosylation, which decreased stronger with increasing age, resulting in similar galactosylation for both sexes from 60 onwards. In younger participants (<60 years of age), but not in the older age group (>60 years), decreased levels of non-galactosylated glycoforms containing a bisecting GlcNAc reflected early features of longevity. CONCLUSIONS/SIGNIFICANCE: We here describe IgG glycoforms associated with calendar age at all ages and the propensity for longevity before middle age. As modulation of IgG effector functions has been described for various IgG glycosylation features, a modulatory effect may be expected for the longevity marker described in this study
    corecore