470 research outputs found

    A whitening approach to probabilistic canonical correlation analysis for omics data integration

    Get PDF
    ackground Canonical correlation analysis (CCA) is a classic statistical tool for investigating complex multivariate data. Correspondingly, it has found many diverse applications, ranging from molecular biology and medicine to social science and finance. Intriguingly, despite the importance and pervasiveness of CCA, only recently a probabilistic understanding of CCA is developing, moving from an algorithmic to a model-based perspective and enabling its application to large-scale settings. Results Here, we revisit CCA from the perspective of statistical whitening of random variables and propose a simple yet flexible probabilistic model for CCA in the form of a two-layer latent variable generative model. The advantages of this variant of probabilistic CCA include non-ambiguity of the latent variables, provisions for negative canonical correlations, possibility of non-normal generative variables, as well as ease of interpretation on all levels of the model. In addition, we show that it lends itself to computationally efficient estimation in high-dimensional settings using regularized inference. We test our approach to CCA analysis in simulations and apply it to two omics data sets illustrating the integration of gene expression data, lipid concentrations and methylation levels. Conclusions Our whitening approach to CCA provides a unifying perspective on CCA, linking together sphering procedures, multivariate regression and corresponding probabilistic generative models. Furthermore, we offer an efficient computer implementation in the “whitening” R package available at https://CRAN.R-project.org/package=whitening

    Inference of demographic history from genealogical trees using reversible jump Markov chain Monte Carlo

    Get PDF
    Background: Coalescent theory is a general framework to model genetic variation in a population. Specifically, it allows inference about population parameters from sampled DNA sequences. However, most currently employed variants of coalescent theory only consider very simple demographic scenarios of population size changes, such as exponential growth. Results: Here we develop a coalescent approach that allows Bayesian non-parametric estimation of the demographic history using genealogies reconstructed from sampled DNA sequences. In this framework inference and model selection is done using reversible jump Markov chain Monte Carlo (MCMC). This method is computationally efficient and overcomes the limitations of related non-parametric approaches such as the skyline plot. We validate the approach using simulated data. Subsequently, we reanalyze HIV-1 sequence data from Central Africa and Hepatitis C virus (HCV) data from Egypt. Conclusions: The new method provides a Bayesian procedure for non-parametric estimation of the demographic history. By construction it additionally provides confidence limits and may be used jointly with other MCMC-based coalescent approaches

    Reconstructing phylogenetic level-1 networks from nondense binet and trinet sets

    Get PDF
    Binets and trinets are phylogenetic networks with two and three leaves, respectively. Here we consider the problem of deciding if there exists a binary level-1 phylogenetic network displaying a given set T of binary binets or trinets over a taxon set X, and constructing such a network whenever it exists. We show that this is NP-hard for trinets but polynomial-time solvable for binets. Moreover, we show that the problem is still polynomial-time solvable for inputs consisting of binets and trinets as long as the cycles in the trinets have size three. Finally, we present an O(3^{|X|} poly(|X|)) time algorithm for general sets of binets and trinets. The latter two algorithms generalise to instances containing level-1 networks with arbitrarily many leaves, and thus provide some of the first supernetwork algorithms for computing networks from a set of rooted 1 phylogenetic networks

    Evolution of genes and repeats in the Nimrod superfamily

    Get PDF
    The recently identified Nimrod superfamily is characterized by the presence of a special type of EGF repeat, the NIM repeat, located right after a typical CCXGY/W amino acid motif. On the basis of structural features, nimrod genes can be divided into three types. The proteins encoded by Draper-type genes have an EMI domain at the N-terminal part and only one copy of the NIM motif, followed by a variable number of EGF-like repeats. The products of Nimrod B-type and Nimrod C-type genes (including the eater gene) have different kinds of N-terminal domains, and lack EGF-like repeats but contain a variable number of NIM repeats. Draper and Nimrod C-type (but not Nimrod B-type) proteins carry a transmembrane domain. Several members of the superfamily were claimed to function as receptors in phagocytosis and/or binding of bacteria, which indicates an important role in the cellular immunity and the elimination of apoptotic cells. In this paper, the evolution of the Nimrod superfamily is studied with various methods on the level of genes and repeats. A hypothesis is presented in which the NIM repeat, along with the EMI domain, emerged by structural reorganizations at the end of an EGF-like repeat chain, suggesting a mechanism for the formation of novel types of repeats. The analyses revealed diverse evolutionary patterns in the sequences containing multiple NIM repeats. Although in the Nimrod B and Nimrod C proteins show characteristics of independent evolution, many internal NIM repeats in Eater sequences seem to have undergone concerted evolution. An analysis of the nimrod genes has been performed using phylogenetic and other methods and an evolutionary scenario of the origin and diversification of the Nimrod superfamily is proposed. Our study presents an intriguing example how the evolution of multigene families may contribute to the complexity of the innate immune response

    Definition of the σW regulon of Bacillus subtilis in the absence of stress

    Get PDF
    Bacteria employ extracytoplasmic function (ECF) sigma factors for their responses to environmental stresses. Despite intensive research, the molecular dissection of ECF sigma factor regulons has remained a major challenge due to overlaps in the ECF sigma factor-regulated genes and the stimuli that activate the different ECF sigma factors. Here we have employed tiling arrays to single out the ECF σW regulon of the Gram-positive bacterium Bacillus subtilis from the overlapping ECF σX, σY, and σM regulons. For this purpose, we profiled the transcriptome of a B. subtilis sigW mutant under non-stress conditions to select candidate genes that are strictly σW-regulated. Under these conditions, σW exhibits a basal level of activity. Subsequently, we verified the σW-dependency of candidate genes by comparing their transcript profiles to transcriptome data obtained with the parental B. subtilis strain 168 grown under 104 different conditions, including relevant stress conditions, such as salt shock. In addition, we investigated the transcriptomes of rasP or prsW mutant strains that lack the proteases involved in the degradation of the σW anti-sigma factor RsiW and subsequent activation of the σW-regulon. Taken together, our studies identify 89 genes as being strictly σW-regulated, including several genes for non-coding RNAs. The effects of rasP or prsW mutations on the expression of σW-dependent genes were relatively mild, which implies that σW-dependent transcription under non-stress conditions is not strictly related to RasP and PrsW. Lastly, we show that the pleiotropic phenotype of rasP mutant cells, which have defects in competence development, protein secretion and membrane protein production, is not mirrored in the transcript profile of these cells. This implies that RasP is not only important for transcriptional regulation via σW, but that this membrane protease also exerts other important post-transcriptional regulatory functions

    NetDiff – Bayesian model selection for differential gene regulatory network inference

    Get PDF
    Differential networks allow us to better understand the changes in cellular processes that are exhibited in conditions of interest, identifying variations in gene regulation or protein interaction between, for example, cases and controls, or in response to external stimuli. Here we present a novel methodology for the inference of differential gene regulatory networks from gene expression microarray data. Specifically we apply a Bayesian model selection approach to compare models of conserved and varying network structure, and use Gaussian graphical models to represent the network structures. We apply a variational inference approach to the learning of Gaussian graphical models of gene regulatory networks, that enables us to perform Bayesian model selection that is significantly more computationally efficient than Markov Chain Monte Carlo approaches. Our method is demonstrated to be more robust than independent analysis of data from multiple conditions when applied to synthetic network data, generating fewer false positive predictions of differential edges. We demonstrate the utility of our approach on real world gene expression microarray data by applying it to existing data from amyotrophic lateral sclerosis cases with and without mutations in C9orf72, and controls, where we are able to identify differential network interactions for further investigation

    The highly rearranged mitochondrial genomes of the crabs Maja crispata and Maja squinado (Majidae) and gene order evolution in Brachyura

    Get PDF
    Abstract We sequenced the mitochondrial genomes of the spider crabs Maja crispata and Maja squinado (Majidae, Brachyura). Both genomes contain the whole set of 37 genes characteristic of Bilaterian genomes, encoded on both \u3b1- and \u3b2-strands. Both species exhibit the same gene order, which is unique among known animal genomes. In particular, all the genes located on the \u3b2-strand form a single block. This gene order was analysed together with the other nine gene orders known for the Brachyura. Our study confirms that the most widespread gene order (BraGO) represents the plesiomorphic condition for Brachyura and was established at the onset of this clade. All other gene orders are the result of transformational pathways originating from BraGO. The different gene orders exhibit variable levels of genes rearrangements, which involve only tRNAs or all types of genes. Local homoplastic arrangements were identified, while complete gene orders remain unique and represent signatures that can have a diagnostic value. Brachyura appear to be a hot-spot of gene order diversity within the phylum Arthropoda. Our analysis, allowed to track, for the first time, the fully evolutionary pathways producing the Brachyuran gene orders. This goal was achieved by coupling sophisticated bioinformatic tools with phylogenetic analysis

    Cumulants and the moment algebra: tools for analysing weak measurements

    Full text link
    Recently it has been shown that cumulants significantly simplify the analysis of multipartite weak measurements. Here we consider the mathematical structure that underlies this, and find that it can be formulated in terms of what we call the moment algebra. Apart from resulting in simpler proofs, the flexibility of this structure allows generalizations of the original results to a number of weak measurement scenarios, including one where the weakly interacting pointers reach thermal equilibrium with the probed system.Comment: Journal reference added, minor correction

    A standardized framework for accurate, high-throughput genotyping of recombinant and non-recombinant viral sequences

    Get PDF
    Human immunodeficiency virus type-1 (HIV-1), hepatitis B and C and other rapidly evolving viruses are characterized by extremely high levels of genetic diversity. To facilitate diagnosis and the development of prevention and treatment strategies that efficiently target the diversity of these viruses, and other pathogens such as human T-lymphotropic virus type-1 (HTLV-1), human herpes virus type-8 (HHV8) and human papillomavirus (HPV), we developed a rapid high-throughput-genotyping system. The method involves the alignment of a query sequence with a carefully selected set of pre-defined reference strains, followed by phylogenetic analysis of multiple overlapping segments of the alignment using a sliding window. Each segment of the query sequence is assigned the genotype and sub-genotype of the reference strain with the highest bootstrap (>70%) and bootscanning (>90%) scores. Results from all windows are combined and displayed graphically using color-coded genotypes. The new Virus-Genotyping Tools provide accurate classification of recombinant and non-recombinant viruses and are currently being assessed for their diagnostic utility. They have incorporated into several HIV drug resistance algorithms including the Stanford (http://hivdb.stanford.edu) and two European databases (http://www.umcutrecht.nl/subsite/spread-programme/ and http://www.hivrdb.org.uk/) and have been successfully used to genotype a large number of sequences in these and other databases. The tools are a PHP/JAVA web application and are freely accessible on a number of servers including
    corecore