653 research outputs found

    Inference from binary gene expression data

    No full text
    Microarrays provide a practical method for measuring the mRNA abundances of thousands of genes in a single experiment. Analysing such large dimensional data is a challenge which attracts researchers from many different fields and machine learning is one of them. However, the biological properties of mRNA such as its low stability, measurements being taken from a population of cells rather than from a single cell, etc. should make researchers sceptical about the high numerical precision reported and thus the reproducibility of these measurements. In this study we explore data representation at lower numerical precision, down to binary (retaining only the information whether a gene is expressed or not), thereby improving the quality of inferences drawn from microarray studies. With binary representation, we propose a solution to reduce the effect of algorithmic choice in the pre-processing stages.First we compare the information loss if researchers made the inferences from quantized transcriptome data rather than the continuous values. Classification, clustering, periodicity detection and analysis of developmental time series data are considered here. Our results showed that there is not much information loss with binary data. Then, by focusing on the two most widely used inference tools, classification and clustering, we show that inferences drawn from transcriptome data can actually be improved with a metric suitable for binary data. This is explained with the uncertainties of the probe level data. We also show that binary transcriptome data can be used in cross-platform studies and when used with Tanimoto kernel, this increase the performance of inferences when compared to individual datasets. In the last part of this work we show that binary transcriptome data reduces the effect of algorithm choice for pre-processing raw data. While there are many different algorithms for pre-processing stages there are few guidelines for the users as to which one to choose. In many studies it has been shown that the choice of algorithms has significant impact on the overall results of microarray studies. Here we show in classification, that if transcriptome data is binarized after pre-processed with any combination of algorithms it has the effect of reducing the variability of the results and increasing the performance of the classifier simultaneously

    Efficient parametric inference for stochastic biological systems with measured variability

    Full text link
    Stochastic systems in biology often exhibit substantial variability within and between cells. This variability, as well as having dramatic functional consequences, provides information about the underlying details of the system's behaviour. It is often desirable to infer properties of the parameters governing such systems given experimental observations of the mean and variance of observed quantities. In some circumstances, analytic forms for the likelihood of these observations allow very efficient inference: we present these forms and demonstrate their usage. When likelihood functions are unavailable or difficult to calculate, we show that an implementation of approximate Bayesian computation (ABC) is a powerful tool for parametric inference in these systems. However, the calculations required to apply ABC to these systems can also be computationally expensive, relying on repeated stochastic simulations. We propose an ABC approach that cheaply eliminates unimportant regions of parameter space, by addressing computationally simple mean behaviour before explicitly simulating the more computationally demanding variance behaviour. We show that this approach leads to a substantial increase in speed when applied to synthetic and experimental datasets.Comment: 11 pages, 4 fig

    A Novel and Fast Approach for Population Structure Inference Using Kernel-PCA and Optimization (PSIKO)

    Get PDF
    Population structure is a confounding factor in Genome Wide Association Studies, increasing the rate of false positive associations. In order to correct for it, several model-based algorithms such as ADMIXTURE and STRUCTURE have been proposed. These tend to suffer from the fact that they have a considerable computational burden, limiting their applicability when used with large datasets, such as those produced by Next Generation Sequencing (NGS) techniques. To address this, non-model based approaches such as SNMF and EIGENSTRAT have been proposed, which scale better with larger data. Here we present a novel non-model based approach, PSIKO, which is based on a unique combination of linear kernel-PCA and least-squares optimization and allows for the inference of admixture coefficients, principal components, and number of founder populations of a dataset. PSIKO has been compared against existing leading methods on a variety of simulation scenarios, as well as on real biological data. We found that in addition to producing results of the same quality as other tested methods, PSIKO scales extremely well with dataset size, being considerably (up to 30 times) faster for longer sequences than even state of the art methods such as SNMF. PSIKO and accompanying manual are freely available at https://www.uea.ac.uk/computing/psiko

    Undisclosed, unmet and neglected challenges in multi-omics studies

    Full text link
    [EN] Multi-omics approaches have become a reality in both large genomics projects and small laboratories. However, the multi-omics research community still faces a number of issues that have either not been sufficiently discussed or for which current solutions are still limited. In this Perspective, we elaborate on these limitations and suggest points of attention for future research. We finally discuss new opportunities and challenges brought to the field by the rapid development of single-cell high-throughput molecular technologies.This work has been funded by the Spanish Ministry of Science and Innovation with grant number BES-2016-076994 to A.A.-L.Tarazona, S.; Arzalluz-Luque, Á.; Conesa, A. (2021). Undisclosed, unmet and neglected challenges in multi-omics studies. Nature Computational Science. 1(6):395-402. https://doi.org/10.1038/s43588-021-00086-z3954021

    Simulation and inference algorithms for stochastic biochemical reaction networks: from basic concepts to state-of-the-art

    Full text link
    Stochasticity is a key characteristic of intracellular processes such as gene regulation and chemical signalling. Therefore, characterising stochastic effects in biochemical systems is essential to understand the complex dynamics of living things. Mathematical idealisations of biochemically reacting systems must be able to capture stochastic phenomena. While robust theory exists to describe such stochastic models, the computational challenges in exploring these models can be a significant burden in practice since realistic models are analytically intractable. Determining the expected behaviour and variability of a stochastic biochemical reaction network requires many probabilistic simulations of its evolution. Using a biochemical reaction network model to assist in the interpretation of time course data from a biological experiment is an even greater challenge due to the intractability of the likelihood function for determining observation probabilities. These computational challenges have been subjects of active research for over four decades. In this review, we present an accessible discussion of the major historical developments and state-of-the-art computational techniques relevant to simulation and inference problems for stochastic biochemical reaction network models. Detailed algorithms for particularly important methods are described and complemented with MATLAB implementations. As a result, this review provides a practical and accessible introduction to computational methods for stochastic models within the life sciences community

    Probabilistic analysis of the human transcriptome with side information

    Get PDF
    Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure

    Computational Methods for Sequencing and Analysis of Heterogeneous RNA Populations

    Get PDF
    Next-generation sequencing (NGS) and mass spectrometry technologies bring unprecedented throughput, scalability and speed, facilitating the studies of biological systems. These technologies allow to sequence and analyze heterogeneous RNA populations rather than single sequences. In particular, they provide the opportunity to implement massive viral surveillance and transcriptome quantification. However, in order to fully exploit the capabilities of NGS technology we need to develop computational methods able to analyze billions of reads for assembly and characterization of sampled RNA populations. In this work we present novel computational methods for cost- and time-effective analysis of sequencing data from viral and RNA samples. In particular, we describe: i) computational methods for transcriptome reconstruction and quantification; ii) method for mass spectrometry data analysis; iii) combinatorial pooling method; iv) computational methods for analysis of intra-host viral populations

    How to Predict Molecular Interactions between Species?

    Get PDF
    Organisms constantly interact with other species through physical contact which leads to chan-ges on the molecular level, for example the transcriptome. These changes can be monitored forall genes, with the help of high-throughput experiments such as RNA-seq or microarrays. Theadaptation of the gene expression to environmental changes within cells is mediated throughcomplex gene regulatory networks. Often, our knowledge of these networks is incomplete. Netw-ork inference predicts gene regulatory interactions based on transcriptome data. An emergingapplication of high-throughput transcriptome studies are dual transcriptomics experiments. Here,the transcriptome of two or more interacting species is measured simultaneously. Based ona dual RNA-seq data set of murine dendritic cells infected with the fungal pathogen Candidaalbicans, the software tool NetGenerator was applied to predict an inter-species gene regulatorynetwork. To promote further investigations of molecular inter-species interactions, we recentlydiscussed dual RNA-seq experiments for host-pathogen interactions and extended the appliedtool NetGenerator (Schulze et al., 2015). The updated version of NetGenerator makes use ofmeasurement variances in the algorithmic procedure and accepts gene expression time seriesdata with missing values. Additionally, we tested multiple modeling scenarios regarding the stimulifunctions of the gene regulatory network. Here, we summarize the work by Schulze et al. (2015)and put it into a broader context. We review various studies making use of the dual transcriptomicsapproach to investigate the molecular basis of interacting species. Besides the application tohost-pathogen interactions, dual transcriptomics data are also utilized to study mutualistic andcommensalistic interactions. Furthermore, we give a short introduction into additional approachesfor the prediction of gene regulatory networks and discuss their application to dual transcriptomicsdata. We conclude that the application of network inference on dual-transcriptomics data is apromising approach to predict molecular inter-species interactions

    Re-using public RNA-Seq data

    Get PDF
    "Järgmise põlvkonna sekveneerimismeetodid"(NGS) on geeniandmete analüüsil kiiresti populaarsust kogumas. RNA-Seq on NGS tehnika, mis võimaldab geeniekspressiooni tasemete hindamist. Eksperimentidest kogutuid andmeid arhiveeritakse jõudsalt avalikesse andmebaasidesse, kuna toorandmete neisse edastamine on üheks eeltingimuseks akadeemilistes ajakirjades avaldamiseks. RNA-Seq toorandmed on mahult üsna suured ja üksikute eksperimentide analüüs üsnagi aeganõudev. Sekveneerimise toorandmeid taaskasutatakse praegu veel üsna vähe. Andmebaasidesse leiduvate andmete taaskasutamisele avaldavad pärssivat mõju ebatäpsed katseplaneerimise kirjeldused ja kindlate standardite puudumine analüüsimeetodites. Tööriistade vahelised algoritmilised eripärad tähendavad erinevatel meetoditel teostatud analüüside vähest võrreldavust. Lihtne kollektsioonide agregeerimine ei tööta, kuna analüüsitud andmed pole võrreldavad. Seega tuleb analüüs kõikide eksperimentide jaoks teostada alates toorandmetest. Iga eksperimendi analüüs on aga üsna aeganõudev ning nõuab kuldsete standardite puudumisel konkreetseid valikuid. Suuremahuliste analüüsiandmete kollektsiooni nõuab seega efektiivset töövoo implementatsiooni. Toimimise tingimusteks on minimaalne inimsekkumine, fikseeritud tööriistade valik ja robustne eksperimentide käsitsemismetoodika. Väga erinevates tingimustes teostatud eksperimentide ekspressiooniandmete agregeerimine loob võimaluse andmekaeve meetodite rakendamiseks. Lokaalselt ilmnevad mustrid võivad taustsüsteemis osutuda signaaliks. Üheks analüüsivallaks, mis selliseid mustreid uurib on koekspressioonianalüüs. Selles magistritöös arendasime ja implementeerisime raamistiku suuremahuliseks avalike RNA-Seq andmete analüüsiks. Analüüs ei vaja eksperimentide analüüsimisele eelnevalt konfiguratsioonifaili vaid toetub ühekordselt konstrueeritud andmebaasile. Kasutajapoolne sekkumine on minimaalne, kõik parameetrid määratakse andmetest lähtuvalt. See võimaldab järjestikulist analüüsi üle arvukate eksperimentide. Loodavat RNA-Seq ekspressiooniandmete kollektsiooni kasutatakse sisendina BIIT töörühma poolt arenda- tud koekspressiooni uurimise tööriistas - MEM. Algselt oli see ehitatud üksnes mikrokiip andmetelt sondide koekspressiooni hindamiseks, kuid RNA-Seq ekspressiooniandmed laiendavad selle rakendusampluaad.Next Generation Sequencing (NGS) methods are rapidly becoming the most popular paradigm for exploring genomic data. RNA-Seq is a NGS method that enables gene expression analyses. Raw sequencing data generated by researchers is actively submitted to public databases as part of the requirements for publishing in academic journals. Raw sequencing data is quite large in size and analysis of each experiment is time consuming. Therefore published raw files are currently not re-used much. Repetitive analysis of uploaded data is also complicated by negligent experiment set-up write-ups and lack of clear standards for the analysis process. Publicly available analysis results have been obtained using a varying set of tools and parameters. There are biases introduced by algorithmic differences of tools which greatly decreases the comparability of results between experiments. This is due because of lack of golden analysis standards. Comprehensive collections of expression data have to account for computational expenses and time limits. Therefore collection set-up needs an effective pipeline implementation with automatic parameter estimation, a defined subset of tools and a robust handling mechanism to ensure minimal required user input. Aggregating expression data from individual experiments with varying experimental conditions creates many new opportunities for data aggregation and mining. Pattern discovery over larger collections generalises local tendencies. One such analysis sub-field is assessing gene co-expression over a broader set of experiments. In this thesis, we have designed and implemented a framework for performing large scale analysis of publicly available RNA-Seq experiments. No separate configuration file for analysis is required, instead a pre-built database is employed. User intervention is minimal and the process is self-guiding. All parameters within the analysis process are determined automatically. This enables unsupervised sequential analysis of numerous experiments. Analysed datasets can be used as an input for co-expression analysis tool MEM which was developed by BIIT research group and was originally designed for public microarray data. RNA-Seq data adds a new application field for the tool. Other than co-expression analysis with MEM, the data can also be used in other downstream analysis applications
    corecore