15 research outputs found

    An eScience-Bayes strategy for analyzing omics data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The omics fields promise to revolutionize our understanding of biology and biomedicine. However, their potential is compromised by the challenge to analyze the huge datasets produced. Analysis of omics data is plagued by the curse of dimensionality, resulting in imprecise estimates of model parameters and performance. Moreover, the integration of omics data with other data sources is difficult to shoehorn into classical statistical models. This has resulted in <it>ad hoc </it>approaches to address specific problems.</p> <p>Results</p> <p>We present a general approach to omics data analysis that alleviates these problems. By combining eScience and Bayesian methods, we retrieve scientific information and data from multiple sources and coherently incorporate them into large models. These models improve the accuracy of predictions and offer new insights into the underlying mechanisms. This "eScience-Bayes" approach is demonstrated in two proof-of-principle applications, one for breast cancer prognosis prediction from transcriptomic data and one for protein-protein interaction studies based on proteomic data.</p> <p>Conclusions</p> <p>Bayesian statistics provide the flexibility to tailor statistical models to the complex data structures in omics biology as well as permitting coherent integration of multiple data sources. However, Bayesian methods are in general computationally demanding and require specification of possibly thousands of prior distributions. eScience can help us overcome these difficulties. The eScience-Bayes thus approach permits us to fully leverage on the advantages of Bayesian methods, resulting in models with improved predictive performance that gives more information about the underlying biological system.</p

    Challenges in funding and developing genomic software: roots and remedies

    Get PDF
    The computer software used for genomic analysis has become a crucial component of the infrastructure for life sciences. However, genomic software is still typically developed in an ad hoc manner, with inadequate funding, and by academic researchers not trained in software development, at substantial costs to the research community. I examine the roots of the incongruity between the importance of and the degree of investment in genomic software, and I suggest several potential remedies for current problems. As genomics continues to grow, new strategies for funding and developing the software that powers the field will become increasingly essential

    Linking the Resource Description Framework to cheminformatics and proteochemometrics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Semantic web technologies are finding their way into the life sciences. Ontologies and semantic markup have already been used for more than a decade in molecular sciences, but have not found widespread use yet. The semantic web technology Resource Description Framework (RDF) and related methods show to be sufficiently versatile to change that situation.</p> <p>Results</p> <p>The work presented here focuses on linking RDF approaches to existing molecular chemometrics fields, including cheminformatics, QSAR modeling and proteochemometrics. Applications are presented that link RDF technologies to methods from statistics and cheminformatics, including data aggregation, visualization, chemical identification, and property prediction. They demonstrate how this can be done using various existing RDF standards and cheminformatics libraries. For example, we show how IC<sub>50</sub> and K<it><sub>i</sub></it> values are modeled for a number of biological targets using data from the ChEMBL database.</p> <p>Conclusions</p> <p>We have shown that existing RDF standards can suitably be integrated into existing molecular chemometrics methods. Platforms that unite these technologies, like Bioclipse, makes this even simpler and more transparent. Being able to create and share workflows that integrate data aggregation and analysis (visual and statistical) is beneficial to interoperability and reproducibility. The current work shows that RDF approaches are sufficiently powerful to support molecular chemometrics workflows.</p

    Statistical strategies for avoiding false discoveries in metabolomics and related experiments

    Full text link

    Computational methods to explore hierarchical and modular structure of biological networks

    Get PDF
    Networks have been widely used to understand structure of complex systems. From studying biological networks of protein-protein, genetic and other types of interactions, we gain insights into functional organization of static biological systems that could hardly be measured experimentally in current state-of-the-art technology. Biological networks also serve as a principled framework that integrates multiple sources of genome-wide data sets such as gene expression arrays and sequencing. Yet, a large-scale network is often intractable for intuitive visualization and computation. We developed novel network clustering algorithms to harness the power of genome-scale biological networks of all genes/proteins. Especially our algorithms were capable of finding hidden modular structures in hierarchical stochastic block model. Since the modules are organized hierarchically, our algorithms facilitate downstream analysis and design of in-depth validation experiments in ``divide-and-conquer'' strategy. Moreover, we present empirical evidence that the hierarchical and modular structure best explains observed biological networks. We used the static clustering methods in two ways. First we sought to extend the static methods to dynamic clustering problems, and observed general patterns of dynamics of network modules. For examples we demonstrate dynamics of yeast metabolic cycle and Arabidopsis root developmental process. Moreover, we propose a prioritization scheme that sorts identified network modules in the order of discriminative power. In the course of research we conclude that biological networks are best understood as hierarchically organized modules, and the modules remain stable in unperturbed biological process, but they can respond differently to abnormal / external perturbations such as knock-down of key enzymes

    Scientific Workflows for Metabolic Flux Analysis

    Get PDF
    Metabolic engineering is a highly interdisciplinary research domain that interfaces biology, mathematics, computer science, and engineering. Metabolic flux analysis with carbon tracer experiments (13 C-MFA) is a particularly challenging metabolic engineering application that consists of several tightly interwoven building blocks such as modeling, simulation, and experimental design. While several general-purpose workflow solutions have emerged in recent years to support the realization of complex scientific applications, the transferability of these approaches are only partially applicable to 13C-MFA workflows. While problems in other research fields (e.g., bioinformatics) are primarily centered around scientific data processing, 13C-MFA workflows have more in common with business workflows. For instance, many bioinformatics workflows are designed to identify, compare, and annotate genomic sequences by "pipelining" them through standard tools like BLAST. Typically, the next workflow task in the pipeline can be automatically determined by the outcome of the previous step. Five computational challenges have been identified in the endeavor of conducting 13 C-MFA studies: organization of heterogeneous data, standardization of processes and the unification of tools and data, interactive workflow steering, distributed computing, and service orientation. The outcome of this thesis is a scientific workflow framework (SWF) that is custom-tailored for the specific requirements of 13 C-MFA applications. The proposed approach – namely, designing the SWF as a collection of loosely-coupled modules that are glued together with web services – alleviates the realization of 13C-MFA workflows by offering several features. By design, existing tools are integrated into the SWF using web service interfaces and foreign programming language bindings (e.g., Java or Python). Although the attributes "easy-to-use" and "general-purpose" are rarely associated with distributed computing software, the presented use cases show that the proposed Hadoop MapReduce framework eases the deployment of computationally demanding simulations on cloud and cluster computing resources. An important building block for allowing interactive researcher-driven workflows is the ability to track all data that is needed to understand and reproduce a workflow. The standardization of 13 C-MFA studies using a folder structure template and the corresponding services and web interfaces improves the exchange of information for a group of researchers. Finally, several auxiliary tools are developed in the course of this work to complement the SWF modules, i.e., ranging from simple helper scripts to visualization or data conversion programs. This solution distinguishes itself from other scientific workflow approaches by offering a system of loosely-coupled components that are flexibly arranged to match the typical requirements in the metabolic engineering domain. Being a modern and service-oriented software framework, new applications are easily composed by reusing existing components
    corecore