15 research outputs found

    DIAMIN: a software library for the distributed analysis of large-scale molecular interaction networks

    Get PDF
    Background Huge amounts of molecular interaction data are continuously produced and stored in public databases. Although many bioinformatics tools have been proposed in the literature for their analysis, based on their modeling through different types of biological networks, several problems still remain unsolved when the problem turns on a large scale. Results We propose DIAMIN, that is, a high-level software library to facilitate the development of applications for the efficient analysis of large-scale molecular interaction networks. DIAMIN relies on distributed computing, and it is implemented in Java upon the framework Apache Spark. It delivers a set of functionalities implementing different tasks on an abstract representation of very large graphs, providing a built-in support for methods and algorithms commonly used to analyze these networks. DIAMIN has been tested on data retrieved from two of the most used molecular interactions databases, resulting to be highly efficient and scalable. As shown by different provided examples, DIAMIN can be exploited by users without any distributed programming experience, in order to perform various types of data analysis, and to implement new algorithms based on its primitives. Conclusions The proposed DIAMIN has been proved to be successful in allowing users to solve specific biological problems that can be modeled relying on biological networks, by using its functionalities. The software is freely available and this will hopefully allow its rapid diffusion through the scientific community, to solve both specific data analysis and more complex tasks

    Controllability Analysis and Control Design of Biological Systems Modeled by Boolean Networks

    Get PDF
    Cell signaling networks are often modeled using ordinary differential equations (ODEs), which represent network components with continuous variables. However, parameters such as reaction rate constants are needed for ODEs are not always available or known, and discrete approaches such as Boolean networks (BNs) are used in such cases. BNs have been applied in the past, in particular, as means to determine network steady states. The goal of this work is to explore the use of BNs from a control theory point of view, that to help manipulate biological systems more efficiently. In this thesis, we propose two methods to analyze and design control strategies for BNs. The first method, based on the algebraic state-space representation of BNs, consist of defining control strategies to reach predetermined states, namely, given a desired output, find all possible system state transition trajectories to that output, and design an input sequence leading to it. The second method aims at introducing an alternative and an extension of the first method in the sense that it offers broader possibilities for the representation of time and it is scalable to BNs of bigger size. This method is based on binary decision diagrams (BDDs), a data structure very efficient to represent logical functions and allow us to relate outputs of a network to its inputs no matter how many layers it contains and whether or not it has a cyclic structure

    Novel methods for comparing and evaluating single and metagenomic assemblies

    Get PDF
    The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments ā€œreadā€ by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still heavily relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. The focus of this work is to develop reference-free computational methods to accurately compare and evaluate genome assemblies. We introduce a reference-free likelihood-based measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. Despite the unresolved challenges of single genome assembly, the decreasing costs of sequencing technology has led to a sharp increase in metagenomics projects over the past decade. These projects allow us to better understand the diversity and function of microbial communities found in the environment, including the ocean, Arctic regions, other living organisms, and the human body. We extend our likelihood-based framework and show that we can accurately compare assemblies of these complex bacterial communities. After an assembly has been produced, it is not an easy task determining what parts of the underlying genome are missing, what parts are mistakes, and what parts are due to experimental artifacts from the sequencing machine. Here we introduce VALET, the first reference-free pipeline that flags regions in metagenomic assemblies that are statistically inconsistent with the data generation process. VALET detects mis-assemblies in publicly available datasets and highlights the current shortcomings in available metagenomic assemblers. By providing the computational methods for researchers to accurately evalu- ate their assemblies, we decrease the chance of incorrect biological conclusions and misguided future studies

    Annual Report

    Get PDF

    Identifying the topology of protein complexes from affinity purification assays

    Get PDF
    Motivation: Recent advances in high-throughput technologies have made it possible to investigate not only individual protein interactions, but also the association of these proteins in complexes. So far the focus has been on the prediction of complexes as sets of proteins from the experimental results. The modular substructure and the physical interactions within the protein complexes have been mostly ignored

    Analysis of High-Throughput Data - Protein-Protein Interactions, Protein Complexes and RNA Half-life

    Get PDF
    The development of high-throughput techniques has lead to a paradigm change in biology from the small-scale analysis of individual genes and proteins to a genome-scale analysis of biological systems. Proteins and genes can now be studied in their interaction with each other and the cooperation within multi-subunit protein complexes can be investigated. Moreover, time-dependent dynamics and regulation of these processes and associations can now be explored by monitoring mRNA changes and turnover. The in-depth analysis of these large and complex data sets would not be possible without sophisticated algorithms for integrating different data sources, identifying interesting patterns in the data and addressing the high variability and error rates in biological measurements. In this thesis, we developed such methods for the investigation of protein interactions and complexes and the corresponding regulatory processes. In the first part, we analyze networks of physical protein-protein interactions measured in large-scale experiments. We show that the topology of the complete interactomes can be confidently extrapolated despite high numbers of missing and wrong interactions from only partial measurements of interaction networks. Furthermore, we find that the structure and stability of protein interaction networks is not only influenced by the degree distribution of the network but also considerably by the suppression or propagation of interactions between highly connected proteins. As analysis of network topology is generally focused on large eukaryotic networks, we developed new methods to analyze smaller networks of intraviral and virus-host interactions. By comparing interactomes of related herpesviral species, we could detect a conserved core of protein interactions and could address the low coverage of the yeast two-hybrid system. In addition, common strategies in the interaction of the viruses with the host cell were identified. New affinity purification methods now make it possible to directly study associations of proteins in complexes. Due to experimental errors the individual protein complexes have to be predicted with computational methods from these purification results. As previously published methods relied more or less heavily on existing knowledge on complexes, we developed an unsupervised prediction algorithm which is independent from such additional data. Using this approach, high-quality protein complexes can be identified from the raw purification data alone for any species purification experiments are performed. To identify the direct, physical interactions within these predicted complexes and their subcomponent structure, we describe a new approach to extract the highest scoring subnetwork connecting the complex and interactions not explained by alternative paths of indirect interactions. In this way, important interactions within the complexes can be identified and their substructure can be resolved in a straightforward way. To explore the regulation of proteins and complexes, we analyzed microarray measurements of mRNA abundance, de novo transcription and decay. Based on the relationship between newly transcribed, pre-existing and total RNA, transcript half-life can be estimated for individual genes using a new microarray normalization method and a quality control can be applied. We show that precise measurements of RNA half-life can be obtained from de novo transcription which are of superior accuracy to previously published results from RNA decay. Using such precise measurements, we studied RNA half-lives in human B-cells and mouse fibroblasts to identify conserved patterns governing RNA turnover. Our results show that transcript half-lives are strongly conserved and specifically correlated to gene function. Although transcript half-life is highly similar in protein complexes and \mbox{families}, individual proteins may deviate significantly from the remaining complex subunits or family members to efficiently support the regulation of protein complexes or to create non-redundant roles of functionally similar proteins. These results illustrate several of the many ways in which high-throughput measurements lead to a better understanding of biological systems. By studying large-scale measure\-ments in this thesis, the structure of protein interaction networks and protein complexes could be better characterized, important interactions and conserved strategies for herpes\-viral infection could be identified and interesting insights could be gained into the regulation of important biological processes and protein complexes. This was made possible by the development of novel algorithms and analysis approaches which will also be valuable for further research on these topics

    LC-MSsim ā€“ a simulation software for liquid chromatography mass spectrometry data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Mass Spectrometry coupled to Liquid Chromatography (LC-MS) is commonly used to analyze the protein content of biological samples in large scale studies. The data resulting from an LC-MS experiment is huge, highly complex and noisy. Accordingly, it has sparked new developments in Bioinformatics, especially in the fields of algorithm development, statistics and software engineering. In a quantitative label-free mass spectrometry experiment, crucial steps are the detection of peptide features in the mass spectra and the alignment of samples by correcting for shifts in retention time. At the moment, it is difficult to compare the plethora of algorithms for these tasks. So far, curated benchmark data exists only for peptide identification algorithms but no data that represents a ground truth for the evaluation of feature detection, alignment and filtering algorithms.</p> <p>Results</p> <p>We present <it>LC-MSsim</it>, a simulation software for LC-ESI-MS experiments. It simulates ESI spectra on the MS level. It reads a list of proteins from a FASTA file and digests the protein mixture using a user-defined enzyme. The software creates an LC-MS data set using a predictor for the retention time of the peptides and a model for peak shapes and elution profiles of the mass spectral peaks. Our software also offers the possibility to add contaminants, to change the background noise level and includes a model for the detectability of peptides in mass spectra. After the simulation, <it>LC-MSsim </it>writes the simulated data to mzData, a public XML format. The software also stores the positions (monoisotopic m/z and retention time) and ion counts of the simulated ions in separate files.</p> <p>Conclusion</p> <p><it>LC-MSsim </it>generates simulated LC-MS data sets and incorporates models for peak shapes and contaminations. Algorithm developers can match the results of feature detection and alignment algorithms against the simulated ion lists and meaningful error rates can be computed. We anticipate that <it>LC-MSsim </it>will be useful to the wider community to perform benchmark studies and comparisons between computational tools.</p
    corecore