16,868 research outputs found

    Integration of Biological Sources: Exploring the Case of Protein Homology

    Get PDF
    Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heteroge- neous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioin- formatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Un- certain databases are able to contain several possi- ble worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration

    Hot-spot analysis for drug discovery targeting protein-protein interactions

    Get PDF
    Introduction: Protein-protein interactions are important for biological processes and pathological situations, and are attractive targets for drug discovery. However, rational drug design targeting protein-protein interactions is still highly challenging. Hot-spot residues are seen as the best option to target such interactions, but their identification requires detailed structural and energetic characterization, which is only available for a tiny fraction of protein interactions. Areas covered: In this review, the authors cover a variety of computational methods that have been reported for the energetic analysis of protein-protein interfaces in search of hot-spots, and the structural modeling of protein-protein complexes by docking. This can help to rationalize the discovery of small-molecule inhibitors of protein-protein interfaces of therapeutic interest. Computational analysis and docking can help to locate the interface, molecular dynamics can be used to find suitable cavities, and hot-spot predictions can focus the search for inhibitors of protein-protein interactions. Expert opinion: A major difficulty for applying rational drug design methods to protein-protein interactions is that in the majority of cases the complex structure is not available. Fortunately, computational docking can complement experimental data. An interesting aspect to explore in the future is the integration of these strategies for targeting PPIs with large-scale mutational analysis.This work has been funded by grants BIO2016-79930-R and SEV-2015-0493 from the Spanish Ministry of Economy, Industry and Competitiveness, and grant EFA086/15 from EU Interreg V POCTEFA. M Rosell is supported by an FPI fellowship from the Severo Ochoa program. The authors are grateful for the support of the the Joint BSC-CRG-IRB Programme in Computational Biology.Peer ReviewedPostprint (author's final draft

    Data-based fault detection in chemical processes: Managing records with operator intervention and uncertain labels

    Get PDF
    Developing data-driven fault detection systems for chemical plants requires managing uncertain data labels and dynamic attributes due to operator-process interactions. Mislabeled data is a known problem in computer science that has received scarce attention from the process systems community. This work introduces and examines the effects of operator actions in records and labels, and the consequences in the development of detection models. Using a state space model, this work proposes an iterative relabeling scheme for retraining classifiers that continuously refines dynamic attributes and labels. Three case studies are presented: a reactor as a motivating example, flooding in a simulated de-Butanizer column, as a complex case, and foaming in an absorber as an industrial challenge. For the first case, detection accuracy is shown to increase by 14% while operating costs are reduced by 20%. Moreover, regarding the de-Butanizer column, the performance of the proposed strategy is shown to be 10% higher than the filtering strategy. Promising results are finally reported in regard of efficient strategies to deal with the presented problemPeer ReviewedPostprint (author's final draft

    Emergence of switch-like behavior in a large family of simple biochemical networks

    Get PDF
    Bistability plays a central role in the gene regulatory networks (GRNs) controlling many essential biological functions, including cellular differentiation and cell cycle control. However, establishing the network topologies that can exhibit bistability remains a challenge, in part due to the exceedingly large variety of GRNs that exist for even a small number of components. We begin to address this problem by employing chemical reaction network theory in a comprehensive in silico survey to determine the capacity for bistability of more than 40,000 simple networks that can be formed by two transcription factor-coding genes and their associated proteins (assuming only the most elementary biochemical processes). We find that there exist reaction rate constants leading to bistability in ~90% of these GRN models, including several circuits that do not contain any of the TF cooperativity commonly associated with bistable systems, and the majority of which could only be identified as bistable through an original subnetwork-based analysis. A topological sorting of the two-gene family of networks based on the presence or absence of biochemical reactions reveals eleven minimal bistable networks (i.e., bistable networks that do not contain within them a smaller bistable subnetwork). The large number of previously unknown bistable network topologies suggests that the capacity for switch-like behavior in GRNs arises with relative ease and is not easily lost through network evolution. To highlight the relevance of the systematic application of CRNT to bistable network identification in real biological systems, we integrated publicly available protein-protein interaction, protein-DNA interaction, and gene expression data from Saccharomyces cerevisiae, and identified several GRNs predicted to behave in a bistable fashion.Comment: accepted to PLoS Computational Biolog

    Cronobacter, the emergent bacterial pathogen Enterobacter sakazakii comes of age; MLST and whole genome sequence analysis

    Get PDF
    Background: Following the association of Cronobacter spp. to several publicized fatal outbreaks in neonatal intensive care units of meningitis and necrotising enterocolitis, the World Health Organization (WHO) in 2004 requested the establishment of a molecular typing scheme to enable the international control of the organism. This paper presents the application of Next Generation Sequencing (NGS) to Cronobacter which has led to the establishment of the Cronobacter PubMLST genome and sequence definition database (http://pubmlst.org/ cronobacter/) containing over 1000 isolates with metadata along with the recognition of specific clonal lineages linked to neonatal meningitis and adult infections Results: Whole genome sequencing and multilocus sequence typing (MLST) has supports the formal recognition of the genus Cronobacter composed of seven species to replace the former single species Enterobacter sakazakii. Applyingthe 7-loci MLST scheme to 1007 strains revealed 298 definable sequence types, yet only C. sakazakii clonal complex 4 (CC4) was principally associated with neonatal meningitis. This clonal lineage has been confirmed using ribosomal-MLST (51-loci) and whole genome-MLST (1865 loci) to analyse 107 whole genomes via the Cronobacter PubMLST database. This database has enabled the retrospective analysis of historic cases and outbreaks following re-identification of those strains. Conclusions: The Cronobacter PubMLST database offers a central, open access, reliable sequence-based repository for researchers. It has the capacity to create new analysis schemes 'on the fly', and to integrate metadata (source, geographic distribution, clinical presentation). It is also expandable and adaptable to changes in taxonomy, and able to support the development of reliable detection methods of use to industry and regulatory authorities. Therefore it meets the WHO (2004) request for the establishment of a typing scheme for this emergent bacterial pathogen. Whole genome sequencing has additionally shown a range of potential virulence and environmental fitness traits which may account for the association of C. sakazakii CC4 pathogenicity, and propensity for neonatal CNS

    Mixed membership stochastic blockmodels

    Full text link
    Observations consisting of measurements on relationships for pairs of objects arise in many settings, such as protein interaction and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilisic models can be delicate because the simple exchangeability assumptions underlying many boilerplate models no longer hold. In this paper, we describe a latent variable model of such data called the mixed membership stochastic blockmodel. This model extends blockmodels for relational data to ones which capture mixed membership latent relational structure, thus providing an object-specific low-dimensional representation. We develop a general variational inference algorithm for fast approximate posterior inference. We explore applications to social and protein interaction networks.Comment: 46 pages, 14 figures, 3 table

    Learning and Designing Stochastic Processes from Logical Constraints

    Get PDF
    Stochastic processes offer a flexible mathematical formalism to model and reason about systems. Most analysis tools, however, start from the premises that models are fully specified, so that any parameters controlling the system's dynamics must be known exactly. As this is seldom the case, many methods have been devised over the last decade to infer (learn) such parameters from observations of the state of the system. In this paper, we depart from this approach by assuming that our observations are {\it qualitative} properties encoded as satisfaction of linear temporal logic formulae, as opposed to quantitative observations of the state of the system. An important feature of this approach is that it unifies naturally the system identification and the system design problems, where the properties, instead of observations, represent requirements to be satisfied. We develop a principled statistical estimation procedure based on maximising the likelihood of the system's parameters, using recent ideas from statistical machine learning. We demonstrate the efficacy and broad applicability of our method on a range of simple but non-trivial examples, including rumour spreading in social networks and hybrid models of gene regulation

    Exploiting network topology for large-scale inference of nonlinear reaction models

    Full text link
    The development of chemical reaction models aids understanding and prediction in areas ranging from biology to electrochemistry and combustion. A systematic approach to building reaction network models uses observational data not only to estimate unknown parameters, but also to learn model structure. Bayesian inference provides a natural approach to this data-driven construction of models. Yet traditional Bayesian model inference methodologies that numerically evaluate the evidence for each model are often infeasible for nonlinear reaction network inference, as the number of plausible models can be combinatorially large. Alternative approaches based on model-space sampling can enable large-scale network inference, but their realization presents many challenges. In this paper, we present new computational methods that make large-scale nonlinear network inference tractable. First, we exploit the topology of networks describing potential interactions among chemical species to design improved "between-model" proposals for reversible-jump Markov chain Monte Carlo. Second, we introduce a sensitivity-based determination of move types which, when combined with network-aware proposals, yields significant additional gains in sampling performance. These algorithms are demonstrated on inference problems drawn from systems biology, with nonlinear differential equation models of species interactions

    Engineering simulations for cancer systems biology

    Get PDF
    Computer simulation can be used to inform in vivo and in vitro experimentation, enabling rapid, low-cost hypothesis generation and directing experimental design in order to test those hypotheses. In this way, in silico models become a scientific instrument for investigation, and so should be developed to high standards, be carefully calibrated and their findings presented in such that they may be reproduced. Here, we outline a framework that supports developing simulations as scientific instruments, and we select cancer systems biology as an exemplar domain, with a particular focus on cellular signalling models. We consider the challenges of lack of data, incomplete knowledge and modelling in the context of a rapidly changing knowledge base. Our framework comprises a process to clearly separate scientific and engineering concerns in model and simulation development, and an argumentation approach to documenting models for rigorous way of recording assumptions and knowledge gaps. We propose interactive, dynamic visualisation tools to enable the biological community to interact with cellular signalling models directly for experimental design. There is a mismatch in scale between these cellular models and tissue structures that are affected by tumours, and bridging this gap requires substantial computational resource. We present concurrent programming as a technology to link scales without losing important details through model simplification. We discuss the value of combining this technology, interactive visualisation, argumentation and model separation to support development of multi-scale models that represent biologically plausible cells arranged in biologically plausible structures that model cell behaviour, interactions and response to therapeutic interventions
    corecore