24 research outputs found

    IgTM: An algorithm to predict transmembrane domains and topology in proteins

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Due to their role of receptors or transporters, membrane proteins play a key role in many important biological functions. In our work we used Grammatical Inference (GI) to localize transmembrane segments. Our GI process is based specifically on the inference of Even Linear Languages.</p> <p>Results</p> <p>We obtained values close to 80% in both specificity and sensitivity. Six datasets have been used for the experiments, considering different encodings for the input sequences. An encoding that includes the topology changes in the sequence (from inside and outside the membrane to it and vice versa) allowed us to obtain the best results. This software is publicly available at: <url>http://www.dsic.upv.es/users/tlcc/bio/bio.html</url></p> <p>Conclusion</p> <p>We compared our results with other well-known methods, that obtain a slightly better precision. However, this work shows that it is possible to apply Grammatical Inference techniques in an effective way to bioinformatics problems.</p

    Category Theoretic Analysis of Hierarchical Protein Materials and Social Networks

    Get PDF
    Materials in biology span all the scales from Angstroms to meters and typically consist of complex hierarchical assemblies of simple building blocks. Here we describe an application of category theory to describe structural and resulting functional properties of biological protein materials by developing so-called ologs. An olog is like a “concept web” or “semantic network” except that it follows a rigorous mathematical formulation based on category theory. This key difference ensures that an olog is unambiguous, highly adaptable to evolution and change, and suitable for sharing concepts with other olog. We consider simple cases of beta-helical and amyloid-like protein filaments subjected to axial extension and develop an olog representation of their structural and resulting mechanical properties. We also construct a representation of a social network in which people send text-messages to their nearest neighbors and act as a team to perform a task. We show that the olog for the protein and the olog for the social network feature identical category-theoretic representations, and we proceed to precisely explicate the analogy or isomorphism between them. The examples presented here demonstrate that the intrinsic nature of a complex system, which in particular includes a precise relationship between structure and function at different hierarchical levels, can be effectively represented by an olog. This, in turn, allows for comparative studies between disparate materials or fields of application, and results in novel approaches to derive functionality in the design of de novo hierarchical systems. We discuss opportunities and challenges associated with the description of complex biological materials by using ologs as a powerful tool for analysis and design in the context of materiomics, and we present the potential impact of this approach for engineering, life sciences, and medicine.Presidential Early Career Award for Scientists and Engineers (N000141010562)United States. Army Research Office. Multidisciplinary University Research Initiative (W911NF0910541)United States. Office of Naval Research (grant N000141010841)Massachusetts Institute of Technology. Dept. of MathematicsStudienstiftung des deutschen VolkesClark BarwickJacob Luri

    Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases

    Get PDF
    Microorganisms that cause infectious diseases present critical issues of national security, public health, and economic welfare.  For example, in recent years, highly pathogenic strains of avian influenza have emerged in Asia, spread through Eastern Europe and threaten to become pandemic. As demonstrated by the coordinated response to Severe Acute Respiratory Syndrome (SARS) and influenza, agents of infectious disease are being addressed via large-scale genomic sequencing.  The goal of genomic sequencing projects are to rapidly put large amounts of data in the public domain to accelerate research on disease surveillance, treatment, and prevention. However, our ability to derive information from large comparative genomic datasets lags far behind acquisition.  Here we review the computational challenges of comparative genomic analyses, specifically sequence alignment and reconstruction of phylogenetic trees.  We present novel analytical results on from two important infectious diseases, Severe Acute Respiratory Syndrome (SARS) and influenza.SARS and influenza have similarities and important differences both as biological and comparative genomic analysis problems.  Influenza viruses (Orthymxyoviridae) are RNA based.  Current evidence indicates that influenza viruses originate in aquatic birds from wild populations. Influenza has been studied for decades via well-coordinated international efforts.  These efforts center on surveillance via antibody characterization of the hemagglutinin (HA) and neuraminidase (N) proteins of the circulating strains to inform vaccine design. However we still do not have a clear understanding of: 1) various transmission pathways such as the role of intermediate hosts such as swine and domestic birds and 2) the key mutation and genomic recombination events that underlie periodic pandemics of influenza.  In the past 30 years, sequence data from HA and N loci has become an important data type. In the past year, full genomic data has become prominent.  These data present exciting opportunities to address unanswered questions in influenza pandemics.SARS is caused by a previously unrecognized lineage of coronavirus, SARS-CoV, which like influenza has an RNA based genome.  Although SARS-CoV is widely believed to have originated in animals there remains disagreement over the candidate animal source that lead to the original outbreak of SARS.  In contrast to the long history of the study of influenza, SARS was only recognized in late 2002 and the virus that causes SARS has been documented primarily by genomic sequencing.In the past, most studies of influenza were performed on a limited number of isolates and genes suited to a particular problem.  Major goals in science today are to understand emerging diseases in broad geographic, environmental, societal, biological, and genomic contexts. Synthesizing diverse information brought together by various researchers is important to find out what can be done to prevent future outbreaks {JON03}.  Thus comprehensive means to organize and analyze large amounts of diverse information are critical.  For example, the relationships of isolates and patterns of genomic change observed in large datasets might not be consistent with hypotheses formed on partial data.  Moreover when researchers rely on partial datasets, they restrict the range of possible discoveries.Phylogenetics is well suited to the complex task of understanding emerging infectious disease. Phylogenetic analyses can test many hypotheses by comparing diverse isolates collected from various hosts, environments, and points in time and organizing these data into various evolutionary scenarios.  The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses.  However, this synthesis comes at a price.  The cost of computation of phylogenetic analysis expands combinatorially as the number of isolates considered increases. Thus, large datasets like those currently produced are commonly considered intractable.  We address this problem with synergistic development of heuristics tree search strategies and parallel computing.Fil: Janies, D.. Ohio State University; Estados UnidosFil: Pol, Diego. Ohio State University; Estados Unidos. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin

    A stochastic context free grammar based framework for analysis of protein sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited the application of formal language theory to the production of grammars whose expressive power is not higher than stochastic regular grammars. However, these grammars, like other state of the art methods, cannot cover any higher-order dependencies such as nested and crossing relationships that are common in proteins. In order to overcome some of these limitations, we propose a Stochastic Context Free Grammar based framework for the analysis of protein sequences where grammars are induced using a genetic algorithm.</p> <p>Results</p> <p>This framework was implemented in a system aiming at the production of binding site descriptors. These descriptors not only allow detection of protein regions that are involved in these sites, but also provide insight in their structure. Grammars were induced using quantitative properties of amino acids to deal with the size of the protein alphabet. Moreover, we imposed some structural constraints on grammars to reduce the extent of the rule search space. Finally, grammars based on different properties were combined to convey as much information as possible. Evaluation was performed on sites of various sizes and complexity described either by PROSITE patterns, domain profiles or a set of patterns. Results show the produced binding site descriptors are human-readable and, hence, highlight biologically meaningful features. Moreover, they achieve good accuracy in both annotation and detection. In addition, findings suggest that, unlike current state-of-the-art methods, our system may be particularly suited to deal with patterns shared by non-homologous proteins.</p> <p>Conclusion</p> <p>A new Stochastic Context Free Grammar based framework has been introduced allowing the production of binding site descriptors for analysis of protein sequences. Experiments have shown that not only is this new approach valid, but produces human-readable descriptors for binding sites which have been beyond the capability of current machine learning techniques.</p

    Timing of oceans on Mars from shoreline deformation.

    No full text
    Widespread evidence points to the existence of an ancient Martian ocean. Most compelling are the putative ancient shorelines in the northern plains. However, these shorelines fail to follow an equipotential surface, and this has been used to challenge the notion that they formed via an early ocean and hence to question the existence of such an ocean. The shorelines' deviation from a constant elevation can be explained by true polar wander occurring after the formation of Tharsis, a volcanic province that dominates the gravity and topography of Mars. However, surface loading from the oceans can drive polar wander only if Tharsis formed far from the equator, and most evidence indicates that Tharsis formed near the equator, meaning that there is no current explanation for the shorelines' deviation from an equipotential that is consistent with our geophysical understanding of Mars. Here we show that variations in shoreline topography can be explained by deformation caused by the emplacement of Tharsis. We find that the shorelines must have formed before and during the emplacement of Tharsis, instead of afterwards, as previously assumed. Our results imply that oceans on Mars formed early, concurrent with the valley networks, and point to a close relationship between the evolution of oceans on Mars and the initiation and decline of Tharsis volcanism, with broad implications for the geology, hydrological cycle and climate of early Mars
    corecore