18 research outputs found

    Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space

    Get PDF
    The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types

    Piecemeal Buildup of the Genetic Code, Ribosomes, and Genomes from Primordial tRNA Building Blocks

    No full text
    The origin of biomolecular machinery likely centered around an ancient and central molecule capable of interacting with emergent macromolecular complexity. tRNA is the oldest and most central nucleic acid molecule of the cell. Its co-evolutionary interactions with aminoacyl-tRNA synthetase protein enzymes define the specificities of the genetic code and those with the ribosome their accurate biosynthetic interpretation. Phylogenetic approaches that focus on molecular structure allow reconstruction of evolutionary timelines that describe the history of RNA and protein structural domains. Here we review phylogenomic analyses that reconstruct the early history of the synthetase enzymes and the ribosome, their interactions with RNA, and the inception of amino acid charging and codon specificities in tRNA that are responsible for the genetic code. We also trace the age of domains and tRNA onto ancient tRNA homologies that were recently identified in rRNA. Our findings reveal a timeline of recruitment of tRNA building blocks for the formation of a functional ribosome, which holds both the biocatalytic functions of protein biosynthesis and the ability to store genetic memory in primordial RNA genomic templates

    An Evolutionarily Structured Universe of Protein Architecture

    No full text
    Protein structural diversity encompasses a finite set of architectural designs. Embedded in these topologies are evolutionary histories that we here uncover using cladistic principles and measurements of protein-fold usage and sharing. The reconstructed phylogenies are inherently rooted and depict histories of protein and proteome diversification. Proteome phylogenies showed two monophyletic sister-groups delimiting Bacteria and Archaea, and a topology rooted in Eucarya. This suggests three dramatic evolutionary events and a common ancestor with a eukaryotic-like, gene-rich, and relatively modern organization. Conversely, a general phylogeny of protein architectures showed that structural classes of globular proteins appeared early in evolution and in defined order, the α/ÎČ class being the first. Although most ancestral folds shared a common architecture of barrels or interleaved ÎČ-sheets and α-helices, many were clearly derived, such as polyhedral folds in the all-α class and ÎČ-sandwiches, ÎČ-propellers, and ÎČ-prisms in all-ÎČ proteins. We also describe transformation pathways of architectures that are prevalently used in nature. For example, ÎČ-barrels with increased curl and stagger were favored evolutionary outcomes in the all-ÎČ class. Interestingly, we found cases where structural change followed the α-to-ÎČ tendency uncovered in the tree of architectures. Lastly, we traced the total number of enzymatic functions associated with folds in the trees and show that there is a general link between structure and enzymatic function

    Computing the origin and evolution of the ribosome from its structure — Uncovering processes of macromolecular accretion benefiting synthetic biology

    Get PDF
    Accretion occurs pervasively in nature at widely different timeframes. The process also manifests in the evolution of macromolecules. Here we review recent computational and structural biology studies of evolutionary accretion that make use of the ideographic (historical, retrodictive) and nomothetic (universal, predictive) scientific frameworks. Computational studies uncover explicit timelines of accretion of structural parts in molecular repertoires and molecules. Phylogenetic trees of protein structural domains and proteomes and their molecular functions were built from a genomic census of millions of encoded proteins and associated terminal Gene Ontology terms. Trees reveal a ‘metabolic-first’ origin of proteins, the late development of translation, and a patchwork distribution of proteins in biological networks mediated by molecular recruitment. Similarly, the natural history of ancient RNA molecules inferred from trees of molecular substructures built from a census of molecular features shows patchwork-like accretion patterns. Ideographic analyses of ribosomal history uncover the early appearance of structures supporting mRNA decoding and tRNA translocation, the coevolution of ribosomal proteins and RNA, and a first evolutionary transition that brings ribosomal subunits together into a processive protein biosynthetic complex. Nomothetic structural biology studies of tertiary interactions and ancient insertions in rRNA complement these findings, once concentric layering assumptions are removed. Patterns of coaxial helical stacking reveal a frustrated dynamics of outward and inward ribosomal growth possibly mediated by structural grafting. The early rise of the ribosomal ‘turnstile’ suggests an evolutionary transition in natural biological computation. Results make explicit the need to understand processes of molecular growth and information transfer of macromolecules

    Rooting Phylogenies and the Tree of Life While Minimizing Ad Hoc and Auxiliary Assumptions

    No full text
    Phylogenetic methods unearth evolutionary history when supported by three starting points of reason: (1) the continuity axiom begs the existence of a “model” of evolutionary change, (2) the singularity axiom defines the historical ground plan (phylogeny) in which biological entities (taxa) evolve, and (3) the memory axiom demands identification of biological attributes (characters) with historical information. Axiom consequences are interlinked, making the retrodiction enterprise an endeavor of reciprocal fulfillment. In particular, establishing direction of evolutionary change (character polarization) roots phylogenies and enables testing the existence of historical memory (homology). Unfortunately, rooting phylogenies, especially the “tree of life,” generally follow narratives instead of integrating empirical and theoretical knowledge of retrodictive exploration. This stems mostly from a focus on molecular sequence analysis and uncertainties about rooting methods. Here, we review available rooting criteria, highlighting the need to minimize both ad hoc and auxiliary assumptions, especially argumentative ad hocness. We show that while the outgroup comparison method has been widely adopted, the generality criterion of nesting and additive phylogenetic change embodied in Weston rule offers the most powerful rooting approach. We also propose a change of focus, from phylogenies that describe the evolution of biological systems to those that describe the evolution of parts of those systems. This weakens violation of character independence, helps formalize the generality criterion of rooting, and provides new ways to study the problem of evolution

    Structural Phylogenomics Retrodicts the Origin of the Genetic Code and Uncovers the Evolutionary Impact of Protein Flexibility

    Get PDF
    <div><p>The genetic code shapes the genetic repository. Its origin has puzzled molecular scientists for over half a century and remains a long-standing mystery. Here we show that the origin of the genetic code is tightly coupled to the history of aminoacyl-tRNA synthetase enzymes and their interactions with tRNA. A timeline of evolutionary appearance of protein domain families derived from a structural census in hundreds of genomes reveals the early emergence of the ‘operational’ RNA code and the late implementation of the standard genetic code. The emergence of codon specificities and amino acid charging involved tight coevolution of aminoacyl-tRNA synthetases and tRNA structures as well as episodes of structural recruitment. Remarkably, amino acid and dipeptide compositions of single-domain proteins appearing before the standard code suggest archaic synthetases with structures homologous to catalytic domains of tyrosyl-tRNA and seryl-tRNA synthetases were capable of peptide bond formation and aminoacylation. Results reveal that genetics arose through coevolutionary interactions between polypeptides and nucleic acid cofactors as an exacting mechanism that favored flexibility and folding of the emergent proteins. These enhancements of phenotypic robustness were likely internalized into the emerging genetic system with the early rise of modern protein structure.</p></div

    Evolutionary heat maps describing the amino acid and dipeptide compositions of FF domain structures of different age.

    No full text
    <p>A. Frequency of amino acids in FFs. The color array of 29,480 cells (1,475 rows×20 columns) describes the amino acid composition of 1,475 FFs along the evolutionary timeline. Columns represent the 20 standard amino acids ordered (from left to right) according to average amino acid frequency and rows represent FFs ordered (from top to bottom) according to domain age (<i>nd</i><sub>FF</sub> = 0 ∌ 1). B. Frequency of dipeptides in FFs. The color array of 589,600 cells (1,475 rows×400 columns) describes the 400-dipeptide composition of FFs along the timeline. Columns represent dipeptide types ordered (from left to right) according to average frequency (from LL to WW) and rows represent FFs ordered according to age. The heat maps confirm the existence of non-random patterns of amino acid and dipeptide compositions along the evolutionary timeline of FFs and reveal unique signatures of amino acid and dipeptide use in FFs. Amino acids are described with single-letter codes.</p

    Phylogenomic analyses of protein domains and tRNA structures and functions.

    No full text
    <p>A. Flow diagram showing the reconstruction of trees of protein domain structures. A census of domain structures in proteomes of hundreds of completely sequenced organisms is used to compose data matrices, which are then used to build phylogenomic trees describing the evolution of individual protein structures. Elements of the matrix (g) represent genomic abundances of domains in proteomes, defined at different level of classification of domain structure (e.g. SCOP F, FSF, and FF). They are converted into multi-state phylogenetic characters with character states transforming according to linearly ordered and reversible pathways. Trees of proteomes can be generated from the matrices of phylogenetic characters. They are not used in this paper but are largely congruent with traditional classification. B. Evolution of tRNA structure and function. The ancient ‘top half’ of tRNA embeds a ‘operational code’ in the identity elements of the acceptor arm that interact with the catalytic domain of aaRSs through class I and II modes of tRNA recognition. The evolutionarily recent ‘bottom half’ of tRNA holds the standard code in identity elements of the anticodon loop that interact with anticodon-binding domains of aaRSs. The flow diagram below describes the phylogenetic reconstruction of trees of tRNA substructures (ToSs). The structures of rRNA molecules were first decomposed into substructures, molecules. Structural features (e.g., length, Shannon entropic descriptors) of substructures such as helical stem tracts and unpaired regions are coded as phylogenetic characters and assigned character states according to an evolutionary model that polarizes character transformation towards an increase in conformational order (character argumentation). Coded characters (s) are arranged in data matrices, which can be transposed for further cladistic analyses (e.g., to produce trees of substructures). Phylogenetic analysis using maximum parsimony optimality criteria generates rooted phylogenetic trees of tRNA molecules. Embedded in trees of domains and trees of tRNAs are timelines that assign age to molecular structures and associated functions. C. Culling of PDB sequences for calculation of amino acid frequencies and dipeptide counts. Dipeptides define concatenated 2-mer amino acid sequences.</p

    Dipeptide makeup of ancient proteins.

    No full text
    <p>A. The distribution of dipeptide compositions in proteins shows remarkable conservation along the FF timeline. Stacked column charts describe the 408 possible dipeptides (combinations of two amino acids) corresponding to 9 sets specified by <i>Groups 1, 2</i> and <i>3</i> aaRS structures (1-1, 1-2, 2-1, etc). The stacked columns on the right display the general distribution pattern of dipeptides in the dipeptide sets for all 2,384 sequences and the expectation of dipeptide set distributions calculated by free permutation. Circles and asterisks represent groups that are over- or underrepresented, respectively, following <i>χ</i>−square statistical contrasts. B. Ancient FFs appearing before anticodon-binding domains (<i>nd</i><sub>FF</sub> ≀0.2) were significantly enriched (<i>P</i><0.01) in dipeptides composed of amino acids specified by the ancient editing domains (<i>Group 1</i> and <i>2</i>). The bar plot shows the amino acid frequencies of the 33 enriched dipeptides, the doughnut chart describes enriched dipeptide set compositions, and the network displays dipeptide makeup, with peptide bonds (edges, weighed by number of dipeptide types) connecting participating amino acids (nodes, with size proportional to connections). C. Mapping of enriched dipeptides in protein structures. Box-and-whisker plots describe the distribution of the 33 dipeptides that are significantly enriched in early FFs (<i>nd</i><sub>FF</sub> ≀0.2) versus that of all dipeptides in regular and non-regular structural regions of the 2,384 protein sequences analyzed. Regular structures include helical regions (H) with α-helix (h), 3<sub>10</sub>-helix (g) and π-helix (i) elements, strand regions (E) with ÎČ-strand (e) and ÎČ-bridge (b) elements, and turn/bend regions (T) with turns (t) and bends (b). Non-regular (unstructured) regions include loops (Ω). PBT amino acids can span different regions. Statistical differences between PBT were defined by p-values of Mann-Whitney non-parametric tests. Increases and decreases in central tendencies for the ancestral proteins are indicated with+and – signs, respectively, for structural sets with significant associations.</p
    corecore