17 research outputs found

    Unraveling the functional dark matter through global metagenomics

    Get PDF
    30 pages, 4 figures, 1 table, supplementary information https://doi.org/10.1038/s41586-023-06583-7.-- Data availability: All of the analysed datasets along with their corresponding sequences are available from the IMG system (http://img.jgi.doe.gov/). A list of the datasets used in this study is provided in Supplementary Data 8. All data from the protein clusters, including sequences, multiple alignments, HMM profiles, 3D structure models, and taxonomic and ecosystem annotation, are available through NMPFamsDB, publicly accessible at www.nmpfamsdb.org. The 3D models are also available at ModelArchive under accession code ma-nmpfamsdb.-- Code availability: Sequence analysis was performed using Tantan (https://gitlab.com/mcfrith/tantan), BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi), LAST (https://gitlab.com/mcfrith/last), HMMER (http://hmmer.org/) and HH-suite3 (https://github.com/soedinglab/hh-suite). Clustering was performed using HipMCL (https://bitbucket.org/azadcse/hipmcl/src/master/). Additional taxonomic annotation was performed using Whokaryote (https://github.com/LottePronk/whokaryote), EukRep (https://github.com/patrickwest/EukRep), DeepVirFinder (https://github.com/jessieren/DeepVirFinder) and MMseqs2 (https://github.com/soedinglab/MMseqs2). 3D modelling was performed using AlphaFold2 (https://github.com/deepmind/alphafold) and TrRosetta2 (https://github.com/RosettaCommons/trRosetta2). Structural alignments were performed using TMalign (https://zhanggroup.org/TM-align/) and MMalign (https://zhanggroup.org/MM-align/). All custom scripts used for the generation and analysis of the data are available at Zenodo (https://doi.org/10.5281/zenodo.8097349)Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matterWith the institutional support of the ‘Severo Ochoa Centre of Excellence’ accreditation (CEX2019-000928-S)Peer reviewe

    Proteome-Wide Detection and Annotation of Receptor Tyrosine Kinases (RTKs): RTK-PRED and the TyReK Database

    No full text
    Receptor tyrosine kinases (RTKs) form a highly important group of protein receptors of the eukaryotic cell membrane. They control many vital cellular functions and are involved in the regulation of complex signaling networks. Mutations in RTKs have been associated with different types of cancers and other diseases. Although they are very important for proper cell function, they have been experimentally studied in a limited range of eukaryotic species. Currently, there is no available database for RTKs providing information about their function, expression, and interactions. Therefore, the identification of RTKs in multiple organisms, the documentation of their characteristics, and the collection of related information would be very useful. In this paper, we present a novel RTK detection pipeline (RTK-PRED) and the Receptor Tyrosine Kinases Database (TyReK-DB). RTK-PRED combines profile HMMs with transmembrane topology prediction to identify and classify potential RTKs. Proteins of all eukaryotic reference proteomes of the UniProt database were used as input in RTK-PRED leading to a filtered dataset of 20,478 RTKs. Based on the information collected for these RTKs from multiple databases, the relational TyReK database was created

    SCALA: A complete solution for multimodal analysis of single-cell Next Generation Sequencing data

    No full text
    Analysis and interpretation of high-throughput transcriptional and chromatin accessibility data at single-cell (sc) resolution are still open challenges in the biomedical field. The existence of countless bioinformatics tools, for the different analytical steps, increases the complexity of data interpretation and the difficulty to derive biological insights. In this article, we present SCALA, a bioinformatics tool for analysis and visualization of single-cell RNA sequencing (scRNA-seq) and Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) datasets, enabling either independent or integrative analysis of the two modalities. SCALA combines standard types of analysis by integrating multiple software packages varying from quality control to the identification of distinct cell populations and cell states. Additional analysis options enable functional enrichment, cellular trajectory inference, ligand-receptor analysis, and regulatory network reconstruction. SCALA is fully parameterizable, presenting data in tabular format and producing publication-ready visualizations. The different available analysis modules can aid biomedical researchers in exploring, analyzing, and visualizing their data without any prior experience in coding. We demonstrate the functionality of SCALA through two use-cases related to TNF-driven arthritic mice, handling both scRNA-seq and scATAC-seq datasets. SCALA is developed in R, Shiny and JavaScript and is mainly available as a standalone version, while an online service of more limited capacity can be found at http://scala.pavlopouloslab.info or https://scala.fleming.gr

    NucEnvDB: A Database of Nuclear Envelope Proteins and Their Interactions

    No full text
    The nuclear envelope (NE) is a double-membrane system surrounding the nucleus of eukaryotic cells. A large number of proteins are localized in the NE, performing a wide variety of functions, from the bidirectional exchange of molecules between the cytoplasm and the nucleus to chromatin tethering, genome organization, regulation of signaling cascades, and many others. Despite its importance, several aspects of the NE, including its protein–protein interactions, remain understudied. In this work, we present NucEnvDB, a publicly available database of NE proteins and their interactions. Each database entry contains useful annotation including a description of its position in the NE, its interactions with other proteins, and cross-references to major biological repositories. In addition, the database provides users with a number of visualization and analysis tools, including the ability to construct and visualize protein–protein interaction networks and perform functional enrichment analysis for clusters of NE proteins and their interaction partners. The capabilities of NucEnvDB and its analysis tools are showcased by two informative case studies, exploring protein–protein interactions in Hutchinson–Gilford progeria and during SARS-CoV-2 infection at the level of the nuclear envelope

    Hidden Aggregation Hot-Spots on Human Apolipoprotein E: A Structural Study

    No full text
    Human apolipoprotein E (apoE) is a major component of lipoprotein particles, and under physiological conditions, is involved in plasma cholesterol transport. Human apolipoprotein E found in three isoforms (E2; E3; E4) is a member of a family of apolipoproteins that under pathological conditions are detected in extracellular amyloid depositions in several amyloidoses. Interestingly, the lipid-free apoE form has been shown to be co-localized with the amyloidogenic Aβ peptide in amyloid plaques in Alzheimer’s disease, whereas in particular, the apoE4 isoform is a crucial risk factor for late-onset Alzheimer’s disease. Evidence at the experimental level proves that apoE self-assembles into amyloid fibrilsin vitro, although the misfolding mechanism has not been clarified yet. Here, we explored the mechanistic insights of apoE misfolding by testing short apoE stretches predicted as amyloidogenic determinants by AMYLPRED, and we computationally investigated the dynamics of apoE and an apoE−Αβ complex. Our in vitro biophysical results prove that apoE peptide−analogues may act as the driving force needed to trigger apoE aggregation and are supported by the computational apoE outcome. Additional computational work concerning the apoE−Αβ complex also designates apoE amyloidogenic regions as important binding sites for oligomeric Αβ; taking an important step forward in the field of Alzheimer’s anti-aggregation drug development

    FLAME: A Web Tool for Functional and Literature Enrichment Analysis of Multiple Gene Lists

    No full text
    Functional enrichment is a widely used method for interpreting experimental results by identifying classes of proteins/genes associated with certain biological functions, pathways, diseases, or phenotypes. Despite the variety of existing tools, most of them can process a single list per time, thus making a more combinatorial analysis more complicated and prone to errors. In this article, we present FLAME, a web tool for combining multiple lists prior to enrichment analysis. Users can upload several lists and use interactive UpSet plots, as an alternative to Venn diagrams, to handle unions or intersections among the given input files. Functional and literature enrichment, along with gene conversions, are offered by g:Profiler and aGOtool applications for 197 organisms. FLAME can analyze genes/proteins for related articles, Gene Ontologies, pathways, annotations, regulatory motifs, domains, diseases, and phenotypes, and can also generate protein–protein interactions derived from STRING. We have validated FLAME by interrogating gene expression data associated with the sensitivity of the distal part of the large intestine to experimental colitis-propelled colon cancer. FLAME comes with an interactive user-friendly interface for easy list manipulation and exploration, while results can be visualized as interactive and parameterizable heatmaps, barcharts, Manhattan plots, networks, and tables

    Darling: A Web Application for Detecting Disease-Related Biomedical Entity Associations with Literature Mining

    No full text
    Finding, exploring and filtering frequent sentence-based associations between a disease and a biomedical entity, co-mentioned in disease-related PubMed literature, is a challenge, as the volume of publications increases. Darling is a web application, which utilizes Name Entity Recognition to identify human-related biomedical terms in PubMed articles, mentioned in OMIM, DisGeNET and Human Phenotype Ontology (HPO) disease records, and generates an interactive biomedical entity association network. Nodes in this network represent genes, proteins, chemicals, functions, tissues, diseases, environments and phenotypes. Users can search by identifiers, terms/entities or free text and explore the relevant abstracts in an annotated format

    VICTOR: A visual analytics web application for comparing cluster sets

    No full text
    Clustering is the process of grouping different data objects based on similar properties. Clustering has applications in various case studies from several fields such as graph theory, image analysis, pattern recognition, statistics and others. Nowadays, there are numerous algorithms and tools able to generate clustering results. However, different algorithms or parameterizations may produce quite dissimilar cluster sets. In this way, the user is often forced to manually filter and compare these results in order to decide which of them generate the ideal clusters. To automate this process, in this study, we present VICTOR, the first fully interactive and dependency-free visual analytics web application which allows the visual comparison of the results of various clustering algorithms. VICTOR can handle multiple cluster set results simultaneously and compare them using ten different metrics. Clustering results can be filtered and compared to each other with the use of data tables or interactive heatmaps, bar plots, correlation networks, sankey and circos plots. We demonstrate VICTOR's functionality using three examples. In the first case, we compare five different network clustering algorithms on a Yeast protein-protein interaction dataset whereas in the second example, we test four different parameters of the MCL clustering algorithm on the same dataset. Finally, as a third example, we compare four different metaanalyses with hierarchically clustered differentially expressed genes found to be involved in myocardial infarction. VICTOR is available at http://victor.pavlopouloslab.info or http://bib.fleming.gr:3838/VICTOR
    corecore