21 research outputs found
Cross-phyla protein annotation by structural prediction and alignment
Background
Protein annotation is a major goal in molecular biology, yet experimentally determined knowledge is typically limited to a few model organisms. In non-model species, the sequence-based prediction of gene orthology can be used to infer protein identity; however, this approach loses predictive power at longer evolutionary distances. Here we propose a workflow for protein annotation using structural similarity, exploiting the fact that similar protein structures often reflect homology and are more conserved than protein sequences.
Results
We propose a workflow of openly available tools for the functional annotation of proteins via structural similarity (MorF: MorphologFinder) and use it to annotate the complete proteome of a sponge. Sponges are highly relevant for inferring the early history of animals, yet their proteomes remain sparsely annotated. MorF accurately predicts the functions of proteins with known homology in >90%
cases and annotates an additional 50%
of the proteome beyond standard sequence-based methods. We uncover new functions for sponge cell types, including extensive FGF, TGF, and Ephrin signaling in sponge epithelia, and redox metabolism and control in myopeptidocytes. Notably, we also annotate genes specific to the enigmatic sponge mesocytes, proposing they function to digest cell walls.
Conclusions
Our work demonstrates that structural similarity is a powerful approach that complements and extends sequence similarity searches to identify homologous proteins over long evolutionary distances. We anticipate this will be a powerful approach that boosts discovery in numerous -omics datasets, especially for non-model organisms
Cloud Prediction of Protein Structure and Function with PredictProtein for Debian
We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome
DescribePROT: database of amino acid-level protein structure and function predictions
We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/
Recommended from our members
An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12
Every two years groups worldwide participate in the Critical Assessment of Protein Structure Prediction (CASP) experiment to blindly test the strengths and weaknesses of their computational methods. CASP has significantly advanced the field but many hurdles still remain, which may require new ideas and collaborations. In 2012 a web-based effort called WeFold, was initiated to promote collaboration within the CASP community and attract researchers from other fields to contribute new ideas to CASP. Members of the WeFold coopetition (cooperation and competition) participated in CASP as individual teams, but also shared components of their methods to create hybrid pipelines and actively contributed to this effort. We assert that the scale and diversity of integrative prediction pipelines could not have been achieved by any individual lab or even by any collaboration among a few partners. The models contributed by the participating groups and generated by the pipelines are publicly available at the WeFold website providing a wealth of data that remains to be tapped. Here, we analyze the results of the 2014 and 2016 pipelines showing improvements according to the CASP assessment as well as areas that require further adjustments and research
Fast methods for metagenomic sequence search and annotation
The past two decades have seen the development of metagenomics, the study of genes and genomes of multiple organisms simultaneously. In contrast to traditional genomic techniques, which require isolating and growing individual organisms in the lab, in metagenomics, samples are directly taken from the environment, sequenced and then analyzed in silico. Modern sequencing techniques have enabled high throughput read-out of DNA and RNA of microorganism communities in marine, soil, gut and many other environments.
The plethora of data generated using these techniques poses a major challenge for existing computational techniques. This burden translates directly to computational run times and the cost of resources required to carry out metagenomic analyses. Thus, computational methods developed for metagenomic analysis require exceptional efficiency and speed. At the same time, metagenomic studies become relevant for more and more fields of research, requiring that techniques be suited for a wide range of scientific disciplines.
In this work, I present three methods I developed to address the throughput bottlenecks of data analysis in metagenomics. (1) The MMseqs2 webserver is a user-friendly extension of the popular homology search method MMseqs2 designed for non-expert bioinformaticians. I accelerated MMseqs2 to process single queries much more quickly and introduced an API to enable MMseqs2's use in web applications. (2) MMseqs2 taxonomy is a method for fast and accurate taxonomy assignment of metagenomic contigs. (3) ColabFold is a method to make the groundbreaking AlphaFold2 protein structure predictions widely accessible, accelerating its input sequence alignment generation and improving its accuracy by assembling a novel database enriched with metagenomic sequences from a multitude of datasets.
These methods improve upon the state-of-the-art by introducing novel algorithms and accelerating previous ones - such that previously infeasible analyses become possible - and making our metagenomic toolbox accessible to users of a wide range of skill levels.2022-06-2
Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold
The open-source de novo protein-level assembler, Plass (https://plass. mmseqs. com), assembles six-frame-translated sequencing reads into protein sequences. It recovers 2-10 times more protein sequences from complex metagenomes and can assemble huge datasets. We assembled two redundancy-filtered reference protein catalogs, 2 billion sequences from 640 soil samples (soil reference protein catalog) and 292 million sequences from 775 marine eukaryotic metatranscriptomes (marine eukaryotic reference catalog), the largest free collections of protein sequences.Y
ColabFold: making protein folding accessible to all
ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold's 40-60-fold faster search and optimized model utilization enables prediction of close to 1,000 structures per day on a server with one graphics processing unit. Coupled with Google Colaboratory, ColabFold becomes a free and accessible platform for protein folding. ColabFold is open-source software available at https://github.cpm/sokrypton/colabfold and its novel environmental databases are available at https://colab-fold.mmseqs.com.N
SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts
International audienceAbstract Summary SpacePHARER (CRISPR Spacer PhageâHost Pair Finder) is a sensitive and fast tool for de novo prediction of phageâhost relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching very short sequences, and combining evidence from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes. Availability and implementation SpacePHARER is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS: https://github.com/soedinglab/spacepharer. Supplementary information Supplementary data are available at Bioinformatics online
Uniclust databases of clustered and deeply annotated protein sequences and alignments
We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uni-boost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust. mmseqs. com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.Y