Search CORE

21 research outputs found

Cross-phyla protein annotation by structural prediction and alignment

Author: Arendt Detlev
Mirdita Milot
Musser Jacob M.
Papadopoulos Nikolaos
Ruperti Fabian
Steinegger Martin
Publication venue: BMC
Publication date: 01/05/2023
Field of study

Background Protein annotation is a major goal in molecular biology, yet experimentally determined knowledge is typically limited to a few model organisms. In non-model species, the sequence-based prediction of gene orthology can be used to infer protein identity; however, this approach loses predictive power at longer evolutionary distances. Here we propose a workflow for protein annotation using structural similarity, exploiting the fact that similar protein structures often reflect homology and are more conserved than protein sequences. Results We propose a workflow of openly available tools for the functional annotation of proteins via structural similarity (MorF: MorphologFinder) and use it to annotate the complete proteome of a sponge. Sponges are highly relevant for inferring the early history of animals, yet their proteomes remain sparsely annotated. MorF accurately predicts the functions of proteins with known homology in >90% cases and annotates an additional 50% of the proteome beyond standard sequence-based methods. We uncover new functions for sponge cell types, including extensive FGF, TGF, and Ephrin signaling in sponge epithelia, and redox metabolism and control in myopeptidocytes. Notably, we also annotate genes specific to the enigmatic sponge mesocytes, proposing they function to digest cell walls. Conclusions Our work demonstrates that structural similarity is a powerful approach that complements and extends sequence similarity searches to identify homologous proteins over long evolutionary distances. We anticipate this will be a powerful approach that boosts discovery in numerous -omics datasets, especially for non-model organisms

SNU Open Repository and Archive

Cloud Prediction of Protein Structure and Function with PredictProtein for Debian

Author: Angermüller Christof
Böhm Ariane
Domke Simon
Ertl Julia
Kaján László
Mertes Christian
Mirdita Milot
Reisinger Eva
Rost Burkhard
Staniewski Cedric
Steinegger Martin
Vicedo Esmeralda
Yachdav Guy
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2013
Field of study

We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome

Crossref

SNU Open Repository and Archive

Directory of Open Access Journals

PubMed Central

DescribePROT: database of amino acid-level protein structure and function predictions

Author: Dunker A. Keith
Faraggi Eshel
Gsponer Jörg
Katuwawala Akila
Kloczkowski Andrzej
Kurgan Lukasz
Malhis Nawar
Mirdita Milot
Obradovic Zoran
Oldfield Christopher J.
Steinegger Martin
Söding Johannes
Zhao Bi
Zhou Yaoqi
Publication venue: 'Oxford University Press (OUP)'
Publication date: 08/01/2021
Field of study

We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/

Crossref

IUPUIScholarWorks

MPG.PuRe

Recommended from our members

An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12

Author: Adhikari Badri
Bacardit Jaume
Baker David
Baranowski Maciej
Bhattacharya Debswapna
Blake Lauren
Bortot Leandro Oliveira
Cao Renzhi
Chapman Nicholas
Cheng Jianlin
Chopra Gaurav
Cooper Seth
Crivelli Silvia N
Czaplewski Cezary
Defelicibus Alexandre
Delbem Alexandre Cláudio Botazzo
Dhanasekaran BK
Dimas Itzhel
Faccioli Rodrigo Antonio
Faraggi Eshel
Flatten Jeff
Floudas Christodoulos
Foldit Players consortium
Ganzynkowicz Robert
Ghosh Sambit
Ghosh Soma
Giełdoń Artur
Golon Lukasz
He Yi
Heo Lim
Hou Jie
Keasar Chen
Khan Main
Khatib Firas
Khoury George A
Kieslich Chris
Kim David E
Kloczkowski Andrzej
Koepnick Brian
Krupa Pawel
Lee Gyu Rie
Levitt Michael
Li Hongbo
Li Jilong
Lipska Agnieszka
Liwo Adam
Maghrabi Ali Hassan A
McGuffin Liam J
Mirdita Milot
Mirzaei Shokoufeh
Mozolewska Magdalena A
Onel Melis
Ovchinnikov Sergey
Ołdziej Stanislaw
Popović Zoran
Scheraga Harold
Seok Chaok
Shah Anand
Shah Utkarsh
Sidi Tomer
Sieradzan Adam K
Smadbeck James
Söding Johannes
Tamamis Phanourios
Trieber Nicholas
Vishveshwara Saraswathi
Wallner Björn
Wirecki Tomasz
Xu Dong
Yin Yanping
Zaborowski Bartlomiej
Zhang Yang
Ślusarz Magdalena
Ślusarz Rafal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Every two years groups worldwide participate in the Critical Assessment of Protein Structure Prediction (CASP) experiment to blindly test the strengths and weaknesses of their computational methods. CASP has significantly advanced the field but many hurdles still remain, which may require new ideas and collaborations. In 2012 a web-based effort called WeFold, was initiated to promote collaboration within the CASP community and attract researchers from other fields to contribute new ideas to CASP. Members of the WeFold coopetition (cooperation and competition) participated in CASP as individual teams, but also shared components of their methods to create hybrid pipelines and actively contributed to this effort. We assert that the scale and diversity of integrative prediction pipelines could not have been achieved by any individual lab or even by any collaboration among a few partners. The models contributed by the participating groups and generated by the pipelines are publicly available at the WeFold website providing a wealth of data that remains to be tapped. Here, we analyze the results of the 2014 and 2016 pipelines showing improvements according to the CASP assessment as well as areas that require further adjustments and research

Central Archive at the University of Reading

Publikationer från Linköpings universitet

eScholarship - University of California

Digitala Vetenskapliga Arkivet - Academic Archive On-line

MPG.PuRe

Fast methods for metagenomic sequence search and annotation

Author: Mirdita Milot
Publication venue: University Goettingen Repository
Publication date: 21/02/2022
Field of study

The past two decades have seen the development of metagenomics, the study of genes and genomes of multiple organisms simultaneously. In contrast to traditional genomic techniques, which require isolating and growing individual organisms in the lab, in metagenomics, samples are directly taken from the environment, sequenced and then analyzed in silico. Modern sequencing techniques have enabled high throughput read-out of DNA and RNA of microorganism communities in marine, soil, gut and many other environments. The plethora of data generated using these techniques poses a major challenge for existing computational techniques. This burden translates directly to computational run times and the cost of resources required to carry out metagenomic analyses. Thus, computational methods developed for metagenomic analysis require exceptional efficiency and speed. At the same time, metagenomic studies become relevant for more and more fields of research, requiring that techniques be suited for a wide range of scientific disciplines. In this work, I present three methods I developed to address the throughput bottlenecks of data analysis in metagenomics. (1) The MMseqs2 webserver is a user-friendly extension of the popular homology search method MMseqs2 designed for non-expert bioinformaticians. I accelerated MMseqs2 to process single queries much more quickly and introduced an API to enable MMseqs2's use in web applications. (2) MMseqs2 taxonomy is a method for fast and accurate taxonomy assignment of metagenomic contigs. (3) ColabFold is a method to make the groundbreaking AlphaFold2 protein structure predictions widely accessible, accelerating its input sequence alignment generation and improving its accuracy by assembling a novel database enriched with metagenomic sequences from a multitude of datasets. These methods improve upon the state-of-the-art by introducing novel algorithms and accelerating previous ones - such that previously infeasible analyses become possible - and making our metagenomic toolbox accessible to users of a wide range of skill levels.2022-06-2

eDiss Georg-August-University Göttingen

Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold

Author: Mirdita Milot
Soeding Johannes
Steinegger Martin
Publication venue: Nature Publishing Group
Publication date: 01/07/2019
Field of study

The open-source de novo protein-level assembler, Plass (https://plass. mmseqs. com), assembles six-frame-translated sequencing reads into protein sequences. It recovers 2-10 times more protein sequences from complex metagenomes and can assemble huge datasets. We assembled two redundancy-filtered reference protein catalogs, 2 billion sequences from 640 soil samples (soil reference protein catalog) and 292 million sequences from 775 marine eukaryotic metatranscriptomes (marine eukaryotic reference catalog), the largest free collections of protein sequences.Y

SNU Open Repository and Archive

ColabFold: making protein folding accessible to all

Author: Heo Lim
Mirdita Milot
Moriwaki Yoshitaka
Ovchinnikov Sergey
Schutze Konstantin
Steinegger Martin
Publication venue: Nature Publishing Group
Publication date: 01/01/2022
Field of study

ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold's 40-60-fold faster search and optimized model utilization enables prediction of close to 1,000 structures per day on a server with one graphics processing unit. Coupled with Google Colaboratory, ColabFold becomes a free and accessible platform for protein folding. ColabFold is open-source software available at https://github.cpm/sokrypton/colabfold and its novel environmental databases are available at https://colab-fold.mmseqs.com.N

SNU Open Repository and Archive

PubMed Central

MPG.PuRe

SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts

Author: Galiez Clovis
Levy Karin Eli
Mirdita Milot
Norroy Clovis
Söding Johannes
Zhang Ruoshi
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/04/2021
Field of study

International audienceAbstract Summary SpacePHARER (CRISPR Spacer Phage–Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage–host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching very short sequences, and combining evidence from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes. Availability and implementation SpacePHARER is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS: https://github.com/soedinglab/spacepharer. Supplementary information Supplementary data are available at Bioinformatics online

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

PubMed Central

MPG.PuRe

Hal-Diderot

Uniclust databases of clustered and deeply annotated protein sequences and alignments

Author: Galiez Clovis
Martin Maria J.
Mirdita Milot
Soeding Johannes
Steinegger Martin
von den Driesch Lars
Publication venue: Oxford University Press
Publication date: 01/01/2017
Field of study

We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uni-boost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust. mmseqs. com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.Y

SNU Open Repository and Archive

MMseqs2 profile/profile: fast and ultra-sensitive searches beyond the twilight zone

Author: Galiez Clovis
Ji Hyunjoo
Karin Eli Levy
Mirdita Milot
Soeding Johannes
Sommer Hans-Georg
Steinegger Martin
Publication venue: BMC
Publication date: 01/12/2021
Field of study

SNU Open Repository and Archive