Search CORE

22 research outputs found

An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement In Accuracy

Author: Almeida-e-Silva Danillo C.
Altenhoff Adrian
Babbitt Patricia C.
Bankapur Asma R.
Bargsten Joachim W.
Ben-Hur Asa
Benso Alfredo
Bhat Prajwal
BKC Dukka
Bonneau Richard
Brenner Steven E.
Bryson Kevin
Cao Renzhi
Casadio Rita
Cejuela Juan M.
Chapan Samuel
Chen Ching-Tai
Cheng Jianlin
Cibrian-Uhalte Elenia
Clark Wyatt T.
Cozzetto Domenico
D\u27Andrea Daniel
Das Sayoni
Dawson Natalie L.
del Pozo Angela
Denny Paul
Dessimoz Christophe
Di Carlo Stefano
Dogan Tunca
ElShal Sarah
Falda Marco
Fang Hai
Feng Shou
Fernández José M.
Ferrari Carlo
Fontana Paolo
Foulger Rebecca E.
Friedberg Iddo
Funk Christopher S.
Gabaldon Toni
Gemovic Branislava
Gillis Jesse
Ginter Filip
Giollo Manuel
Glisic Sanja
Goldberg Tatyana
Gong Qingtian
Gough Julian
Greene Casey S.
Hakala Kai
Hamp Tobias
Hieta Reija
Holm Liisa
Hsu Wen-Lian
Huntley Rachael P.
Jiang Yuxiang
Jones David T.
Kaewphan Suwisa
Kahanda Indika
Kansakar Lakesh
Khan Ishita K.
Kihara Daisuke
Koo Da Chen Emily
Koskinen Patrik
Lavezzo Enrico
Lee David
Lees Jonathan G.
Legge Duncan
Lepore Rosalba
Li Biao
Lin Alexandra
Linial Michal
Lovering Ruth C.
Magrane Michele
Maietta Paolo
Marcet-Houben Marina
Martelli Pier Luigi
Martin Maria J.
Mehryar Farrokh
Melidoni Anna N.
Mesiti Marco
Minneci Federico
Mooney Sean D.
Moreau Yves
Mutowo-Meullenet Prudence
Nepusz Tamás
Ning Wei
O\u27Donovan Claire
Oates Matt
Ofer Dan
Orengo Christine A.
Oron Tal Ronnen
Paccanaro Alberto
Pavlidis Paul
Penfold-Brown Duncan
Perovic Vladmir
Pichler Klemens
Piovesan Damiano
Politano Gianfranco
Profiti Giuseppe
Radivojac Predrag
Rappoport Nadav
Re Matteo
Rehman Hafeez Ur
Richter Lothar
Robinson Peter N.
Romero Alfonso E.
Rost Burkhard
Sahraeian Sayed M.E.
Salakoski Tapio
Salamov Asaf
Sasidharan Rajkumar
Savino Alessandro
Sedeño-Cortés Adriana E.
Sharan Malvika
Shasha Dennis
Shypitsyna Aleksandra
Skunca Nives
Smithers Ben
Stern Amos
Sternberg Michael J.E.
Stilltoe Ian
Supek Fran
Tian Weidong
Toppo Stefano
Tosatto Silvio C.E.
Tramontano Anna
Tranchevent Léon-Charles
Tress Michael L.
Törönen Petri
Valencia Alfonso
Valentini Giorgio
van Dijk Aalt D.J.
Veljkovic Nevena
Veljkovic Veljko
Vencio Ricardo Z.N.
Verspoor Karin M.
Vogel Jörg
Vucetic Slobodan
Wang Zheng
Wass Mark N.
Yang Haixuan
Youngs Noah
Zakeri Pooya
Zhang Shanshan
Zhong Zhaolong
Zhou Yuanpeng
Publication venue: The Aquila Digital Community
Publication date: 07/09/2016
Field of study

Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent

Aquila Digital Community (University of Southern Mississippi, USM)

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Author: Almeida-e-Silva Danillo C.
Altenhoff Adrian
Babbitt Patricia C.
Bankapur Asma R.
Bargsten Joachim W.
Ben-Hur Asa
Benso Alfredo
Bhat Prajwal
Bkc Dukka
Bonneau Richard
Brenner Steven E.
Bryson Kevin
Cao Renzhi
Casadio Rita
Cejuela Juan M.
Chapman Samuel
Chen Ching-Tai
Cheng Jianlin
Cibrian-Uhalte Elena
Clark Wyatt T.
Cozzetto Domenico
D'Andrea Daniel
Das Sayoni
Dawson Natalie L.
del Pozo Angela
Denny Paul
Dessimoz Christophe
Di Carlo Stefano
Dogan Tunca
ElShal Sarah
Falda Marco
Fang Hai
Feng Shou
Fernández José M.
Ferrari Carlo
Fontana Paolo
Foulger Rebecca E.
Friedberg Iddo
Funk Christopher S.
Gabaldon Toni
Gemovic Branislava
Gillis Jesse
Ginter Filip
Giollo Manuel
Glisic Sanja
Goldberg Tatyana
Gong Qingtian
Gough Julian
Greene Casey S.
Hakala Kai
Hamp Tobias
Hieta Reija
Holm Liisa
Hsu Wen-Lian
Huntley Rachael P.
Jiang Yuxiang
Jones David T.
Kaewphan Suwisa
Kahanda Indika
Kansakar Lakesh
Khan Ishita K.
Kihara Daisuke
Koo Da Chen Emily
Koskinen Patrik
Lavezzo Enrico
Lee David
Lees Jonathan G.
Legge Duncan
Lepore Rosalba
Li Biao
Lin Alexandra
Linial Michal
Lovering Ruth C.
Magrane Michele
Maietta Paolo
Marcet-Houben Marina
Martelli Pier Luigi
Martin Maria J.
Mehryary Farrokh
Melidoni Anna N.
Mesiti Marco
Minneci Federico
Mooney Sean D.
Moreau Yves
Mutowo-Meullenet Prudence
Nepusz Tamás
Ning Wei
O'Donovan Claire
Oates Matt
Ofer Dan
Orengo Christine A.
Oron Tal Ronnen
Paccanaro Alberto
Pavlidis Paul
Penfold-Brown Duncan
Perovic Vladmir
Pichler Klemens
Piovesan Damiano
Politano Gianfranco
Profiti Giuseppe
Radivojac Predrag
Rappoport Nadav
Re Matteo
Rehman Hafeez Ur
Richter Lothar
Robinson Peter N.
Romero Alfonso E.
Rost Burkhard
Sahraeian Sayed M.E.
Salakoski Tapio
Salamov Asaf
Sasidharan Rajkumar
Savino Alessandro
Sedeño-Cortés Adriana E.
Sharan Malvika
Shasha Dennis
Shypitsyna Aleksandra
Sillitoe Ian
Skunca Nives
Smithers Ben
Stern Amos
Sternberg Michael J.E.
Supek Fran
Tian Weidong
Toppo Stefano
Tosatto Silvio C.E.
Tramontano Anna
Tranchevent Léon-Charles
Tress Michael L.
Törönen Petri
Valencia Alfonso
Valentini Giorgio
van Dijk Aalt D.J.
Veljkovic Nevena
Veljkovic Veljko
Vencio Ricardo ZN
Verspoor Karin M.
Vogel Jörg
Vucetic Slobodan
Wang Zheng
Wass Mark N.
Yang Haixuan
Youngs Noah
Zakeri Pooya
Zhang Shanshan
Zhong Zhaolong
Zhou Yuanpeng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Brage HiM

Positive-Unlabeled Learning in the Context of Protein Function Prediction

Author: Youngs Noah
Publication venue: 'New York University'
Publication date: 01/01/2014
Field of study

With the recent proliferation of large, unlabeled data sets, a particular subclass of semisupervised learning problems has become more prevalent. Known as positive-unlabeled learning (PU learning), this scenario provides only positive labeled examples, usually just a small fraction of the entire dataset, with the remaining examples unknown and thus potentially belonging to either the positive or negative class. Since the vast majority of traditional machine learning classifiers require both positive and negative examples in the training set, a new class of algorithms has been developed to deal with PU learning problems. A canonical example of this scenario is topic labeling of a large corpus of documents. Once the size of a corpus reaches into the thousands, it becomes largely infeasible to have a curator read even a sizable fraction of the documents, and annotate them with topics. In addition, the entire set of topics may not be known, or may change over time, making it impossible for a curator to annotate which documents are NOT about certain topics. Thus a machine learning algorithm needs to be able to learn from a small set of positive examples, without knowledge of the negative class, and knowing that the unlabeled training examples may contain an arbitrary number of additional but as yet unknown positive examples. Another example of a PU learning scenario recently garnering attention is the protein function prediction problem (PFP problem). While the number of organisms with fully sequenced genomes continues to grow, the progress of annotating those sequences with the biological functions that they perform lags far behind. Machine learning methods have already been successfully applied to this problem, but with many organisms having a small number of positive annotated training examples, and the lack of availability of almost any labeled negative examples, PU learning algorithms have the potential to make large gains in predictive performance. The first part of this dissertation motivates the protein function prediction problem, explores previous work, and introduces novel methods that improve upon previously reported benchmarks for a particular type of learning algorithm, known as Gaussian Random Field Label Propagation (GRFLP). In addition, we present improvements to the computational efficiency of the GRFLP algorithm, and a modification to the traditional structure of the PFP learning problem that allows for simultaneous prediction across multiple species. The second part of the dissertation focuses specifically on the positive-unlabeled aspects of the PFP problem. Two novel algorithms are presented, and rigorously compared to existing PU learning techniques in the context of protein function prediction. Additionally, we take a step back and examine some of the theoretical considerations of the PU scenario in general, and provide an additional novel algorithm applicable in any PU context. This algorithm is tailored for situations in which the labeled positive examples are a small fraction of the set of true positive examples, and where the labeling process may be subject to some type of bias rather than being a random selection of true positives (arguably some of the most difficult PU learning scenarios). The third and fourth sections return to the PFP problem, examining the power of tertiary structure as a predictor of protein function, as well as presenting two case studies of function prediction performance on novel benchmarks. Lastly, we conclude with several promising avenues of future research into both PU learning in general, and the protein function prediction problem specifically

ProQuest OAI Repository

Negative Example Selection for Protein Function Prediction: The NoGO Database

Author: Dennis Shasha (46527)
Duncan Penfold-Brown (577557)
Noah Youngs (577556)
Richard Bonneau (4318)
Publication venue
Publication date: 01/06/2014
Field of study

<div><p>Negative examples – genes that are known <i>not</i> to carry out a given protein function – are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html).</p></div

Directory of Open Access Journals

PubMed Central

FigShare

Performance measures for negative example prediction on the human genome.

Author: Dennis Shasha (46527)
Duncan Penfold-Brown (577557)
Noah Youngs (577556)
Richard Bonneau (4318)
Publication venue
Publication date
Field of study

<p>The number of erroneous negative example predictions is plotted as a function of the number of negative examples chosen, for each of the three branches of GO. The Rocchio, NETL, and SNOB algorithms show consistently strong performance, with SNOB achieving the lowest error rate in each branch. The “Sibling” and “All non-positive as negative” heuristics have been omitted, as their poor performance dramatically skewed the scale of the images (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003644#pcbi.1003644.s003" target="_blank">figure S3</a> for results including the sibling method).</p

FigShare

Performance measures for function prediction.

Author: Dennis Shasha (46527)
Duncan Penfold-Brown (577557)
Noah Youngs (577556)
Richard Bonneau (4318)
Publication venue
Publication date
Field of study

<p>AUC_ROC measures for function prediction using the best-performing negative example selection methods, with the random negative example selector included for comparison. Performance measures are broken up by ontology branch, and represent the average AUC_ROC for all GO terms predicted in that branch.</p

FigShare

Performance measures for RNA binding.

Author: Dennis Shasha (46527)
Duncan Penfold-Brown (577557)
Noah Youngs (577556)
Richard Bonneau (4318)
Publication venue
Publication date
Field of study

<p>Performance of the competing algorithms on a specific GO category: GO:0003723 RNA binding, with validation data augmented by annotations taken from <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003644#pcbi.1003644-Baltz1" target="_blank">[16]</a>. The left panel shows the complete results, while the right is a scaled to see the differences between algorithms near the origin. The SNOB algorithm achieves the fewest false negatives for large numbers of negative examples, while the Rocchio and NETL algorithms maintain a zero false negative rate for a greater number of negative examples.</p

FigShare

Performance measures for mitochondrian organization.

Author: Dennis Shasha (46527)
Duncan Penfold-Brown (577557)
Noah Youngs (577556)
Richard Bonneau (4318)
Publication venue
Publication date
Field of study

<p>ROC curves are depicted for each algorithm on the golden set of annotations for GO:0007005 in yeast, calculated through cross-validation. SNOB shows the highest area under the curve (AUC), followed by NETL and Rocchio, which have approximately equal AUCs.</p

FigShare