Search CORE

20 research outputs found

Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling

Author: Filip Ginter
Kai Hakala
Niko Miekka
Suwisa Kaewphan
Tapio Salakoski
Publication venue: 'Oxford University Press (OUP)'
Publication date: 28/10/2022
Field of study

We present a system for automatically identifying a multitude of biomedical entities from the literature. This work is based on our previous efforts in the BioCreative VI: Interactive Bio-ID Assignment shared task in which our system demonstrated state-of-the-art performance with the highest achieved results in named entity recognition. In this paper we describe the original conditional random field-based system used in the shared task as well as experiments conducted since, including better hyperparameter tuning and character level modeling, which led to further performance improvements. For normalizing the mentions into unique identifiers we use fuzzy character n-gram matching. The normalization approach has also been improved with a better abbreviation resolution method and stricter guideline compliance resulting in vastly improved results for various entity types. All tools and models used for both named entity recognition and normalization are publicly available under open license.</p

UTUPub

Cell line name recognition in support of the identification of synthetic lethality in cancer from text

Author: Ginter Filip
Kaewphan Suwisa
Ohta Tomoko
Pyysalo Sampo
Van de Peer Yves
Van Landeghem Sofie
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/10/2015
Field of study

Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers

Ghent University Academic Bibliography

PubMed Central

UPSpace at the University of Pretoria

Finding novel relationships with integrated gene-gene association network analysis of Synechocystis sp. PCC 6803 using species-independent text-mining

Author: Filip Ginter
Patrik R. Jones
Sanna M. Kreula
Suwisa Kaewphan
Publication venue: 'PeerJ'
Publication date: 28/10/2022
Field of study

The increasing move towards open access full-text scientific literature enhances our ability to utilize advanced text-mining methods to construct information-rich networks that no human will be able to grasp simply from 'reading the literature'. The utility of text-mining for well-studied species is obvious though the utility for less studied species, or those with no prior track-record at all, is not clear. Here we present a concept for how advanced text-mining can be used to create information-rich networks even for less well studied species and apply it to generate an open-access gene-gene association network resource for Synechocystis sp. PCC 6803, a representative model organism for cyanobacteria and first case-study for the methodology. By merging the text-mining network with networks generated from species-specific experimental data, network integration was used to enhance the accuracy of predicting novel interactions that are biologically relevant. A rule-based algorithm (filter) was constructed in order to automate the search for novel candidate genes with a high degree of likely association to known target genes by (1) ignoring established relationships from the existing literature, as they are already 'known', and (2) demanding multiple independent evidences for every novel and potentially relevant relationship. Using selected case studies, we demonstrate the utility of the network resource and filter to (i) discover novel candidate associations between different genes or proteins in the network, and (ii) rapidly evaluate the potential role of any one particular gene or protein. The full network is provided as an open-source resource.</p

UTUPub

Proceedings of the 2013 Workshop on Biomedical Natural Language Processing (BioNLP'13)

Author: Filip Ginter
Sofie Van Landeghem
Suwisa Kaewphan
Yves Van de Peer
Publication venue: 'Indiana University Press (Project Muse)'
Publication date: 28/10/2022
Field of study

UTUPub

Neural Network and Random Forest Models in Protein Function Prediction

Author: Björne Jari
Ginter Filip
Hakala Kai
Kaewphan Suwisa
Mehryary Farrokh
Moen Hans
Salakoski Tapio
Tolvanen Martti
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/10/2022
Field of study

Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.</p

UTUPub

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Author: Almeida-e-Silva Danillo C.
Altenhoff Adrian
Babbitt Patricia C.
Bankapur Asma R.
Bargsten Joachim W.
Ben-Hur Asa
Benso Alfredo
Bhat Prajwal
Bkc Dukka
Bonneau Richard
Brenner Steven E.
Bryson Kevin
Cao Renzhi
Casadio Rita
Cejuela Juan M.
Chapman Samuel
Chen Ching-Tai
Cheng Jianlin
Cibrian-Uhalte Elena
Clark Wyatt T.
Cozzetto Domenico
D'Andrea Daniel
Das Sayoni
Dawson Natalie L.
del Pozo Angela
Denny Paul
Dessimoz Christophe
Di Carlo Stefano
Dogan Tunca
ElShal Sarah
Falda Marco
Fang Hai
Feng Shou
Fernández José M.
Ferrari Carlo
Fontana Paolo
Foulger Rebecca E.
Friedberg Iddo
Funk Christopher S.
Gabaldon Toni
Gemovic Branislava
Gillis Jesse
Ginter Filip
Giollo Manuel
Glisic Sanja
Goldberg Tatyana
Gong Qingtian
Gough Julian
Greene Casey S.
Hakala Kai
Hamp Tobias
Hieta Reija
Holm Liisa
Hsu Wen-Lian
Huntley Rachael P.
Jiang Yuxiang
Jones David T.
Kaewphan Suwisa
Kahanda Indika
Kansakar Lakesh
Khan Ishita K.
Kihara Daisuke
Koo Da Chen Emily
Koskinen Patrik
Lavezzo Enrico
Lee David
Lees Jonathan G.
Legge Duncan
Lepore Rosalba
Li Biao
Lin Alexandra
Linial Michal
Lovering Ruth C.
Magrane Michele
Maietta Paolo
Marcet-Houben Marina
Martelli Pier Luigi
Martin Maria J.
Mehryary Farrokh
Melidoni Anna N.
Mesiti Marco
Minneci Federico
Mooney Sean D.
Moreau Yves
Mutowo-Meullenet Prudence
Nepusz Tamás
Ning Wei
O'Donovan Claire
Oates Matt
Ofer Dan
Orengo Christine A.
Oron Tal Ronnen
Paccanaro Alberto
Pavlidis Paul
Penfold-Brown Duncan
Perovic Vladmir
Pichler Klemens
Piovesan Damiano
Politano Gianfranco
Profiti Giuseppe
Radivojac Predrag
Rappoport Nadav
Re Matteo
Rehman Hafeez Ur
Richter Lothar
Robinson Peter N.
Romero Alfonso E.
Rost Burkhard
Sahraeian Sayed M.E.
Salakoski Tapio
Salamov Asaf
Sasidharan Rajkumar
Savino Alessandro
Sedeño-Cortés Adriana E.
Sharan Malvika
Shasha Dennis
Shypitsyna Aleksandra
Sillitoe Ian
Skunca Nives
Smithers Ben
Stern Amos
Sternberg Michael J.E.
Supek Fran
Tian Weidong
Toppo Stefano
Tosatto Silvio C.E.
Tramontano Anna
Tranchevent Léon-Charles
Tress Michael L.
Törönen Petri
Valencia Alfonso
Valentini Giorgio
van Dijk Aalt D.J.
Veljkovic Nevena
Veljkovic Veljko
Vencio Ricardo ZN
Verspoor Karin M.
Vogel Jörg
Vucetic Slobodan
Wang Zheng
Wass Mark N.
Yang Haixuan
Youngs Noah
Zakeri Pooya
Zhang Shanshan
Zhong Zhaolong
Zhou Yuanpeng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent. Keywords: Protein function prediction, Disease gene prioritizationpublishedVersio

Brage HiM

An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement In Accuracy

Author: Almeida-e-Silva Danillo C.
Altenhoff Adrian
Babbitt Patricia C.
Bankapur Asma R.
Bargsten Joachim W.
Ben-Hur Asa
Benso Alfredo
Bhat Prajwal
BKC Dukka
Bonneau Richard
Brenner Steven E.
Bryson Kevin
Cao Renzhi
Casadio Rita
Cejuela Juan M.
Chapan Samuel
Chen Ching-Tai
Cheng Jianlin
Cibrian-Uhalte Elenia
Clark Wyatt T.
Cozzetto Domenico
D\u27Andrea Daniel
Das Sayoni
Dawson Natalie L.
del Pozo Angela
Denny Paul
Dessimoz Christophe
Di Carlo Stefano
Dogan Tunca
ElShal Sarah
Falda Marco
Fang Hai
Feng Shou
Fernández José M.
Ferrari Carlo
Fontana Paolo
Foulger Rebecca E.
Friedberg Iddo
Funk Christopher S.
Gabaldon Toni
Gemovic Branislava
Gillis Jesse
Ginter Filip
Giollo Manuel
Glisic Sanja
Goldberg Tatyana
Gong Qingtian
Gough Julian
Greene Casey S.
Hakala Kai
Hamp Tobias
Hieta Reija
Holm Liisa
Hsu Wen-Lian
Huntley Rachael P.
Jiang Yuxiang
Jones David T.
Kaewphan Suwisa
Kahanda Indika
Kansakar Lakesh
Khan Ishita K.
Kihara Daisuke
Koo Da Chen Emily
Koskinen Patrik
Lavezzo Enrico
Lee David
Lees Jonathan G.
Legge Duncan
Lepore Rosalba
Li Biao
Lin Alexandra
Linial Michal
Lovering Ruth C.
Magrane Michele
Maietta Paolo
Marcet-Houben Marina
Martelli Pier Luigi
Martin Maria J.
Mehryar Farrokh
Melidoni Anna N.
Mesiti Marco
Minneci Federico
Mooney Sean D.
Moreau Yves
Mutowo-Meullenet Prudence
Nepusz Tamás
Ning Wei
O\u27Donovan Claire
Oates Matt
Ofer Dan
Orengo Christine A.
Oron Tal Ronnen
Paccanaro Alberto
Pavlidis Paul
Penfold-Brown Duncan
Perovic Vladmir
Pichler Klemens
Piovesan Damiano
Politano Gianfranco
Profiti Giuseppe
Radivojac Predrag
Rappoport Nadav
Re Matteo
Rehman Hafeez Ur
Richter Lothar
Robinson Peter N.
Romero Alfonso E.
Rost Burkhard
Sahraeian Sayed M.E.
Salakoski Tapio
Salamov Asaf
Sasidharan Rajkumar
Savino Alessandro
Sedeño-Cortés Adriana E.
Sharan Malvika
Shasha Dennis
Shypitsyna Aleksandra
Skunca Nives
Smithers Ben
Stern Amos
Sternberg Michael J.E.
Stilltoe Ian
Supek Fran
Tian Weidong
Toppo Stefano
Tosatto Silvio C.E.
Tramontano Anna
Tranchevent Léon-Charles
Tress Michael L.
Törönen Petri
Valencia Alfonso
Valentini Giorgio
van Dijk Aalt D.J.
Veljkovic Nevena
Veljkovic Veljko
Vencio Ricardo Z.N.
Verspoor Karin M.
Vogel Jörg
Vucetic Slobodan
Wang Zheng
Wass Mark N.
Yang Haixuan
Youngs Noah
Zakeri Pooya
Zhang Shanshan
Zhong Zhaolong
Zhou Yuanpeng
Publication venue: The Aquila Digital Community
Publication date: 07/09/2016
Field of study

Aquila Digital Community

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Author: Alborzi Seyed Ziaeddin
Altenhoff Adrian
Amezola Miguel
Antczak Magdalena
Aridhi Sabeur
Asgari Ehsaneddin
Atalay Volkan
Babbitt Patricia C.
Barot Meet
Ben-Hur Asa
Benso Alfredo
Bergquist Timothy R.
Berselli Michele
Bhat Prajwal
Björne Jari
Black Gage S.
Boecker Florian
Bonneau Richard
Borukhov Itamar
Bosco Giovanni
Boudellioua Imane
Brackenridge Danielle A.
Brenner Steven E.
Cao Renzhi
Carraro Marco
Casadio Rita
Cetin-Atalay Rengul
Chandler Caleb
Chang Jia-Ming
Cheng Jianlin
Chi Po-Han
Cozzetto Domenico
Crocker Alex W.
Dai Suyang
Dalkiran Alperen
Das Sayoni
Davidović Radoslav S.
Davis Larry
Dayton Jonathan B.
Dessimoz Christophe
Devignes Marie-Dominique
Di Carlo Stefano
Dogan Tunca
Dzeroski Saso
Emily Koo Da Chen
Fa Rui
Fabris Fabio
Falda Marco
Fang Hai
Fernández José M.
Fontana Paolo
Frank Yotam
Frasca Marco
Freddolino Peter L.
Freitas Alex A.
Friedberg Iddo
Gemovic Branislava
Georghiou George
Ginter Filip
Gligorijević Vladimir
Goldberg Tatyana
Gough Julian
Greene Casey S.
Grossi Giuliano
Hakala Kai
Hamid Md Nafiz
Hoehndorf Robert
Hogan Deborah A.
Holm Liisa
Hou Jie
Hou Jie
Hurto Rebecca L.
Jain Aashish
Jeffery Constance J.
Jiang Yuxiang
Jo Dane
Johnson Devon
Jones David T.
Kacsoh Balint Z.
Kaewphan Suwisa
Kahanda Indika
Kihara Daisuke
Kulmanov Maxat
Larsen Dallas J.
Lavezzo Enrico
Lee Alexandra J.
Lees Jonathan Gill
Lewis Kimberley A.
Liao Wen-Hung
Lichtarge Olivier
Linial Michal
Liu Yi-Wei
Mao Qizhong
Martelli Pier Luigi
Martin Maria J.
McGuffin Liam
McHardy Alice C.
Medlar Alan J.
Mehryary Farrokh
Mesiti Marco
Moen Hans
Mofrad Mohammad R. K.
Mooney Sean D.
Nguyen Huy N.
Notaro Marco
Novikov Ilya
Omdahl Ashton R.
Orengo Christine A.
O’Donovan Claire
Paccanaro Alberto
Pascarelli Stefano
Perovic Vladimir R.
Petrini Alessandro
Piovesan Damiano
Politano Gianfranco
Profiti Giuseppe
Radivojac Predrag
Re Matteo
Reeb Jonas
Rehman Hafeez Ur
Renaux Alexandre
Rifaioglu Ahmet S.
Ritchie David W.
Roche Daniel B.
Rodriguez Jose Manuel
Romero Alfonso E.
Rose Peter W.
Rost Burkhard
Sagers Luke W.
Saidi Rabie
Salakoski Tapio
Savojardo Castrense
Sillitoe Ian
Suh Erica
Sumonja Neven
Supek Fran
Thurlby Natalie
Tian Weidong
Tolvanen Martti E. E.
Toppo Stefano
Torres Mateo
Tosatto Silvio C. E.
Tress Michael L.
Tseng Wei-Cheng
Törönen Petri
Valentini Giorgio
Veljkovic Nevena
Vesztrocy Alex Wiarwick
Vidulin Vedrana
Vucetic Slobodan
Wan Cen
Wang Zheng
Wass Mark N.
Wilkins Angela
Yang Haixuan
Yao Shuwei
You Ronghui
Yunes Jeffrey M.
Zhang Chengxin
Zhang Feng
Zhang Shanshan
Zhang Yang
Zhang Zihan
Zhao Chenguang
Zhou Naihui
Zhu Shanfeng
Zosa Elaine
Šmuc Tomislav
Publication venue
Publication date: 01/01/2019
Field of study

Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.Peer reviewe

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

REPISALUD

Archivio istituzionale della ricerca - Università di Padova

Helmholtz Zentrum für Infektionsforschung Repository

Central Archive at the University of Reading

AIR Universita degli studi di Milano

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Repository of the Vinča Nuclear Institute (VinaR)

OpenMETU (Middle East Technical University)

Explore Bristol Research

Deep Blue Documents at the University of Michigan

Archivio istituzionale della ricerca - Fondazione Edmund Mach

HAL Clermont Université

HAL Descartes

University of Miami: Scholarship Miami

Helsingin yliopiston digitaalinen arkisto

Hal-Diderot

Hacettepe University Institutional Repository

Repository for Publications and Research Data

INRIA a CCSD electronic archive server

UCL Discovery

Kent Academic Repository

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Author: Aashish Jain
Adrian Altenhoff
Ahmet S. Rifaioglu
Alan J. Medlar
Alberto Paccanaro
Alessandro Petrini
Alex A. Freitas
Alex W. Crocker
Alex Warwick Vesztrocy
Alexandra J. Lee
Alexandre Renaux
Alfonso E. Romero
Alfredo Benso
Alice C. McHardy
Alperen Dalkıran
Angela Wilkins
Asa Ben-Hur
Ashton R. Omdahl
Balint Z. Kacsoh
Branislava Gemovic
Burkhard Rost
Caleb Chandler
Casey S. Greene
Castrense Savojardo
Cen Wan
Chenguang Zhao
Chengxin Zhang
Christine A. Orengo
Christophe Dessimoz
Claire O’Donovan
Constance J. Jeffery
Da Chen Emily Koo
Daisuke Kihara
Dallas J. Larsen
Damiano Piovesan
Dane Jo
Daniel B. Roche
Danielle A. Brackenridge
David T. Jones
David W. Ritchie
Deborah A. Hogan
Devon Johnson
Domenico Cozzetto
Ehsaneddin Asgari
Elaine Zosa
Enrico Lavezzo
Erica Suh
Fabio Fabris
Farrokh Mehryary
Feng Zhang
Filip Ginter
Florian Boecker
Fran Supek
Gage S. Black
George Georghiou
Gianfranco Politano
Giorgio Valentini
Giovanni Bosco
Giuliano Grossi
Giuseppe Profiti
Hafeez Ur Rehman
Hai Fang
Haixuan Yang
Hans Moen
Heiko Schoof
Huy N. Nguyen
Ian Sillitoe
Iddo Friedberg
Ilya Novikov
Imane Boudellioua
Indika Kahanda
Itamar Borukhov
Jari Björne
Jeffrey M. Yunes
Jia-Ming Chang
Jianlin Cheng
Jie Hou
Jonas Reeb
Jonathan B. Dayton
Jonathan Gill Lees
Jose Manuel Rodriguez
José M. Fernández
Julian Gough
Kai Hakala
Kimberley A. Lewis
Larry Davis
Liam J. McGuffin
Liisa Holm
Magdalena Antczak
Marco Carraro
Marco Falda
Marco Frasca
Marco Mesiti
Marco Notaro
Maria J. Martin
Marie-Dominique Devignes
Mark N. Wass
Martti E.E. Tolvanen
Mateo Torres
Matteo Re
Maxat Kulmanov
Md Nafiz Hamid
Meet Barot
Michael L. Tress
Michal Linial
Michele Berselli
Miguel Amezola
Mohammad R.K. Mofrad
Naihui Zhou
Natalie Thurlby
Neven Sumonja
Nevena Veljkovic
Olivier Lichtarge
Paolo Fontana
Patricia C. Babbitt
Peter L. Freddolino
Peter W. Rose
Petri Törönen
Pier Luigi Martelli
Po-Han Chi
Prajwal Bhat
Predrag Radivojac
Qizhong Mao
Rabie Saidi
Radoslav S. Davidović
Rebecca L. Hurto
Rengul Cetin Atalay
Renzhi Cao
Richard Bonneau
Rita Casadio
Robert Hoehndorf
Ronghui You
Rui Fa
Sabeur Aridhi
Saso Dzeroski
Sayoni Das
Sean D. Mooney
Seyed Ziaeddin Alborzi
Shanfeng Zhu
Shanshan Zhang
Shuwei Yao
Silvio C.E. Tosatto
Slobodan Vucetic
Stefano Di Carlo
Stefano Pascarelli
Stefano Toppo
Steven E. Brenner
Suwisa Kaewphan
Suyang Dai
Tapio Salakoski
Tatyana Goldberg
Timothy R. Bergquist
Tomislav Šmuc
Tunca Dogan
Vedrana Vidulin
Vladimir Gligorijević
Vladimir R. Perovic
Volkan Atalay
Wei-Cheng Tseng
Weidong Tian
Wen-Hung Liao
Yang Zhang
Yi-Wei Liu
Yotam Frank
Yuxiang Jiang
Zheng Wang
Zihan Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/10/2022
Field of study

BackgroundThe Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function.ResultsHere, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory.ConclusionWe conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.</p

UTUPub