Search CORE

32 research outputs found

Guest editorial: special issue on structured prediction

Author: C. H. Lampert
C. Sutton
C.-N. Hsu
Charles Parker
F. Lauer
F. Maes
G. H. Bakir
G. Neu
H. Daumé III
I. Tsochantaridis
Prasad Tadepalli
Y. Mao
Yasemin Altun
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Inferring latent task structure for Multitask Learning by Multiple Kernel Learning

Author: B Schölkopf
C Chang
C Leslie
Christian Widmer
F Bach
G Rätsch
G Schweikert
Gunnar Rätsch
H Daumé
H Daumé III
J Blitzer
J Robinson
L Bottou
L Jacob
L Jacob
M Kloft
Nora C Toussaint
P Gehler
R Caruana
S Sonnenburg
Schuller Ben-David
T Evgeniou
T Evgeniou
T Joachims
V Vapnik
Y Xue
Yasemin Altun
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics, many problems can be cast into the Multitask Learning scenario by incorporating data from several organisms. However, combining information from several tasks requires careful consideration of the degree of similarity between tasks. Our proposed method simultaneously learns or refines the similarity between tasks along with the Multitask Learning classifier. This is done by formulating the Multitask Learning problem as Multiple Kernel Learning, using the recently published <it>q</it>-Norm MKL algorithm. Results We demonstrate the performance of our method on two problems from Computational Biology. First, we show that our method is able to improve performance on a splice site dataset with given hierarchical task structure by refining the task relationships. Second, we consider an MHC-I dataset, for which we assume no knowledge about the degree of task relatedness. Here, we are able to learn the task similarities<it> ab initio</it> along with the Multitask classifiers. In both cases, we outperform baseline methods that we compare against. Conclusions We present a novel approach to Multitask Learning that is capable of learning task similarity along with the classifiers. The framework is very general as it allows to incorporate prior knowledge about tasks relationships if available, but is also able to identify task similarities in absence of such prior information. Both variants show promising results in applications from Computational Biology.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Structured prediction with reinforcement learning

Author: A. Doan
F. Garcia
Francis Maes
H. Daumé III
I. Tsochantaridis
J. Baxter
J. Lafferty
L. Denoyer
L. Ramshaw
Ludovic Denoyer
Patrick Gallinari
R. Sutton
W. L. Ruzzo
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Investigating heterogeneous protein annotations toward cross-corpora utilization

Author: A Arnold
A Yeh
AM Cohen
B Alex
B Efron
C Nédellec
CJ Kuo
EFTK Sang
EW Noreen
F Rinaldi
F Sha
G Zhou
H Daumé III
H Shatkay
HL Johnson
J Wilbur
JD Kim
JD Kim
Jin-Dong Kim
Jun'ichi Tsujii
K Franzén
K Yoshida
KB Cohen
L Gillick
L Tanabe
MA Mandel
R Bunescu
R Bunescu
R Kabiljo
RTH Tsai
Rune Sætre
S Pyysalo
Sampo Pyysalo
T Ohta
V Hatzivassiloglou
X Sun
Y Song
Y Wang
Yue Wang
Publication venue: BioMed Central
Publication date: 01/12/2009
Field of study

Abstract Background The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. Results We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. Conclusion Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

An eScience-Bayes strategy for analyzing omics data

Author: A Gelman
A Gelman
A Isaksson
BP Carlin
C Desmedt
C Sotiriou
CF Taylor
CN Chi
CP Robert
D Milburn
D Muthas
D Talavera
EC Butcher
EL Kaplan
H Chuang
H Daumé III
HB Mann
HM Berman
Jarl ES Wikberg
JO Berger
JR Chen
L Ein-Dor
L Xu
LD Miller
M Xiao-Li
MA Stiffler
Martin Eklund
N Sha
O Spjuth
Ola Spjuth
P Murray-Rust
P Prusis
PCG da Costa
R Development Core Team
R Edgar
R Tonikian
RG Smock
RL Ho
S Gianni
S Lockless
S Michiels
SR Eddy
U Wickenberg-Bolin
Y Pawitan
Y Wang
Z Kutalik
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The omics fields promise to revolutionize our understanding of biology and biomedicine. However, their potential is compromised by the challenge to analyze the huge datasets produced. Analysis of omics data is plagued by the curse of dimensionality, resulting in imprecise estimates of model parameters and performance. Moreover, the integration of omics data with other data sources is difficult to shoehorn into classical statistical models. This has resulted in <it>ad hoc </it>approaches to address specific problems. Results We present a general approach to omics data analysis that alleviates these problems. By combining eScience and Bayesian methods, we retrieve scientific information and data from multiple sources and coherently incorporate them into large models. These models improve the accuracy of predictions and offer new insights into the underlying mechanisms. This "eScience-Bayes" approach is demonstrated in two proof-of-principle applications, one for breast cancer prognosis prediction from transcriptomic data and one for protein-protein interaction studies based on proteomic data. Conclusions Bayesian statistics provide the flexibility to tailor statistical models to the complex data structures in omics biology as well as permitting coherent integration of multiple data sources. However, Bayesian methods are in general computationally demanding and require specification of possibly thousands of prior distributions. eScience can help us overcome these difficulties. The eScience-Bayes thus approach permits us to fully leverage on the advantages of Bayesian methods, resulting in models with improved predictive performance that gives more information about the underlying biological system.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

Author: A Abi-Haidar
A Ceol
A Chatr-aryamontri
A Cohen
A Kolchinsky
A Lourenco
A McCallum
A Ng
A Yeh
Alfonso Valencia
AM Cohen
Andrew Chatr-aryamontri
Andrew Winter
Ashish V Tendulkar
B Aranda
B Settles
BP Suomela
C Blaschke
C Elkan
C Stark
Charles Elkan
D Bauer
D Salgado
David Salgado
E Marcotte
F Ehrler
F Leitner
F Leitner
F Leitner
F Rinaldi
F Rinaldi
F Rinaldi
Fabio Rinaldi
Feifan Liu
Florian Leitner
G Andrew
Gerold Schneider
Gianni Cesareni
GL Poulter
Graciela Gonzalez
H Daumé III
H Hermjakob
H Shatkay
H Wang
Hagit Shatkay
HK Rekapalli
I Donaldson
J Lin
Jean-Fred Fontaine
JR Curran
Keith Noto
KG Dowell
L Tanabe
Leonardo Briganti
Livia Perfetto
Luana Licata
Luis Rocha
Luisa Castagnoli
M Hall
M Harris
M Hollander
M Krallinger
M Krallinger
M Krallinger
M Krallinger
M Krallinger
M Oberoi
Marta Iannuccelli
Martin Krallinger
Miguel A Andrade-Navarro
Miguel Vazquez
Mike Tyers
P Wang
R Chowdhary
R Hoffmann
Rafal Rak
Rezarta Islamaj Dogan
Robert Leaman
S Kim
S Matos
S Orchard
Sergio Matos
Shashank Agarwal
Sun Kim
T Kappeler
T Ono
T Zhang
W Baumgartner
W Hersh
W Hersh
W John Wilbur
W Wilbur
Xinglong Wang
Y Niu
Y Sasaki
Z Cao
Zhiyong Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.RESULTS:A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89 and the best AUC iP/R was 68. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35) the macro-averaged precision ranged between 50 and 80, with a maximum F-Score of 55. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows

Crossref

Springer - Publisher Connector

Monash University Research Portal