Search CORE

24 research outputs found

Fast Label Embeddings via Randomized Linear Algebra

Author: A Beck
AJ Izenman
CR Rao
F Tai
H Wang
JM Geusebroek
L Breiman
L Schietgat
M Barker
M Cissé
N Halko
P Geladi
S Friedland
Y Nesterov
Publication venue
Publication date: 05/07/2015
Field of study

Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.Comment: To appear in the proceedings of the ECML/PKDD 2015 conference. Reference implementation available at https://github.com/pmineiro/randembe

arXiv.org e-Print Archive

Crossref

Predicting gene function using hierarchical multi-label decision tree ensembles

Author: A Clare
A Clare
A Clare
B Hayete
C Vens
Celine Vens
D Kocev
Dragi Kocev
E Zdobnov
F Provost
F Wilcoxon
G Obozinski
GR Lanckriet
H Blockeel
H Blockeel
H Blockeel
H Chua
H Drucker
H Lee
H Mewes
Hendrik Blockeel
J Davis
J Gough
J Quinlan
J Rousu
J Struyf
Jan Struyf
L Breiman
L Breiman
L Breiman
L Breiman
L Pena-Castillo
Leander Schietgat
M Ashburner
M Deng
M Ouali
N Cesa-Bianchi
O Troyanskaya
R Caruana
S Altschul
S Mostafavi
Sašo Džeroski
T Hughes
T Joachims
U Karaoz
W Kim
W Tian
Y Chen
Y Guan
Z Barutcuoglu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background <it>S. cerevisiae</it>, <it>A. thaliana </it>and <it>M. musculus </it>are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability. Results We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use. Conclusions Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Leiden University Scholary Publications

A two-step target binding and selectivity support vector machines approach for virtual screening of dopamine receptor subtype-selective ligands

Author: A Bender
A Givehchi
A Monge
A Talevi
A Zhang
Arto Urtti
Bucong Han
BW Matthews
C Enguehard–Gueiffier
C Singer
C Zeng
CA Heidbreder
Chunyan Tan
CW Yap
D Erhan
D Huber
DG Sprous
DI Cho
DR Sibley
EY Chien
F Boeckler
F Micheli
G Aloisi
G Tsoumakas
G Tsoumakas
H Dragos
H Li
H Sun
HJ Verheij
I Salama
J Bostrom
J Overington
J Zhang
JD Durrant
Jingxian Zhang
JJ Irwin
JR Quinlan
K Audouze
K Ehrlich
L Carro
L Herm
L Lopez
L Michielan
L Schietgat
LJ Bellis
LY Han
M Albersen
M Pilla
M-L Zhang
MJ Wester
MM Simpson
MWB Trotter
MY Cha
N Huang
NK Mishra
P Jenner
P Mahe
P Willett
P Willett
Q Wang
R Burbidge
R Czerminski
R Kohavi
RA Johnson
RB McCall
RD Clark
S Lober
S Lober
S Renner
T Sato
VN Vapnik
XH Liu
XH Ma
XH Ma
Xiaona Wei
Y Wang
Y Xue
Yuyang Jiang
Yuzong Chen
Z Shi
ZR Li
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 15/06/2012
Field of study

10.1371/journal.pone.0039076PLoS ONE76

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS

FigShare

The evolutionary signal in metagenome phyletic profiles predicts many gene functions

Background. The function of many genes is still not known even in model organisms. An increasing availability of microbiome DNA sequencing data provides an opportunity to infer gene function in a systematic manner. Results. We evaluated if the evolutionary signal contained in metagenome phyletic profiles (MPP) is predictive of a broad array of gene functions. The MPPs are an encoding of environmental DNA sequencing data that consists of relative abundances of gene families across metagenomes. We find that such MPPs can accurately predict 826 Gene Ontology functional categories, while drawing on human gut microbiomes, ocean metagenomes, and DNA sequences from various other engineered and natural environments. Overall, in this task, the MPPs are highly accurate, and moreover they provide coverage for a set of Gene Ontology terms largely complementary to standard phylogenetic profiles, derived from fully sequenced genomes. We also find that metagenomes approximated from taxon relative abundance obtained via 16S rRNA gene sequencing may provide surprisingly useful predictive models. Crucially, the MPPs derived from different types of environments can infer distinct, non-overlapping sets of gene functions and therefore complement each other. Consistently, simulations on > 5000 metagenomes indicate that the amount of data is not in itself critical for maximizing predictive accuracy, while the diversity of sampled environments appears to be the critical factor for obtaining robust models. Conclusions. In past work, metagenomics has provided invaluable insight into ecology of various habitats, into diversity of microbial life and also into human health and disease mechanisms. We propose that environmental DNA sequencing additionally constitutes a useful tool to predict biological roles of genes, yielding inferences out of reach for existing comparative genomics approaches

Crossref

ZENODO

Directory of Open Access Journals

Full-text Institutional Repository of the Ruđer Bošković Institute

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Open Babel: An open chemical toolbox

Author: A Amini
A Andronico
A Bender
A Gakh
A Karwath
A Maunz
A Maunz
A Poater
A Rappe
AA Gakh
AD Hill
B-b Yan
BD McKay
C Helma
C Reynès
Chris Morley
CR Jacob
Craig A James
CW Bullock
D Filimonov
D Lagorce
D Lagorce
D Weininger
DC Bas
DC Lonie
DR Koes
F Fontaine
Geoffrey R Hutchison
GL Holliday
HL Morgan
I Wallach
I Wallach
IV Filippov
IV Tetko
J Ahmed
J Ahmed
J Kazius
J Myers
J Wang
J Wang
JH Chen
JJ Langham
JL Melville
JL Sharman
K Fogel
K Martin
L Fabian
L Liu
L Schietgat
M Brüstle
M Buehler
M Dehmer
M Konyk
M Krier
M Kuhn
MA Meineke
MA Miteva
Michael Banck
MJ Gómez
N O'Boyle
N Zonta
NM O'Boyle
NM O'Boyle
Noel M O'Boyle
O Sperandio
P Lind
P Murray-Rust
P Murray-Rust
P Murray-Rust
P Murray-Rust
P Rydberg
P Tosco
P Tosco
R Esposito
RA Bauer
RA Bauer
RS Armen
S Arbor
S Ingsriswang
SV Trepalin
T Cheng
T Halgren
T Halgren
T Halgren
T Halgren
T Halgren
T Kogej
T Pencheva
Tim Vandermeersch
TWH Backman
U Schmidt
VV Mihaleva
William H Green
X Jiang
X Wang
YD Paila
Z Huang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: A frequent problem in computational modeling is the interconversion of chemical structures between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing problem due to the multitude of different application areas for chemistry data, differences in the data stored by different formats (0D versus 3D, for example), and competition between software along with a lack of vendorneutral formats. Results: We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics algorithms, from partial charge assignment and aromaticity detection, to bond order perception and canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and outline a variety of uses both in terms of software products and scientific research, including applications far beyond simple format interconversion. Conclusions: Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and substructure and similarity searching. For developers, it can be used as a programming library to handle chemical data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely available under an open-source license fro

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Irish Universities

PubMed Central

Cork Open Research Archive

Multi-Task Drug Bioactivity Classification with Graph Labeling Ensembles

Author: A. Ceroni
A. Esuli
D. Opitz
H. Su
J. Rousu
L. Breiman
L. Ralaivola
L. Schietgat
M. Trotter
N. Meinshausen
O. Obrezanova
P. Shivakumar
R.E. Schapire
T. Hastie
Y. Wang
Publication venue
Publication date: 01/01/2011
Field of study

Abstract. We present a new method for drug bioactivity classification based on learning an ensemble of multi-task classifiers. As the base classifiers of the ensemble we use Max-Margin Conditional Random Field (MMCRF) models, which have previously obtained the state-of-the-art accuracy in this problem. MMCRF relies on a graph structure coupling the set of tasks together, and thus turns the multi-task learning problem into a graph labeling problem. In our ensemble method the graphs of the base classifiers are random, constructed by random pairing or random spanning tree extraction over the set of tasks. We compare the ensemble approaches on datasets containing the cancer inhibition potential of drug-like molecules against 60 cancer cell lines. In our experiments we find that ensembles based on random graphs surpass the accuracy of single SVM as well as a single MMCRF model relying on a graph built from auxiliary data

CiteSeerX

Crossref