Search CORE

112 research outputs found

A machine learning based framework to identify and classify long terminal repeat retrotransposons

Author: Blockeel Hendrik
Carareto Claudia MA
Cerri Ricardo
Costa Eduardo
Fischer Carlos N
Ramon Jan
Schietgat Leander
Vens Celine
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-LEARNER, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: REPEATMASKER, CENSOR and LTRDIGEST. In contrast to these methods, TE-LEARNER is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance , while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-LEARNER'S predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Ghent University Academic Bibliography

Directory of Open Access Journals

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

On the complexity of haplotyping a microbial community

Author: Aubrey Wayne
Clare Amanda
Creevey Chris
de Grave Kurt
Nicholls Sam
Schietgat Leander
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 15/05/2021
Field of study

Aberystwyth Research Portal

Recovery of gene haplotypes from a metagenome

Author: Aubrey Wayne
Clare Amanda
Creevey Christopher
de Grave Kurt
Edwards Arwyn
Huws Sharon
Leander Schietgat
Nicholls Sam
Soares Andre
Publication venue
Publication date: 22/11/2017
Field of study

Queen's University Belfast Research Portal

Aberystwyth Research Portal

University of Birmingham Research Portal

Fast Label Embeddings via Randomized Linear Algebra

Author: A Beck
AJ Izenman
CR Rao
F Tai
H Wang
JM Geusebroek
L Breiman
L Schietgat
M Barker
M Cissé
N Halko
P Geladi
S Friedland
Y Nesterov
Publication venue
Publication date: 05/07/2015
Field of study

Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.Comment: To appear in the proceedings of the ECML/PKDD 2015 conference. Reference implementation available at https://github.com/pmineiro/randembe

arXiv.org e-Print Archive

Crossref

Predicting gene function using hierarchical multi-label decision tree ensembles

Author: A Clare
A Clare
A Clare
B Hayete
C Vens
Celine Vens
D Kocev
Dragi Kocev
E Zdobnov
F Provost
F Wilcoxon
G Obozinski
GR Lanckriet
H Blockeel
H Blockeel
H Blockeel
H Chua
H Drucker
H Lee
H Mewes
Hendrik Blockeel
J Davis
J Gough
J Quinlan
J Rousu
J Struyf
Jan Struyf
L Breiman
L Breiman
L Breiman
L Breiman
L Pena-Castillo
Leander Schietgat
M Ashburner
M Deng
M Ouali
N Cesa-Bianchi
O Troyanskaya
R Caruana
S Altschul
S Mostafavi
Sašo Džeroski
T Hughes
T Joachims
U Karaoz
W Kim
W Tian
Y Chen
Y Guan
Z Barutcuoglu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background <it>S. cerevisiae</it>, <it>A. thaliana </it>and <it>M. musculus </it>are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability. Results We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use. Conclusions Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Leiden University Scholary Publications

On the complexity of submap isomorphism and maximum common submap problems

Author: Akutsu
Braquelaire
Bunke
Christine Solnon
Colin de la Higuera
Combier
Damiand
Damiand
Eppstein
Fradin
Guillaume Damiand
Horváth
Jean-Christophe Janodet
Jiang
Knuth
Koch
Lichtenstein
Lienhardt
Luks
Pinz
Rosenfeld
Schietgat
Solnon
Sorlin
Syslo
Trémeau
Zuckerman
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

The evolutionary signal in metagenome phyletic profiles predicts many gene functions

Background. The function of many genes is still not known even in model organisms. An increasing availability of microbiome DNA sequencing data provides an opportunity to infer gene function in a systematic manner. Results. We evaluated if the evolutionary signal contained in metagenome phyletic profiles (MPP) is predictive of a broad array of gene functions. The MPPs are an encoding of environmental DNA sequencing data that consists of relative abundances of gene families across metagenomes. We find that such MPPs can accurately predict 826 Gene Ontology functional categories, while drawing on human gut microbiomes, ocean metagenomes, and DNA sequences from various other engineered and natural environments. Overall, in this task, the MPPs are highly accurate, and moreover they provide coverage for a set of Gene Ontology terms largely complementary to standard phylogenetic profiles, derived from fully sequenced genomes. We also find that metagenomes approximated from taxon relative abundance obtained via 16S rRNA gene sequencing may provide surprisingly useful predictive models. Crucially, the MPPs derived from different types of environments can infer distinct, non-overlapping sets of gene functions and therefore complement each other. Consistently, simulations on > 5000 metagenomes indicate that the amount of data is not in itself critical for maximizing predictive accuracy, while the diversity of sampled environments appears to be the critical factor for obtaining robust models. Conclusions. In past work, metagenomics has provided invaluable insight into ecology of various habitats, into diversity of microbial life and also into human health and disease mechanisms. We propose that environmental DNA sequencing additionally constitutes a useful tool to predict biological roles of genes, yielding inferences out of reach for existing comparative genomics approaches

Crossref

ZENODO

Directory of Open Access Journals

Full-text Institutional Repository of the Ruđer Bošković Institute

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

A two-step target binding and selectivity support vector machines approach for virtual screening of dopamine receptor subtype-selective ligands

Author: A Bender
A Givehchi
A Monge
A Talevi
A Zhang
Arto Urtti
Bucong Han
BW Matthews
C Enguehard–Gueiffier
C Singer
C Zeng
CA Heidbreder
Chunyan Tan
CW Yap
D Erhan
D Huber
DG Sprous
DI Cho
DR Sibley
EY Chien
F Boeckler
F Micheli
G Aloisi
G Tsoumakas
G Tsoumakas
H Dragos
H Li
H Sun
HJ Verheij
I Salama
J Bostrom
J Overington
J Zhang
JD Durrant
Jingxian Zhang
JJ Irwin
JR Quinlan
K Audouze
K Ehrlich
L Carro
L Herm
L Lopez
L Michielan
L Schietgat
LJ Bellis
LY Han
M Albersen
M Pilla
M-L Zhang
MJ Wester
MM Simpson
MWB Trotter
MY Cha
N Huang
NK Mishra
P Jenner
P Mahe
P Willett
P Willett
Q Wang
R Burbidge
R Czerminski
R Kohavi
RA Johnson
RB McCall
RD Clark
S Lober
S Lober
S Renner
T Sato
VN Vapnik
XH Liu
XH Ma
XH Ma
Xiaona Wei
Y Wang
Y Xue
Yuyang Jiang
Yuzong Chen
Z Shi
ZR Li
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 15/06/2012
Field of study

10.1371/journal.pone.0039076PLoS ONE76

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS

FigShare

Open Babel: An open chemical toolbox

Author: A Amini
A Andronico
A Bender
A Gakh
A Karwath
A Maunz
A Maunz
A Poater
A Rappe
AA Gakh
AD Hill
B-b Yan
BD McKay
C Helma
C Reynès
Chris Morley
CR Jacob
Craig A James
CW Bullock
D Filimonov
D Lagorce
D Lagorce
D Weininger
DC Bas
DC Lonie
DR Koes
F Fontaine
Geoffrey R Hutchison
GL Holliday
HL Morgan
I Wallach
I Wallach
IV Filippov
IV Tetko
J Ahmed
J Ahmed
J Kazius
J Myers
J Wang
J Wang
JH Chen
JJ Langham
JL Melville
JL Sharman
K Fogel
K Martin
L Fabian
L Liu
L Schietgat
M Brüstle
M Buehler
M Dehmer
M Konyk
M Krier
M Kuhn
MA Meineke
MA Miteva
Michael Banck
MJ Gómez
N O'Boyle
N Zonta
NM O'Boyle
NM O'Boyle
Noel M O'Boyle
O Sperandio
P Lind
P Murray-Rust
P Murray-Rust
P Murray-Rust
P Murray-Rust
P Rydberg
P Tosco
P Tosco
R Esposito
RA Bauer
RA Bauer
RS Armen
S Arbor
S Ingsriswang
SV Trepalin
T Cheng
T Halgren
T Halgren
T Halgren
T Halgren
T Halgren
T Kogej
T Pencheva
Tim Vandermeersch
TWH Backman
U Schmidt
VV Mihaleva
William H Green
X Jiang
X Wang
YD Paila
Z Huang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: A frequent problem in computational modeling is the interconversion of chemical structures between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing problem due to the multitude of different application areas for chemistry data, differences in the data stored by different formats (0D versus 3D, for example), and competition between software along with a lack of vendorneutral formats. Results: We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics algorithms, from partial charge assignment and aromaticity detection, to bond order perception and canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and outline a variety of uses both in terms of software products and scientific research, including applications far beyond simple format interconversion. Conclusions: Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and substructure and similarity searching. For developers, it can be used as a programming library to handle chemical data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely available under an open-source license fro

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Irish Universities

PubMed Central

Cork Open Research Archive