Search CORE

69 research outputs found

A machine learning based framework to identify and classify long terminal repeat retrotransposons

Author: Blockeel Hendrik
Carareto Claudia MA
Cerri Ricardo
Costa Eduardo
Fischer Carlos N
Ramon Jan
Schietgat Leander
Vens Celine
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-LEARNER, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: REPEATMASKER, CENSOR and LTRDIGEST. In contrast to these methods, TE-LEARNER is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance , while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-LEARNER'S predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Ghent University Academic Bibliography

Directory of Open Access Journals

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

On the complexity of haplotyping a microbial community

Author: Aubrey Wayne
Clare Amanda
Creevey Chris
de Grave Kurt
Nicholls Sam
Schietgat Leander
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 15/05/2021
Field of study

Aberystwyth Research Portal

Recovery of gene haplotypes from a metagenome

Author: Aubrey Wayne
Clare Amanda
Creevey Christopher
de Grave Kurt
Edwards Arwyn
Huws Sharon
Leander Schietgat
Nicholls Sam
Soares Andre
Publication venue
Publication date: 22/11/2017
Field of study

AbstractElucidation of population-level diversity of microbiomes is a significant step towards a complete understanding of the evolutionary, ecological and functional importance of microbial communities. Characterizing this diversity requires the recovery of the exact DNA sequence (haplotype) of each gene isoform from every individual present in the community. To address this, we present Hansel and Gretel: a freely-available data structure and algorithm, providing a software package that reconstructs the most likely haplotypes from metagenomes. We demonstrate recovery of haplotypes from short-read Illumina data for a bovine rumen microbiome, and verify our predictions are 100% accurate with long-read PacBio CCS sequencing. We show that Gretel’s haplotypes can be analyzed to determine a significant difference in mutation rates between core and accessory gene families in an ovine rumen microbiome. All tools, documentation and data for evaluation are open source and available via our repository: https://github.com/samstudio8/gretel</jats:p

Queen's University Belfast Research Portal

Crossref

Aberystwyth Research Portal

University of Birmingham Research Portal

Predicting gene function using hierarchical multi-label decision tree ensembles

Author: A Clare
A Clare
A Clare
B Hayete
C Vens
Celine Vens
D Kocev
Dragi Kocev
E Zdobnov
F Provost
F Wilcoxon
G Obozinski
GR Lanckriet
H Blockeel
H Blockeel
H Blockeel
H Chua
H Drucker
H Lee
H Mewes
Hendrik Blockeel
J Davis
J Gough
J Quinlan
J Rousu
J Struyf
Jan Struyf
L Breiman
L Breiman
L Breiman
L Breiman
L Pena-Castillo
Leander Schietgat
M Ashburner
M Deng
M Ouali
N Cesa-Bianchi
O Troyanskaya
R Caruana
S Altschul
S Mostafavi
Sašo Džeroski
T Hughes
T Joachims
U Karaoz
W Kim
W Tian
Y Chen
Y Guan
Z Barutcuoglu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background <it>S. cerevisiae</it>, <it>A. thaliana </it>and <it>M. musculus </it>are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability. Results We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use. Conclusions Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Leiden University Scholary Publications

Graph-based data mining for biological applications

Author: Schietgat Leander
Publication venue: 'IOS Press'
Publication date: 01/01/2011
Field of study

In many real-world problems, one deals with input or output data that are structured. This thesis investigates the use of graphs as a representation for structured data and introduces relational learning techniques that can efficiently process them. We apply the techniques to two biological problems. On the one hand, we use decision trees to predict the functions of genes, of which the hierarchical relationships can be structured as a graph. On the other hand, we predict chemical activity of molecules by representing them as graphs. We show that, by exploiting graph properties, efficient learning techniques can be developed. It turns out that in both cases, the relational models are not only learned more efficiently, but their predictive performance significantly improves as well.status: publishe

Lirias

Graph-Based Data Mining for Biological Applications (Graafgebaseerde datamining voor biologische toepassingen)

Author: Schietgat Leander
Publication venue
Publication date: 28/05/2010
Field of study

Het onderzoek in deze thesis situeert zich in het domein van het relatio neel leren. In het bijzonder stellen we leeralgoritmes voor die modellen bouwen voor gestructureerde gegevens op basis van grafen. Het belangrij kste doel van deze thesis is om de efficiëntie van relationele leeralgor itmes te verhogen, alsook hun toepasbaarheid op problemen uit de biologi e en chemie. In het eerste deel bestuderen we hiërarchische multi-label classificatie (HMC), een variant van classificatie waarbij een voorbeeld tot meerdere klassen kan behoren en waarbij de klassen georganiseerd zijn in een hië rarchie. Deze hiërarchie kan voorgesteld worden door een graaf en de uit voer van een HMC-model bestaat uit één of meerdere paden van deze g raaf. Een belangrijke toepassing van HMC is het voorspellen van functies van genen. Het is bekend dat een gen meerdere functies kan hebben, terw ijl biologen deze functies hebben ingedeeld in hiërarchieën. In plaats v an een methode te gebruiken dat voor iedere klasse een onafhankelijk mod el leert, stellen we een methode voor dat één model leert dat alle klassen ineens voorspelt. We tonen aan dat deze methode in de context va n beslissingsbomen resulteert in modellen die niet alleen efficiënter ge leerd worden, maar die ook beter presteren op het vlak van predictieve p erformantie, complexiteit en interpreteerbaarheid. Als we gaan vergelijk en met state-of-the-art technieken voor het voorspellen van functies van genen stellen we vast dat de voorgestelde HMC-methode een hogere effici ëntie en een vergelijkbare predictieve performantie heeft. In het tweede deel beschouwen we leeralgoritmes waarvan de invoer voorge steld wordt door grafen. De toepassing die we hier voor ogen hebben is h et leren van structuur-activiteitsrelaties (SAR). Het doel van SAR is om eigenschappen van moleculen te voorspellen aan de hand van hun chemisch e structuur. Om de leeralgoritmes efficiënter te maken, buiten we specif ieke eigenschappen uit van moleculaire grafen. Doordat de meeste molecul en voorgesteld kunnen worden door outerplanaire grafen en omdat het blok -en-brug-behoudende (BBP) subgraaf isomorfisme een geschikte vergelijkin gsoperator blijkt voor SAR, kunnen we een polynomiaal algoritme ontwikke len dat een maximaal gemeenschappelijke subgraaf van twee outerplanaire grafen berekent. We gebruiken dit algoritme om een metriek voor molecule n te bouwen en om patronen voor moleculen te genereren. Deze methodes bl ijken niet alleen efficiënter te zijn dan bestaande methodes, maar behal en ook een state-of-the-art predictieve performantie op SAR-problemen.status: publishe

Lirias

A polynomial-time metric for outerplanar graphs

Author: Bruynooghe Maurice
Ramon Jan
Schietgat Leander
Publication venue
Publication date: 01/01/2007
Field of study

In the chemoinformatics context, graphs have become very popular for the representation of molecules. However, a lot of algorithms handling graphs are computationally very expensive. In this paper we focus on outerplanar graphs, a class of graphs that is able to represent the majority of molecules. We define a metric on outerplanar graphs that is based on finding a maximum common subgraph and we present an algorithm that runs in polynomial time. Having an efficiently computable metric on molecules can improve the virtual screening of molecular databases significantly.status: publishe

Lirias

Predicting protein function and protein-ligand interaction with the 3D neighborhood kernel (Extended Abstract)

Author: Fannes Thomas
Ramon Jan
Schietgat Leander
Publication venue
Publication date: 01/11/2015
Field of study

status: publishe

Lirias

Predicting protein function with the relative backbone position kernel

Author: Aryal Sunil
Ramon Jan
Schietgat Leander
Publication venue
Publication date: 01/09/2010
Field of study

We propose a kernel method for predicting the function of proteins that makes use of 3D structural information. Using the kernel in support vector machines, we obtain a state-of-the-art accuracy on two datasets of protein structures.status: publishe

Lirias