10 research outputs found

    Automatic text summarization using pathfinder network scaling

    Get PDF
    ContĂ©m uma errataTese de Mestrado. InteligĂȘncia Artificial e Sistemas Inteligentes. Faculdade de Engenharia. Universidade do Porto, Faculdade de Economia. Universidade do Porto. 200

    Genome signature based sequence comparison for taxonomic assignment and tree inference

    Get PDF
    In this work we consider the use of the genome signature for two important bioinformatics problems; the taxonomic assignment of metagenome sequences and tree inference from whole genomes. We look at those problems from a sequence comparison point of view and propose machine learning based methods as solutions. For the first problem, we propose a novel method based on structural support vector machines that can directly predict paths in a tree implied by evolutionary relationships between taxa. The method is based on an ensemble strategy to predict highly specific assignments for varying length sequences arising from metagenome projects. Through controlled experimental analyses on simulated and real data sets we show the benefits of our method under realistic conditions. For the task of genome tree inference we propose a metric learning method. Based on the assumption that for different groups of prokaryotes, as defined by their phylogeny, genomic or ecological properties, different oligonucleotide weights can be more informative, our method learns group-specific distance metrics. We show that, indeed, it is possible to learn specific distance metrics that provide improved genome trees for the groups. In the outlook, we expect that for the addressed problems the work of this thesis will complement and in some cases even outperform alignment-based sequence comparison at a considerably reduced computational cost, allowing it to keep up with advancements in sequencing technologies.In dieser Arbeit wird die Verwendung der Genomsignatur fĂŒr zwei wichtige bioinformatische Probleme untersucht. Diese sind zum einen die taxonomische Einordnung von Sequenzen aus Metagenomexperimenten und zum anderen das Lernen eines taxonomischen Baums aus verschiedenen ganzen Genomen. Diese beiden Probleme werden aus dem Blickwinkel der Sequenzanalyse betrachtet und Verfahren des maschinellen Lernens werden als LösungsansĂ€tze vorgeschlagen. FĂŒr die Lösung des ersten Problems schlagen wir eine neue Methode vor, die auf strukturellen Support Vektor Maschinen beruht und direkt Pfade in einem Baum vorhersagen kann, der auf den evolutionĂ€ren Ähnlichkeiten der Taxa beruht. Die Methode basiert auf einer Ensemble Strategie, um sehr genaue Zuweisungen fĂŒr Sequenzen verschiedener LĂ€nge, die in Metagenomprojekten gemessen wurden, vorherzusagen. Wir zeigen die Vorteile unserer Methode auf simulierten sowie auf experimentellen Daten. FĂŒr das zweite Problem, bei dem ein taxonomischer Baum, basierend auf der genetischen Sequenz gelernt werden soll, schlagen wir eine Methode vor, die eine Metrik lernt. Die Annahme, auf der diese Methode beruht, ist, dass fĂŒr verschiedene Gruppen von Prokaryoten unterschiedliche Gewichtungen der Oligonukleotidvorkommen notwendig sind, weswegen eine gruppenspezifische Metrik gelernt wird. Die Gruppen können dabei aufgrund ihrer phylogenetischen Beziehungen oder ökologischer sowie genomischer Merkmale bestimmt sein. Wir zeigen in unserer Analyse, dass es hierdurch möglich ist, spezifische Metriken zu lernen, die zu besseren BĂ€umen fĂŒr diese Gruppen fĂŒhren. Wir erwarten, dass unsere hier vorgestellten Arbeiten fĂŒr die bearbeiteten Probleme Alignment-basierte AnsĂ€tze ergĂ€nzen und teilweise sogar ĂŒberbieten können, wobei unsere Lösungen deutlich weniger Rechenzeit benötigen und damit mit dem rasanten Wachstum im Sequenzierbereich schritthalten können

    Sequenzvergleich mit Hilfe der Genomsignatur fĂŒr die taxonomische Einordnung von Sequenzen und das Lernen taxonomischer BĂ€ume

    Get PDF
    In this work we consider the use of the genome signature for two important bioinformatics problems; the taxonomic assignment of metagenome sequences and tree inference from whole genomes. We look at those problems from a sequence comparison point of view and propose machine learning based methods as solutions. For the first problem, we propose a novel method based on structural support vector machines that can directly predict paths in a tree implied by evolutionary relationships between taxa. The method is based on an ensemble strategy to predict highly specific assignments for varying length sequences arising from metagenome projects. Through controlled experimental analyses on simulated and real data sets we show the benefits of our method under realistic conditions. For the task of genome tree inference we propose a metric learning method. Based on the assumption that for different groups of prokaryotes, as defined by their phylogeny, genomic or ecological properties, different oligonucleotide weights can be more informative, our method learns group-specific distance metrics. We show that, indeed, it is possible to learn specific distance metrics that provide improved genome trees for the groups. In the outlook, we expect that for the addressed problems the work of this thesis will complement and in some cases even outperform alignment-based sequence comparison at a considerably reduced computational cost, allowing it to keep up with advancements in sequencing technologies.In dieser Arbeit wird die Verwendung der Genomsignatur fĂŒr zwei wichtige bioinformatische Probleme untersucht. Diese sind zum einen die taxonomische Einordnung von Sequenzen aus Metagenomexperimenten und zum anderen das Lernen eines taxonomischen Baums aus verschiedenen ganzen Genomen. Diese beiden Probleme werden aus dem Blickwinkel der Sequenzanalyse betrachtet und Verfahren des maschinellen Lernens werden als LösungsansĂ€tze vorgeschlagen. FĂŒr die Lösung des ersten Problems schlagen wir eine neue Methode vor, die auf strukturellen Support Vektor Maschinen beruht und direkt Pfade in einem Baum vorhersagen kann, der auf den evolutionĂ€ren Ähnlichkeiten der Taxa beruht. Die Methode basiert auf einer Ensemble Strategie, um sehr genaue Zuweisungen fĂŒr Sequenzen verschiedener LĂ€nge, die in Metagenomprojekten gemessen wurden, vorherzusagen. Wir zeigen die Vorteile unserer Methode auf simulierten sowie auf experimentellen Daten. FĂŒr das zweite Problem, bei dem ein taxonomischer Baum, basierend auf der genetischen Sequenz gelernt werden soll, schlagen wir eine Methode vor, die eine Metrik lernt. Die Annahme, auf der diese Methode beruht, ist, dass fĂŒr verschiedene Gruppen von Prokaryoten unterschiedliche Gewichtungen der Oligonukleotidvorkommen notwendig sind, weswegen eine gruppenspezifische Metrik gelernt wird. Die Gruppen können dabei aufgrund ihrer phylogenetischen Beziehungen oder ökologischer sowie genomischer Merkmale bestimmt sein. Wir zeigen in unserer Analyse, dass es hierdurch möglich ist, spezifische Metriken zu lernen, die zu besseren BĂ€umen fĂŒr diese Gruppen fĂŒhren. Wir erwarten, dass unsere hier vorgestellten Arbeiten fĂŒr die bearbeiteten Probleme Alignment-basierte AnsĂ€tze ergĂ€nzen und teilweise sogar ĂŒberbieten können, wobei unsere Lösungen deutlich weniger Rechenzeit benötigen und damit mit dem rasanten Wachstum im Sequenzierbereich schritthalten können

    The PhyloPythiaS Web Server for Taxonomic Assignment of Metagenome Sequences

    Get PDF
    Metagenome sequencing is becoming common and there is an increasing need for easily accessible tools for data analysis. An essential step is the taxonomic classification of sequence fragments. We describe a web server for the taxonomic assignment of metagenome sequences with PhyloPythiaS. PhyloPythiaS is a fast and accurate sequence composition-based classifier that utilizes the hierarchical relationships between clades. Taxonomic assignments with the web server can be made with a generic model, or with sample-specific models that users can specify and create. Several interactive visualization modes and multiple download formats allow quick and convenient analysis and downstream processing of taxonomic assignments. Here, we demonstrate usage of our web server by taxonomic assignment of metagenome samples from an acidophilic biofilm community of an acid mine and of a microbial community from cow rumen

    Bioactivity assessment of natural compounds using machine learning models trained on target similarity between drugs.

    No full text
    Natural compounds constitute a rich resource of potential small molecule therapeutics. While experimental access to this resource is limited due to its vast diversity and difficulties in systematic purification, computational assessment of structural similarity with known therapeutic molecules offers a scalable approach. Here, we assessed functional similarity between natural compounds and approved drugs by combining multiple chemical similarity metrics and physicochemical properties using a machine-learning approach. We computed pairwise similarities between 1410 drugs for training classification models and used the drugs shared protein targets as class labels. The best performing models were random forest which gave an average area under the ROC of 0.9, Matthews correlation coefficient of 0.35, and F1 score of 0.33, suggesting that it captured the structure-activity relation well. The models were then used to predict protein targets of circa 11k natural compounds by comparing them with the drugs. This revealed therapeutic potential of several natural compounds, including those with support from previously published sources as well as those hitherto unexplored. We experimentally validated one of the predicted pair's activities, viz., Cox-1 inhibition by 5-methoxysalicylic acid, a molecule commonly found in tea, herbs and spices. In contrast, another natural compound, 4-isopropylbenzoic acid, with the highest similarity score when considering most weighted similarity metric but not picked by our models, did not inhibit Cox-1. Our results demonstrate the utility of a machine-learning approach combining multiple chemical features for uncovering protein binding potential of natural compounds

    Taxonomic assignments of the cow rumen metagenome scaffolds with the PhyloPythiaS generic model.

    No full text
    <p>This data-set contained 26,042scaffolds in total. The assignments are shown at the order level. Each slice represents number of bases assigned. The “Other” slice represents sequences that were unassigned or assigned at a higher level.</p

    Percentage of bases correctly assigned to modeled taxa by different methods for the AMD metagenome scaffolds.

    No full text
    <p>The reference taxonomic affiliations were obtained by aligning the test scaffolds with the draft genomes. For PhyloPythiaS (both generic and sample-specific), the drop in accuracy is mostly due to unassigned sequences at a particular rank, while other methods produced more false assignments. Thermoplasmatales archaeon Gpl (comprising 21.8% of the total bases) has no defined parental clade at the genus and family ranks, contributing to the observed lower accuracy values for these ranks. Additional measures are shown in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0038581#pone.0038581.s006" target="_blank">Figure S6</a>.</p

    Taxonomic assignments of the acid mine drainage metagenome scaffolds.

    No full text
    <p>Each slice represents number of bases assigned. (a) the PhyloPythiaS generic model at the phylum level, (b) the PhyloPythiaS sample-specific model at the phylum level, (c) the PhyloPythiaS sample-specific model at various ranks, (d) taxonomic reference composition, obtained by alignment of the scaffolds with draft genome assemblies, (e) quantitative cell counts from a FISH study, reproduced from Tyson <i>et al.</i> (2004) <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0038581#pone.0038581-Tyson1" target="_blank">[13]</a> and (f) NBC with N-mer length 15 and Bacteria/Archaea genomes at the phylum level. The “Other” slice represents sequences that were unassigned or assigned at a higher level. Assignments were mapped to phylum level in plots a, b and f for ease of visualization.</p

    Taxonomic distance and consistency analysis of the 15 genome bins from the cow rumen metagenome consisting of 466 scaffolds in total.

    No full text
    <p>The first three columns describe the dataset while the last three columns summarize the predictions of the PhyloPythiaS generic model. The last three columns show the average taxonomic distances between the predicted order and the correct order (Tax Dist), the consistency calculated based on the fraction of assigned scaffolds (Const_n_scaff) and the consistency calculated based on the fraction of assigned base-pairs (Const_n_bp). See ‘Results’ for the definitions of taxonomic distance and consistency. The micro average is the average value over all scaffolds and the macro average represents the average over the genome bins.</p
    corecore