6 research outputs found

    Sparse Dynamic Programming on DAGs with Small Width

    Get PDF
    The minimum path cover problem asks us to find a minimum-cardinality set of paths that cover all the nodes of a directed acyclic graph (DAG). We study the case when the size k of a minimum path cover is small, that is, when the DAG has a small width. This case is motivated by applications in pan-genomics, where the genomic variation of a population is expressed as a DAG. We observe that classical alignment algorithms exploiting sparse dynamic programming can be extended to the sequence-against-DAG case by mimicking the algorithm for sequences on each path of a minimum path cover and handling an evaluation order anomaly with reachability queries. Namely, we introduce a general framework for DAG-extensions of sparse dynamic programming. This framework produces algorithms that are slower than their counterparts on sequences only by a factor k. We illustrate this on two classical problems extended to DAGs: longest increasing subsequence and longest common subsequence. For the former, we obtain an algorithm with running time O(k vertical bar E vertical bar log vertical bar V vertical bar). This matches the optimal solution to the classical problem variant when the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. We then apply this technique to the co-linear chaining problem, which is a generalization of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets and a two-dimensional range search tree modified to support range maximum queries. We also study a general sequence-to-DAG alignment formulation that allows affine gap costs in the sequence. The main ingredient of the proposed framework is a new algorithm for finding a minimum path cover of a DAG (V, E) in O(k vertical bar E vertical bar log vertical bar V vertical bar) time, improving all known time-bounds when k is small and the DAG is not too dense. In addition to boosting the sparse dynamic programming framework, an immediate consequence of this new minimum path cover algorithm is an improved space/time tradeoff for reachability queries in arbitrary directed graphs.Peer reviewe

    Third-generation RNA-sequencing analysis : graph alignment and transcript assembly with long reads

    Get PDF
    The information contained in the genome of an organism, its DNA, is expressed through transcription of its genes to RNA, in quantities determined by many internal and external factors. As such, studying the gene expression can give valuable information for e.g. clinical diagnostics. A common analysis workflow of RNA-sequencing (RNA-seq) data consists of mapping the sequencing reads to a reference genome, followed by the transcript assembly and quantification based on these alignments. The advent of second-generation sequencing revolutionized the field by reducing the sequencing costs by 50,000-fold. Now another revolution is imminent with the third-generation sequencing platforms producing an order of magnitude higher read lengths. However, higher error rate, higher cost and lower throughput compared to the second-generation sequencing bring their own challenges. To compensate for the low throughput and high cost, hybrid approaches using both short second-generation and long third-generation reads have gathered recent interest. The first part of this thesis focuses on the analysis of short-read RNA-seq data. As short-read mapping is an already well-researched field, we focus on giving a literature review of the topic. For transcript assembly we propose a novel (at the time of the publication) approach of using minimum-cost flows to solve the problem of covering a graph created from the read alignments with a set of paths with the minimum cost, under some cost model. Various network-flow-based solutions were proposed in parallel to, as well as after, ours. The second part, where the main contributions of this thesis lie, focuses on the analysis of long-read RNA-seq data. The driving point of our research has been the Minimum Path Cover with Subpath Constraints (MPC-SC) model, where transcript assembly is modeled as a minimum path cover problem, with the addition that each of the chains of exons (subpath constraints) created from the long reads must be completely contained in a solution path. In addition to implementing this concept, we experimentally studied different approaches on how to find the exon chains in practice. The evaluated approaches included aligning the long reads to a graph created from short read alignments instead of the reference genome, which led to our final contribution: extending a co-linear chaining algorithm from between two sequences to between a sequence and a directed acyclic graph.Transkriptiossa organismin geenien mallin mukaan luodaan RNA-molekyyleja. Lukuisat tekijät, sekä solun sisäiset että ulkoiset, määrittävät mitä geenejä transkriptoidaan, ja missä määrin. Tämän prosessin tutkiminen antaa arvokasta tietoa esimerkiksi lääketieteelliseen diagnostiikkaan. Yksi yleisistä RNA-sekvensointidatan analyysitavoista koostuu kolmesta osasta: lukujaksojen (read sequences) linjaus referenssigenomiin, transkriptien kokoaminen, ja transkriptien ekspressiotasojen määrittäminen. Toisen sukupolven sekvensointiteknologian kehityksen myötä sekvensoinnin hinta laski huomattavasti, mikä salli RNA-sekvensointidatan käytön yhä useampaan tarkoitukseen. Nyt kolmannen sukupolven sekvensointiteknologiat tarjoavat kertaluokkaa pidempiä lukujaksoja, mikä laajentaa analysointimahdollisuuksia. Kuitenkin suurempi virhemäärä, korkeampi hinta ja pienempi määrä tuotettua dataa tuovat omat haasteensa. Toisen ja kolmannen sukupolven teknologioiden käyttäminen yhdessä, ns. hybridilähestymistapa, on tutkimussuunta joka on kerännyt paljon kiinnostusta viimeaikoina. Tämän tutkielman ensimmäinen osa keskittyy toisen sukupolven, eli ns. lyhyiden RNA-lukujaksojen (short read), analyysiin. Näiden lyhyiden lukujaksojen linjausta referenssigenomiin on tutkittu jo 2000-luvulla, joten tällä alueella keskitymme olemassaolevaan kirjallisuuteen. Transkriptien kokoamisen alalta esittelemme metodin, joka käyttää vähimmäiskustannusvirtauksen (minimum-cost flow) mallia. Vähimmäiskustannusvirtauksen mallissa lukujaksoista luotu verkko peitetään joukolla polkuja, joiden kustannus on pienin mahdollinen. Virtausmalleja on käytetty myös muiden tutkijoiden kehittämissä analyysityökaluissa. Tämän tutkielman suurin kontribuutio on toisessa osassa, joka keskittyy ns. pitkien RNA-lukujaksojen (long read) analysointiin. Tutkimuksemme lähtökohtana on ollut malli, jossa pienimmän polkupeitteen (Minimum Path Cover) ongelmaan lisätään alipolkurajoitus (subpath constraint). Jokainen alipolkurajoitus vastaa eksoniketjua (exon chain), jotka jokin pitkä lukujakso peittää, ja jokaisen alipolkurajoituksen täytyy sisältyä kokonaan johonkin polkupeitteen polkuun. Tämän konseptin toteuttamisen lisäksi testasimme kokeellisesti erilaisia lähestymistapoja eksoniketjujen löytämiseksi. Näihin testattaviin lähestymistapoihin kuului pitkien lukujaksojen linjaaminen suoraan lyhyistä lukujaksoista luotuun verkkoon referenssigenomin sijaan. Tämä lähestymistapa johti tämän tutkielman viimeiseen kontribuutioon: kolineaarisen ketjun (co-linear chaining) algoritmin yleistäminen kahden sekvenssin sijasta sekvenssiin ja suunnattuun syklittömään verkkoon

    YOC, A new strategy for pairwise alignment of collinear genomes

    Get PDF
    Background: Comparing and aligning genomes is a key step in analyzing closely related genomes. Despite the development of many genome aligners in the last 15 years, the problem is not yet fully resolved, even when aligning closely related bacterial genomes of the same species. In addition, no procedures are available to assess the quality of genome alignments or to compare genome aligners.[br/] Results: We designed an original method for pairwise genome alignment, named YOC, which employs a highly sensitive similarity detection method together with a recent collinear chaining strategy that allows overlaps. YOC improves the reliability of collinear genome alignments, while preserving or even improving sensitivity. We also propose an original qualitative evaluation criterion for measuring the relevance of genome alignments. We used this criterion to compare and benchmark YOC with five recent genome aligners on large bacterial genome datasets, and showed it is suitable for identifying the specificities and the potential flaws of their underlying strategies.[br/] Conclusions: The YOC prototype is available at https://github.com/ruricaru/YOC. It has several advantages over existing genome aligners: (1) it is based on a simplified two phase alignment strategy, (2) it is easy to parameterize (3), it produces reliable genome alignments, which are easier to analyze and to use

    Identifying Relevant Evidence for Systematic Reviews and Review Updates

    Get PDF
    Systematic reviews identify, assess and synthesise the evidence available to answer complex research questions. They are essential in healthcare, where the volume of evidence in scientific research publications is vast and cannot feasibly be identified or analysed by individual clinicians or decision makers. However, the process of creating a systematic review is time consuming and expensive. The pace of scientific publication in medicine and related fields also means that evidence bases are continually changing and review conclusions can quickly become out of date. Therefore, developing methods to support the creating and updating of reviews is essential to reduce the workload required and thereby ensure that reviews remain up to date. This research aims to support systematic reviews, thus improving healthcare through natural language processing and information retrieval techniques. More specifically, this thesis aims to support the process of identifying relevant evidence for systematic reviews and review updates to reduce the workload required from researchers. This research proposes methods to improve studies ranking for systematic reviews. In addition, this thesis describes a dataset of systematic review updates in the field of medicine created using 25 Cochrane reviews. Moreover, this thesis develops an algorithm to automatically refine the Boolean query to improve the identification of relevant studies for review updates. The research demonstrates that automating the process of identifying relevant evidence can reduce the workload of conducting and updating systematic reviews
    corecore