5 research outputs found

    Sparse Dynamic Programming on DAGs with Small Width

    Get PDF
    The minimum path cover problem asks us to find a minimum-cardinality set of paths that cover all the nodes of a directed acyclic graph (DAG). We study the case when the size k of a minimum path cover is small, that is, when the DAG has a small width. This case is motivated by applications in pan-genomics, where the genomic variation of a population is expressed as a DAG. We observe that classical alignment algorithms exploiting sparse dynamic programming can be extended to the sequence-against-DAG case by mimicking the algorithm for sequences on each path of a minimum path cover and handling an evaluation order anomaly with reachability queries. Namely, we introduce a general framework for DAG-extensions of sparse dynamic programming. This framework produces algorithms that are slower than their counterparts on sequences only by a factor k. We illustrate this on two classical problems extended to DAGs: longest increasing subsequence and longest common subsequence. For the former, we obtain an algorithm with running time O(k vertical bar E vertical bar log vertical bar V vertical bar). This matches the optimal solution to the classical problem variant when the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. We then apply this technique to the co-linear chaining problem, which is a generalization of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets and a two-dimensional range search tree modified to support range maximum queries. We also study a general sequence-to-DAG alignment formulation that allows affine gap costs in the sequence. The main ingredient of the proposed framework is a new algorithm for finding a minimum path cover of a DAG (V, E) in O(k vertical bar E vertical bar log vertical bar V vertical bar) time, improving all known time-bounds when k is small and the DAG is not too dense. In addition to boosting the sparse dynamic programming framework, an immediate consequence of this new minimum path cover algorithm is an improved space/time tradeoff for reachability queries in arbitrary directed graphs.Peer reviewe

    Third-generation RNA-sequencing analysis : graph alignment and transcript assembly with long reads

    Get PDF
    The information contained in the genome of an organism, its DNA, is expressed through transcription of its genes to RNA, in quantities determined by many internal and external factors. As such, studying the gene expression can give valuable information for e.g. clinical diagnostics. A common analysis workflow of RNA-sequencing (RNA-seq) data consists of mapping the sequencing reads to a reference genome, followed by the transcript assembly and quantification based on these alignments. The advent of second-generation sequencing revolutionized the field by reducing the sequencing costs by 50,000-fold. Now another revolution is imminent with the third-generation sequencing platforms producing an order of magnitude higher read lengths. However, higher error rate, higher cost and lower throughput compared to the second-generation sequencing bring their own challenges. To compensate for the low throughput and high cost, hybrid approaches using both short second-generation and long third-generation reads have gathered recent interest. The first part of this thesis focuses on the analysis of short-read RNA-seq data. As short-read mapping is an already well-researched field, we focus on giving a literature review of the topic. For transcript assembly we propose a novel (at the time of the publication) approach of using minimum-cost flows to solve the problem of covering a graph created from the read alignments with a set of paths with the minimum cost, under some cost model. Various network-flow-based solutions were proposed in parallel to, as well as after, ours. The second part, where the main contributions of this thesis lie, focuses on the analysis of long-read RNA-seq data. The driving point of our research has been the Minimum Path Cover with Subpath Constraints (MPC-SC) model, where transcript assembly is modeled as a minimum path cover problem, with the addition that each of the chains of exons (subpath constraints) created from the long reads must be completely contained in a solution path. In addition to implementing this concept, we experimentally studied different approaches on how to find the exon chains in practice. The evaluated approaches included aligning the long reads to a graph created from short read alignments instead of the reference genome, which led to our final contribution: extending a co-linear chaining algorithm from between two sequences to between a sequence and a directed acyclic graph.Transkriptiossa organismin geenien mallin mukaan luodaan RNA-molekyyleja. Lukuisat tekijät, sekä solun sisäiset että ulkoiset, määrittävät mitä geenejä transkriptoidaan, ja missä määrin. Tämän prosessin tutkiminen antaa arvokasta tietoa esimerkiksi lääketieteelliseen diagnostiikkaan. Yksi yleisistä RNA-sekvensointidatan analyysitavoista koostuu kolmesta osasta: lukujaksojen (read sequences) linjaus referenssigenomiin, transkriptien kokoaminen, ja transkriptien ekspressiotasojen määrittäminen. Toisen sukupolven sekvensointiteknologian kehityksen myötä sekvensoinnin hinta laski huomattavasti, mikä salli RNA-sekvensointidatan käytön yhä useampaan tarkoitukseen. Nyt kolmannen sukupolven sekvensointiteknologiat tarjoavat kertaluokkaa pidempiä lukujaksoja, mikä laajentaa analysointimahdollisuuksia. Kuitenkin suurempi virhemäärä, korkeampi hinta ja pienempi määrä tuotettua dataa tuovat omat haasteensa. Toisen ja kolmannen sukupolven teknologioiden käyttäminen yhdessä, ns. hybridilähestymistapa, on tutkimussuunta joka on kerännyt paljon kiinnostusta viimeaikoina. Tämän tutkielman ensimmäinen osa keskittyy toisen sukupolven, eli ns. lyhyiden RNA-lukujaksojen (short read), analyysiin. Näiden lyhyiden lukujaksojen linjausta referenssigenomiin on tutkittu jo 2000-luvulla, joten tällä alueella keskitymme olemassaolevaan kirjallisuuteen. Transkriptien kokoamisen alalta esittelemme metodin, joka käyttää vähimmäiskustannusvirtauksen (minimum-cost flow) mallia. Vähimmäiskustannusvirtauksen mallissa lukujaksoista luotu verkko peitetään joukolla polkuja, joiden kustannus on pienin mahdollinen. Virtausmalleja on käytetty myös muiden tutkijoiden kehittämissä analyysityökaluissa. Tämän tutkielman suurin kontribuutio on toisessa osassa, joka keskittyy ns. pitkien RNA-lukujaksojen (long read) analysointiin. Tutkimuksemme lähtökohtana on ollut malli, jossa pienimmän polkupeitteen (Minimum Path Cover) ongelmaan lisätään alipolkurajoitus (subpath constraint). Jokainen alipolkurajoitus vastaa eksoniketjua (exon chain), jotka jokin pitkä lukujakso peittää, ja jokaisen alipolkurajoituksen täytyy sisältyä kokonaan johonkin polkupeitteen polkuun. Tämän konseptin toteuttamisen lisäksi testasimme kokeellisesti erilaisia lähestymistapoja eksoniketjujen löytämiseksi. Näihin testattaviin lähestymistapoihin kuului pitkien lukujaksojen linjaaminen suoraan lyhyistä lukujaksoista luotuun verkkoon referenssigenomin sijaan. Tämä lähestymistapa johti tämän tutkielman viimeiseen kontribuutioon: kolineaarisen ketjun (co-linear chaining) algoritmin yleistäminen kahden sekvenssin sijasta sekvenssiin ja suunnattuun syklittömään verkkoon

    Fast and accurate cDNA mapping and splice site identification

    No full text
    Mapping and alignment of cDNA sequences containing splice sites is an algorithmically and computationally challenging task. Most recently developed spliced aligners are designed for mapping short reads and sacrifice sensitivity for increased performance. We present mesalina, a highly accurate spliced aligner, that can also be used to detect novel non-canonical splice sites and whose performance is more robust with respect to increasing read length. Mesalina utilizes the seed-extend strategy, combining fast retrieval of maximal exact matches with a sensitive sandwich dynamic programming algorithm. Preliminary results indicate that mesalina is accurate and very fast, especially for mapping longer reads. In particular, it is more than ten times faster than mappers with a comparable accuracy. Mesalina is available from https://github.ugent.be/ComputationalBiology/mesalina

    ALFALFA : fast and accurate mapping of long next generation sequencing reads

    Get PDF
    corecore