38 research outputs found

    Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

    Get PDF
    Background: The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results: In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions: Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another

    Chaining with Overlaps Revisited

    Get PDF
    Chaining algorithms aim to form a semi-global alignment of two sequences based on a set of anchoring local alignments as input. Depending on the optimization criteria and the exact definition of a chain, there are several O(n log n) time algorithms to solve this problem optimally, where n is the number of input anchors. In this paper, we focus on a formulation allowing the anchors to overlap in a chain. This formulation was studied by Shibuya and Kurochkin (WABI 2003), but their algorithm comes with no proof of correctness. We revisit and modify their algorithm to consider a strict definition of precedence relation on anchors, adding the required derivation to convince on the correctness of the resulting algorithm that runs in O(n log2 n) time on anchors formed by exact matches. With the more relaxed definition of precedence relation considered by Shibuya and Kurochkin or when anchors are non-nested such as matches of uniform length (k-mers), the algorithm takes O(n log n) time. We also establish a connection between chaining with overlaps and the widely studied longest common subsequence problem. 2012 ACM Subject Classification Theory of computation ! Pattern matching; Theory of computation ! Dynamic programming; Applied computing ! Genomics.Peer reviewe

    Sparse Dynamic Programming on DAGs with Small Width

    Get PDF
    The minimum path cover problem asks us to find a minimum-cardinality set of paths that cover all the nodes of a directed acyclic graph (DAG). We study the case when the size k of a minimum path cover is small, that is, when the DAG has a small width. This case is motivated by applications in pan-genomics, where the genomic variation of a population is expressed as a DAG. We observe that classical alignment algorithms exploiting sparse dynamic programming can be extended to the sequence-against-DAG case by mimicking the algorithm for sequences on each path of a minimum path cover and handling an evaluation order anomaly with reachability queries. Namely, we introduce a general framework for DAG-extensions of sparse dynamic programming. This framework produces algorithms that are slower than their counterparts on sequences only by a factor k. We illustrate this on two classical problems extended to DAGs: longest increasing subsequence and longest common subsequence. For the former, we obtain an algorithm with running time O(k vertical bar E vertical bar log vertical bar V vertical bar). This matches the optimal solution to the classical problem variant when the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. We then apply this technique to the co-linear chaining problem, which is a generalization of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets and a two-dimensional range search tree modified to support range maximum queries. We also study a general sequence-to-DAG alignment formulation that allows affine gap costs in the sequence. The main ingredient of the proposed framework is a new algorithm for finding a minimum path cover of a DAG (V, E) in O(k vertical bar E vertical bar log vertical bar V vertical bar) time, improving all known time-bounds when k is small and the DAG is not too dense. In addition to boosting the sparse dynamic programming framework, an immediate consequence of this new minimum path cover algorithm is an improved space/time tradeoff for reachability queries in arbitrary directed graphs.Peer reviewe

    Third-generation RNA-sequencing analysis : graph alignment and transcript assembly with long reads

    Get PDF
    The information contained in the genome of an organism, its DNA, is expressed through transcription of its genes to RNA, in quantities determined by many internal and external factors. As such, studying the gene expression can give valuable information for e.g. clinical diagnostics. A common analysis workflow of RNA-sequencing (RNA-seq) data consists of mapping the sequencing reads to a reference genome, followed by the transcript assembly and quantification based on these alignments. The advent of second-generation sequencing revolutionized the field by reducing the sequencing costs by 50,000-fold. Now another revolution is imminent with the third-generation sequencing platforms producing an order of magnitude higher read lengths. However, higher error rate, higher cost and lower throughput compared to the second-generation sequencing bring their own challenges. To compensate for the low throughput and high cost, hybrid approaches using both short second-generation and long third-generation reads have gathered recent interest. The first part of this thesis focuses on the analysis of short-read RNA-seq data. As short-read mapping is an already well-researched field, we focus on giving a literature review of the topic. For transcript assembly we propose a novel (at the time of the publication) approach of using minimum-cost flows to solve the problem of covering a graph created from the read alignments with a set of paths with the minimum cost, under some cost model. Various network-flow-based solutions were proposed in parallel to, as well as after, ours. The second part, where the main contributions of this thesis lie, focuses on the analysis of long-read RNA-seq data. The driving point of our research has been the Minimum Path Cover with Subpath Constraints (MPC-SC) model, where transcript assembly is modeled as a minimum path cover problem, with the addition that each of the chains of exons (subpath constraints) created from the long reads must be completely contained in a solution path. In addition to implementing this concept, we experimentally studied different approaches on how to find the exon chains in practice. The evaluated approaches included aligning the long reads to a graph created from short read alignments instead of the reference genome, which led to our final contribution: extending a co-linear chaining algorithm from between two sequences to between a sequence and a directed acyclic graph.Transkriptiossa organismin geenien mallin mukaan luodaan RNA-molekyyleja. Lukuisat tekijÀt, sekÀ solun sisÀiset ettÀ ulkoiset, mÀÀrittÀvÀt mitÀ geenejÀ transkriptoidaan, ja missÀ mÀÀrin. TÀmÀn prosessin tutkiminen antaa arvokasta tietoa esimerkiksi lÀÀketieteelliseen diagnostiikkaan. Yksi yleisistÀ RNA-sekvensointidatan analyysitavoista koostuu kolmesta osasta: lukujaksojen (read sequences) linjaus referenssigenomiin, transkriptien kokoaminen, ja transkriptien ekspressiotasojen mÀÀrittÀminen. Toisen sukupolven sekvensointiteknologian kehityksen myötÀ sekvensoinnin hinta laski huomattavasti, mikÀ salli RNA-sekvensointidatan kÀytön yhÀ useampaan tarkoitukseen. Nyt kolmannen sukupolven sekvensointiteknologiat tarjoavat kertaluokkaa pidempiÀ lukujaksoja, mikÀ laajentaa analysointimahdollisuuksia. Kuitenkin suurempi virhemÀÀrÀ, korkeampi hinta ja pienempi mÀÀrÀ tuotettua dataa tuovat omat haasteensa. Toisen ja kolmannen sukupolven teknologioiden kÀyttÀminen yhdessÀ, ns. hybridilÀhestymistapa, on tutkimussuunta joka on kerÀnnyt paljon kiinnostusta viimeaikoina. TÀmÀn tutkielman ensimmÀinen osa keskittyy toisen sukupolven, eli ns. lyhyiden RNA-lukujaksojen (short read), analyysiin. NÀiden lyhyiden lukujaksojen linjausta referenssigenomiin on tutkittu jo 2000-luvulla, joten tÀllÀ alueella keskitymme olemassaolevaan kirjallisuuteen. Transkriptien kokoamisen alalta esittelemme metodin, joka kÀyttÀÀ vÀhimmÀiskustannusvirtauksen (minimum-cost flow) mallia. VÀhimmÀiskustannusvirtauksen mallissa lukujaksoista luotu verkko peitetÀÀn joukolla polkuja, joiden kustannus on pienin mahdollinen. Virtausmalleja on kÀytetty myös muiden tutkijoiden kehittÀmissÀ analyysityökaluissa. TÀmÀn tutkielman suurin kontribuutio on toisessa osassa, joka keskittyy ns. pitkien RNA-lukujaksojen (long read) analysointiin. Tutkimuksemme lÀhtökohtana on ollut malli, jossa pienimmÀn polkupeitteen (Minimum Path Cover) ongelmaan lisÀtÀÀn alipolkurajoitus (subpath constraint). Jokainen alipolkurajoitus vastaa eksoniketjua (exon chain), jotka jokin pitkÀ lukujakso peittÀÀ, ja jokaisen alipolkurajoituksen tÀytyy sisÀltyÀ kokonaan johonkin polkupeitteen polkuun. TÀmÀn konseptin toteuttamisen lisÀksi testasimme kokeellisesti erilaisia lÀhestymistapoja eksoniketjujen löytÀmiseksi. NÀihin testattaviin lÀhestymistapoihin kuului pitkien lukujaksojen linjaaminen suoraan lyhyistÀ lukujaksoista luotuun verkkoon referenssigenomin sijaan. TÀmÀ lÀhestymistapa johti tÀmÀn tutkielman viimeiseen kontribuutioon: kolineaarisen ketjun (co-linear chaining) algoritmin yleistÀminen kahden sekvenssin sijasta sekvenssiin ja suunnattuun syklittömÀÀn verkkoon

    Methods for Transcriptome Assembly in the Allopolyploid Brassica napus

    Get PDF
    Canada is the world’s largest producer of canola and the trend of production is ever increasing with an annual growth rate of 9.38% according to FAOSTAT. In 2017, canola acreage surpassed wheat in Saskatchewan, the highest producer of both crops in Canada. Country-wide, the total farming area of canola increased by 9.9% to 22.4 million acres while wheat area saw a slight decline to 23.3 million acres. While Canada is the highest producer of the crop, yields are lower than other countries. To maximize the benefit of this market, canola cultivation could be made more efficient with further characterization of the organism’s genes and their involvement in plant robustness. Such studies using transcriptome analysis have been successful in organisms with relatively small and simple genomes. However, such analyses in B. napus are complicated by the allopolyploid genome structure resulting from ancestral whole genome duplications in the species’ evolutionary history. Homeologous gene pairs originating from the orthology between the two B. napus progenitor species complicate the process of transcriptome assembly. Modern assemblers: Trinity, Oases and SOAPdenovo-Trans were used to generate several de novo transcriptome assemblies for B. napus. A variety of metrics were used to determine the impact that the complex genome structure has on transcriptome studies. In particular, the most important questions for transcriptome assembly in B. napus were how does varying the k-mer parameter effect assembly quality, and to what extent do similar genes resulting from homeology within B. napus complicate the process of assembly. These metrics used for evaluating the assemblies include basic assembly statistics such as the number of contigs and contig lengths (via N25, N50 and N75 statistics); as well as more involved investigation via comparison to annotated coding DNA sequences; evaluation softwares scores for de novo transcriptome assemblies and finally; quantification of homeolog differentiation by alignment to previously identified pairs of homeologous genes. These metrics provided a picture of the trade-offs between assembly softwares and the k-parameter determining the length of subsequences used to build de Bruijn graphs for de novo transcriptome assembly. It was shown that shorter k-mer lengths produce fewer, and more complete contigs due to the shorter required overlap between read sequences; while longer k-mer lengths increase the sensitivity of an assembler to sequence variation between similar gene sequences. The Trinity assembler outperformed Oases and SOAPdenovo-Trans when considering the total breadth of evaluation metrics, generating longer transcripts with fewer chimers between homeologous gene pairs

    Computational problems of analysis of short next generation sequencing reads

    Get PDF
    Short read next generation sequencing (NGS) has significant impacts on modern genomics, genetics, cell biology and medicine, especially on meta-genomics, comparative genomics, polymorphism detection, mutation screening, transcriptome profiling, methylation profiling, chromatin remodelling and many more applications. However, NGS are prone for errors which complicate scientific conclusions. NGS technologies consist of shearing DNA molecules into collection of numerous small fragments, called a ‘library’, and their further extensive parallel sequencing. These sequenced overlapping fragments are called ‘reads’, they are assembled into contiguous strings. The contiguous sequences are in turn assembled into genomes for further analysis. Computational sequencing problems are those arising from numerical processing of sequenced samples. The numerical processing involves procedures such as: quality-scoring, mapping/assembling, and surprisingly, error-correction of a data. This paper is reviewing post-processing errors and computational methods to discern them. It also includes sequencing dictionary. We present here quality control of raw data, errors arising at the steps of alignment of sequencing reads to a reference genome and assembly. Finally this work presents identification of mutations (“Variant calling”) in sequencing data and its quality control
    corecore