101 research outputs found

    Comparison of spliced alignment software in analyzing RNA- Seq data

    Get PDF
    A recently developed protocol for sequencing RNA in a cell in a high-throughput manner, RNA-seq, generates from hundreds of thousands to a few billion short sequence fragments from each RNA sample. Aligning these fragments, or 'reads', to the reference genome in a fast and accurate manner is a challenging task that has been tackled by many researchers over the past five years. In this thesis I review the process of RNA-seq data creation and analysis, and introduce and compare some of the popular alignment software. As part of the thesis, I implemented an alignment software based on the novel idea of a limited range BWT-transformed index. This software, called SpliceAligner, is also introduced in detail. In addition to my own software, I chose for comparison Tophat, SpliceMap, MapSplice, SOAPsplice and SHRiMP2. I tested the chosen software on simulated data sets with read lengths of 50, 100, 150 and 250 base pairs, as well as with data from a real RNA-seq experiment. I ranked the software based on the running time, number of reads mapped and the accuracy of the alignments. I also predicted transcripts from the alignments of the simulated data, and measured the correctness of the predictions. With read lengths of 50 base pairs, 100 base pairs and 150 base pairs, speed, alignment accuracy and ease of use make Tophat a solid top choice. MapSplice is a comparable choice in speed and alignment accuracy, and SOAPsplice is only slightly behind, but their user interfaces are much more complicated. However, Tophat slowed down significantly as the read length increased to 250 base pairs and SOAPsplice completely failed to run with 250 base pairs long reads. This leaves MapSplice as the top choice for long reads in most cases. My software SpliceAligner was competitive in the alignment accuracy with the top choices, but there still remains work to be done on the running speed as well as on multiple small optimizations

    Third-generation RNA-sequencing analysis : graph alignment and transcript assembly with long reads

    Get PDF
    The information contained in the genome of an organism, its DNA, is expressed through transcription of its genes to RNA, in quantities determined by many internal and external factors. As such, studying the gene expression can give valuable information for e.g. clinical diagnostics. A common analysis workflow of RNA-sequencing (RNA-seq) data consists of mapping the sequencing reads to a reference genome, followed by the transcript assembly and quantification based on these alignments. The advent of second-generation sequencing revolutionized the field by reducing the sequencing costs by 50,000-fold. Now another revolution is imminent with the third-generation sequencing platforms producing an order of magnitude higher read lengths. However, higher error rate, higher cost and lower throughput compared to the second-generation sequencing bring their own challenges. To compensate for the low throughput and high cost, hybrid approaches using both short second-generation and long third-generation reads have gathered recent interest. The first part of this thesis focuses on the analysis of short-read RNA-seq data. As short-read mapping is an already well-researched field, we focus on giving a literature review of the topic. For transcript assembly we propose a novel (at the time of the publication) approach of using minimum-cost flows to solve the problem of covering a graph created from the read alignments with a set of paths with the minimum cost, under some cost model. Various network-flow-based solutions were proposed in parallel to, as well as after, ours. The second part, where the main contributions of this thesis lie, focuses on the analysis of long-read RNA-seq data. The driving point of our research has been the Minimum Path Cover with Subpath Constraints (MPC-SC) model, where transcript assembly is modeled as a minimum path cover problem, with the addition that each of the chains of exons (subpath constraints) created from the long reads must be completely contained in a solution path. In addition to implementing this concept, we experimentally studied different approaches on how to find the exon chains in practice. The evaluated approaches included aligning the long reads to a graph created from short read alignments instead of the reference genome, which led to our final contribution: extending a co-linear chaining algorithm from between two sequences to between a sequence and a directed acyclic graph.Transkriptiossa organismin geenien mallin mukaan luodaan RNA-molekyyleja. Lukuisat tekijät, sekä solun sisäiset että ulkoiset, määrittävät mitä geenejä transkriptoidaan, ja missä määrin. Tämän prosessin tutkiminen antaa arvokasta tietoa esimerkiksi lääketieteelliseen diagnostiikkaan. Yksi yleisistä RNA-sekvensointidatan analyysitavoista koostuu kolmesta osasta: lukujaksojen (read sequences) linjaus referenssigenomiin, transkriptien kokoaminen, ja transkriptien ekspressiotasojen määrittäminen. Toisen sukupolven sekvensointiteknologian kehityksen myötä sekvensoinnin hinta laski huomattavasti, mikä salli RNA-sekvensointidatan käytön yhä useampaan tarkoitukseen. Nyt kolmannen sukupolven sekvensointiteknologiat tarjoavat kertaluokkaa pidempiä lukujaksoja, mikä laajentaa analysointimahdollisuuksia. Kuitenkin suurempi virhemäärä, korkeampi hinta ja pienempi määrä tuotettua dataa tuovat omat haasteensa. Toisen ja kolmannen sukupolven teknologioiden käyttäminen yhdessä, ns. hybridilähestymistapa, on tutkimussuunta joka on kerännyt paljon kiinnostusta viimeaikoina. Tämän tutkielman ensimmäinen osa keskittyy toisen sukupolven, eli ns. lyhyiden RNA-lukujaksojen (short read), analyysiin. Näiden lyhyiden lukujaksojen linjausta referenssigenomiin on tutkittu jo 2000-luvulla, joten tällä alueella keskitymme olemassaolevaan kirjallisuuteen. Transkriptien kokoamisen alalta esittelemme metodin, joka käyttää vähimmäiskustannusvirtauksen (minimum-cost flow) mallia. Vähimmäiskustannusvirtauksen mallissa lukujaksoista luotu verkko peitetään joukolla polkuja, joiden kustannus on pienin mahdollinen. Virtausmalleja on käytetty myös muiden tutkijoiden kehittämissä analyysityökaluissa. Tämän tutkielman suurin kontribuutio on toisessa osassa, joka keskittyy ns. pitkien RNA-lukujaksojen (long read) analysointiin. Tutkimuksemme lähtökohtana on ollut malli, jossa pienimmän polkupeitteen (Minimum Path Cover) ongelmaan lisätään alipolkurajoitus (subpath constraint). Jokainen alipolkurajoitus vastaa eksoniketjua (exon chain), jotka jokin pitkä lukujakso peittää, ja jokaisen alipolkurajoituksen täytyy sisältyä kokonaan johonkin polkupeitteen polkuun. Tämän konseptin toteuttamisen lisäksi testasimme kokeellisesti erilaisia lähestymistapoja eksoniketjujen löytämiseksi. Näihin testattaviin lähestymistapoihin kuului pitkien lukujaksojen linjaaminen suoraan lyhyistä lukujaksoista luotuun verkkoon referenssigenomin sijaan. Tämä lähestymistapa johti tämän tutkielman viimeiseen kontribuutioon: kolineaarisen ketjun (co-linear chaining) algoritmin yleistäminen kahden sekvenssin sijasta sekvenssiin ja suunnattuun syklittömään verkkoon

    Evaluating approaches to find exon chains based on long reads

    Get PDF
    Transcript prediction can be modeled as a graph problem where exons are modeled as nodes and reads spanning two or more exons are modeled as exon chains. Pacific Biosciences third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions. We survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity/precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long-read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy. Availability: The simulated data and in-house scripts used for this article are available at http://www.cs.helsinki.fi/group/gsa/exon-chains/exon-chains-bib.tar.bz2.Peer reviewe

    Sparse Dynamic Programming on DAGs with Small Width

    Get PDF
    The minimum path cover problem asks us to find a minimum-cardinality set of paths that cover all the nodes of a directed acyclic graph (DAG). We study the case when the size k of a minimum path cover is small, that is, when the DAG has a small width. This case is motivated by applications in pan-genomics, where the genomic variation of a population is expressed as a DAG. We observe that classical alignment algorithms exploiting sparse dynamic programming can be extended to the sequence-against-DAG case by mimicking the algorithm for sequences on each path of a minimum path cover and handling an evaluation order anomaly with reachability queries. Namely, we introduce a general framework for DAG-extensions of sparse dynamic programming. This framework produces algorithms that are slower than their counterparts on sequences only by a factor k. We illustrate this on two classical problems extended to DAGs: longest increasing subsequence and longest common subsequence. For the former, we obtain an algorithm with running time O(k vertical bar E vertical bar log vertical bar V vertical bar). This matches the optimal solution to the classical problem variant when the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. We then apply this technique to the co-linear chaining problem, which is a generalization of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets and a two-dimensional range search tree modified to support range maximum queries. We also study a general sequence-to-DAG alignment formulation that allows affine gap costs in the sequence. The main ingredient of the proposed framework is a new algorithm for finding a minimum path cover of a DAG (V, E) in O(k vertical bar E vertical bar log vertical bar V vertical bar) time, improving all known time-bounds when k is small and the DAG is not too dense. In addition to boosting the sparse dynamic programming framework, an immediate consequence of this new minimum path cover algorithm is an improved space/time tradeoff for reachability queries in arbitrary directed graphs.Peer reviewe

    Novel germline variant in the histone demethylase and transcription regulator KDM4C induces a multi-cancer phenotype

    Get PDF
    Background Genes involved in epigenetic regulation are central for chromatin structure and gene expression. Specific mutations in these might promote carcinogenesis in several tissue types. Methods We used exome, whole-genome and Sanger sequencing to detect rare variants shared by seven affected individuals in a striking early-onset multi-cancer family. The only variant that segregated with malignancy resided in a histone demethylase KDM4C. Consequently, we went on to study the epigenetic landscape of the mutation carriers with ATAC, ChIP (chromatin immunoprecipitation) and RNA-sequencing from lymphoblastoid cell lines to identify possible pathogenic effects. Results A novel variant in KDM4C, encoding a H3K9me3 histone demethylase and transcription regulator, was found to segregate with malignancy in the family. Based on Roadmap Epigenomics Project data, differentially accessible chromatin regions between the variant carriers and controls enrich to normally H3K9me3-marked chromatin. We could not detect a difference in global H3K9 trimethylation levels. However, carriers of the variant seemed to have more trimethylated H3K9 at transcription start sites. Pathway analyses of ChIP-seq and differential gene expression data suggested that genes regulated through KDM4C interaction partner EZH2 and its interaction partner PLZF are aberrantly expressed in mutation carriers. Conclusions The apparent dysregulation of H3K9 trimethylation and KDM4C-associated genes in lymphoblastoid cells supports the hypothesis that the KDM4C variant is causative of the multi-cancer susceptibility in the family. As the variant is ultrarare, located in the conserved catalytic JmjC domain and predicted pathogenic by the majority of available in silico tools, further studies on the role of KDM4C in cancer predisposition are warranted.Peer reviewe

    Vitamin C boosts DNA demethylation in TET2 mutation carriers

    Get PDF
    Background Accurate regulation of DNA methylation is necessary for normal cells to differentiate, develop and function. TET2 catalyzes stepwise DNA demethylation in hematopoietic cells. Mutations in the TET2 gene predispose to hematological malignancies by causing DNA methylation overload and aberrant epigenomic landscape. Studies on mice and cell lines show that the function of TET2 is boosted by vitamin C. Thus, by strengthening the demethylation activity of TET2, vitamin C could play a role in the prevention of hematological malignancies in individuals with TET2 dysfunction. We recently identified a family with lymphoma predisposition where a heterozygous truncating germline mutation in TET2 segregated with nodular lymphocyte-predominant Hodgkin lymphoma. The mutation carriers displayed a hypermethylation pattern that was absent in the family members without the mutation.Methods In a clinical trial of 1 year, we investigated the effects of oral 1 g/day vitamin C supplementation on DNA methylation by analyzing genome-wide DNA methylation and gene expression patterns from the family members.Results We show that vitamin C reinforces the DNA demethylation cascade, reduces the proportion of hypermethylated loci and diminishes gene expression differences between TET2 mutation carriers and control individuals.Conclusions These results suggest that vitamin C supplementation increases DNA methylation turnover and provide a basis for further work to examine the potential benefits of vitamin C supplementation in individuals with germline and somatic TET2 mutations.Peer reviewe

    Next-generation sequencing in a large pedigree segregating visceral artery aneurysms suggests potential role of COL4A1/COL4A2 in disease etiology

    Get PDF
    Background Visceral artery aneurysms (VAAs) can be fatal if ruptured. Although a relatively rare incident, it holds a contemporary mortality rate of approximately 12%. VAAs have multiple possible causes, one of which is genetic predisposition. Here, we present a striking family with seven individuals affected by VAAs, and one individual affected by a visceral artery pseudoaneurysm. Methods We exome sequenced the affected family members and the parents of the proband to find a possible underlying genetic defect. As exome sequencing did not reveal any feasible protein-coding variants, we combined whole-genome sequencing of two individuals with linkage analysis to find a plausible non-coding culprit variant. Variants were ranked by the deep learning framework DeepSEA. Results Two of seven top-ranking variants, NC_000013.11:g.108154659C>T and NC_000013.11:g.110409638C>T, were found in all VAA-affected individuals, but not in the individual affected by the pseudoaneurysm. The second variant is in a candidate cis-regulatory element in the fourth intron of COL4A2, proximal to COL4A1. Conclusions As type IV collagens are essential for the stability and integrity of the vascular basement membrane and involved in vascular disease, we conclude that COL4A1 and COL4A2 are strong candidates for VAA susceptibility genes.Peer reviewe
    corecore