16 research outputs found

    Applying the Positional Burrows–Wheeler Transform to All-Pairs Hamming distance

    Get PDF
    Crochemore et al. gave in WABI 2017 an algorithm that from a set of input strings finds all pairs of strings that have Hamming distance at most a given threshold. The proposed algorithm first finds all long enough exact matches between the strings, and sorts these into pairs whose coordinates also match. Then the remaining pairs are verified for the Hamming distance threshold. The algorithm was shown to work in average linear time, under some constraints and assumptions.under some constraints and assumptions. We show that one can use the Positional Burrows-Wheeler Transform (PBWT) by Durbin (Bioinformatics, 2014) to directly find all exact matches whose coordinates also match. The same structure also extends to verifying the pairs for the Hamming distance threshold. The same analysis as for the algorithm of Crochemore et al. applies. As a side result, we show how to extend PBWT for non-binary alphabets. The new operations provided by PBWT find other applications in similar tasks as those considered here. (C) 2019 The Authors. Published by Elsevier B.V.Peer reviewe

    Verbatim Implementation of a Fast and Space-Efficient Indexed Pattern Matching Algorithm

    Get PDF
    Approximate string matching considers finding a given text pattern in another text while allowing some number of differences. In the offline version of the problem, the text is known beforehand and is processed to generate an indexing data structure. While the problem has received a lot of attention and it has many practical uses in bioinformatics, the common tools often do not make use of the algorithms the time and space complexities of which are the best ones known. Hence it is interesting to compare the performance of an efficient algorithm to tools that make use of heuristics. In this work, a pattern matching algorithm by T.-W. Lam, W.-K. Sung and S.-S. Wong is described. An implementation of the algorithm is provided and tested against two other tools, namely Erne 2 and readaligner 2012. The algorithm by Lam, Sung and Wong searches the text for the pattern while allowing one mismatch or difference, that is, also allowing character insertion and deletion. It makes use of certain types of compressed suffix array and compressed suffix tree that provide fast operations. Additionally, to restrict the search to relevant parts of the suffix tree, a sample is taken from the suffix array and the sampled indices are stored into a data structure that provides double logarithmic worst case range queries. To find the pattern in the text while allowing k errors, the algorithm is combined with a dynamic programming algorithm. The latter is used to find partial matches with k - 1 errors. The candidate occurrences are located from the suffix tree and these branches are used with Lam, Sung and Wong's algorithm. For a text of length n and a pattern of length m drawn from an alphabet of size σ, the time complexity of the algorithm is O(σ^k m^k (k + log log n) + occ) using an O(n (log n)^(1/2) log σ)-bit indexing data structure, where occ is the number of occurrences in the text, given that σ is O(2^((log n)^(1/2))). For short patterns, this is the best known time complexity with an indexing data structure of the given size. The results indicate that in practice relying on heuristics yields better results in terms of time and memory use. While such an outcome is not remarkable, some important data structures were implemented in the process. An implementation of S. S. Rao's compressed suffix array already existed but it was rewritten to allow using different supporting data structures for e.g. rank and select support. The inverse suffix array described by Lam, Sung and Wong was implemented. Also, while implementations of X and Y-fast were available, to the author's knowledge a publicly available implementation of these combined with perfect hashing had not been produced before. Moreover, no implementation of Lam, Sung and Wong's algorithm was known to exist before

    Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

    Get PDF
    Given a threshold L and a set R = {R_1, ..., R_m} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b] in P has length at least L and the number d(a,b)=|{R_i[a,b] : 1 <= i <= m}| of distinct substrings at segment [a,b] is minimized over [a,b] in P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b) : [a,b] in P} founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting

    Evaluating approaches to find exon chains based on long reads

    Get PDF
    Transcript prediction can be modeled as a graph problem where exons are modeled as nodes and reads spanning two or more exons are modeled as exon chains. Pacific Biosciences third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions. We survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity/precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long-read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy. Availability: The simulated data and in-house scripts used for this article are available at http://www.cs.helsinki.fi/group/gsa/exon-chains/exon-chains-bib.tar.bz2.Peer reviewe

    Linear time minimum segmentation enables scalable founder reconstruction

    Get PDF
    Abstract Background  We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R={R1,,Rm}{\mathcal {R}} = \{R_1, \ldots , R_m\} R = { R 1 , … , R m } of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment [a,b]P[a,b] \in P [ a , b ] ∈ P has length at least L and the number d(a,b)={Ri[a,b]:1im}d(a,b)=|\{R_i[a,b] :1\le i \le m\}| d ( a , b ) = | { R i [ a , b ] : 1 ≤ i ≤ m } | of distinct substrings at segment [a, b] is minimized over [a,b]P[a,b] \in P [ a , b ] ∈ P . The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b):[a,b]P}\max \{ d(a,b) :[a,b] \in P \} max { d ( a , b ) : [ a , b ] ∈ P } founder sequences representing the original R{\mathcal {R}} R such that crossovers happen only at segment boundaries. Results  We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn2)O(mn^2) O ( m n 2 ) . Conclusions  Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences

    Linear time minimum segmentation enables scalable founder reconstruction

    Get PDF
    Background: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R={R1,...,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b]P has length at least L and the number d(a,b)=|{Ri[a,b]:1im}| of distinct substrings at segment [a,b] is minimized over [a,b]P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b):[a,b]P} founder sequences representing the original R such that crossovers happen only at segment boundaries. Results: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn2). Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.Peer reviewe

    Founder reconstruction enables scalable and seamless pangenomic analysis

    Get PDF
    Motivation: Variant calling workflows that utilize a single reference sequence are the de facto standard elementary genomic analysis routine for resequencing projects. Various ways to enhance the reference with pangenomic information have been proposed, but scalability combined with seamless integration to existing workflows remains a challenge. Results: We present PanVC with founder sequences, a scalable and accurate variant calling workflow based on a multiple alignment of reference sequences. Scalability is achieved by removing duplicate parts up to a limit into a founder multiple alignment, that is then indexed using a hybrid scheme that exploits general purpose read aligners. Our implemented workflow uses GATK or BCFtools for variant calling, but the various steps of our workflow (e.g. vcf2multialign tool, founder reconstruction) can be of independent interest as a basis for creating novel pangenome analysis workflows beyond variant calling.Peer reviewe

    On Founder Segmentations of Aligned Texts

    No full text
    Aligned texts, such as sequence alignments or multiple sequence alignments, are sets of two or more texts in which corresponding parts have been identified. Typically the corresponding parts are determined in such a way that the amount of changes needed to transform one text to another by using editing operations, such as replacements, insertions or deletions, is minimised. Such text alignments have a variety of uses when applied to biological sequences. In this dissertation, using a set of aligned texts as an input, we consider a set of related problems that have to do with finding segmentations, i.e. splitting the texts into shorter parts by some criteria. Our first aim is to identify equivalent parts of the texts for the purpose of data compression. From the resulting segmentation we would like to construct a smaller number of sequences. These are called founder sequences since they can theoretically be seen, considering DNA sequences, as those of the founding members of some population. We solve a variant of the problem where the maximum number of distinct segments over all groups of segments separated by segment boundaries is minimised, given that the segment boundaries occur in the same positions in all input texts. Our algorithm works in linear time and takes the minimum segment length as a parameter. We also adapt the algorithm to process a set of variant records directly, and use the resulting founder sequences to build an index for a genotyping workflow. Our second aim is to find a segmentation, in which the text segments are distinct by certain criteria. Consequently, a type of graph can be built in which solving the problem of offline string matching on a labelled graph can be done efficiently. The graphs in question are called founder graphs that have the property of being repeat-free or semi-repeat-free. We present algorithms for finding the required segmentation, constructing an index data structure and determining if a given pattern occurs in the built graph in either linear or near-linear time. The achieved time complexities make the algorithms relevant for practical purposes. The work is based on five published papers and previously unpublished research. In Paper I, we extend the positional Burrows-Wheeler transform to constant alphabets with more than two characters. The transform is used as an elementary part of the algorithm for segmenting a set of aligned texts for the purpose of generating founder sequences, presented in Paper II. These are incorporated into a genotyping workflow for short reads in Paper III, where we compare the variant calling accuracy to other workflows in experiments with both simulated and natural data. In particular, we show that utilising founder sequences this way results in good precision and recall especially in case of single-nucleotide variants. In Paper IV, we describe founder graphs and show how to construct them from a set of aligned texts, in which the unaligned texts are all of the same length. We also show that if the node labels are repeat-free, i.e. sufficiently unique, the graph admits efficient indexing. We extend the theory on founder graphs in Paper V by showing how to construct an indexable founder graph from a set of aligned texts that also contains insertions and deletions. In this case, we make use of semi-repeat-free founder graphs. Additionally, we show in this dissertation that semi-repeat-free founder graphs admit a type of prefix property. We make use of the property to augment the index generated from a founder graph to include path information for the purpose of identifying the path on which a given pattern occurs from a set of predefined paths. This part of the work has not been published earlier.Linjatut tekstit, kuten biologisten sekvenssien linjaukset tai monilinjaukset, ovat kahden tai useamman tekstin kokonaisuuksia, joista on määritetty toisiaan vastaavat osat. Tyypillisesti nämä osat on etsitty siten, että yhden tekstin muuttamiseen toiseksi tarvittaisiin mahdollisimman vähän muokkausoperaatioita. Näitä voivat olla esimerkiksi tekstin lisääminen, poistaminen ja korvaaminen. Biologisten sekvenssien tapauksessa linjauksilla ja monilinjauksilla on useita eri käyttötarkoituksia. Tässä väitöskirjassa tarkastelemme eräitä tekstin segmentointiin eli ositukseen liittyviä ongelmia, joissa syötteenä on kokoelma linjattuja tekstejä. Eräässä ongelmassa halutaan muodostaa ositus, jota käyttäen voidaan tuottaa pienempi määrä uusia tekstejä. Näitä kutsutaan kantasekvensseiksi, sillä syötteenä annetut sekvenssit voidaan tietyin ehdoin muodostaa kantasekvensseistä. Ratkaisemme tämän ongelman erikoistapauksen, jossa segmenttien reunojen on sijaittava kaikissa syötteen teksteissä samoissa kohdissa. Esittämässämme ratkaisussa minimoidaan segmenttien maksimimäärä näin muodostuvissa lohkoissa siten, että segmenteillä on annettu minimipituus. Suunnittelemamme algoritmin aikavaativuus on lineaarinen. Lisäksi esitämme algoritmista version, joka käsittelee sekvenssien sijasta varianttidataa, ja käytämme algoritmin tuottamia kantasekvenssejä osana genotyypityssovellusta. Toisessa ongelmassa tarkoituksenamme on osittaa teksti siten, että tuloksena saatavat tekstisegmentit ovat tietyin ehdoin toisistaan poikkeavia. Seurauksena segmenteistä voidaan muodostaa verkko, josta voidaan etsiä siinä esiintyvien merkkijonojen osia tehokkaasti. Kyseisiä verkkoja kutsutaan kantasegmentteihin perustuviksi verkoiksi, jotka ovat joko kokonaan tai osittain toisteettomia. Esitämme algoritmit, joilla löydetään tarvittava ositus, muodostetaan hakutietorakenne sekä suoritetaan haku. Algoritmimme toimivat lineaarisessa tai lähes lineaarisessa ajassa, mikä tekee niistä käytännön sovellusten kannalta hyödyllisiä. Väitöskirja pohjautuu viiteen julkaisuun sekä aikaisemmin julkaisemattomaan tutkimukseen. Ensimmäisessä julkaisussa laajennamme sijainnittaisen Burrows-Wheeler-muunnoksen käsittelemään binääriaakkostoa suuremmat vakiokokoiset aakkostot. Muunnosta käytetään osana toisessa julkaisussa esittämäämme segmentointialgoritmia, joka tarvitaan kantasekvenssien muodostamiseen. Kantasekvenssejä hyödynnetään osana hakutietorakennetta lyhytlukuja käyttävässä genotyypityssovelluksessa kolmannessa julkaisussa. Vertaamme sovellusta muihin vastaaviin ja osoitamme, että sovelluksellamme erityisesti yhden nukleotidin polymorfismit havaitaan hyvällä herkkyydellä ja tarkkuudella. Neljännessä julkaisussa kuvaamme kantasegmentteihin perustuvat verkot. Osoitamme, että jos verkon solmujen tunnisteet ovat toisteettomia, voidaan verkosta muodostaa tehokas hakutietorakenne. Viidennessä julkaisussa laajennamme kantasegmentteihin perustuvien verkkojen teoriaa käsittämään osittain toisteettomat verkot näyttämällä, kuinka indeksoinnin mahdollistava verkko voidaan tuottaa, vaikka syötteessä olisi lisäyksiä ja poistoja. Lisäksi osoitamme, että osittain toisteettomilla kantasegmentteihin perustuvilla verkoilla on tietynlainen välittömyysominaisuus. Käyttäen ominaisuutta hyödyksi lisäämme verkosta tehtyyn hakutietorakenteeseen tietoa poluista, jotta voimme hakutilanteessa määritellä näistä vaihtoehdoista ne polut, joilla annettu hahmo esiintyy
    corecore