1,208 research outputs found

    Chaining with Overlaps Revisited

    Get PDF
    Chaining algorithms aim to form a semi-global alignment of two sequences based on a set of anchoring local alignments as input. Depending on the optimization criteria and the exact definition of a chain, there are several O(n log n) time algorithms to solve this problem optimally, where n is the number of input anchors. In this paper, we focus on a formulation allowing the anchors to overlap in a chain. This formulation was studied by Shibuya and Kurochkin (WABI 2003), but their algorithm comes with no proof of correctness. We revisit and modify their algorithm to consider a strict definition of precedence relation on anchors, adding the required derivation to convince on the correctness of the resulting algorithm that runs in O(n log2 n) time on anchors formed by exact matches. With the more relaxed definition of precedence relation considered by Shibuya and Kurochkin or when anchors are non-nested such as matches of uniform length (k-mers), the algorithm takes O(n log n) time. We also establish a connection between chaining with overlaps and the widely studied longest common subsequence problem. 2012 ACM Subject Classification Theory of computation ! Pattern matching; Theory of computation ! Dynamic programming; Applied computing ! Genomics.Peer reviewe

    Accurate spliced alignment of long RNA sequencing reads

    Get PDF
    Motivation: Long-read RNA sequencing technologies are establishing themselves as the primary techniques to detect novel isoforms, and many such analyses are dependent on read alignments. However, the error rate and sequencing length of the reads create new challenges for accurately aligning them, particularly around small exons. Results: We present an alignment method uLTRA for long RNA sequencing reads based on a novel two-pass collinear chaining algorithm. We show that uLTRA produces higher accuracy over state-of-the-art aligners with substantially higher accuracy for small exons on simulated and synthetic data. On simulated data, uLTRA achieves an accuracy of about 60% for exons of length 10 nucleotides or smaller and close to 90% accuracy for exons of length between 11 and 20 nucleotides. On biological data where true read location is unknown, we show several examples where uLTRA aligns to known and novel isoforms containing small exons that are not detected with other aligners. While uLTRA obtains its accuracy using annotations, it can also be used as a wrapper around minimap2 to align reads outside annotated regions.Peer reviewe

    Chaining of Maximal Exact Matches in Graphs

    Full text link
    We study the problem of finding maximal exact matches (MEMs) between a query string QQ and a labeled directed acyclic graph (DAG) G=(V,E,)G=(V,E,\ell) and subsequently co-linearly chaining these matches. We show that it suffices to compute MEMs between node labels and QQ (node MEMs) to encode full MEMs. Node MEMs can be computed in linear time and we show how to co-linearly chain them to solve the Longest Common Subsequence (LCS) problem between QQ and GG. Our chaining algorithm is the first to consider a symmetric formulation of the chaining problem in graphs and runs in O(k2V+E+kNlogN)O(k^2|V| + |E| + kN\log N) time, where kk is the width (minimum number of paths covering the nodes) of GG, and NN is the number of node MEMs. We then consider the problem of finding MEMs when the input graph is an indexable elastic founder graph (subclass of labeled DAGs studied by Equi et al., Algorithmica 2022). For arbitrary input graphs, the problem cannot be solved in truly sub-quadratic time under SETH (Equi et al., ICALP 2019). We show that we can report all MEMs between QQ and an indexable elastic founder graph in time O(nH2+m+Mκ)O(nH^2 + m + M_\kappa), where nn is the total length of node labels, HH is the maximum number of nodes in a block of the graph, m=Qm = |Q|, and MκM_\kappa is the number of MEMs of length at least κ\kappa. The results extend to the indexing problem, where the graph is preprocessed and a set of queries is processed as a batch.Comment: 19 pages, 1 figur

    On Computational Small Steps and Big Steps: Refocusing for Outermost Reduction

    Get PDF
    We study the relationship between small-step semantics, big-step semantics and abstract machines, for programming languages that employ an outermost reduction strategy, i.e., languages where reductions near the root of the abstract syntax tree are performed before reductions near the leaves.In particular, we investigate how Biernacka and Danvy's syntactic correspondence and Reynolds's functional correspondence can be applied to inter-derive semantic specifications for such languages.The main contribution of this dissertation is three-fold:First, we identify that backward overlapping reduction rules in the small-step semantics cause the refocusing step of the syntactic correspondence to be inapplicable.Second, we propose two solutions to overcome this in-applicability: backtracking and rule generalization.Third, we show how these solutions affect the other transformations of the two correspondences.Other contributions include the application of the syntactic and functional correspondences to Boolean normalization.In particular, we show how to systematically derive a spectrum of normalization functions for negational and conjunctive normalization

    Co-Linear Chaining on Pangenome Graphs

    Get PDF
    Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width [Makinen et al., TALG\u2719] and how incorporating gap cost in the scoring function improves alignment accuracy [Chandra and Jain, RECOMB\u2723]. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy

    Screening synteny blocks in pairwise genome comparisons through integer programming

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It is difficult to accurately interpret chromosomal correspondences such as true orthology and paralogy due to significant divergence of genomes from a common ancestor. Analyses are particularly problematic among lineages that have repeatedly experienced whole genome duplication (WGD) events. To compare multiple "subgenomes" derived from genome duplications, we need to relax the traditional requirements of "one-to-one" syntenic matchings of genomic regions in order to reflect "one-to-many" or more generally "many-to-many" matchings. However this relaxation may result in the identification of synteny blocks that are derived from ancient shared WGDs that are not of interest. For many downstream analyses, we need to eliminate weak, low scoring alignments from pairwise genome comparisons. Our goal is to objectively select subset of synteny blocks whose total scores are maximized while respecting the duplication history of the genomes in comparison. We call this "quota-based" screening of synteny blocks in order to appropriately fill a quota of syntenic relationships within one genome or between two genomes having WGD events.</p> <p>Results</p> <p>We have formulated the synteny block screening as an optimization problem known as "Binary Integer Programming" (BIP), which is solved using existing linear programming solvers. The computer program QUOTA-ALIGN performs this task by creating a clear objective function that maximizes the compatible set of synteny blocks under given constraints on overlaps and depths (corresponding to the duplication history in respective genomes). Such a procedure is useful for any pairwise synteny alignments, but is most useful in lineages affected by multiple WGDs, like plants or fish lineages. For example, there should be a 1:2 ploidy relationship between genome A and B if genome B had an independent WGD subsequent to the divergence of the two genomes. We show through simulations and real examples using plant genomes in the rosid superorder that the quota-based screening can eliminate ambiguous synteny blocks and focus on specific genomic evolutionary events, like the divergence of lineages (in cross-species comparisons) and the most recent WGD (in self comparisons).</p> <p>Conclusions</p> <p>The QUOTA-ALIGN algorithm screens a set of synteny blocks to retain only those compatible with a user specified ploidy relationship between two genomes. These blocks, in turn, may be used for additional downstream analyses such as identifying true orthologous regions in interspecific comparisons. There are two major contributions of QUOTA-ALIGN: 1) reducing the block screening task to a BIP problem, which is novel; 2) providing an efficient software pipeline starting from all-against-all BLAST to the screened synteny blocks with dot plot visualizations. Python codes and full documentations are publicly available <url>http://github.com/tanghaibao/quota-alignment</url>. QUOTA-ALIGN program is also integrated as a major component in SynMap <url>http://genomevolution.com/CoGe/SynMap.pl</url>, offering easier access to thousands of genomes for non-programmers.</p

    The Conflict Notion and its Static Detection: a Formal Survey

    Get PDF
    The notion of policy is widely used to enable a flexible control of many systems: access control, privacy, accountability, data base, service, contract , network configuration, and so on. One important feature is to be able to check these policies against contradictions before the enforcement step. This is the problem of the conflict detection which can be done at different steps and with different approaches. This paper presents a review of the principles for conflict detection in related security policy languages. The policy languages, the notions of conflict and the means to detect conflicts are various, hence it is difficult to compare the different principles. We propose an analysis and a comparison of the five static detection principles we found in reviewing more than forty papers of the literature. To make the comparison easier we develop a logical model with four syntactic types of systems covering most of the literature examples. We provide a semantic classification of the conflict notions and thus, we are able to relate the detection principles, the syntactic types and the semantic classification. Our comparison shows the exact link between logical consistency and the conflict notions, and that some detection principles are subject to weaknesses if not used with the right conditions
    corecore