19 research outputs found

    Approximating Longest Common Substring with k mismatches: Theory and Practice

    Get PDF
    In the problem of the longest common substring with k mismatches we are given two strings X, Y and must find the maximal length ? such that there is a length-? substring of X and a length-? substring of Y that differ in at most k positions. The length ? can be used as a robust measure of similarity between X, Y. In this work, we develop new approximation algorithms for computing ? that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound

    Fast and sensitive mapping of nanopore sequencing reads with GraphMap

    Get PDF
    Realizing the democratic promise of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. Here we present GraphMap, a mapping algorithm designed to analyse nanopore sequencing reads, which progressively refines candidate alignments to robustly handle potentially high-error rates and a fast graph traversal to align long reads with speed and high precision (>95%). Evaluation on MinION sequencing data sets against short- and long-read mappers indicates that GraphMap increases mapping sensitivity by 10–80% and maps >95% of bases. GraphMap alignments enabled single-nucleotide variant calling on the human genome with increased sensitivity (15%) over the next best mapper, precise detection of structural variants from length 100 bp to 4 kbp, and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap

    Tight Conditional Lower Bounds for Longest Common Increasing Subsequence

    Get PDF
    We consider the canonical generalization of the well-studied Longest Increasing Subsequence problem to multiple sequences, called k-LCIS: Given k integer sequences X_1,...,X_k of length at most n, the task is to determine the length of the longest common subsequence of X_1,...,X_k that is also strictly increasing. Especially for the case of k=2 (called LCIS for short), several algorithms have been proposed that require quadratic time in the worst case. Assuming the Strong Exponential Time Hypothesis (SETH), we prove a tight lower bound, specifically, that no algorithm solves LCIS in (strongly) subquadratic time. Interestingly, the proof makes no use of normalization tricks common to hardness proofs for similar problems such as LCS. We further strengthen this lower bound to rule out O((nL)^{1-epsilon}) time algorithms for LCIS, where L denotes the solution size, and to rule out O(n^{k-epsilon}) time algorithms for k-LCIS. We obtain the same conditional lower bounds for the related Longest Common Weakly Increasing Subsequence problem

    Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates

    Full text link
    Many datasets such as market basket data, text or hypertext documents, and sensor observations recorded in different locations or time periods, are modeled as a collection of sets over a ground set of keys. We are interested in basic aggregates such as the weight or selectivity of keys that satisfy some selection predicate defined over keys' attributes and membership in particular sets. This general formulation includes basic aggregates such as the Jaccard coefficient, Hamming distance, and association rules. On massive data sets, exact computation can be inefficient or infeasible. Sketches based on coordinated random samples are classic summaries that support approximate query processing. Queries are resolved by generating a sketch (sample) of the union of sets used in the predicate from the sketches these sets and then applying an estimator to this union-sketch. We derive novel tighter (unbiased) estimators that leverage sampled keys that are present in the union of applicable sketches but excluded from the union sketch. We establish analytically that our estimators dominate estimators applied to the union-sketch for {\em all queries and data sets}. Empirical evaluation on synthetic and real data reveals that on typical applications we can expect a 25%-4 fold reduction in estimation error.Comment: 16 page

    SUPPLY CHAIN NETWORK DESIGN: RISK-AVERSE VS. RISK-NEUTRAL DECISION MAKING

    Get PDF
    Recent events, such as the Heparin tragedy, highlight the necessity for designers and planners of supply chain networks to consider the risk of disruptions in spite of their low probability of occurrence. One effective way to hedge against supply chain network disruptions is to have a robustly designed supply chain network. This involves strategic decisions, such as choosing which markets to serve, which suppliers to source from, the location of plants, the types of facilities to use, and tactical decisions, such as production and capacity allocation. In this dissertation, we focus on models for designing supply chain networks that are resilient to disruptions. We consider two types of decision making policies. A risk-neutral decision making policy is based on the cost minimization approach, and the decision-maker defines the set of decisions that minimize expected cost. We also consider a risk-averse policy wherein rather than selecting facilities that minimize expected cost, the decision-maker uses a Conditional Value-at-Risk approach to measure and quantify risk. However, such network design problems belong to class of NP hard problems. Accordingly, we develop efficient heuristic algorithms and metaheuristic approaches to obtain acceptable solutions to these types of problems in reasonable runtimes so that the decision making process is facilitated with at most a moderate reduction in solution quality. Finally, we perform statistical analyses (e.g., logistic regression) to assess the likelihood of selection for each facility. These models allow us to identify the factors that impact facility selection in both the risk-neutral and risk-averse policies

    Linear-Time Algorithm for Long LCF with k Mismatches

    Get PDF
    In the Longest Common Factor with k Mismatches (LCF_k) problem, we are given two strings X and Y of total length n, and we are asked to find a pair of maximal-length factors, one of X and the other of Y, such that their Hamming distance is at most k. Thankachan et al. [Thankachan et al. 2016] show that this problem can be solved in O(n log^k n) time and O(n) space for constant k. We consider the LCF_k(l) problem in which we assume that the sought factors have length at least l. We use difference covers to reduce the LCF_k(l) problem with l=Omega(log^{2k+2}n) to a task involving m=O(n/log^{k+1}n) synchronized factors. The latter can be solved in O(m log^{k+1}m) time, which results in a linear-time algorithm for LCF_k(l) with l=Omega(log^{2k+2}n). In general, our solution to the LCF_k(l) problem for arbitrary l takes O(n + n log^{k+1} n/sqrt{l}) time
    corecore