Search CORE

19 research outputs found

Approximating Longest Common Substring with k mismatches: Theory and Practice

Author: Gourdel Garance
Kociumaka Tomasz
Radoszewski Jakub
Starikovskaya Tatiana
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)
Publication date: 01/01/2020
Field of study

In the problem of the longest common substring with k mismatches we are given two strings X, Y and must find the maximal length ? such that there is a length-? substring of X and a length-? substring of Y that differ in at most k positions. The length ? can be used as a robust measure of similarity between X, Y. In this work, we develop new approximation algorithms for computing ? that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

HAL-Rennes 1

Fast and sensitive mapping of nanopore sequencing reads with GraphMap

Author: Chen Swaine
Fenlon Shannon Nicole
Nagarajan Niranjan
Sović Ivan
Wilm Andreas
Šikić Mile
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Realizing the democratic promise of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. Here we present GraphMap, a mapping algorithm designed to analyse nanopore sequencing reads, which progressively refines candidate alignments to robustly handle potentially high-error rates and a fast graph traversal to align long reads with speed and high precision (>95%). Evaluation on MinION sequencing data sets against short- and long-read mappers indicates that GraphMap increases mapping sensitivity by 10–80% and maps >95% of bases. GraphMap alignments enabled single-nucleotide variant calling on the human genome with increased sensitivity (15%) over the next best mapper, precise detection of structural variants from length 100 bp to 4 kbp, and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap

PubMed Central

Full-text Institutional Repository of the Ruđer Bošković Institute

ScholarBank@NUS

Tight Conditional Lower Bounds for Longest Common Increasing Subsequence

Author: Duraj Lech
Polak Adam
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 12th International Symposium on Parameterized and Exact Computation (IPEC 2017)
Publication date: 01/01/2017
Field of study

We consider the canonical generalization of the well-studied Longest Increasing Subsequence problem to multiple sequences, called k-LCIS: Given k integer sequences X_1,...,X_k of length at most n, the task is to determine the length of the longest common subsequence of X_1,...,X_k that is also strictly increasing. Especially for the case of k=2 (called LCIS for short), several algorithms have been proposed that require quadratic time in the worst case. Assuming the Strong Exponential Time Hypothesis (SETH), we prove a tight lower bound, specifically, that no algorithm solves LCIS in (strongly) subquadratic time. Interestingly, the proof makes no use of normalization tricks common to hardness proofs for similar problems such as LCS. We further strengthen this lower bound to rule out O((nL)^{1-epsilon}) time algorithms for LCIS, where L denotes the solution size, and to rule out O(n^{k-epsilon}) time algorithms for k-LCIS. We obtain the same conditional lower bounds for the related Longest Common Weakly Increasing Subsequence problem

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

Dagstuhl Research Online Publication Server

Jagiellonian Univeristy Repository

MPG.PuRe

Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates

Author: Cohen Edith
Kaplan Haim
Publication venue
Publication date: 01/01/2009
Field of study

Many datasets such as market basket data, text or hypertext documents, and sensor observations recorded in different locations or time periods, are modeled as a collection of sets over a ground set of keys. We are interested in basic aggregates such as the weight or selectivity of keys that satisfy some selection predicate defined over keys' attributes and membership in particular sets. This general formulation includes basic aggregates such as the Jaccard coefficient, Hamming distance, and association rules. On massive data sets, exact computation can be inefficient or infeasible. Sketches based on coordinated random samples are classic summaries that support approximate query processing. Queries are resolved by generating a sketch (sample) of the union of sets used in the predicate from the sketches these sets and then applying an estimator to this union-sketch. We derive novel tighter (unbiased) estimators that leverage sampled keys that are present in the union of applicable sketches but excluded from the union sketch. We establish analytically that our estimators dominate estimators applied to the union-sketch for {\em all queries and data sets}. Empirical evaluation on synthetic and real data reveals that on typical applications we can expect a 25%-4 fold reduction in estimation error.Comment: 16 page

arXiv.org e-Print Archive

CiteSeerX

SUPPLY CHAIN NETWORK DESIGN: RISK-AVERSE VS. RISK-NEUTRAL DECISION MAKING

Author: Madadi Alireza
Publication venue: Clemson University Libraries
Publication date: 01/12/2012
Field of study

Recent events, such as the Heparin tragedy, highlight the necessity for designers and planners of supply chain networks to consider the risk of disruptions in spite of their low probability of occurrence. One effective way to hedge against supply chain network disruptions is to have a robustly designed supply chain network. This involves strategic decisions, such as choosing which markets to serve, which suppliers to source from, the location of plants, the types of facilities to use, and tactical decisions, such as production and capacity allocation. In this dissertation, we focus on models for designing supply chain networks that are resilient to disruptions. We consider two types of decision making policies. A risk-neutral decision making policy is based on the cost minimization approach, and the decision-maker defines the set of decisions that minimize expected cost. We also consider a risk-averse policy wherein rather than selecting facilities that minimize expected cost, the decision-maker uses a Conditional Value-at-Risk approach to measure and quantify risk. However, such network design problems belong to class of NP hard problems. Accordingly, we develop efficient heuristic algorithms and metaheuristic approaches to obtain acceptable solutions to these types of problems in reasonable runtimes so that the decision making process is facilitated with at most a moderate reduction in solution quality. Finally, we perform statistical analyses (e.g., logistic regression) to assess the likelihood of selection for each facility. These models allow us to identify the factors that impact facility selection in both the risk-neutral and risk-averse policies

Clemson University: TigerPrints

Linear-Time Algorithm for Long LCF with k Mismatches

Author: Charalampopoulos Panagiotis
Crochemore Maxime
Iliopoulos Costas S.
Kociumaka Tomasz
Pissis Solon P.
Radoszewski Jakub
Rytter Wojciech
Walen Tomasz
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Annual Symposium on Combinatorial Pattern Matching (CPM 2018)
Publication date: 01/01/2018
Field of study

In the Longest Common Factor with k Mismatches (LCF_k) problem, we are given two strings X and Y of total length n, and we are asked to find a pair of maximal-length factors, one of X and the other of Y, such that their Hamming distance is at most k. Thankachan et al. [Thankachan et al. 2016] show that this problem can be solved in O(n log^k n) time and O(n) space for constant k. We consider the LCF_k(l) problem in which we assume that the sought factors have length at least l. We use difference covers to reduce the LCF_k(l) problem with l=Omega(log^{2k+2}n) to a task involving m=O(n/log^{k+1}n) synchronized factors. The latter can be solved in O(m log^{k+1}m) time, which results in a linear-time algorithm for LCF_k(l) with l=Omega(log^{2k+2}n). In general, our solution to the LCF_k(l) problem for arbitrary l takes O(n + n log^{k+1} n/sqrt{l}) time

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server