418 research outputs found

    Dynamic Approximate All-Pairs Shortest Paths: Breaking the O(mn) Barrier and Derandomization

    Full text link
    We study dynamic (1+ϵ)(1+\epsilon)-approximation algorithms for the all-pairs shortest paths problem in unweighted undirected nn-node mm-edge graphs under edge deletions. The fastest algorithm for this problem is a randomized algorithm with a total update time of O~(mn/ϵ)\tilde O(mn/\epsilon) and constant query time by Roditty and Zwick [FOCS 2004]. The fastest deterministic algorithm is from a 1981 paper by Even and Shiloach [JACM 1981]; it has a total update time of O(mn2)O(mn^2) and constant query time. We improve these results as follows: (1) We present an algorithm with a total update time of O~(n5/2/ϵ)\tilde O(n^{5/2}/\epsilon) and constant query time that has an additive error of 22 in addition to the 1+ϵ1+\epsilon multiplicative error. This beats the previous O~(mn/ϵ)\tilde O(mn/\epsilon) time when m=Ω(n3/2)m=\Omega(n^{3/2}). Note that the additive error is unavoidable since, even in the static case, an O(n3δ)O(n^{3-\delta})-time (a so-called truly subcubic) combinatorial algorithm with 1+ϵ1+\epsilon multiplicative error cannot have an additive error less than 2ϵ2-\epsilon, unless we make a major breakthrough for Boolean matrix multiplication [Dor et al. FOCS 1996] and many other long-standing problems [Vassilevska Williams and Williams FOCS 2010]. The algorithm can also be turned into a (2+ϵ)(2+\epsilon)-approximation algorithm (without an additive error) with the same time guarantees, improving the recent (3+ϵ)(3+\epsilon)-approximation algorithm with O~(n5/2+O(log(1/ϵ)/logn))\tilde O(n^{5/2+O(\sqrt{\log{(1/\epsilon)}/\log n})}) running time of Bernstein and Roditty [SODA 2011] in terms of both approximation and time guarantees. (2) We present a deterministic algorithm with a total update time of O~(mn/ϵ)\tilde O(mn/\epsilon) and a query time of O(loglogn)O(\log\log n). The algorithm has a multiplicative error of 1+ϵ1+\epsilon and gives the first improved deterministic algorithm since 1981. It also answers an open question raised by Bernstein [STOC 2013].Comment: A preliminary version was presented at the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS 2013

    Lower bound techniques for data structures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 135-143).We describe new techniques for proving lower bounds on data-structure problems, with the following broad consequences: * the first [omega](lg n) lower bound for any dynamic problem, improving on a bound that had been standing since 1989; * for static data structures, the first separation between linear and polynomial space. Specifically, for some problems that have constant query time when polynomial space is allowed, we can show [omega](lg n/ lg lg n) bounds when the space is O(n - polylog n). Using these techniques, we analyze a variety of central data-structure problems, and obtain improved lower bounds for the following: * the partial-sums problem (a fundamental application of augmented binary search trees); * the predecessor problem (which is equivalent to IP lookup in Internet routers); * dynamic trees and dynamic connectivity; * orthogonal range stabbing. * orthogonal range counting, and orthogonal range reporting; * the partial match problem (searching with wild-cards); * (1 + [epsilon])-approximate near neighbor on the hypercube; * approximate nearest neighbor in the l[infinity] metric. Our new techniques lead to surprisingly non-technical proofs. For several problems, we obtain simpler proofs for bounds that were already known.by Mihai Pǎtraşcu.Ph.D

    Feature extraction and duplicate detection for text mining: A survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user

    A Nearly-Linear Time Algorithm for Linear Programs with Small Treewidth: A Multiscale Representation of Robust Central Path

    Full text link
    Arising from structural graph theory, treewidth has become a focus of study in fixed-parameter tractable algorithms in various communities including combinatorics, integer-linear programming, and numerical analysis. Many NP-hard problems are known to be solvable in O~(n2O(tw))\widetilde{O}(n \cdot 2^{O(\mathrm{tw})}) time, where tw\mathrm{tw} is the treewidth of the input graph. Analogously, many problems in P should be solvable in O~(ntwO(1))\widetilde{O}(n \cdot \mathrm{tw}^{O(1)}) time; however, due to the lack of appropriate tools, only a few such results are currently known. [Fom+18] conjectured this to hold as broadly as all linear programs; in our paper, we show this is true: Given a linear program of the form minAx=b,xucx\min_{Ax=b,\ell \leq x\leq u} c^{\top} x, and a width-τ\tau tree decomposition of a graph GAG_A related to AA, we show how to solve it in time O~(nτ2log(1/ε)),\widetilde{O}(n \cdot \tau^2 \log (1/\varepsilon)), where nn is the number of variables and ε\varepsilon is the relative accuracy. Combined with recent techniques in vertex-capacitated flow [BGS21], this leads to an algorithm with O~(ntw2log(1/ε))\widetilde{O}(n \cdot \mathrm{tw}^2 \log (1/\varepsilon)) run-time. Besides being the first of its kind, our algorithm has run-time nearly matching the fastest run-time for solving the sub-problem Ax=bAx=b (under the assumption that no fast matrix multiplication is used). We obtain these results by combining recent techniques in interior-point methods (IPMs), sketching, and a novel representation of the solution under a multiscale basis similar to the wavelet basis

    Feature Extraction and Duplicate Detection for Text Mining: A Survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
    corecore