191 research outputs found

    Fast and Compact Regular Expression Matching

    Get PDF
    We study 4 problems in string matching, namely, regular expression matching, approximate regular expression matching, string edit distance, and subsequence indexing, on a standard word RAM model of computation that allows logarithmic-sized words to be manipulated in constant time. We show how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way

    Dictionaries Revisited

    Get PDF
    Dictionaries are probably the most well studied class of data structures. A dictionary supports insertions, deletions, membership queries, and usually successor, predecessor, and extract-min. Given their centrality to both the theory and practice of data structures, surprisingly basic questions about them remain unsolved and sometimes even unposed. This talk focuses on questions that arise from the disparity between the way large-scale dictionaries are analyzed and the way they are used in practice

    Insertion Sort is O(n log n)

    Full text link
    Traditional Insertion Sort runs in O(n^2) time because each insertion takes O(n) time. When people run Insertion Sort in the physical world, they leave gaps between items to accelerate insertions. Gaps help in computers as well. This paper shows that Gapped Insertion Sort has insertion times of O(log n) with high probability, yielding a total running time of O(n log n) with high probability.Comment: 6 pages, Latex. In Proceedings of the Third International Conference on Fun With Algorithms, FUN 200

    Improved Distortion and Spam Resistance for PageRank

    Full text link
    For a directed graph G=(V,E)G = (V,E), a ranking function, such as PageRank, provides a way of mapping elements of VV to non-negative real numbers so that nodes can be ordered. Brin and Page argued that the stationary distribution, R(G)R(G), of a random walk on GG is an effective ranking function for queries on an idealized web graph. However, R(G)R(G) is not defined for all GG, and in particular, it is not defined for the real web graph. Thus, they introduced PageRank to approximate R(G)R(G) for graphs GG with ergodic random walks while being defined on all graphs. PageRank is defined as a random walk on a graph, where with probability (1ϵ)(1-\epsilon), a random out-edge is traversed, and with \emph{reset probability} ϵ\epsilon the random walk instead restarts at a node selected using a \emph{reset vector} r^\hat{r}. Originally, r^\hat{r} was taken to be uniform on the nodes, and we call this version UPR. In this paper, we introduce graph-theoretic notions of quality for ranking functions, specifically \emph{distortion} and \emph{spam resistance}. We show that UPR has high distortion and low spam resistance and we show how to select an r^\hat{r} that yields low distortion and high spam resistance.Comment: 36 page

    GPU LSM: A Dynamic Dictionary Data Structure for the GPU

    Full text link
    We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The trade-off for the dynamic updates is that the sorted array is almost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic general-purpose dictionary data structure for the GPU.Comment: 11 pages, accepted to appear on the Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS'18

    Tight Bounds for Monotone Minimal Perfect Hashing

    Full text link
    The monotone minimal perfect hash function (MMPHF) problem is the following indexing problem. Given a set S={s1,,sn}S= \{s_1,\ldots,s_n\} of nn distinct keys from a universe UU of size uu, create a data structure DSDS that answers the following query: RankOp(q)=rank of q in S for all qS  and arbitrary answer otherwise. RankOp(q) = \text{rank of } q \text{ in } S \text{ for all } q\in S ~\text{ and arbitrary answer otherwise.} Solutions to the MMPHF problem are in widespread use in both theory and practice. The best upper bound known for the problem encodes DSDS in O(nlogloglogu)O(n\log\log\log u) bits and performs queries in O(logu)O(\log u) time. It has been an open problem to either improve the space upper bound or to show that this somewhat odd looking bound is tight. In this paper, we show the latter: specifically that any data structure (deterministic or randomized) for monotone minimal perfect hashing of any collection of nn elements from a universe of size uu requires Ω(nlogloglogu)\Omega(n \cdot \log\log\log{u}) expected bits to answer every query correctly. We achieve our lower bound by defining a graph G\mathbf{G} where the nodes are the possible (un){u \choose n} inputs and where two nodes are adjacent if they cannot share the same DSDS. The size of DSDS is then lower bounded by the log of the chromatic number of G\mathbf{G}. Finally, we show that the fractional chromatic number (and hence the chromatic number) of G\mathbf{G} is lower bounded by 2Ω(nlogloglogu)2^{\Omega(n \log\log\log u)}