186 research outputs found

    Math Search for the Masses: Multimodal Search Interfaces and Appearance-Based Retrieval

    Full text link
    We summarize math search engines and search interfaces produced by the Document and Pattern Recognition Lab in recent years, and in particular the min math search interface and the Tangent search engine. Source code for both systems are publicly available. "The Masses" refers to our emphasis on creating systems for mathematical non-experts, who may be looking to define unfamiliar notation, or browse documents based on the visual appearance of formulae rather than their mathematical semantics.Comment: Paper for Invited Talk at 2015 Conference on Intelligent Computer Mathematics (July, Washington DC

    Pattern Masking for Dictionary Matching:Theory and Practice

    Get PDF
    Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary D of d strings, each of length ℓ, a query string q of length ℓ, and a positive integer z, we are asked to compute a smallest set K⊆{1, …, ℓ}, so that if q[i] is replaced by a wildcard for all i∈K, then q matches at least z strings from D. Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for binary strings. We thus approach the problem from a more practical perspective. We show a combinatorial O((dℓ)|K|/3+dℓ)-time and O(dℓ)-space algorithm for PMDM for |K|=O(1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails (Abboud et al. in SIAM J Comput 47:2527–2555, 2018; Lincoln et al., in: 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018). Our combinatorial algorithm, executed with small |K|, is the backbone of a greedy heuristic that we propose. Our experiments on real-world and synthetic datasets show that our heuristic finds nearly-optimal solutions in practice and is also very efficient. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in D. PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z=1, one obtains the minimal number of mismatches of q with any string from D. The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time O(2ℓd). We present a data structure for PMDM that answers queries over D in time O(2ℓ/2(2ℓ/2+τ)ℓ) and requires space O(2ℓd2/τ2+2ℓ/2d), for any parameter τ∈[1, d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017]. This gives a polynomial-time O(d1/4+ϵ)-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. This is an extended version of a paper that was presented at International Symposium on Algorithms and Computation (ISAAC) 2021

    SoK: Cryptographically Protected Database Search

    Full text link
    Protected database search systems cryptographically isolate the roles of reading from, writing to, and administering the database. This separation limits unnecessary administrator access and protects data in the case of system breaches. Since protected search was introduced in 2000, the area has grown rapidly; systems are offered by academia, start-ups, and established companies. However, there is no best protected search system or set of techniques. Design of such systems is a balancing act between security, functionality, performance, and usability. This challenge is made more difficult by ongoing database specialization, as some users will want the functionality of SQL, NoSQL, or NewSQL databases. This database evolution will continue, and the protected search community should be able to quickly provide functionality consistent with newly invented databases. At the same time, the community must accurately and clearly characterize the tradeoffs between different approaches. To address these challenges, we provide the following contributions: 1) An identification of the important primitive operations across database paradigms. We find there are a small number of base operations that can be used and combined to support a large number of database paradigms. 2) An evaluation of the current state of protected search systems in implementing these base operations. This evaluation describes the main approaches and tradeoffs for each base operation. Furthermore, it puts protected search in the context of unprotected search, identifying key gaps in functionality. 3) An analysis of attacks against protected search for different base queries. 4) A roadmap and tools for transforming a protected search system into a protected database, including an open-source performance evaluation platform and initial user opinions of protected search.Comment: 20 pages, to appear to IEEE Security and Privac

    Towards Optimal Approximate Streaming Pattern Matching by Matching Multiple Patterns in Multiple Streams

    Get PDF
    Recently, there has been a growing focus in solving approximate pattern matching problems in the streaming model. Of particular interest are the pattern matching with k-mismatches (KMM) problem and the pattern matching with w-wildcards (PMWC) problem. Motivated by reductions from these problems in the streaming model to the dictionary matching problem, this paper focuses on designing algorithms for the dictionary matching problem in the multi-stream model where there are several independent streams of data (as opposed to just one in the streaming model), and the memory complexity of an algorithm is expressed using two quantities: (1) a read-only shared memory storage area which is shared among all the streams, and (2) local stream memory that each stream stores separately. In the dictionary matching problem in the multi-stream model the goal is to preprocess a dictionary D={P_1,P_2,...,P_d} of d=|D| patterns (strings with maximum length m over alphabet Sigma) into a data structure stored in shared memory, so that given multiple independent streaming texts (where characters arrive one at a time) the algorithm reports occurrences of patterns from D in each one of the texts as soon as they appear. We design two efficient algorithms for the dictionary matching problem in the multi-stream model. The first algorithm works when all the patterns in D have the same length m and costs O(d log m) words in shared memory, O(log m log d) words in stream memory, and O(log m) time per character. The second algorithm works for general D, but the time cost per character becomes O(log m+log d log log d). We also demonstrate the usefulness of our first algorithm in solving both the KMM problem and PMWC problem in the streaming model. In particular, we obtain the first almost optimal (up to poly-log factors) algorithm for the PMWC problem in the streaming model. We also design a new algorithm for the KMM problem in the streaming model that, up to poly-log factors, has the same bounds as the most recent results that use different techniques. Moreover, for most inputs, our algorithm for KMM is significantly faster on average

    Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution

    Full text link
    Cloud-based data analysis is nowadays common practice because of the lower system management overhead as well as the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collections of queries accessing the same data frequently can become expensive. The problem is compounded by the limited options for the user to optimize query execution when using declarative interfaces such as SQL. In this paper, we show how, without modifying existing systems and without the involvement of the cloud provider, it is possible to significantly reduce the overhead, and hence the cost, of query-as-a-service systems. Our approach is based on query rewriting so that multiple concurrent queries are combined into a single query. Our experiments show the aggregated amount of work done by the shared execution is smaller than in a query-at-a-time approach. Since queries are charged per byte processed, the cost of executing a group of queries is often the same as executing a single one of them. As an example, we demonstrate how the shared execution of the TPC-H benchmark is up to 100x and 16x cheaper in Amazon Athena and Google BigQuery than using a query-at-a-time approach while achieving a higher throughput

    Replicode: A Constructivist Programming Paradigm and Language

    Get PDF
    Replicode is a language designed to encode short parallel programs and executable models, and is centered on the notions of extensive pattern-matching and dynamic code production. The language is domain independent and has been designed to build systems that are modelbased and model-driven, as production systems that can modify their own code. More over, Replicode supports the distribution of knowledge and computation across clusters of computing nodes. This document describes Replicode and its executive, i.e. the system that executes Replicode constructions. The Replicode executive is meant to run on Linux 64 bits and Windows 7 32/64 bits platforms and interoperate with custom C++ code. The motivations for the Replicode language, the constructivist paradigm it rests on, and the higher-level AI goals targeted by its construction, are described by Thórisson (2012), Nivel and Thórisson (2009), and Thórisson and Nivel (2009a, 2009b). An overview presents the main concepts of the language. Section 3 describes the general structure of Replicode objects and describes pattern matching. Section 4 describes the execution model of Replicode and section 5 describes how computation and knowledge are structured and controlled. Section 6 describes the high-level reasoning facilities offered by the system. Finally, section 7 describes how the computation is distributed over a cluster of computing nodes. Consult Annex 1 for a formal definition of Replicode, Annex 2 for a specification of the executive, Annex 3 for the specification of the executable code format (r-code) and its C++ API, and Annex 4 for the definition of the Replicode Extension C++ API
    corecore