179 research outputs found

    An algebraic approach to the prefix model analysis of binary trie structures and set intersection algorithms

    Get PDF
    AbstractThe trie, or digital tree, is a standard data structure for representing sets of strings over a given finite alphabet. Since Knuth's original work (1973), these data structures have been extensively studied and analyzed. In this paper, we present an algebraic approach to the analysis of average storage and average time required by the retrieval algorithms of trie structures under the prefix model. This approach extends the work of Flajolet et al. for other models which, unlike the prefix model, assume that no key in a sample set is the prefix of another. As the main application, we analyze the average running time of two algorithms for computing set intersections

    bdbms -- A Database Management System for Biological Data

    Full text link
    Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs to include: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of compressed biological data types. This paper presents the design of bdbms along with the techniques proposed to support these functionalities including an extension to SQL. We also outline some open issues in building bdbms.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    An Improved Algorithm for Finding the Shortest Synchronizing Words

    Get PDF
    A synchronizing word of a deterministic finite complete automaton is a word whose action maps every state to a single one. Finding a shortest or a short synchronizing word is a central computational problem in the theory of synchronizing automata and is applied in other areas such as model-based testing and the theory of codes. Because the problem of finding a shortest synchronizing word is computationally hard, among exact algorithms only exponential ones are known. We redesign the previously fastest known exact algorithm based on the bidirectional breadth-first search and improve it with respect to time and space in a practical sense. We develop new algorithmic enhancements and adapt the algorithm to multithreaded and GPU computing. Our experiments show that the new algorithm is multiple times faster than the previously fastest one and its advantage quickly grows with the hardness of the problem instance. Given a modest time limit, we compute the lengths of the shortest synchronizing words for random binary automata up to 570 states, significantly beating the previous record. We refine the experimental estimation of the average reset threshold of these automata. Finally, we develop a general computational package devoted to the problem, where an efficient and practical implementation of our algorithm is included, together with several well-known heuristics

    An Improved Algorithm for Finding the Shortest Synchronizing Words

    Get PDF
    A synchronizing word of a deterministic finite complete automaton is a word whose action maps every state to a single one. Finding a shortest or a short synchronizing word is a central computational problem in the theory of synchronizing automata and is applied in other areas such as model-based testing and the theory of codes. Because the problem of finding a shortest synchronizing word is computationally hard, among \emph{exact} algorithms only exponential ones are known. We redesign the previously fastest known exact algorithm based on the bidirectional breadth-first search and improve it with respect to time and space in a practical sense. We develop new algorithmic enhancements and adapt the algorithm to multithreaded and GPU computing. Our experiments show that the new algorithm is multiple times faster than the previously fastest one and its advantage quickly grows with the hardness of the problem instance. Given a modest time limit, we compute the lengths of the shortest synchronizing words for random binary automata up to 570 states, significantly beating the previous record. We refine the experimental estimation of the average reset threshold of these automata. Finally, we develop a general computational package devoted to the problem, where an efficient and practical implementation of our algorithm is included, together with several well-known heuristics.Comment: Full version of ESA 2022 pape

    Proceedings of Workshop on Quantum Computing and Quantum Information

    Get PDF

    Exact sampling and optimisation in statistical machine translation

    Get PDF
    In Statistical Machine Translation (SMT), inference needs to be performed over a high-complexity discrete distribution de ned by the intersection between a translation hypergraph and a target language model. This distribution is too complex to be represented exactly and one typically resorts to approximation techniques either to perform optimisation { the task of searching for the optimum translation { or sampling { the task of nding a subset of translations that is statistically representative of the goal distribution. Beam-search is an example of an approximate optimisation technique, where maximisation is performed over a heuristically pruned representation of the goal distribution. For inference tasks other than optimisation, rather than nding a single optimum, one is really interested in obtaining a set of probabilistic samples from the distribution. This is the case in training where one wishes to obtain unbiased estimates of expectations in order to t the parameters of a model. Samples are also necessary in consensus decoding where one chooses from a sample of likely translations the one that minimises a loss function. Due to the additional computational challenges posed by sampling, n-best lists, a by-product of optimisation, are typically used as a biased approximation to true probabilistic samples. A more direct procedure is to attempt to directly draw samples from the underlying distribution rather than rely on n-best list approximations. Markov Chain Monte Carlo (MCMC) methods, such as Gibbs sampling, o er a way to overcome the tractability issues in sampling, however their convergence properties are hard to assess. That is, it is di cult to know when, if ever, an MCMC sampler is producing samples that are compatible iii with the goal distribution. Rejection sampling, a Monte Carlo (MC) method, is more fundamental and natural, it o ers strong guarantees, such as unbiased samples, but is typically hard to design for distributions of the kind addressed in SMT, rendering an intractable method. A recent technique that stresses a uni ed view between the two types of inference tasks discussed here | optimisation and sampling | is the OS approach. OS can be seen as a cross between Adaptive Rejection Sampling (an MC method) and A optimisation. In this view the intractable goal distribution is upperbounded by a simpler (thus tractable) proxy distribution, which is then incrementally re ned to be closer to the goal until the maximum is found, or until the sampling performance exceeds a certain level. This thesis introduces an approach to exact optimisation and exact sampling in SMT by addressing the tractability issues associated with the intersection between the translation hypergraph and the language model. The two forms of inference are handled in a uni ed framework based on the OS approach. In short, an intractable goal distribution, over which one wishes to perform inference, is upperbounded by tractable proposal distributions. A proposal represents a relaxed version of the complete space of weighted translation derivations, where relaxation happens with respect to the incorporation of the language model. These proposals give an optimistic view on the true model and allow for easier and faster search using standard dynamic programming techniques. In the OS approach, such proposals are used to perform a form of adaptive rejection sampling. In rejection sampling, samples are drawn from a proposal distribution and accepted or rejected as a function of the mismatch between the proposal and the goal. The technique is adaptive in that rejected samples are used to motivate a re nement of the upperbound proposal that brings it closer to the goal, improving the rate of acceptance. Optimisation can be connected to an extreme form of sampling, thus the framework introduced here suits both exact optimisation and exact iv sampling. Exact optimisation means that the global maximum is found with a certi cate of optimality. Exact sampling means that unbiased samples are independently drawn from the goal distribution. We show that by using this approach exact inference is feasible using only a fraction of the time and space that would be required by a full intersection, without recourse to pruning techniques that only provide approximate solutions. We also show that the vast majority of the entries (n-grams) in a language model can be summarised by shorter and optimistic entries. This means that the computational complexity of our approach is less sensitive to the order of the language model distribution than a full intersection would be. Particularly in the case of sampling, we show that it is possible to draw exact samples compatible with distributions which incorporate a high-order language model component from proxy distributions that are much simpler. In this thesis, exact inference is performed in the context of both hierarchical and phrase-based models of translation, the latter characterising a problem that is NP-complete in nature.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Proceedings of the Eindhoven FASTAR Days 2004 : Eindhoven, The Netherlands, September 3-4, 2004

    Get PDF
    The Eindhoven FASTAR Days (EFD) 2004 were organized by the Software Construction group of the Department of Mathematics and Computer Science at the Technische Universiteit Eindhoven. On September 3rd and 4th 2004, over thirty participants|hailing from the Czech Republic, Finland, France, The Netherlands, Poland and South Africa|gathered at the Department to attend the EFD. The EFD were organized in connection with the research on finite automata by the FASTAR Research Group, which is centered in Eindhoven and at the University of Pretoria, South Africa. FASTAR (Finite Automata Systems|Theoretical and Applied Research) is an in- ternational research group that aims to lead in all areas related to finite state systems. The work in FASTAR includes both core and applied parts of this field. The EFD therefore focused on the field of finite automata, with an emphasis on practical aspects and applications. Eighteen presentations, mostly on subjects within this field, were given, by researchers as well as students from participating universities and industrial research facilities. This report contains the proceedings of the conference, in the form of papers for twelve of the presentations at the EFD. Most of them were initially reviewed and distributed as handouts during the EFD. After the EFD took place, the papers were revised for publication in these proceedings. We would like to thank the participants for their attendance and presentations, making the EFD 2004 as successful as they were. Based on this success, it is our intention to make the EFD into a recurring event. Eindhoven, December 2004 Loek Cleophas Bruce W. Watso

    Function-specific schemes for verifiable computation

    Get PDF
    An integral component of modern computing is the ability to outsource data and computation to powerful remote servers, for instance, in the context of cloud computing or remote file storage. While participants can benefit from this interaction, a fundamental security issue that arises is that of integrity of computation: How can the end-user be certain that the result of a computation over the outsourced data has not been tampered with (not even by a compromised or adversarial server)? Cryptographic schemes for verifiable computation address this problem by accompanying each result with a proof that can be used to check the correctness of the performed computation. Recent advances in the field have led to the first implementations of schemes that can verify arbitrary computations. However, in practice the overhead of these general-purpose constructions remains prohibitive for most applications, with proof computation times (at the server) in the order of minutes or even hours for real-world problem instances. A different approach for designing such schemes targets specific types of computation and builds custom-made protocols, sacrificing generality for efficiency. An important representative of this function-specific approach is an authenticated data structure (ADS), where a specialized protocol is designed that supports query types associated with a particular outsourced dataset. This thesis presents three novel ADS constructions for the important query types of set operations, multi-dimensional range search, and pattern matching, and proves their security under cryptographic assumptions over bilinear groups. The scheme for set operations can support nested queries (e.g., two unions followed by an intersection of the results), extending previous works that only accommodate a single operation. The range search ADS provides an exponential (in the number of attributes in the dataset) asymptotic improvement from previous schemes for storage and computation costs. Finally, the pattern matching ADS supports text pattern and XML path queries with minimal cost, e.g., the overhead at the server is less than 4% compared to simply computing the result, for all our tested settings. The experimental evaluation of all three constructions shows significant improvements in proof-computation time over general-purpose schemes
    • …
    corecore