13,386 research outputs found

    A Call to Arms: Revisiting Database Design

    Get PDF
    Good database design is crucial to obtain a sound, consistent database, and - in turn - good database design methodologies are the best way to achieve the right design. These methodologies are taught to most Computer Science undergraduates, as part of any Introduction to Database class. They can be considered part of the "canon", and indeed, the overall approach to database design has been unchanged for years. Moreover, none of the major database research assessments identify database design as a strategic research direction. Should we conclude that database design is a solved problem? Our thesis is that database design remains a critical unsolved problem. Hence, it should be the subject of more research. Our starting point is the observation that traditional database design is not used in practice - and if it were used it would result in designs that are not well adapted to current environments. In short, database design has failed to keep up with the times. In this paper, we put forth arguments to support our viewpoint, analyze the root causes of this situation and suggest some avenues of research.Comment: Removed spurious column break. Nothing else was change

    Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes

    Full text link
    Let S be a finite, ordered alphabet, and let x = x_1 x_2 ... x_n be a string over S. A "secondary index" for x answers alphabet range queries of the form: Given a range [a_l,a_r] over S, return the set I_{[a_l;a_r]} = {i |x_i \in [a_l; a_r]}. Secondary indexes are heavily used in relational databases and scientific data analysis. It is well-known that the obvious solution, storing a dictionary for the position set associated with each character, does not always give optimal query time. In this paper we give the first theoretically optimal data structure for the secondary indexing problem. In the I/O model, the amount of data read when answering a query is within a constant factor of the minimum space needed to represent I_{[a_l;a_r]}, assuming that the size of internal memory is (|S| log n)^{delta} blocks, for some constant delta > 0. The space usage of the data structure is O(n log |S|) bits in the worst case, and we further show how to bound the size of the data structure in terms of the 0-th order entropy of x. We show how to support updates achieving various time-space trade-offs. We also consider an approximate version of the basic secondary indexing problem where a query reports a superset of I_{[a_l;a_r]} containing each element not in I_{[a_l;a_r]} with probability at most epsilon, where epsilon > 0 is the false positive probability. For this problem the amount of data that needs to be read by the query algorithm is reduced to O(|I_{[a_l;a_r]}| log(1/epsilon)) bits.Comment: 16 page

    Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency

    Get PDF
    Subsequence matching has appeared to be an ideal approach for solving many problems related to the fields of data mining and similarity retrieval. It has been shown that almost any data class (audio, image, biometrics, signals) is or can be represented by some kind of time series or string of symbols, which can be seen as an input for various subsequence matching approaches. The variety of data types, specific tasks and their partial or full solutions is so wide that the choice, implementation and parametrization of a suitable solution for a given task might be complicated and time-consuming; a possibly fruitful combination of fragments from different research areas may not be obvious nor easy to realize. The leading authors of this field also mention the implementation bias that makes difficult a proper comparison of competing approaches. Therefore we present a new generic Subsequence Matching Framework (SMF) that tries to overcome the aforementioned problems by a uniform frame that simplifies and speeds up the design, development and evaluation of subsequence matching related systems. We identify several relatively separate subtasks solved differently over the literature and SMF enables to combine them in straightforward manner achieving new quality and efficiency. This framework can be used in many application domains and its components can be reused effectively. Its strictly modular architecture and openness enables also involvement of efficient solutions from different fields, for instance efficient metric-based indexes. This is an extended version of a paper published on DEXA 2012.Comment: This is an extended version of a paper published on DEXA 201
    • …
    corecore