21 research outputs found
Figure 1. Comparison of the number of matches
ABSTRACT Text data is prevalent in life. Some of this data is uncertain and is best modeled by probability distributions. Examples include biological sequence data and automatic ECG annotations, among others. Approximate substring matching over uncertain texts is largely an unexplored problem in data management. In this paper, we study this intriguing question. We propose a semantics called (k, Ï„)-matching queries and argue that it is more suitable in this context than a related semantics that has been proposed previously. Since uncertainty incurs considerable overhead on indexing as well as the final verification for a match, we devise techniques for both. For indexing, we propose a multilevel filtering technique based on measuring signature distance; for verification, we design two algorithms that give upper and lower bounds and significantly reduce the costs. We validate our algorithms with a systematic evaluation on two real-world datasets and some synthetic datasets
Top-K Queries on Uncertain Data: On Score Distribution and Typical Answers
Uncertain data arises in a number of domains, including data integration and sensor networks. Top-k queries that rank results according to some user-defined score are an important tool for exploring large uncertain data sets. As several recent papers have observed, the semantics of top-k queries on uncertain data can be ambiguous due to tradeoffs between reporting high-scoring tuples and tuples with a high probability of being in the resulting data set. In this paper, we demonstrate the need to present the score distribution of top-k vectors to allow the user to choose between results along this score-probability dimensions. One option would be to display the complete distribution of all potential top-k tuple vectors, but this set is too large to compute. Instead, we propose to provide a number of typical vectors that effectively sample this distribution. We propose efficient algorithms to compute these vectors. We also extend the semantics and algorithms to the scenario of score ties, which is not dealt with in the previous work in the area. Our work includes a systematic empirical study on both real dataset and synthetic datasets.National Natural Science Foundation (Grant number IIS-0086057)National Natural Science Foundation (Grant number IIS- 0325838)National Natural Science Foundation (Grant number IIS-0448124
Accuracy-Aware Uncertain Stream Databases
Abstract-Previous work has introduced probability distributions as first-class components in uncertain stream database systems. A lacking element is the fact of how accurate these probability distributions are. This indeed has a profound impact on the accuracy of query results presented to end users. While there is some previous work that studies unreliable intermediate query results in the tuple uncertainty model, to the best of our knowledge, we are the first to consider an uncertain stream database in which accuracy is taken into consideration all the way from the learned distributions based on raw data samples to the query results. We perform an initial study of various components in an accuracy-aware uncertain stream database system, including the representation of accuracy information and how to obtain query results' accuracy. In addition, we propose novel predicates based on hypothesis testing for decision-making using data with limited accuracy. We augment our study with a comprehensive set of experimental evaluations. I. INTRODUCTION Recent research has extended stream databases to handle uncertain data in order to meet the requirements from everincreasing applications in sensor networks and ubiquitous computing (e.g., Where do we obtain the probabilities in the first place? In many applications, probability distributions are learned from observations and measurements, a.k.a. samples. Such applications include sensor networks, ubiquitous computing, and scientific databases. Let us look at an example. Example 1 (accuracy of learned probability distributions). A few projects in both academia and industry (e.g., the CarTel project at MIT [24
Answering Aggregation Queries in a Secure System Model
As more sensitive data is captured in electronic form, security becomes more and more important. Data encryption is the main technique for achieving security. While in the past enterprises were hesitant to implement database encryption because of the very high cost, complexity, and performance degradation, they now have to face the ever-growing risk of data theft as well as emerging legislative requirements. Data encryption can be done at multiple tiers within the enterprise. Different choices on where to encrypt the data offer different security features that protect against different attacks. One class of attack that needs to be taken seriously is the compromise of the database server, its software or administrator. A secure way to address this threat is for a DBMS to directly process queries on the ciphertext, without decryption. We conduct a comprehensive study on answering SUM and AVG aggregation queries in such a system model by using a secure homomorphic encryption scheme in a novel way. We demonstrate that the performance of such a solution is comparable to a traditional symmetric encryption scheme (e.g., DES) in which each value is decrypted and the computation is performed on the plaintext. Clearly this traditional encryption scheme is not a viable solution to the problem because the server must have access to the secret key and the plaintext, which violates our system model and security requirements. We study the problem in the setting of a read-optimized DBMS for data warehousing applications, in which SUM and AVG are frequent and crucial
Figure 1. Illustration of a typical ECG signal and R-R intervals [31]. Online Windowed Subsequence Matching over Probabilistic Sequences
ABSTRACT Windowed subsequence matching over deterministic strings has been studied in previous work in the contexts of knowledge discovery, data mining, and molecular biology. However, we observe that in these applications, as well as in data stream monitoring, complex event processing, and time series data processing in which streams can be mapped to strings, the strings are often noisy and probabilistic. We study this problem in the online setting where efficiency is paramount. We first formulate the query semantics, and propose an exact algorithm. Then we propose a randomized approximation algorithm that is faster and, in the mean time, provably accurate. Moreover, we devise a filtering algorithm to further enhance the efficiency with an optimization technique that is adaptive to sequence stream contents. Finally, we propose algorithms for patterns with negations. In order to verify the algorithms, we conduct a systematic empirical study using three real datasets and some synthetic datasets