Search CORE

165 research outputs found

A Framework for High-Accuracy Privacy-Preserving Mining

Author: Agrawal Shipra
Haritsa Jayant R.
Publication venue
Publication date: 01/01/2004
Field of study

To preserve client privacy in the data mining process, a variety of techniques based on random perturbation of data records have been proposed recently. In this paper, we present a generalized matrix-theoretic model of random perturbation, which facilitates a systematic approach to the design of perturbation mechanisms for privacy-preserving mining. Specifically, we demonstrate that (a) the prior techniques differ only in their settings for the model parameters, and (b) through appropriate choice of parameter settings, we can derive new perturbation techniques that provide highly accurate mining results even under strict privacy guarantees. We also propose a novel perturbation mechanism wherein the model parameters are themselves characterized as random variables, and demonstrate that this feature provides significant improvements in privacy at a very marginal cost in accuracy. While our model is valid for random-perturbation-based privacy-preserving mining in general, we specifically evaluate its utility here with regard to frequent-itemset mining on a variety of real datasets. The experimental results indicate that our mechanisms incur substantially lower identity and support errors as compared to the prior techniques

arXiv.org e-Print Archive

CiteSeerX

Open Access Repository of IISc Research Publications

Providing Diversity in K-Nearest Neighbor Query Results

Author: Haritsa Jayant R.
Jain Anoop
Sarda Parag
Publication venue
Publication date: 15/10/2003
Field of study

Given a point query Q in multi-dimensional space, K-Nearest Neighbor (KNN) queries return the K closest answers according to given distance metric in the database with respect to Q. In this scenario, it is possible that a majority of the answers may be very similar to some other, especially when the data has clusters. For a variety of applications, such homogeneous result sets may not add value to the user. In this paper, we consider the problem of providing diversity in the results of KNN queries, that is, to produce the closest result set such that each answer is sufficiently different from the rest. We first propose a user-tunable definition of diversity, and then present an algorithm, called MOTLEY, for producing a diverse result set as per this definition. Through a detailed experimental evaluation on real and synthetic data, we show that MOTLEY can produce diverse result sets by reading only a small fraction of the tuples in the database. Further, it imposes no additional overhead on the evaluation of traditional KNN queries, thereby providing a seamless interface between diversity and distance.Comment: 20 pages, 11 figure

arXiv.org e-Print Archive

Open Access Repository of IISc Research Publications

Scalable and Dynamic Regeneration of Big Data Volumes

Author: Haritsa Jayant
Sanghi Anupam
Sood Raghav
Tirthapura Srikanta
Tirthapura Srikanta
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2018
Field of study

A core requirement of database engine testing is the ability to create synthetic versions of the customer’s data warehouse at the vendor site. A rich body of work exists on synthetic database regeneration, but suffers critical limitations with regard to: (a) maintaining statistical fidelity to the client’s query processing, and/or (b) scaling to large data volumes. In this paper, we present HYDRA, a workload-dependent database regenerator that leverages a declarative approach to data regeneration to assure volumetric similarity, a crucial aspect of statistical fidelity, and materially improves on the prior art by adding scale, dynamism and functionality. Specifically, Hydra uses an optimized linear programming (LP) formulation based on a novel regionpartitioning approach. This spatial strategy drastically reduces the LP complexity, enabling it to handle query workloads on which contemporary techniques fail. Second, Hydra incorporates deterministic post-LP processing algorithms that provide high efficiency and improved accuracy. Third, Hydra introduces the concept of dynamic regeneration by constructing a minuscule database summary that can on-the-fly regenerate databases of arbitrary size during query execution, while obeying volumetric specifications derived from the query workload. A detailed experimental evaluation on standard OLAP benchmarks demonstrates that Hydra can efficiently and dynamically regenerate large warehouses that accurately mimic the desired statistical characteristics

Digital Repository @ Iowa State University (ISU)

HYDRA: A Dynamic Big Data Regenerator

Author: Haritsa Jayant
Sanghi Anupam
Singh Dharmendra
Sood Raghav
Tirthapura Srikanta
Tirthapura Srikanta
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2018
Field of study

A core requirement of database engine testing is the ability to create synthetic versions of the customer’s data warehouse at the vendor site. Prior work on synthetic data regeneration suffers from critical limitations with regard to (a) scaling to large data volumes, (b) handling complex query workloads, and (c) producing data on demand. In this demo, we present HYDRA, a workload-dependent dynamic data regenerator, that materially addresses these limitations. It introduces the concept of dynamic regeneration by constructing a minuscule memory-resident database summary that can on-the-fly regenerate databases of arbitrary size during query execution. Further, since the data is generated in memory, the velocity of generation can be closely regulated. Finally, to complement dynamic regeneration, Hydra also ensures that the process of summary construction is data-scale-free

Digital Repository @ Iowa State University (ISU)

Advances in Real-Time Database Systems Research Special Section on RTDBS of ACM SIGMOD Record 25(1), March 1996.

Author: Adelberg Brad
Andler S.
Berndtsson M.
Buchmann Alejandro
Eftring B.
Eriksson J.
Garcia-Molina Hector
Hansson J.
Haritsa Jayant
Kao Ben
Kuo Tei-Wei
Lin Kwei-Jay
Mellin J.
Mok Aloysius
Peng Ching-Shan
Ramamritham Krithi
Seshadri S.
Sivasankaran Raju
Son Sang
Stankovic John
Towsley Don
Ulusoy Ozgur
Xiong Ming
Publication venue: Boston University Computer Science Department
Publication date: 15/01/1996
Field of study

A Real-Time DataBase System (RTDBS) can be viewed as an amalgamation of a conventional DataBase Management System (DBMS) and a real-time system. Like a DBMS, it has to process transactions and guarantee ACID database properties. Furthermore, it has to operate in real-time, satisfying time constraints imposed on transaction commitments. A RTDBS may exist as a stand-alone system or as an embedded component in a larger multidatabase system. The publication in 1988 of a special issue of ACM SIGMOD Record on Real-Time DataBases signaled the birth of the RTDBS research area -- an area that brings together researchers from both the database and real-time systems communities. Today, almost eight years later, I am pleased to present in this special section of ACM SIGMOD Record a review of recent advances in RTDBS research. There were 18 submissions to this special section, of which eight papers were selected for inclusion to provide the readers of ACM SIGMOD Record with an overview of current and future research directions within the RTDBS community. In this paper [below], I summarize these directions and provide the reader with pointers to other publications for further information. -Azer Bestavros, Guest Edito

Boston University Institutional Repository (OpenBU)

Transaction Scheduling in Firm Real-Time Database Systems

Author: Haritsa Jayant R
Publication venue: University of Wisconsin-Madison Department of Computer Sciences
Publication date: 01/01/1991
Field of study

Minds@University of Wisconsin