Search CORE

51 research outputs found

MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning

Author: Evfimievski Alexandre V.
Iyer Rishabh
Kate Kiran
Killamsetty Krishnateja
Pedapati Tejaswini
Popa Lucian
Publication venue
Publication date: 16/06/2023
Field of study

Training deep networks and tuning hyperparameters on large datasets is computationally intensive. One of the primary research directions for efficient training is to reduce training costs by selecting well-generalizable subsets of training data. Compared to simple adaptive random subset selection baselines, existing intelligent subset selection approaches are not competitive due to the time-consuming subset selection step, which involves computing model-dependent gradients and feature embeddings and applies greedy maximization of submodular objectives. Our key insight is that removing the reliance on downstream model parameters enables subset selection as a pre-processing step and enables one to train multiple models at no additional cost. In this work, we propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training while enabling superior model convergence and performance by using an easy-to-hard curriculum. Our empirical results indicate that MILO can train models

3\times - 10 \times

faster and tune hyperparameters

20\times - 75 \times

faster than full-dataset training or tuning without compromising performance

arXiv.org e-Print Archive

Information driven evaluation of data hiding algorithms

Author: A. Evfimievski
A.V. Levitin
G. Kumar Tayi
K. Orr
L. Sweeney
R.Y. Wang
V.S. Verykios
W. Kent
Y. Wand
Publication venue
Publication date: 01/01/2005
Field of study

Abstract. Privacy is one of the most important properties an information system must satisfy. A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used. Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of modifying the database in such a way to prevent the discovery of sensible information. Due to the large amount of possible techniques that can be used to achieve this goal, it is necessary to provide some standard evaluation metrics to determine the best algorithms for a specific application or context. Currently, however, there is no common set of parameters that can be used for this purpose. This paper explores the problem of PPDM algorithm evaluation, starting from the key goal of preserving of data quality. To achieve such goal, we propose a formal definition of data quality specifically tailored for use in the context of PPDM algorithms, a set of evaluation parameters and an evaluation algorithm. The resulting evaluation core process is then presented as a part of a more general three step evaluation framework, taking also into account other aspects of the algorithm evaluation such as efficiency, scalability and level of privacy.

CiteSeerX

Crossref

Distributed data mining and agents

Author: Chris Giannella
Clifton
Clifton
Evfimievski
Farkas
Forman
Han
Hand
Hastie
Hillol Kargupta
Jajodia
Johnson
Josenildo C. da Silva
Kargupta
Kargupta
Kargupta
Klusch
Matthias Klusch
Oliveira
Park
Pinkas
Provost
Ruchita Bhargava
Samatova
Saygin
Strehl
Thuraisingham
Vaidya
Verykios
Witten
Zaki
Zaki
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Mining probabilistic automata: a statistical view of sequential pattern mining

Author: A. S. Reber
A. V. Evfimievski
C. Higuera de la
E. M. Gold
E. M. Newton
François Jacquenet
G. I. Webb
H. Mannila
J. Ayres
J. Borges
J. Han
J. Pei
J. Shaffer
K. Pearson
L. G. Valiant
L. Sweeney
M. Garofalakis
M. J. Zaki
M. J. Zaki
M. Klemettinen
M. Spiliopoulou
Marc Sebban
P. Dupont
P. Laur
P. Laur
R. A. Fisher
R. Agrawal
R. Agrawal
R. C. Carrasco
R. J. Bayardo
R. Kosala
R. Srikant
S. Holm
Stéphanie Jacquemont
V. S. Verykios
W. Hoeffding
Y. Benjamini
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Limiting Privacy Breaches in Privacy Preserving Data Mining

Author: Alexandre Evfimievski
Alexandre Evfimievski Aevf
Publication venue: ACM Press
Publication date
Field of study

There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average. In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization

CiteSeerX

PRIVACY PRESERVING INFORMATION SHARING

Author: Alexandre Valentinovich Evfimievski
Publication venue
Publication date: 01/01/2004
Field of study

Modern business creates an increasing need for sharing, querying and mining information across autonomous enterprises while maintaining privacy of their own data records. The capability of preserving privacy in query processing algorithms can be demonstrated in two ways: through statistics and through cryptography. Statistical approach evaluates disclosure by its effect on an adversary’s probability assumptions regarding privacysensitive data properties, while cryptographic approach gives comparative lower bounds on the computational complexity of learning these properties. This dissertation presents results in both approaches. First, it considers the setup with one central server and a large number of clients connected only to the server, each client having a private data record. The server wants to generate an aggregate model of clients’ data, and the clients want to limit disclosure of their individual records. Before sending to the server, each client hides its record using randomization, i.e. replaces the record with another one drawn from a certain distribution that depends on the original record. Disclosure is limited statistically by providing guarantees against “privacy breaches”: situations when the randomized record significantly alters the server’s probabilit

CiteSeerX

A probabilistic algorithm for updating files over a communication link

Author: Alexandre V. Evfimievski
Publication venue
Publication date
Field of study

1 Introduction Consider two persons, P and Q; assume that P knows a binary string x and Q knows a binary string y (that is assumed to be close to x, see below). The persons can send bits to each other (Figure 1). P wants to know the string y; the communication protocol should require as few bits as possible using the fact that P already knows the string x that y is close to. The distance between y and x is measured by the number of edit operations needed to transform x into y (see Definition 2.1); let us mention that distance in our sense is not symmetric. We present a probabilistic algorithm and estimate the number of transmitted bits and running time

CiteSeerX