51 research outputs found
MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning
Training deep networks and tuning hyperparameters on large datasets is
computationally intensive. One of the primary research directions for efficient
training is to reduce training costs by selecting well-generalizable subsets of
training data. Compared to simple adaptive random subset selection baselines,
existing intelligent subset selection approaches are not competitive due to the
time-consuming subset selection step, which involves computing model-dependent
gradients and feature embeddings and applies greedy maximization of submodular
objectives. Our key insight is that removing the reliance on downstream model
parameters enables subset selection as a pre-processing step and enables one to
train multiple models at no additional cost. In this work, we propose MILO, a
model-agnostic subset selection framework that decouples the subset selection
from model training while enabling superior model convergence and performance
by using an easy-to-hard curriculum. Our empirical results indicate that MILO
can train models faster and tune hyperparameters
faster than full-dataset training or tuning without
compromising performance
Information driven evaluation of data hiding algorithms
Abstract. Privacy is one of the most important properties an information system must satisfy. A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used. Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of modifying the database in such a way to prevent the discovery of sensible information. Due to the large amount of possible techniques that can be used to achieve this goal, it is necessary to provide some standard evaluation metrics to determine the best algorithms for a specific application or context. Currently, however, there is no common set of parameters that can be used for this purpose. This paper explores the problem of PPDM algorithm evaluation, starting from the key goal of preserving of data quality. To achieve such goal, we propose a formal definition of data quality specifically tailored for use in the context of PPDM algorithms, a set of evaluation parameters and an evaluation algorithm. The resulting evaluation core process is then presented as a part of a more general three step evaluation framework, taking also into account other aspects of the algorithm evaluation such as efficiency, scalability and level of privacy.
Limiting Privacy Breaches in Privacy Preserving Data Mining
There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average. In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization
PRIVACY PRESERVING INFORMATION SHARING
Modern business creates an increasing need for sharing, querying and mining information across autonomous enterprises while maintaining privacy of their own data records. The capability of preserving privacy in query processing algorithms can be demonstrated in two ways: through statistics and through cryptography. Statistical approach evaluates disclosure by its effect on an adversaryâs probability assumptions regarding privacysensitive data properties, while cryptographic approach gives comparative lower bounds on the computational complexity of learning these properties. This dissertation presents results in both approaches. First, it considers the setup with one central server and a large number of clients connected only to the server, each client having a private data record. The server wants to generate an aggregate model of clientsâ data, and the clients want to limit disclosure of their individual records. Before sending to the server, each client hides its record using randomization, i.e. replaces the record with another one drawn from a certain distribution that depends on the original record. Disclosure is limited statistically by providing guarantees against âprivacy breachesâ: situations when the randomized record significantly alters the serverâs probabilit
A probabilistic algorithm for updating files over a communication link
1 Introduction Consider two persons, P and Q; assume that P knows a binary string x and Q knows a binary string y (that is assumed to be close to x, see below). The persons can send bits to each other (Figure 1). P wants to know the string y; the communication protocol should require as few bits as possible using the fact that P already knows the string x that y is close to. The distance between y and x is measured by the number of edit operations needed to transform x into y (see Definition 2.1); let us mention that distance in our sense is not symmetric. We present a probabilistic algorithm and estimate the number of transmitted bits and running time
- âŠ