51 research outputs found

    MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning

    Full text link
    Training deep networks and tuning hyperparameters on large datasets is computationally intensive. One of the primary research directions for efficient training is to reduce training costs by selecting well-generalizable subsets of training data. Compared to simple adaptive random subset selection baselines, existing intelligent subset selection approaches are not competitive due to the time-consuming subset selection step, which involves computing model-dependent gradients and feature embeddings and applies greedy maximization of submodular objectives. Our key insight is that removing the reliance on downstream model parameters enables subset selection as a pre-processing step and enables one to train multiple models at no additional cost. In this work, we propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training while enabling superior model convergence and performance by using an easy-to-hard curriculum. Our empirical results indicate that MILO can train models 3×−10×3\times - 10 \times faster and tune hyperparameters 20×−75×20\times - 75 \times faster than full-dataset training or tuning without compromising performance

    Information driven evaluation of data hiding algorithms

    Get PDF
    Abstract. Privacy is one of the most important properties an information system must satisfy. A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when datamining techniques are used. Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of modifying the database in such a way to prevent the discovery of sensible information. Due to the large amount of possible techniques that can be used to achieve this goal, it is necessary to provide some standard evaluation metrics to determine the best algorithms for a specific application or context. Currently, however, there is no common set of parameters that can be used for this purpose. This paper explores the problem of PPDM algorithm evaluation, starting from the key goal of preserving of data quality. To achieve such goal, we propose a formal definition of data quality specifically tailored for use in the context of PPDM algorithms, a set of evaluation parameters and an evaluation algorithm. The resulting evaluation core process is then presented as a part of a more general three step evaluation framework, taking also into account other aspects of the algorithm evaluation such as efficiency, scalability and level of privacy.

    Limiting Privacy Breaches in Privacy Preserving Data Mining

    No full text
    There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average. In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from [9] to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization

    PRIVACY PRESERVING INFORMATION SHARING

    No full text
    Modern business creates an increasing need for sharing, querying and mining information across autonomous enterprises while maintaining privacy of their own data records. The capability of preserving privacy in query processing algorithms can be demonstrated in two ways: through statistics and through cryptography. Statistical approach evaluates disclosure by its effect on an adversary’s probability assumptions regarding privacysensitive data properties, while cryptographic approach gives comparative lower bounds on the computational complexity of learning these properties. This dissertation presents results in both approaches. First, it considers the setup with one central server and a large number of clients connected only to the server, each client having a private data record. The server wants to generate an aggregate model of clients’ data, and the clients want to limit disclosure of their individual records. Before sending to the server, each client hides its record using randomization, i.e. replaces the record with another one drawn from a certain distribution that depends on the original record. Disclosure is limited statistically by providing guarantees against “privacy breaches”: situations when the randomized record significantly alters the server’s probabilit

    A probabilistic algorithm for updating files over a communication link

    No full text
    1 Introduction Consider two persons, P and Q; assume that P knows a binary string x and Q knows a binary string y (that is assumed to be close to x, see below). The persons can send bits to each other (Figure 1). P wants to know the string y; the communication protocol should require as few bits as possible using the fact that P already knows the string x that y is close to. The distance between y and x is measured by the number of edit operations needed to transform x into y (see Definition 2.1); let us mention that distance in our sense is not symmetric. We present a probabilistic algorithm and estimate the number of transmitted bits and running time
    • 

    corecore