19 research outputs found

    Reconstruction Methods for Providing Privacy in Data Mining

    Get PDF
    Data mining is the process of finding correlations or patterns among the dozens of fields in large database. A fruitful direction for data mining research will be the development of techniques that incorporate privacy concerns. Since primary task in our paper is that accurate data which we retrieve should be somewhat changed while providing to users. For this reason, recently much research effort has been devoted for addressing the problem of providing security in data mining. We consider the concrete case of building a decision tree classifier from data in which the values of individual records have been reconstructed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. By using these reconstructed distribution we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data

    Sanitizing and Minimizing Databases for Software Application Test Outsourcing

    Full text link
    Abstract—Testing software applications that use nontrivial databases is increasingly outsourced to test centers in order to achieve lower cost and higher quality. Not only do different data privacy laws prevent organizations from sharing this data with test centers because databases contain sensitive information, but also this situation is aggravated by big data – it is time consum-ing and difficult to anonymize, distribute, and test with large databases. Deleting data randomly often leads to significantly worsened test coverages and fewer uncovered faults, thereby reducing the quality of software applications. We propose a novel approach for Protecting and mInimizing databases for Software TestIng taSks (PISTIS) that both sanitizes and minimizes a database that comes along with an application. PISTIS uses a weight-based data clustering algorithm that partitions data in the database using information obtained using program analysis that describes how this data is used by the application. For each cluster, a centroid object is computed that represents different persons or entities in the cluster, and we use associative rule mining to compute and use constraints to ensure that the centroid objects are representative of the general population of the data in the cluster. Doing so also sanitizes information, since these centroid objects replace the original data to make it difficult for attackers to infer sensitive information. Thus, we reduce a large database to a few centroid objects and we show in our experiments with two applications that test coverage stays within a close range to its original level. I

    Privacy Preserving Data Mining

    Get PDF
    Data mining techniques provide benefits in many areas such as medicine, sports, marketing, signal processing as well as data and network security. However, although data mining techniques used in security subjects such as intrusion detection, biometric authentication, fraud and malware classification, “privacy” has become a serious problem, especially in data mining applications that involve the collection and sharing of personal data. For these reasons, the problem of protecting privacy in the context of data mining differs from traditional data privacy protection, as data mining can act as both a friend and foe. Chapter covers the previously developed privacy preserving data mining techniques in two parts: (i) techniques proposed for input data that will be subject to data mining and (ii) techniques suggested for processed data (output of the data mining algorithms). Also presents attacks against the privacy of data mining applications. The chapter conclude with a discussion of next-generation privacy-preserving data mining applications at both the individual and organizational levels

    Verifiable and private top-k monitoring

    Get PDF

    Secure server-aided top-k monitoring

    Get PDF
    National Research Foundation (NRF) Singapor

    A generic privacy ontology and its applications to different domains

    Get PDF
    Privacy is becoming increasingly important due to the advent of e-commerce, but is equally important in other application domains. Domain applications frequently require customers to divulge many personal details about themselves that must be protected carefully in accordance with privacy principles and regulations. Here, we define a privacy ontology to support the provision of privacy and help derive the level of privacy associated with transactions and applications. The privacy ontology provides a framework for developers and service providers to guide and benchmark their applications and systems with regards to the concepts of privacy and the levels and dimensions experienced. Furthermore, it supports users or data subjects with the ability to describe their own privacy requirements and measure them when dealing with other parties that process personal information. The ontology developed captures the knowledge of the domain of privacy and its quality aspects, dimensions and assessment criteria. It is composed of a core ontology, which we call generic privacy ontology and application domain specific extensions, which commit to some of application domain concepts, properties and relationships as well as all of the generic privacy ontology ones. This allows for an evaluation of privacy dimensions in different application domains and we present case studies for two different application domains, namely a restricted B2C e-commerce scenario as well as a restricted hospital scenario from the medical domain

    Programação genética aplicada à identificação de acidentes de uma usina nuclear PWR

    Get PDF
    This work presentes the results of the study that evaluated the efficiency of the evolutionary computation algorithm genetic programming as a technique for the optimization and feature generation at a pattern recognition system for the diagnostic of accidents in a pressurized water reactor nuclear power plant. The foundations of a typical pattern recognition system, the state of the art of genetic programming and of similar accident/transient diagnosis systems at nuclear power plants are also presented. Considering the set of the time evolution of seventeen operational variables for the three accident scenarios approached, plus normal condition, the task of genetic programming was to evolve non-linear regressors with combination of those variables that would provide the most discriminatory information for each of the events. After exhaustive tests with plenty of variable associations, genetic programming was proven to be a methodology capable of attaining success rates of, or very close to, 100%, with quite simple parametrization of the algorithm and at very reasonable time, putting itself in levels of performance similar or even superior as other similar systems available in the scientific literature, while also having the additional advantage of requiring very little pretreatment (sometimes none at all) of the dataNeste trabalho são apresentados os resultados do estudo que avaliou a performance do algoritmo de computação evolucionária programação genética como ferramenta de otimização e geração de atributos em um sistema de reconhecimento de padrões para identificação e diagnóstico de acidentes de uma usina nuclear com reator de água pressurizada. São apresentados ainda as bases de um sistema de reconhecimento de padrões, o estado da arte da programação genética e de sistemas similares de diagnóstico de acidentes e transientes de usinas nucleares. Dentro do conjunto da evolução temporal de 17 variáveis operacionais dos três acidentes/transientes considerado, além da condição normal, a função da programação genética foi evoluir regressores não lineares de combinações dessas variáveis que fornecessem o máximo de informação discriminatória para cada um dos eventos. Após testes exaustivos com diversas associações de variáveis, a programação genética se mostrou uma metodologia capaz de fornecer taxas de acerto de, ou muito próximas de, 100%, com parametrizações do algoritmo relativamente simples e em tempo de treinamento bastante razoável, mostrando ser capaz de fornecer resultados compatíveis e até superiores a outros sistemas disponíveis na literatura, com a vantagem adicional de requerer pouco (e muitas vezes nenhum) pré-tratamento nos dados

    Privacy preserving data publishing with multiple sensitive attributes

    Get PDF
    Data mining is the process of extracting hidden predictive information from large databases, it has a great potential to help governments, researchers and companies focus on the most significant information in their data warehouses. High quality data and effective data publishing are needed to gain a high impact from data mining process. However there is a clear need to preserve individual privacy in the released data. Privacy-preserving data publishing is a research topic of eliminating privacy threats. At the same time it provides useful information in the released data. Normally datasets include many sensitive attributes; it may contain static data or dynamic data. Datasets may need to publish multiple updated releases with different time stamps. As a concrete example, public opinions include highly sensitive information about an individual and may reflect a person's perspective, understanding, particular feelings, way of life, and desires. On one hand, public opinion is often collected through a central server which keeps a user profile for each participant and needs to publish this data for researchers to deeply analyze. On the other hand, new privacy concerns arise and user's privacy can be at risk. The user's opinion is sensitive information and it must be protected before and after data publishing. Opinions are about a few issues, while the total number of issues is huge. In this case we will deal with multiple sensitive attributes in order to develop an efficient model. Furthermore, opinions are gathered and published periodically, correlations between sensitive attributes in different releases may occur. Thus the anonymization technique must care about previous releases as well as the dependencies between released issues. This dissertation identifies a new privacy problem of public opinions. In addition it presents two probabilistic anonymization algorithms based on the concepts of k-anonymity [1, 2] and l-diversity [3, 4] to solve the problem of both publishing datasets with multiple sensitive attributes and publishing dynamic datasets. Proposed algorithms provide a heuristic solution for multidimensional quasi-identifier and multidimensional sensitive attributes using probabilistic l-diverse definition. Experimental results show that these algorithms clearly outperform the existing algorithms in term of anonymization accuracy
    corecore