1,589 research outputs found

    Assessing Utility of Differential Privacy for RCTs

    Full text link
    Randomized control trials, RCTs, have become a powerful tool for assessing the impact of interventions and policies in many contexts. They are considered the gold-standard for inference in the biomedical fields and in many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of the inference, and these studies typically include the response data collected, de-identified and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of strong privacy-preservation methodology (with \ac{DP} guarantees), on published analyses from RCTs, leveraging the availability of replication packages (research compendia) in economics and policy analysis. We provide simulations studies and demonstrate how we can replicate the analysis in a published economics article on privacy-protected data under various parametrizations. We find that relatively straightforward DP-based methods allow for inference-valid protection of the published data, though computational issues may limit more complex analyses from using these methods. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.Comment: Submitte

    User's Privacy in Recommendation Systems Applying Online Social Network Data, A Survey and Taxonomy

    Full text link
    Recommender systems have become an integral part of many social networks and extract knowledge from a user's personal and sensitive data both explicitly, with the user's knowledge, and implicitly. This trend has created major privacy concerns as users are mostly unaware of what data and how much data is being used and how securely it is used. In this context, several works have been done to address privacy concerns for usage in online social network data and by recommender systems. This paper surveys the main privacy concerns, measurements and privacy-preserving techniques used in large-scale online social networks and recommender systems. It is based on historical works on security, privacy-preserving, statistical modeling, and datasets to provide an overview of the technical difficulties and problems associated with privacy preserving in online social networks.Comment: 26 pages, IET book chapter on big data recommender system

    Impacts of frequent itemset hiding algorithms on privacy preserving data mining

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2010Includes bibliographical references (leaves: 54-58)Text in English; Abstract: Turkish and Englishx, 69 leavesThe invincible growing of computer capabilities and collection of large amounts of data in recent years, make data mining a popular analysis tool. Association rules (frequent itemsets), classification and clustering are main methods used in data mining research. The first part of this thesis is implementation and comparison of two frequent itemset mining algorithms that work without candidate itemset generation: Matrix Apriori and FP-Growth. Comparison of these algorithms revealed that Matrix Apriori has higher performance with its faster data structure. One of the great challenges of data mining is finding hidden patterns without violating data owners. privacy. Privacy preserving data mining came into prominence as a solution. In the second study of the thesis, Matrix Apriori algorithm is modified and a frequent itemset hiding framework is developed. Four frequent itemset hiding algorithms are proposed such that: i) all versions work without pre-mining so privacy breech caused by the knowledge obtained by finding frequent itemsets is prevented in advance, ii) efficiency is increased since no pre-mining is required, iii) supports are found during hiding process and at the end sanitized dataset and frequent itemsets of this dataset are given as outputs so no post-mining is required, iv) the heuristics use pattern lengths rather than transaction lengths eliminating the possibility of distorting more valuable data

    Adaptive Anomaly Detection via Self-Calibration and Dynamic Updating

    Get PDF
    The deployment and use of Anomaly Detection (AD) sensors often requires the intervention of a human expert to manually calibrate and optimize their performance. Depending on the site and the type of traffic it receives, the operators might have to provide recent and sanitized training data sets, the characteristics of expected traffic (i.e. outlier ratio), and exceptions or even expected future modifications of system's behavior. In this paper, we study the potential performance issues that stem from fully automating the AD sensors' day-to-day maintenance and calibration. Our goal is to remove the dependence on human operator using an unlabeled, and thus potentially dirty, sample of incoming traffic. To that end, we propose to enhance the training phase of AD sensors with a self-calibration phase, leading to the automatic determination of the optimal AD parameters. We show how this novel calibration phase can be employed in conjunction with previously proposed methods for training data sanitization resulting in a fully automated AD maintenance cycle. Our approach is completely agnostic to the underlying AD sensor algorithm. Furthermore, the self-calibration can be applied in an online fashion to ensure that the resulting AD models reflect changes in the system's behavior which would otherwise render the sensor's internal state inconsistent. We verify the validity of our approach through a series of experiments where we compare the manually obtained optimal parameters with the ones computed from the self-calibration phase. Modeling traffic from two different sources, the fully automated calibration shows a 7.08% reduction in detection rate and a 0.06% increase in false positives, in the worst case, when compared to the optimal selection of parameters. Finally, our adaptive models outperform the statically generated ones retaining the gains in performance from the sanitization process over time

    Large-scale Wireless Local-area Network Measurement and Privacy Analysis

    Get PDF
    The edge of the Internet is increasingly becoming wireless. Understanding the wireless edge is therefore important for understanding the performance and security aspects of the Internet experience. This need is especially necessary for enterprise-wide wireless local-area networks (WLANs) as organizations increasingly depend on WLANs for mission- critical tasks. To study a live production WLAN, especially a large-scale network, is a difficult undertaking. Two fundamental difficulties involved are (1) building a scalable network measurement infrastructure to collect traces from a large-scale production WLAN, and (2) preserving user privacy while sharing these collected traces to the network research community. In this dissertation, we present our experience in designing and implementing one of the largest distributed WLAN measurement systems in the United States, the Dartmouth Internet Security Testbed (DIST), with a particular focus on our solutions to the challenges of efficiency, scalability, and security. We also present an extensive evaluation of the DIST system. To understand the severity of some potential trace-sharing risks for an enterprise-wide large-scale wireless network, we conduct privacy analysis on one kind of wireless network traces, a user-association log, collected from a large-scale WLAN. We introduce a machine-learning based approach that can extract and quantify sensitive information from a user-association log, even though it is sanitized. Finally, we present a case study that evaluates the tradeoff between utility and privacy on WLAN trace sanitization

    Efficient Generation of Social Network Data from Computer-Mediated Communication Logs

    Get PDF
    The insider threat poses a significant risk to any network or information system. A general definition of the insider threat is an authorized user performing unauthorized actions, a broad definition with no specifications on severity or action. While limited research has been able to classify and detect insider threats, it is generally understood that insider attacks are planned, and that there is a time period in which the organization\u27s leadership can intervene and prevent the attack. Previous studies have shown that the person\u27s behavior will generally change, and it is possible that social network analysis could be used to observe those changes. Unfortunately, generation of social network data can be a time consuming and manually intensive process. This research discusses the automatic generation of such data from computer-mediated communication records. Using the tools developed in this research, raw social network data can be gathered from communication logs quickly and cheaply. Ideas on further analysis of this data for insider threat mitigation are then presented

    Privacy evaluation of fairness-enhancing pre-processing techniques

    Full text link
    La prédominance d’algorithmes de prise de décision, qui sont souvent basés sur desmodèles issus de l’apprentissage machine, soulève des enjeux importants en termes de ladiscrimination et du manque d’équité par ceux-ci ainsi que leur impact sur le traitement degroupes minoritaires ou sous-représentés. Cela a toutefois conduit au développement de tech-niques dont l’objectif est de mitiger ces problèmes ainsi que les les difficultés qui y sont reliées. Dans ce mémoire, nous analysons certaines de ces méthodes d’amélioration de l’équitéde type «pré-traitement» parmi les plus récentes, et mesurons leur impact sur le compromiséquité-utilité des données transformées. Plus précisément, notre focus se fera sur troistechniques qui ont pour objectif de cacher un attribut sensible dans un ensemble de données,dont deux basées sur les modèles générateurs adversériaux (LAFTR [67] et GANSan [6])et une basée sur une transformation déterministe et les fonctions de densités (DisparateImpact Remover [33]). Nous allons premièrement vérifier le niveau de contrôle que cestechniques nous offrent quant au compromis équité-utilité des données. Par la suite, nousallons investiguer s’il est possible d’inverser la transformation faite aux données par chacunde ces algorithmes en construisant un auto-encodeur sur mesure qui tentera de reconstruireles données originales depuis les données transformées. Finalement, nous verrons qu’unacteur malveillant pourrait, avec les données transformées par ces trois techniques, retrouverl’attribut sensible qui est censé être protégé avec des algorithmes d’apprentissage machinede base. Une des conclusions de notre recherche est que même si ces techniques offrentdes garanties pratiques quant à l’équité des données produites, il reste souvent possible deprédire l’attribut sensible en question par des techniques d’apprentissage, ce qui annulepotentiellement toute protection que la technique voulait accorder, créant ainsi de sérieuxdangers au niveau de la vie privée.The prevalence of decision-making algorithms, based on increasingly powerful patternrecognition machine learning algorithms, has brought a growing wave of concern about dis-crimination and fairness of those algorithm predictions as well as their impacts on equity andtreatment of minority or under-represented groups. This in turn has fuelled the developmentof new techniques to mitigate those issues and helped outline challenges related to such issues. n this work, we analyse recent advances in fairness enhancing pre-processing techniques,evaluate how they control the fairness-utility trade-off and the dataset’s ability to be usedsuccessfully in downstream tasks. We focus on three techniques that attempt to hide asensitive attribute in a dataset, two based onGenerative Adversarial Networksarchitectures(LAFTR [67] and GANSan [6]), and one deterministic transformation of dataset relyingon density functions (Disparate Impact Remover [33]). First we analyse the control overthe fairness-utility trade-off each of these techniques offer. We then attempt to revertthe transformation on the data each of these techniques applied using a variation of anauto-encoder built specifically for this purpose, which we calledreconstructor. Lastly wesee that even though these techniques offer practical guarantees of specific fairness metrics,basic machine learning classifiers are often able to successfully predict the sensitive attributefrom the transformed data, effectively enabling discrimination. This creates what we believeis a major issue in fairness-enhancing technique research that is in large part due to intricaterelationship between fairness and privacy
    corecore