120 research outputs found

    Understanding and controlling leakage in machine learning

    Get PDF
    Machine learning models are being increasingly adopted in a variety of real-world scenarios. However, the privacy and confidentiality implications introduced in these scenarios are not well understood. Towards better understanding such implications, we focus on scenarios involving interactions between numerous parties prior to, during, and after training relevant models. Central to these interactions is sharing information for a purpose e.g., contributing data samples towards a dataset, returning predictions via an API. This thesis takes a step toward understanding and controlling leakage of private information during such interactions. In the first part of the thesis we investigate leakage of private information in visual data and specifically, photos representative of content shared on social networks. There is a long line of work to tackle leakage of personally identifiable information in social photos, especially using face- and body-level visual cues. However, we argue this presents only a narrow perspective as images reveal a wide spectrum of multimodal private information (e.g., disabilities, name-tags). Consequently, we work towards a Visual Privacy Advisor that aims to holistically identify and mitigate private risks when sharing social photos. In the second part, we address leakage during training of ML models. We observe learning algorithms are being increasingly used to train models on rich decentralized datasets e.g., personal data on numerous mobile devices. In such cases, information in the form of high-dimensional model parameter updates are anonymously aggregated from participating individuals. However, we find that the updates encode sufficient identifiable information and allows them to be linked back to participating individuals. We additionally propose methods to mitigate this leakage while maintaining high utility of the updates. In the third part, we discuss leakage of confidential information during inference time of black-box models. In particular, we find models lend themselves to model functionality stealing attacks: an adversary can interact with the black-box model towards creating a replica `knock-off' model that exhibits similar test-set performances. As such attacks pose a severe threat to the intellectual property of the model owner, we also work towards effective defenses. Our defense strategy by introducing bounded and controlled perturbations to predictions can significantly amplify model stealing attackers' error rates. In summary, this thesis advances understanding of privacy leakage when information is shared in raw visual forms, during training of models, and at inference time when deployed as black-boxes. In each of the cases, we further propose techniques to mitigate leakage of information to enable wide-spread adoption of techniques in real-world scenarios.Modelle für maschinelles Lernen werden zunehmend in einer Vielzahl realer Szenarien eingesetzt. Die in diesen Szenarien vorgestellten Auswirkungen auf Datenschutz und Vertraulichkeit wurden jedoch nicht vollständig untersucht. Um solche Implikationen besser zu verstehen, konzentrieren wir uns auf Szenarien, die Interaktionen zwischen mehreren Parteien vor, während und nach dem Training relevanter Modelle beinhalten. Das Teilen von Informationen für einen Zweck, z. B. das Einbringen von Datenproben in einen Datensatz oder die Rückgabe von Vorhersagen über eine API, ist zentral für diese Interaktionen. Diese Arbeit verhilft zu einem besseren Verständnis und zur Kontrolle des Verlusts privater Informationen während solcher Interaktionen. Im ersten Teil dieser Arbeit untersuchen wir den Verlust privater Informationen bei visuellen Daten und insbesondere bei Fotos, die für Inhalte repräsentativ sind, die in sozialen Netzwerken geteilt werden. Es gibt eine lange Reihe von Arbeiten, die das Problem des Verlustes persönlich identifizierbarer Informationen in sozialen Fotos angehen, insbesondere mithilfe visueller Hinweise auf Gesichts- und Körperebene. Wir argumentieren jedoch, dass dies nur eine enge Perspektive darstellt, da Bilder ein breites Spektrum multimodaler privater Informationen (z. B. Behinderungen, Namensschilder) offenbaren. Aus diesem Grund arbeiten wir auf einen Visual Privacy Advisor hin, der darauf abzielt, private Risiken beim Teilen sozialer Fotos ganzheitlich zu identifizieren und zu minimieren. Im zweiten Teil befassen wir uns mit Datenverlusten während des Trainings von ML-Modellen. Wir beobachten, dass zunehmend Lernalgorithmen verwendet werden, um Modelle auf umfangreichen dezentralen Datensätzen zu trainieren, z. B. persönlichen Daten auf zahlreichen Mobilgeräten. In solchen Fällen werden Informationen von teilnehmenden Personen in Form von hochdimensionalen Modellparameteraktualisierungen anonym verbunden. Wir stellen jedoch fest, dass die Aktualisierungen ausreichend identifizierbare Informationen codieren und es ermöglichen, sie mit teilnehmenden Personen zu verknüpfen. Wir schlagen zudem Methoden vor, um diesen Datenverlust zu verringern und gleichzeitig die hohe Nützlichkeit der Aktualisierungen zu erhalten. Im dritten Teil diskutieren wir den Verlust vertraulicher Informationen während der Inferenzzeit von Black-Box-Modellen. Insbesondere finden wir, dass sich Modelle für die Entwicklung von Angriffen, die auf Funktionalitätsdiebstahl abzielen, eignen: Ein Gegner kann mit dem Black-Box-Modell interagieren, um ein Replikat-Knock-Off-Modell zu erstellen, das ähnliche Test-Set-Leistungen aufweist. Da solche Angriffe eine ernsthafte Bedrohung für das geistige Eigentum des Modellbesitzers darstellen, arbeiten wir auch an einer wirksamen Verteidigung. Unsere Verteidigungsstrategie durch die Einführung begrenzter und kontrollierter Störungen in Vorhersagen kann die Fehlerraten von Modelldiebstahlangriffen erheblich verbessern. Zusammenfassend lässt sich sagen, dass diese Arbeit das Verständnis von Datenschutzverlusten beim Informationsaustausch verbessert, sei es bei rohen visuellen Formen, während des Trainings von Modellen oder während der Inferenzzeit von Black-Box-Modellen. In jedem Fall schlagen wir ferner Techniken zur Verringerung des Informationsverlusts vor, um eine weit verbreitete Anwendung von Techniken in realen Szenarien zu ermöglichen.Max Planck Institute for Informatic

    Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data

    Get PDF
    In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data

    ESSAYS ON QUALITY CERTIFICATION IN FOOD AND AGRICULTURAL MARKETS

    Get PDF
    This dissertation features three essays exploring the market impacts of two types of quality certification---a voluntary non-GMO label and a mandatory food safety standard. In the first essay, I use a hedonic framework to examine whether firms use a voluntary quality certification for non-GMO products to extract rent from customers. Using U.S. retail scanner data coupled with data from a voluntary non-GMO label, I find no evidence of price premiums or quantity changes for newly certified non-GMO products. Instead, the label may induce firms to develop new non-GMO products targeted to high-valuation consumers. The second essay examines how voluntary non-GMO food labeling impacts demand in the ready-to-eat [RTE] cereal industry. I estimate a discrete-choice, random coefficients logit demand model using monthly data for 50 cereal brands across 100 DMAs. Consumer tastes for the label are widely distributed, and this heterogeneity plays a substantial role in individual choices; but, on average, the non-GMO label has a positive impact on demand. I estimate welfare effects by simulating two labeling scenarios: one in which all brands use the non-GMO label, and one in which no brands use the label. The simulation results suggest that non-GMO labeling in the RTE cereal industry may improve consumer welfare on average. In the final essay, we use data from an original national survey of produce growers to examine whether complying with the Food Safety Modernization Act's Produce Rule will be prohibitively costly for some growers. We examine how food safety measure expenditures required by the Rule vary with farm size and practices using a double hurdle model to control for selectivity in using food safety practices and reporting expenditures. Expenditures per acre decrease with farm size, and growers using sustainable farming practices spend more than conventional growers on many food safety practices. We use our estimates to quantify how the cost burden of compliance varies with farm size. We also explore the policy implications of exemptions to the Rule by simulating how changes to exemption thresholds might affect the cost burden of each food safety practice on farms at the threshold

    Detection and Explanation of Distributed Denial of Service (DDoS) Attack Through Interpretable Machine Learning

    Get PDF
    Distributed denial of service (DDoS) is a network-based attack where the aim of the attacker is to overwhelm the victim server. The attacker floods the server by sending enormous amount of network packets in a distributed manner beyond the servers capacity and thus causing the disruption of its normal service. In this dissertation, we focus to build intelligent detectors that can learn by themselves with less human interactions and detect DDoS attacks accurately. Machine learning (ML) has promising outcomes throughout the technologies including cybersecurity and provides us with intelligence when applied on Intrusion Detection Systems (IDSs). In addition, from the state-of-the-art ML-based IDSs, the Ensemble classifier (combination of classifiers) outperforms single classifier. Therefore, we have implemented both supervised and unsupervised ensemble frameworks to build IDSs for better DDoS detection accuracy with lower false alarms compared to the existing ones. Our experimentation, done with the most popular and benchmark datasets such as NSL-KDD, UNSW-NB15, and CICIDS2017, have achieved at most detection accuracy of 99.1% with the lowest false positive rate of 0.01%. As feature selection is one of the mandatory preprocessing phases in ML classification, we have designed several feature selection techniques for better performances in terms of DDoS detection accuracy, false positive alarms, and training times. Initially, we have implemented an ensemble framework for feature selection (FS) methods which combines almost all well-known FS methods and yields better outcomes compared to any single FS method.The goal of my dissertation is not only to detect DDoS attacks precisely but also to demonstrate explanations for these detections. Interpretable machine learning (IML) technique is used to explain a detected DDoS attack with the help of the effectiveness of the corresponding features. We also have implemented a novel feature selection approach based on IML which helps to find optimum features that are used further to retrain our models. The retrained model gives better performances than general feature selection process. Moreover, we have developed an explainer model using IML that identifies detected DDoS attacks with proper explanations based on effectiveness of the features. The contribution of this dissertation is five-folded with the ultimate goal of detecting the most frequent DDoS attacks in cyber security. In order to detect DDoS attacks, we first used ensemble machine learning classification with both supervised and unsupervised classifiers. For better performance, we then implemented and applied two feature selection approaches, such as ensemble feature selection framework and IML based feature selection approach, both individually and in a combination with supervised ensemble framework. Furthermore, we exclusively added explanations for the detected DDoS attacks with the help of explainer models that are built using LIME and SHAP IML methods. To build trustworthy explainer models, a detailed survey has been conducted on interpretable machine learning methods and on their associated tools. We applied the designed framework in various domains, like smart grid and NLP-based IDS to verify its efficacy and ability of performing as a generic model

    Anonymizing and Trading Person-specific Data with Trust

    Get PDF
    In the past decade, data privacy, security, and trustworthiness have gained tremendous attention from research communities, and these are still active areas of research with the proliferation of cloud services and social media applications. The data is growing at a rapid pace. It has become an integral part of almost every industry and business, including commercial and non-profit organizations. It often contains person-specific information and a data custodian who holds it must be responsible for managing its use, disclosure, accuracy and privacy protection. In this thesis, we present three research problems. The first two problems address the concerns of stakeholders on privacy protection, data trustworthiness, and profit distribution in the online market for trading person-specific data. The third problem addresses the health information custodians (HICs) concern on privacy-preserving healthcare network data publishing. Our first research problem is identified in cloud-based data integration service where data providers collaborate with their trading partners in order to deliver quality data mining services. Data-as-a-Service (DaaS) enables data integration to serve the demands of data consumers. Data providers face challenges not only to protect private data over the cloud but also to legally adhere to privacy compliance rules when trading person-specific data. We propose a model that allows the collaboration of multiple data providers for integrating their data and derives the contribution of each data provider by valuating the incorporated cost factors. This model serves as a guide for business decision-making, such as estimating the potential privacy risk and finding the sub-optimal value for publishing mashup data. Experiments on real-life data demonstrate that our approach can identify the sub-optimal value in data mashup for different privacy models, including K-anonymity, LKC-privacy, and ϵ-differential privacy, with various anonymization algorithms and privacy parameters. Second, consumers demand a good quality of data for accurate analysis and effective decision- making while the data providers intend to maximize their profits by competing with peer providers. In addition, the data providers or custodians must conform to privacy policies to avoid potential penalties for privacy breaches. To address these challenges, we propose a two-fold solution: (1) we present the first information entropy-based trust computation algorithm, IEB_Trust, that allows a semi-trusted arbitrator to detect the covert behavior of a dishonest data provider and chooses the qualified providers for a data mashup, and (2) we incorporate the Vickrey-Clarke-Groves (VCG) auction mechanism for the valuation of data providers’ attributes into the data mashup process. Experiments on real-life data demonstrate the robustness of our approach in restricting dishonest providers from participation in the data mashup and improving the efficiency in comparison to provenance-based approaches. Furthermore, we derive the monetary shares for the chosen providers from their information utility and trust scores over the differentially private release of the integrated dataset under their joint privacy requirements. Finally, we address the concerns of HICs of exchanging healthcare data to provide better and more timely services while mitigating the risk of exposing patients’ sensitive information to privacy threats. We first model a complex healthcare dataset using a heterogeneous information network that consists of multi-type entities and their relationships. We then propose DiffHetNet, an edge-based differentially private algorithm, to protect the sensitive links of patients from inbound and outbound attacks in the heterogeneous health network. We evaluate the performance of our proposed method in terms of information utility and efficiency on different types of real-life datasets that can be modeled as networks. Experimental results suggest that DiffHetNet generally yields less information loss and is significantly more efficient in terms of runtime in comparison with existing network anonymization methods. Furthermore, DiffHetNet is scalable to large network datasets

    Achieving network resiliency using sound theoretical and practical methods

    Get PDF
    Computer networks have revolutionized the life of every citizen in our modern intercon- nected society. The impact of networked systems spans every aspect of our lives, from financial transactions to healthcare and critical services, making these systems an attractive target for malicious entities that aim to make financial or political profit. Specifically, the past decade has witnessed an astounding increase in the number and complexity of sophisti- cated and targeted attacks, known as advanced persistent threats (APT). Those attacks led to a paradigm shift in the security and reliability communities’ perspective on system design; researchers and government agencies accepted the inevitability of incidents and malicious attacks, and marshaled their efforts into the design of resilient systems. Rather than focusing solely on preventing failures and attacks, resilient systems are able to maintain an acceptable level of operation in the presence of such incidents, and then recover gracefully into normal operation. Alongside prevention, resilient system design focuses on incident detection as well as timely response. Unfortunately, the resiliency efforts of research and industry experts have been hindered by an apparent schism between theory and practice, which allows attackers to maintain the upper hand advantage. This lack of compatibility between the theory and practice of system design is attributed to the following challenges. First, theoreticians often make impractical and unjustifiable assumptions that allow for mathematical tractability while sacrificing accuracy. Second, the security and reliability communities often lack clear definitions of success criteria when comparing different system models and designs. Third, system designers often make implicit or unstated assumptions to favor practicality and ease of design. Finally, resilient systems are tested in private and isolated environments where validation and reproducibility of the results are not publicly accessible. In this thesis, we set about showing that the proper synergy between theoretical anal- ysis and practical design can enhance the resiliency of networked systems. We illustrate the benefits of this synergy by presenting resiliency approaches that target the inter- and intra-networking levels. At the inter-networking level, we present CPuzzle as a means to protect the transport control protocol (TCP) connection establishment channel from state- exhaustion distributed denial of service attacks (DDoS). CPuzzle leverages client puzzles to limit the rate at which misbehaving users can establish TCP connections. We modeled the problem of determining the puzzle difficulty as a Stackleberg game and solve for the equilibrium strategy that balances the users’ utilizes against CPuzzle’s resilience capabilities. Furthermore, to handle volumetric DDoS attacks, we extend CPuzzle and implement Midgard, a cooperative approach that involves end-users in the process of tolerating and neutralizing DDoS attacks. Midgard is a middlebox that resides at the edge of an Internet service provider’s network and uses client puzzles at the IP level to allocate bandwidth to its users. At the intra-networking level, we present sShield, a game-theoretic network response engine that manipulates a network’s connectivity in response to an attacker who is moving laterally to compromise a high-value asset. To implement such decision making algorithms, we leverage the recent advances in software-defined networking (SDN) to collect logs and security alerts about the network and implement response actions. However, the programma- bility offered by SDN comes with an increased chance for design-time bugs that can have drastic consequences on the reliability and security of a networked system. We therefore introduce BiFrost, an open-source tool that aims to verify safety and security proper- ties about data-plane programs. BiFrost translates data-plane programs into functionally equivalent sequential circuits, and then uses well-established hardware reduction, abstrac- tion, and verification techniques to establish correctness proofs about data-plane programs. By focusing on those four key efforts, CPuzzle, Midgard, sShield, and BiFrost, we believe that this work illustrates the benefits that the synergy between theory and practice can bring into the world of resilient system design. This thesis is an attempt to pave the way for further cooperation and coordination between theoreticians and practitioners, in the hope of designing resilient networked systems

    Ex-ante demand assessment of human excreta recovery and reuse in agriculture.

    Get PDF
    Doctoral Degree. University of KwaZulu-Natal, Pietermaritzburg.Abstract available in PDF

    Privacy-Aware Risk-Based Access Control Systems

    Get PDF
    Modern organizations collect massive amounts of data, both internally (from their employees and processes) and externally (from customers, suppliers, partners). The increasing availability of these large datasets was made possible thanks to the increasing storage and processing capability. Therefore, from a technical perspective, organizations are now in a position to exploit these diverse datasets to create new data-driven businesses or optimizing existing processes (real-time customization, predictive analytics, etc.). However, this kind of data often contains very sensitive information that, if leaked or misused, can lead to privacy violations. Privacy is becoming increasingly relevant for organization and businesses, due to strong regulatory frameworks (e.g., the EU General Data Protection Regulation GDPR, the Health Insurance Portability and Accountability Act HIPAA) and the increasing awareness of citizens about personal data issues. Privacy breaches and failure to meet privacy requirements can have a tremendous impact on companies (e.g., reputation loss, noncompliance fines, legal actions). Privacy violation threats are not exclusively caused by external actors gaining access due to security gaps. Privacy breaches can also be originated by internal actors, sometimes even by trusted and authorized ones. As a consequence, most organizations prefer to strongly limit (even internally) the sharing and dissemination of data, thereby making most of the information unavailable to decision-makers, and thus preventing the organization from fully exploit the power of these new data sources. In order to unlock this potential, while controlling the privacy risk, it is necessary to develop novel data sharing and access control mechanisms able to support risk-based decision making and weigh the advantages of information against privacy considerations. To achieve this, access control decisions must be based on an (dynamically assessed) estimation of expected cost and benefits compared to the risk, and not (as in traditional access control systems) on a predefined policy that statically defines what accesses are allowed and denied. In Risk-based access control for each access request, the corresponding risk is estimated and if the risk is lower than a given threshold (possibly related to the trustworthiness of the requester), then access is granted or denied. The aim is to be more permissive than in traditional access control systems by allowing for a better exploitation of data. Although existing risk-based access control models provide an important step towards a better management and exploitation of data, they have a number of drawbacks which limit their effectiveness. In particular, most of the existing risk-based systems only support binary access decisions: the outcome is “allowed” or “denied”, whereas in real life we often have exceptions based on additional conditions (e.g., “I cannot provide this information, unless you sign the following non-disclosure agreement.” or “I cannot disclose this data, because they contain personal identifiable information, but I can disclose an anonymized version of the data.”). In other words, the system should be able to propose risk mitigation measures to reduce the risk (e.g., disclose partial or anonymized version of the requested data) instead of denying risky access requests. Alternatively, it should be able to propose appropriate trust enhancement measures (e.g., stronger authentication), and once they are accepted/fulfilled by the requester, more information can be shared. The aim of this thesis is to propose and validate a novel privacy enhancing access control approach offering adaptive and fine-grained access control for sensitive data-sets. This approach enhances access to data, but it also mitigates privacy threats originated by authorized internal actors. More in detail: 1. We demonstrate the relevance and evaluate the impact of authorized actors’ threats. To this aim, we developed a privacy threats identification methodology EPIC (Evaluating Privacy violation rIsk in Cyber security systems) and apply EPIC in a cybersecurity use case where very sensitive information is used. 2. We present the privacy-aware risk-based access control framework that supports access control in dynamic contexts through trust enhancement mechanisms and privacy risk mitigation strategies. This allows us to strike a balance between the privacy risk and the trustworthiness of the data request. If the privacy risk is too large compared to the trust level, then the framework can identify adaptive strategies that can decrease the privacy risk (e.g., by removing/obfuscating part of the data through anonymization) and/or increase the trust level (e.g., by asking for additional obligations to the requester). 3. We show how the privacy-aware risk-based approach can be integrated to existing access control models such as RBAC and ABAC and that it can be realized using a declarative policy language with a number of advantages including usability, flexibility, and scalability. 4. We evaluate our approach using several industrial relevant use cases, elaborated to meet the requirements of the industrial partner (SAP) of this industrial doctorate

    The Relationship Between Caregiver Intimate Partner Violence, Posttraumatic Stress, Child Cognitive Self-development, And Treatment Attrition Among Child Sexual Abuse Victims.

    Get PDF
    Child sexual abuse (CSA) is a worldwide problem, with two-thirds of all cases going unreported. A wealth of research over the last 30 years demonstrates the negative emotional, cognitive, physical, spiritual, academic, and social effects of CSA. As a result, researchers and mental health professionals frequently attempt to measure the efficacy of treatment modalities in order to assess which treatments lead to better outcomes. However, in order to effectively study treatment outcomes, researchers must be able to track the status of child functioning and symptomology before, during, and after treatment. Because high levels of treatment attrition exist among CSA victims, researchers are unable to effectively study outcomes due to large losses in research participants, loss of statistical power, and threats to external validity (Kazdin, 1990). Moreover, due to the high prevalence of concurrent family violence, caregivers with intimate partner violence are more than twice as likely to have children who are also direct victims of abuse (Kazdin, 1996). Caregivers ultimately make the decisions regarding whether or not a child stays in treatment, and therefore, it is important to examine the influence of both parent factors (e.g., intimate partner violence) and child factors (e.g., traumatization and/or disturbances in cognitive selfdevelopment) on treatment attrition. This two-pronged approach of examining both child and family characteristics simultaneously with attrition patterns offers a more complete picture for the ways concurrent family violence influences treatment than looking at child and caregiver factors separately. The purpose of this study was to investigate the relationships between caregiver intimate partner violence, child posttraumatic stress (Trauma Symptom Checklist for Children [TSCC]; Briere, 1996), child cognitive self-development (Trauma and Attachment Belief Scale [TABS]; iv Pearlman, 2003), and treatment attrition. The statistical analyses in this study included (a) Logistic Regression, (b) Poisson Regression, and (c) Chi-square Test for Independence. Elevated TSCC subscale scores in posttraumatic stress predicted both an increased number of sessions attended and increased number of sessions missed. Elevated TABS subscale scores in self-trust predicted an increased number of sessions attended and decreased number of sessions missed. Elevated TABS subscale scores of other-intimacy and self-control predicted an increased number of sessions missed. Moreover, the presence of past or current caregiver intimate partner violence predicted a decrease in number of sessions attended. While no relationship existed between child posttraumatic stress or cognitive self-development and whether a child graduated or prematurely terminated from treatment, children with parents who confirmed past or current intimate partner violence were 2.5 times more likely to prematurely terminate from treatment
    • …
    corecore