152 research outputs found

    Screen Capture for Sensitive Systems

    Get PDF
    Maintaining usable security in application domains such as healthcare or power systems requires an ongoing conversation among stakeholders such as end-users, administrators, developers, and policy makers. Each party has power to influence the design and implementation of the application and its security posture, and effective communication among stakeholders is one key to achieving influence and adapting an application to meet evolving needs. In this thesis, we develop a system that combines keyboard/video/mouse (KVM) capture with automatic text redaction to produce precise technical content that can enrich stakeholder communications, improve end-user influence on system evolution, and help reveal the definition of ``usable security.\u27\u27 Text-redacted screen captures reduce sensitivity of captured material and thus can facilitate timely data sharing among stakeholders. KVM-based capture makes our system both application and operating-system independent because it eliminates software-interface dependencies on capture targets. Thus, our work can be used to instrument closed or certified systems where capture software cannot be installed or documentation and support lack. It can instrument widely-varying platforms that lack standards-compliance and interoperability or redact special document formats while displayed onscreen. We present three techniques for redacting text from screenshots and two redaction applications. One application can capture, text redact, and edit screen video and the other can text redact and edit static screenshots. We also present empirical measurements of redaction effectiveness and processing latency to demonstrate system performance. When applied to our principal dataset, redaction removes text with over 93\% accuracy and simultaneously preserves more than 76\% of image pixels on average. Thus by default, it retains more visual context than a technique such as blindly redacting entire screenshots. Finally, our system redacts each screenshot in 0.1 to 21 seconds depending on which technique it applies

    Information protection in content-centric networks

    Get PDF
    Information-centric networks have distinct advantages with regard to securing sensitive content as a result of their new approaches to managing data in potential future internet architectures. These kinds of systems, because of their data-centric perspective, provide the opportunity to embed policy-centric content management components that can address looming problems in information distribution that both companies and federal agencies are beginning to face with respect to sensitive content. This information-centricity facilitates the application of security techniques that are very difficult and in some cases impossible to apply in traditional packetized networks. This work addresses the current state of the art in both these kinds of cross-domain systems and information-centric networking in general. It then covers other related work, outlining why information-centric networks are more powerful than traditional packetized networks with regard to usage management. Then, it introduces a taxonomy of types of policy-centric usage managed information network systems and an associated methodology for evaluating the individual taxonomic elements. It finally delves into experimental evaluation of the various defined architectural options and presents results of comparing experimental evaluation with anticipated outcomes

    Automating Disk Image Redaction

    Get PDF
    In order to comply with best preservation and curation practices, collecting institutions must ensure that private and sensitive information contained in born-digital materials has been properly redacted before the materials are made available. Institutions receiving donor media in the form of hard disks, USB flash drives, compact disks, floppy disks, and even entire computers, are increasingly creating bit-identical copies called disk images. Redacting data from within a disk image currently is a manual, time-consuming task. In this project, I demonstrate the feasibility of automating disk image redaction using open-source, forensic software. I discuss the problems encountered when redacting disk images using automated methods and ways to improve future disk image redaction tools.Master of Science in Information Scienc

    Understanding and controlling leakage in machine learning

    Get PDF
    Machine learning models are being increasingly adopted in a variety of real-world scenarios. However, the privacy and confidentiality implications introduced in these scenarios are not well understood. Towards better understanding such implications, we focus on scenarios involving interactions between numerous parties prior to, during, and after training relevant models. Central to these interactions is sharing information for a purpose e.g., contributing data samples towards a dataset, returning predictions via an API. This thesis takes a step toward understanding and controlling leakage of private information during such interactions. In the first part of the thesis we investigate leakage of private information in visual data and specifically, photos representative of content shared on social networks. There is a long line of work to tackle leakage of personally identifiable information in social photos, especially using face- and body-level visual cues. However, we argue this presents only a narrow perspective as images reveal a wide spectrum of multimodal private information (e.g., disabilities, name-tags). Consequently, we work towards a Visual Privacy Advisor that aims to holistically identify and mitigate private risks when sharing social photos. In the second part, we address leakage during training of ML models. We observe learning algorithms are being increasingly used to train models on rich decentralized datasets e.g., personal data on numerous mobile devices. In such cases, information in the form of high-dimensional model parameter updates are anonymously aggregated from participating individuals. However, we find that the updates encode sufficient identifiable information and allows them to be linked back to participating individuals. We additionally propose methods to mitigate this leakage while maintaining high utility of the updates. In the third part, we discuss leakage of confidential information during inference time of black-box models. In particular, we find models lend themselves to model functionality stealing attacks: an adversary can interact with the black-box model towards creating a replica `knock-off' model that exhibits similar test-set performances. As such attacks pose a severe threat to the intellectual property of the model owner, we also work towards effective defenses. Our defense strategy by introducing bounded and controlled perturbations to predictions can significantly amplify model stealing attackers' error rates. In summary, this thesis advances understanding of privacy leakage when information is shared in raw visual forms, during training of models, and at inference time when deployed as black-boxes. In each of the cases, we further propose techniques to mitigate leakage of information to enable wide-spread adoption of techniques in real-world scenarios.Modelle fĂŒr maschinelles Lernen werden zunehmend in einer Vielzahl realer Szenarien eingesetzt. Die in diesen Szenarien vorgestellten Auswirkungen auf Datenschutz und Vertraulichkeit wurden jedoch nicht vollstĂ€ndig untersucht. Um solche Implikationen besser zu verstehen, konzentrieren wir uns auf Szenarien, die Interaktionen zwischen mehreren Parteien vor, wĂ€hrend und nach dem Training relevanter Modelle beinhalten. Das Teilen von Informationen fĂŒr einen Zweck, z. B. das Einbringen von Datenproben in einen Datensatz oder die RĂŒckgabe von Vorhersagen ĂŒber eine API, ist zentral fĂŒr diese Interaktionen. Diese Arbeit verhilft zu einem besseren VerstĂ€ndnis und zur Kontrolle des Verlusts privater Informationen wĂ€hrend solcher Interaktionen. Im ersten Teil dieser Arbeit untersuchen wir den Verlust privater Informationen bei visuellen Daten und insbesondere bei Fotos, die fĂŒr Inhalte reprĂ€sentativ sind, die in sozialen Netzwerken geteilt werden. Es gibt eine lange Reihe von Arbeiten, die das Problem des Verlustes persönlich identifizierbarer Informationen in sozialen Fotos angehen, insbesondere mithilfe visueller Hinweise auf Gesichts- und Körperebene. Wir argumentieren jedoch, dass dies nur eine enge Perspektive darstellt, da Bilder ein breites Spektrum multimodaler privater Informationen (z. B. Behinderungen, Namensschilder) offenbaren. Aus diesem Grund arbeiten wir auf einen Visual Privacy Advisor hin, der darauf abzielt, private Risiken beim Teilen sozialer Fotos ganzheitlich zu identifizieren und zu minimieren. Im zweiten Teil befassen wir uns mit Datenverlusten wĂ€hrend des Trainings von ML-Modellen. Wir beobachten, dass zunehmend Lernalgorithmen verwendet werden, um Modelle auf umfangreichen dezentralen DatensĂ€tzen zu trainieren, z. B. persönlichen Daten auf zahlreichen MobilgerĂ€ten. In solchen FĂ€llen werden Informationen von teilnehmenden Personen in Form von hochdimensionalen Modellparameteraktualisierungen anonym verbunden. Wir stellen jedoch fest, dass die Aktualisierungen ausreichend identifizierbare Informationen codieren und es ermöglichen, sie mit teilnehmenden Personen zu verknĂŒpfen. Wir schlagen zudem Methoden vor, um diesen Datenverlust zu verringern und gleichzeitig die hohe NĂŒtzlichkeit der Aktualisierungen zu erhalten. Im dritten Teil diskutieren wir den Verlust vertraulicher Informationen wĂ€hrend der Inferenzzeit von Black-Box-Modellen. Insbesondere finden wir, dass sich Modelle fĂŒr die Entwicklung von Angriffen, die auf FunktionalitĂ€tsdiebstahl abzielen, eignen: Ein Gegner kann mit dem Black-Box-Modell interagieren, um ein Replikat-Knock-Off-Modell zu erstellen, das Ă€hnliche Test-Set-Leistungen aufweist. Da solche Angriffe eine ernsthafte Bedrohung fĂŒr das geistige Eigentum des Modellbesitzers darstellen, arbeiten wir auch an einer wirksamen Verteidigung. Unsere Verteidigungsstrategie durch die EinfĂŒhrung begrenzter und kontrollierter Störungen in Vorhersagen kann die Fehlerraten von Modelldiebstahlangriffen erheblich verbessern. Zusammenfassend lĂ€sst sich sagen, dass diese Arbeit das VerstĂ€ndnis von Datenschutzverlusten beim Informationsaustausch verbessert, sei es bei rohen visuellen Formen, wĂ€hrend des Trainings von Modellen oder wĂ€hrend der Inferenzzeit von Black-Box-Modellen. In jedem Fall schlagen wir ferner Techniken zur Verringerung des Informationsverlusts vor, um eine weit verbreitete Anwendung von Techniken in realen Szenarien zu ermöglichen.Max Planck Institute for Informatic

    UNC Law Library's Redaction of its Digitized Collection of North Carolina Supreme Court Briefs: A Case Study

    Get PDF
    This study evaluates the digital redaction process as undertaken by the University of North Carolina Kathrine R. Everett Law Library as part of digitizing their collection of North Carolina Supreme Court briefs. New privacy concerns are raised by digitizing court documents and making them available online. Libraries have an interest in digitizing their print collections of court documents for public access on the Internet, but have received no clear guidance on how to proceed in the face of legal concerns. The purpose of this research is to inform libraries of the legal, ethical, and practical situation surrounding redaction of digitized court documents faced by the UNC Law Library under North Carolina law and to provide libraries with an example of redaction procedures. This research was accomplished through a case study of the UNC Law Library and composed of interviews with the law librarians and observation of the redaction process.Master of Science in Library Scienc

    Diverse Community Data for Benchmarking Data Privacy Algorithms

    Full text link
    The Diverse Communities Data Excerpts are the core of a National Institute of Standards and Technology (NIST) program to strengthen understanding of tabular data deidentification technologies such as synthetic data. Synthetic data is an ambitious attempt to democratize the benefits of big data; it uses generative models to recreate sensitive personal data with new records for public release. However, it is vulnerable to the same bias and privacy issues that impact other machine learning applications, and can even amplify those issues. When deidentified data distributions introduce bias or artifacts, or leak sensitive information, they propagate these problems to downstream applications. Furthermore, real-world survey conditions such as diverse subpopulations, heterogeneous non-ordinal data spaces, and complex dependencies between features pose specific challenges for synthetic data algorithms. These observations motivate the need for real, diverse, and complex benchmark data to support a robust understanding of algorithm behavior. This paper introduces four contributions: new theoretical work on the relationship between diverse populations and challenges for equitable deidentification; public benchmark data focused on diverse populations and challenging features curated from the American Community Survey; an open source suite of evaluation metrology for deidentified datasets; and an archive of evaluation results on a broad collection of deidentification techniques. The initial set of evaluation results demonstrate the suitability of these tools for investigations in this field

    Just Because We Can Doesn’t Mean We Should: On Knowing and Protecting Data Produced by the Jewish Consumptives’ Relief Society

    Get PDF
    A recent project at the University of Denver Libraries used handwritten text recognition (HTR) software to create transcriptions of records from the Jewish Consumptives’ Relief Society (JCRS), a tuberculosis sanatorium located in Denver, Colorado from 1904 to 1954. Among a great many other potential uses, these type- and hand-written records give insight into the human experience of disease and epidemic, its treatment, its effect on cultures, and of Jewish immigration to and early life in the American West. Our intent is to provide these transcripts as data so the text may be computationally analyzed, pursuant to a larger effort in developing capacity in services and infrastructure to support digital humanities as a library, and to contribute to the emerging HTR ecosystem in archival work.Just because we can, however, doesn’t always mean we should: the realities of publishing large datasets online that contain medical and personal histories of potentially vulnerable people and communities introduce serious ethical considerations. This paper both underscores the value of HTR and frames ethical considerations related to protecting data derived from it. It suggests a terms-of-use intervention perhaps valuable to similar projects, one that balances meeting the research needs of digital scholars with the care and respect of persons, their communities and inheritors, who lives produced the very data now valuable to those researchers

    Just Because We Can Doesn’t Mean We Should: On Knowing and Protecting Data Produced by the Jewish Consumptives’ Relief Society

    Get PDF
    A recent project at the University of Denver Libraries used handwritten text recognition (HTR) software to create transcriptions of records from the Jewish Consumptives’ Relief Society (JCRS), a tuberculosis sanatorium located in Denver, Colorado from 1904 to 1954. Among a great many other potential uses, these type- and hand-written records give insight into the human experience of disease and epidemic, its treatment, its effect on cultures, and of Jewish immigration to and early life in the American West. Our intent is to provide these transcripts as data so the text may be computationally analyzed, pursuant to a larger effort in developing capacity in services and infrastructure to support digital humanities as a library, and to contribute to the emerging HTR ecosystem in archival work. Just because we can, however, doesn’t always mean we should: the realities of publishing large datasets online that contain medical and personal histories of potentially vulnerable people and communities introduce serious ethical considerations. This paper both underscores the value of HTR and frames ethical considerations related to protecting data derived from it. It suggests a terms-of-use intervention perhaps valuable to similar projects, one that balances meeting the research needs of digital scholars with the care and respect of persons, their communities and inheritors, who lives produced the very data now valuable to those researchers

    Detecting Important Terms in Source Code for Program Comprehension

    Get PDF
    Software Engineering research has become extremely dependent on terms (words in textual data) extracted from source code. Different techniques have been proposed to extract the most important\u27\u27 terms from code. These terms are typically used as input to research prototypes: the quality of the output of these prototypes will depend on the quality of the term extraction technique. At present no consensus exists about which technique predicts the best terms for code comprehension. We perform a literature review, and propose a unified prediction model based on a Naive Bayes algorithm. We evaluate our model in a field study with professional programmers, as well as a standard 10-fold synthetic study. We found our model predicts the top quartile of the most-important terms with approximately 50% precision and recall, outperforming other popular techniques. We found the predictions from our model to help programmers to the same degree as the gold set
    • 

    corecore