3,368 research outputs found

    A Proposal of a Privacy-preserving Questionnaire by Non-deterministic Information and Its Analysis

    Get PDF
    We focus on a questionnaire consisting of three-choice question or multiple-choice question, and propose a privacy-preserving questionnaire by non-deterministic information. Each respondent usually answers one choice from the multiple choices, and each choice is stored as a tuple in a table data. The organizer of this questionnaire analyzes the table data set, and obtains rules and the tendency. If this table data set contains personal information, the organizer needs to employ the analytical procedures with the privacy-preserving functionality. In this paper, we propose a new framework that each respondent intentionally answers non-deterministic information instead of deterministic information. For example, he answers ‘either A, B, or C’ instead of the actual choice A, and he intentionally dilutes his choice. This may be the similar concept on the k-anonymity. Non-deterministic information will be desirable for preserving each respondent\u27s information. We follow the framework of Rough Non-deterministic Information Analysis (RNIA), and apply RNIA to the privacy-preserving questionnaire by non-deterministic information. In the current data mining algorithms, the tuples with non-deterministic information may be removed based on the data cleaning process. However, RNIA can handle such tuples as well as the tuples with deterministic information. By using RNIA, we can consider new types of privacy-preserving questionnaire.2016 IEEE International Conference on Big Data, December 5-8, 2016, Washington DC, US

    Lessons Learned: Surveying the Practicality of Differential Privacy in the Industry

    Full text link
    Since its introduction in 2006, differential privacy has emerged as a predominant statistical tool for quantifying data privacy in academic works. Yet despite the plethora of research and open-source utilities that have accompanied its rise, with limited exceptions, differential privacy has failed to achieve widespread adoption in the enterprise domain. Our study aims to shed light on the fundamental causes underlying this academic-industrial utilization gap through detailed interviews of 24 privacy practitioners across 9 major companies. We analyze the results of our survey to provide key findings and suggestions for companies striving to improve privacy protection in their data workflows and highlight the necessary and missing requirements of existing differential privacy tools, with the goal of guiding researchers working towards the broader adoption of differential privacy. Our findings indicate that analysts suffer from lengthy bureaucratic processes for requesting access to sensitive data, yet once granted, only scarcely-enforced privacy policies stand between rogue practitioners and misuse of private information. We thus argue that differential privacy can significantly improve the processes of requesting and conducting data exploration across silos, and conclude that with a few of the improvements suggested herein, the practical use of differential privacy across the enterprise is within striking distance

    On Two Apriori-Based Rule Generators: Apriori in Prolog and Apriori in SQL

    Get PDF
    This paper focuses on two Apriori-based rule generators. The first is the rule generator in Prolog and C, and the second is the one in SQL. They are named Apriori in Prolog and Apriori in SQL, respectively. Each rule generator is based on the Apriori algorithm. However, each rule generator has its own properties. Apriori in Prolog employs the equivalence classes defined by table data sets and follows the framework of rough sets. On the other hand, Apriori in SQL employs a search for rule generation and does not make use of equivalence classes. This paper clarifies the properties of these two rule generators and considers effective applications of each to existing data sets

    A Multi-site Resting State fMRI Study on the Amplitude of Low Frequency Fluctuations in Schizophrenia

    Get PDF
    Background: This multi-site study compares resting state fMRI amplitude of low frequency fluctuations (ALFF) and fractional ALFF (fALFF) between patients with schizophrenia (SZ) and healthy controls (HC). Methods: Eyes-closed resting fMRI scans (5:38 min; n = 306, 146 SZ) were collected from 6 Siemens 3T scanners and one GE 3T scanner. Imaging data were pre-processed using an SPM pipeline. Power in the low frequency band (0.01–0.08 Hz) was calculated both for the original pre-processed data as well as for the pre-processed data after regressing out the six rigid-body motion parameters, mean white matter (WM) and cerebral spinal fluid (CSF) signals. Both original and regressed ALFF and fALFF measures were modeled with site, diagnosis, age, and diagnosis × age interactions. Results: Regressing out motion and non-gray matter signals significantly decreased fALFF throughout the brain as well as ALFF in the cortical edge, but significantly increased ALFF in subcortical regions. Regression had little effect on site, age, and diagnosis effects on ALFF, other than to reduce diagnosis effects in subcortical regions. There were significant effects of site across the brain in all the analyses, largely due to vendor differences. HC showed greater ALFF in the occipital, posterior parietal, and superior temporal lobe, while SZ showed smaller clusters of greater ALFF in the frontal and temporal/insular regions as well as in the caudate, putamen, and hippocampus. HC showed greater fALFF compared with SZ in all regions, though subcortical differences were only significant for original fALFF. Conclusions: SZ show greater eyes-closed resting state low frequency power in frontal cortex, and less power in posterior lobes than do HC; fALFF, however, is lower in SZ than HC throughout the cortex. These effects are robust to multi-site variability. Regressing out physiological noise signals significantly affects both total and fALFF measures, but does not affect the pattern of case/control differences

    Sharing Privacy-sensitive Access to Neuroimaging and Genetics Data: A Review and Preliminary Validation

    Get PDF
    The growth of data sharing initiatives for neuroimaging and genomics represents an exciting opportunity to confront the “small N” problem that plagues contemporary neuroimaging studies while further understanding the role genetic markers play in the function of the brain. When it is possible, open data sharing provides the most benefits. However, some data cannot be shared at all due to privacy concerns and/or risk of re-identification. Sharing other data sets is hampered by the proliferation of complex data use agreements (DUAs) which preclude truly automated data mining. These DUAs arise because of concerns about the privacy and confidentiality for subjects; though many do permit direct access to data, they often require a cumbersome approval process that can take months. An alternative approach is to only share data derivatives such as statistical summaries—the challenges here are to reformulate computational methods to quantify the privacy risks associated with sharing the results of those computations. For example, a derived map of gray matter is often as identifiable as a fingerprint. Thus alternative approaches to accessing data are needed. This paper reviews the relevant literature on differential privacy, a framework for measuring and tracking privacy loss in these settings, and demonstrates the feasibility of using this framework to calculate statistics on data distributed at many sites while still providing privacy

    Towards Reproducible and Privacy-preserving Analyses Across Federated Repositories for Omics data

    Get PDF
    Even when duly anonymized, health research data has the potential to be disclosive and there- fore requires special safeguards according to the European General Data Protection Regulation (GDPR). Furthermore, the incorporation of FAIR principles (Findable, Accessible, Interoperable, Reusable) for a more favorable reuse of existing data, calls for an approach where sensitive data is kept locally and only metadata and aggregated results are shared. Additionally, since central pool- ing is discouraged by ethical, legal, and societal issues, it is more frequent to observe maturing data management frameworks, and platforms adopting the federated approach. Current implementations of privacy-preserving analysis frameworks seem to be limited when data becomes very large (millions of rows, hundreds of variables). Biological samples data, col- lected by high-throughput technologies, such as Next Generation Sequencing (NGS), which allows to sequence entire genomes, are examples of this kind of data. The term "genomics" refers to the field of science that studies genomes. The Omics tech- nologies intend to produce a systematic identification of all mRNA (transcriptomics), proteins (proteomics), and metabolites (metabolomics), respectively, present in a given biological sample. In the particular case of Omics data, these data are produced by computational workflows known as bioinformatics pipelines. The reproducibility of these pipelines is hard and it is often underestimated. Nevertheless, it is important to generate trust in scientific results, and therefore, is fundamental to know how these Omics data were generated or obtained. This work will leverage on the promising results of current open-source implementations for distributed privacy-preserving analyses, while aiming at generalizing the approach and addressing some of their shortcomings. To enable the privacy-preserving analysis of Omics data, we introduced the "resource" con- cept, implemented in one of the studied solutions. The results were promising, seeing that the privacy-preserving analysis was effective when us- ing the DataSHIELD framework in conjunction with the "resource R" package. We also concluded that the adoption of specialized DataSHIELD packages for Omics analyses is a viable pathway to leverage the privacy-preserving for Omics data. To address the reproducibility challenges, we defined a database model to represent the steps, commands and operations executed by the bioinformatics pipelines. The database model is promising, but to accomplish all reproducibility requirements, including container support and integration with code sharing platforms, it is necessary to use other tools, such as Nextflow or Snakemake, with dozens of other tested and mature functions.Even when duly anonymized, health research data has the potential to be disclosive and there- fore requires special safeguards according to the European General Data Protection Regulation (GDPR). Furthermore, the incorporation of FAIR principles (Findable, Accessible, Interoperable, Reusable) for a more favorable reuse of existing data, calls for an approach where sensitive data is kept locally and only metadata and aggregated results are shared. Additionally, since central pool- ing is discouraged by ethical, legal, and societal issues, it is more frequent to observe maturing data management frameworks, and platforms adopting the federated approach. Current implementations of privacy-preserving analysis frameworks seem to be limited when data becomes very large (millions of rows, hundreds of variables). Biological samples data, col- lected by high-throughput technologies, such as Next Generation Sequencing (NGS), which allows to sequence entire genomes, are examples of this kind of data. The term "genomics" refers to the field of science that studies genomes. The Omics tech- nologies intend to produce a systematic identification of all mRNA (transcriptomics), proteins (proteomics), and metabolites (metabolomics), respectively, present in a given biological sample. In the particular case of Omics data, these data are produced by computational workflows known as bioinformatics pipelines. The reproducibility of these pipelines is hard and it is often underestimated. Nevertheless, it is important to generate trust in scientific results, and therefore, is fundamental to know how these Omics data were generated or obtained. This work will leverage on the promising results of current open-source implementations for distributed privacy-preserving analyses, while aiming at generalizing the approach and addressing some of their shortcomings. To enable the privacy-preserving analysis of Omics data, we introduced the "resource" con- cept, implemented in one of the studied solutions. The results were promising, seeing that the privacy-preserving analysis was effective when us- ing the DataSHIELD framework in conjunction with the "resource R" package. We also concluded that the adoption of specialized DataSHIELD packages for Omics analyses is a viable pathway to leverage the privacy-preserving for Omics data. To address the reproducibility challenges, we defined a database model to represent the steps, commands and operations executed by the bioinformatics pipelines. The database model is promising, but to accomplish all reproducibility requirements, including container support and integration with code sharing platforms, it is necessary to use other tools, such as Nextflow or Snakemake, with dozens of other tested and mature functions

    Extending the Exposure Score of Web Browsers by Incorporating CVSS

    Get PDF
    When browsing the Internet, HTTP headers enable both clients and servers send extra data in their requests or responses such as the User-Agent string. This string contains information related to the sender’s device, browser, and operating system. Yet its content differs from one browser to another. Despite the privacy and security risks of User-Agent strings, very few works have tackled this problem. Our previous work proposed giving Internet browsers exposure relative scores to aid users to choose less intrusive ones. Thus, the objective of this work is to extend our previous work through: first, conducting a user study to identify its limitations. Second, extending the exposure score via incorporating data from the NVD. Third, providing a full implementation, instead of a limited prototype. The proposed system: assigns scores to users’ browsers upon visiting our website. It also suggests alternative safe browsers, and finally it allows updating the back-end database with a click of a button. We applied our method to a data set of more than 52 thousand unique browsers. Our performance and validation analysis show that our solution is accurate and efficient. The source code and data set are publicly available here [4].</p

    A Survey of Asynchronous Programming Using Coroutines in the Internet of Things and Embedded Systems

    Full text link
    Many Internet of Things and embedded projects are event-driven, and therefore require asynchronous and concurrent programming. Current proposals for C++20 suggest that coroutines will have native language support. It is timely to survey the current use of coroutines in embedded systems development. This paper investigates existing research which uses or describes coroutines on resource-constrained platforms. The existing research is analysed with regard to: software platform, hardware platform and capacity; use cases and intended benefits; and the application programming interface design used for coroutines. A systematic mapping study was performed, to select studies published between 2007 and 2018 which contained original research into the application of coroutines on resource-constrained platforms. An initial set of 566 candidate papers were reduced to only 35 after filters were applied, revealing the following taxonomy. The C & C++ programming languages were used by 22 studies out of 35. As regards hardware, 16 studies used 8- or 16-bit processors while 13 used 32-bit processors. The four most common use cases were concurrency (17 papers), network communication (15), sensor readings (9) and data flow (7). The leading intended benefits were code style and simplicity (12 papers), scheduling (9) and efficiency (8). A wide variety of techniques have been used to implement coroutines, including native macros, additional tool chain steps, new language features and non-portable assembly language. We conclude that there is widespread demand for coroutines on resource-constrained devices. Our findings suggest that there is significant demand for a formalised, stable, well-supported implementation of coroutines in C++, designed with consideration of the special needs of resource-constrained devices, and further that such an implementation would bring benefits specific to such devices.Comment: 22 pages, 8 figures, to be published in ACM Transactions on Embedded Computing Systems (TECS
    corecore