43 research outputs found

    Securely measuring the overlap between private datasets with cryptosets

    Get PDF
    Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure

    Open is as Open Does: Lessons from Running a “Professional Open Source” Company

    No full text
    <p>   In this presentation, Dr. Leon Rozenblit, Founder and CEO of Prometheus Research, describes the lessons learned from running a professional open source company. He covers the business models, core technologies, architectures, and open-source licensing decisions made over the 15 years Prometheus has been in business.</p> <p>   Find out more about RexDB at http://www.rexdb.org, or download the source code at http://www.bitbucket.org/rexdb.</p

    Developing a Suite of Electronic Data Capture Applications Based on an Open-Source Instrument Definition Standard

    No full text
    <p>At the present moment, multiple research groups are configuring either identical or very similar forms for use in different electronic data capture (EDC) systems, resulting in wasted time, lack of consistency across projects and inefficiency. Ideally these various groups should be able to download form configuration tools from an open library of instrument definitions and reuse them in any common EDC application.</p> <p>This problem and vision for the future led us to develop an open-source, portable research instrument standard for mental health (PRISMH). We tested PRISMH with two of our own EDC applications and had positive results. We are now proceeding with developing translators to/from REDCap and to Qualtrics.</p> <p>If you want to develop translators to/from your favorite EDC app, join us! An open-source, revision-controlled instrument-definition library, based on a portable open standard, will save resources and will enable better data sharing and interoperability across research programs and institutions.</p

    Open is as Open Does: Lessons from Running a “Professional Open Source” Company

    No full text
    <p>   In this presentation, Dr. Leon Rozenblit, Founder and CEO of Prometheus Research, describes the lessons learned from running a professional open source company. He covers the business models, core technologies, architectures, and open-source licensing decisions made over the 15 years Prometheus has been in business.</p> <p>   Find out more about RexDB at http://www.rexdb.org, or download the source code at http://www.bitbucket.org/rexdb.</p

    Improving Research Efficiency, Data Quality, and Data Utility through Integrated Data Management

    No full text
    <p>Multidimensional data integration and data reuse are emerging challenges in psychological research. We critique several common but inadequate practices and introduce an integrated data management framework, in which data are centralized, cleaned up front, and made available via a query interface for maximum efficiency, quality, and reusability.</p

    Cryptosets stably estimate the overlap proportion between private datasets, no matter the dataset size, and with accuracy tunable by length.

    No full text
    <p>Each column of figures corresponds to a different number of public IDs: 500, 1000 and 2000. The first row shows the results of an empirical study, demonstrating that the error (the spread of each data series) is stable across all dataset sizes. The second row shows the analytically derived 95% confidence intervals (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0117898#pone.0117898.e012" target="_blank">Equation 9</a>), which closely match the distribution of empirical estimates and are stable across all dataset sizes. Also evident in these figures is that estimate accuracy is tuned by the length (the number of possible public IDs) of the cryptosets.</p

    Cryptosets can measure the overlap between chemical collections.

    No full text
    <p>In this figure we compare two molecular libraries, which have about 5000 scaffolds in common. The public IDs from the libraries’ scaffolds are nearly evenly distributed across public IDs, but a subtle, statistically significant correlation demonstrates they overlap. The estimated overlap is quite good. Moreover, the privacy of the libraries is maintained. Within each public ID bin (representative examples shown for one bin), there are both scaffolds unique and common to each library, and there is no way to determine which are which from the cryptosets. Sharing overlaps between molecular libraries could help researchers know when it makes sense to screen a private molecule library with a biological assay.</p

    Cryptosets are shareable summaries of private data, from which estimates of overlap can be computed.

    No full text
    <p>They are constructed using a cryptographic hash function to transform private IDs from a dataset into a limited number of public IDs, and then combining these public IDs into a histogram. From this histogram (about 1000 IDs long in practice), the overlap between private datasets can be estimated in a public space. The security of cryptosets relies on the fact that several private IDs map to each public ID. The estimates are based on the Pearson correlation between cryptosets, and can only measure overlap at a predetermined resolution.</p
    corecore