Search CORE

Springer - Publisher Connector

arXiv.org e-Print Archive

Evaluating Maintainability Prejudices with a Large-Scale Study of Open-Source Projects

Author: B Ray
C Bird
D Spinellis
EJ Weyuker
GA Miller
I Ahmed
I Samoladas
I Samoladas
I Stamelos
K Beck
K Emam El
KM Eisenhardt
RC Martin
S Wagner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/06/2018
Field of study

Exaggeration or context changes can render maintainability experience into prejudice. For example, JavaScript is often seen as least elegant language and hence of lowest maintainability. Such prejudice should not guide decisions without prior empirical validation. We formulated 10 hypotheses about maintainability based on prejudices and test them in a large set of open-source projects (6,897 GitHub repositories, 402 million lines, 5 programming languages). We operationalize maintainability with five static analysis metrics. We found that JavaScript code is not worse than other code, Java code shows higher maintainability than C# code and C code has longer methods than other code. The quality of interface documentation is better in Java code than in other code. Code developed by teams is not of higher and large code bases not of lower maintainability. Projects with high maintainability are not more popular or more often forked. Overall, most hypotheses are not supported by open-source data.Comment: 20 page

Public Library of Science (PLOS)

A Protocol for the Secure Linking of Registries for HPV Surveillance

In order to monitor the effectiveness of HPV vaccination in Canada the linkage of multiple data registries may be required. These registries may not always be managed by the same organization and, furthermore, privacy legislation or practices may restrict any data linkages of records that can actually be done among registries. The objective of this study was to develop a secure protocol for linking data from different registries and to allow on-going monitoring of HPV vaccine effectiveness.A secure linking protocol, using commutative hash functions and secure multi-party computation techniques was developed. This protocol allows for the exact matching of records among registries and the computation of statistics on the linked data while meeting five practical requirements to ensure patient confidentiality and privacy. The statistics considered were: odds ratio and its confidence interval, chi-square test, and relative risk and its confidence interval. Additional statistics on contingency tables, such as other measures of association, can be added using the same principles presented. The computation time performance of this protocol was evaluated.The protocol has acceptable computation time and scales linearly with the size of the data set and the size of the contingency table. The worse case computation time for up to 100,000 patients returned by each query and a 16 cell contingency table is less than 4 hours for basic statistics, and the best case is under 3 hours.A computationally practical protocol for the secure linking of data from multiple registries has been demonstrated in the context of HPV vaccine initiative impact assessment. The basic protocol can be generalized to the surveillance of other conditions, diseases, or vaccination programs

The Francis Crick Institute

De-identifying a public use microdata file from the Canadian national discharge abstract database

Author: A Dale
A de Waal
A Gionis
A Hundepool
A Hundepool
A Machanavajjhala
A Machanavajjhala
A Meyerson
A Narayanan
Agency for Healthcare Research and Quality
B Hore
B Yolles
B-C Chen
BCM Fung
BCM Fung
BCM Fung
C Hogue
C Mackie
C Marsh
C Marsh
C Skinner
C Skinner
Canada Statistics
Canadian Institute for Health Information
Canadian Institute for Health Information
CE Shannon
CE Shannon
CK Liew
D Altman
D Defays
D Defays
D Hutchon
D Lafky
David Paton
DB Rubin
Department of Health and Human Services
Department of Health and Human Services
E Boyko
Federal Court (Canada)
Fida Dankar
G Aggarwal
G Duncan
G Loukides
G Sande
G Sullivan
G Sullivan
GD Smith
GR Heer
Gunes Koru
H Kargupta
J Castro
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Jimenez
J Schoenman
J Xu
JJ Kim
JP Gouweleeuw
K Abraham
K Benitez
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K LeFevre
Khaled El Emam
L Alexander
L Sweeney
L Sweeney
L Sweeney
L Sweeney
L Sweeney
L Willenborg
L Willenborg
LA Alexander
LH Cox
M Barbaro
M Templ
ME Nergiz
National Committee on Vital and Health Statistics
P Doyle
P Kooiman
P Nanopoulos
P Samarati
P Samarati
P Samarati
R Bayardo
R Gopal
RA Dandekar
RA Dandekar
RJ Bayardo
RJA Little
S Fienberg
S Hansell
S Ochoa
Statistics Canada
Statistics Canada
Statistics Canada
T de Waal
T Delamothe
T Hedrick
T Zeller Jr
V Ciriani
V Iyengar
V Torra
V Torra
V Torra
VS Iyengar
W Lowrance
W Winkler
WE Winkler
X Xiao
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records. Methods Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy. Results Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression. Conclusions The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.</p

Springer - Publisher Connector

Efficient and effective pruning strategies for health data de-identification

Author: B Davey
B Fung
B Malin
BCM Fung
CC Aggarwal
DE Willard
F Kohlmayer
F Prasser
F Prasser
F Prasser
Fabian Prasser
FK Dankar
Florian Kohlmayer
G Poulis
J Domingo-Ferrer
J Goldberger
J Soria-Comas
K Babu
K El Emam
K El Emam
K El Emam
K El Emam
K LeFevre
KE Emam
Klaus A. Kuhn
L Mattner
L Sweeney
LH Cox
M Maass
M Nergiz
N Li
P Bose
P Samarati
P Samarati
R Lautenschläger
RJ Bayardo
V Iyengar
W Xia
Z Wan
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation

Author: A Beimel
A Geissbuhler
AF Karr
AF Karr
AL Potosky
Antonis Michalas
AS Lunde
B Pinkas
BA Malin
BA Stewart
BH Bloom
C Clifton
C Friedman
C Quantin
D Vatsalan
EA Durham
G Cormode
G Hripcsak
GM Weber
GM Weber
GM Weber
IS Kohane
J Gichoya
J Vaidya
JF Ludvigsson
JH Holmes
JL Warren
Johan Gustav Bellika
JT Finnell
K Emam El
K Emam El
K Emam El
K Emam El
Kassaye Yitbarek Yigzaw
L Fan
L Lenert
LH Curtis
M Kantarcioglu
MA Hailemichael
MA Hernández
MK Ross
O Goldreich
P Christen
P Paillier
P Saint-Andre
R Cramer
R Lazarus
R Lazarus
R Schnell
RL Richesson
S Tarkoma
SC Pohlig
SM Randall
T Dimitriou
W Du
W Du
WB Lober
Y Lindell
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Background Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. Methods We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. Results The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N − 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. Conclusions The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians

Munin - Open Research Archive

WestminsterResearch

NORA - Norwegian Open Research Archives

The re-identification risk of Canadians from longitudinal demographics

Abstract Background The public is less willing to allow their personal health information to be disclosed for research purposes if they do not trust researchers and how researchers manage their data. However, the public is more comfortable with their data being used for research if the risk of re-identification is low. There are few studies on the risk of re-identification of Canadians from their basic demographics, and no studies on their risk from their longitudinal data. Our objective was to estimate the risk of re-identification from the basic cross-sectional and longitudinal demographics of Canadians. Methods Uniqueness is a common measure of re-identification risk. Demographic data on a 25% random sample of the population of Montreal were analyzed to estimate population uniqueness on postal code, date of birth, and gender as well as their generalizations, for periods ranging from 1 year to 11 years. Results Almost 98% of the population was unique on full postal code, date of birth and gender: these three variables are effectively a unique identifier for Montrealers. Uniqueness increased for longitudinal data. Considerable generalization was required to reach acceptably low uniqueness levels, especially for longitudinal data. Detailed guidelines and disclosure policies on how to ensure that the re-identification risk is low are provided. Conclusions A large percentage of Montreal residents are unique on basic demographics. For non-longitudinal data sets, the three character postal code, gender, and month/year of birth represent sufficiently low re-identification risk. Data custodians need to generalize their demographic information further for longitudinal data sets.</p

Springer - Publisher Connector

A Systematic Review of Re-Identification Attacks on Health Data

Privacy legislation in most jurisdictions allows the disclosure of health data for secondary purposes without patient consent if it is de-identified. Some recent articles in the medical, legal, and computer science literature have argued that de-identification methods do not provide sufficient protection because they are easy to reverse. Should this be the case, it would have significant and important implications on how health information is disclosed, including: (a) potentially limiting its availability for secondary purposes such as research, and (b) resulting in more identifiable health information being disclosed. Our objectives in this systematic review were to: (a) characterize known re-identification attacks on health data and contrast that to re-identification attacks on other kinds of data, (b) compute the overall proportion of records that have been correctly re-identified in these attacks, and (c) assess whether these demonstrate weaknesses in current de-identification methods.Searches were conducted in IEEE Xplore, ACM Digital Library, and PubMed. After screening, fourteen eligible articles representing distinct attacks were identified. On average, approximately a quarter of the records were re-identified across all studies (0.26 with 95% CI 0.046-0.478) and 0.34 for attacks on health data (95% CI 0-0.744). There was considerable uncertainty around the proportions as evidenced by the wide confidence intervals, and the mean proportion of records re-identified was sensitive to unpublished studies. Two of fourteen attacks were performed with data that was de-identified using existing standards. Only one of these attacks was on health data, which resulted in a success rate of 0.00013.The current evidence shows a high re-identification rate but is dominated by small-scale studies on data that was not de-identified according to existing standards. This evidence is insufficient to draw conclusions about the efficacy of de-identification methods

arXiv.org e-Print Archive

Routes for breaching and protecting genetic privacy

Author: A Acquisti
A Cavoukian
A Kong
A Machanavajjhala
A Narayanan
AD Johnson
AJ Pakstis
AK Manning
AL McGuire
Arvind Narayanan
B Fons
B Malin
B Malin
BA Malin
BM Henn
C Dwork
C Shannon
CD Huff
D Clayton
D He
D Zubakov
DJ Solve
DR Nyholt
DW Craig
EA Zerhouni
EE Schadt
EM Ramos
F Liu
G Church
H Lango Allen
H Li
HK Im
HS Venter
J Burn
J Gitschier
J Kaiser
J Kaye
J Kaye
J Lee
J Marchini
JE Lunshof
JH Park
JM Oliver
JP Roberts
K Benitez
K El Emam
K El Emam
K Silventoinen
KA Tryka
KB Jacobs
KS Kendler
L Kamm
L Sweeney
L Sweeney
LA Sweeney
LA Sweeney
LAP Kohn
LL Rodriguez
M Canim
M Gymrek
M Gymrek
M Kantarcioglu
M Kayser
MD Mailman
N Chatterjee
N Homer
NN Taleb
P Bohannon
P Kwok
P Ohm
P Paillier
PM Visscher
R Braun
R Drmanac
R Khan
R Noumeir
RL Bennett
S Byers
S McClure
S Sankararaman
S Walsh
SE Brenner
SF Terry
SH Friend
T Lumley
TE King
TE King
V Bafna
W Fu
W Hartzog
WG Hill
WW Lowrance
XL Ou
Yaniv Erlich
Z Lin
Publication venue
Publication date: 01/12/2013
Field of study

We are entering the era of ubiquitous genetic information for research, clinical care, and personal curiosity. Sharing these datasets is vital for rapid progress in understanding the genetic basis of human diseases. However, one growing concern is the ability to protect the genetic privacy of the data originators. Here, we technically map threats to genetic privacy and discuss potential mitigation strategies for privacy-preserving dissemination of genetic data.Comment: Draft for comment

Princeton University Open Access Repository