1,543 research outputs found

    Big Data Privacy Context: Literature Effects On Secure Informational Assets

    Get PDF
    This article's objective is the identification of research opportunities in the current big data privacy domain, evaluating literature effects on secure informational assets. Until now, no study has analyzed such relation. Its results can foster science, technologies and businesses. To achieve these objectives, a big data privacy Systematic Literature Review (SLR) is performed on the main scientific peer reviewed journals in Scopus database. Bibliometrics and text mining analysis complement the SLR. This study provides support to big data privacy researchers on: most and least researched themes, research novelty, most cited works and authors, themes evolution through time and many others. In addition, TOPSIS and VIKOR ranks were developed to evaluate literature effects versus informational assets indicators. Secure Internet Servers (SIS) was chosen as decision criteria. Results show that big data privacy literature is strongly focused on computational aspects. However, individuals, societies, organizations and governments face a technological change that has just started to be investigated, with growing concerns on law and regulation aspects. TOPSIS and VIKOR Ranks differed in several positions and the only consistent country between literature and SIS adoption is the United States. Countries in the lowest ranking positions represent future research opportunities.Comment: 21 pages, 9 figure

    Injecting Uncertainty in Graphs for Identity Obfuscation

    Full text link
    Data collected nowadays by social-networking applications create fascinating opportunities for building novel services, as well as expanding our understanding about social structures and their dynamics. Unfortunately, publishing social-network graphs is considered an ill-advised practice due to privacy concerns. To alleviate this problem, several anonymization methods have been proposed, aiming at reducing the risk of a privacy breach on the published data, while still allowing to analyze them and draw relevant conclusions. In this paper we introduce a new anonymization approach that is based on injecting uncertainty in social graphs and publishing the resulting uncertain graphs. While existing approaches obfuscate graph data by adding or removing edges entirely, we propose using a finer-grained perturbation that adds or removes edges partially: this way we can achieve the same desired level of obfuscation with smaller changes in the data, thus maintaining higher utility. Our experiments on real-world networks confirm that at the same level of identity obfuscation our method provides higher usefulness than existing randomized methods that publish standard graphs.Comment: VLDB201

    Microdata Disclosure by Resampling: Empirical Findings for Business Survey Data

    Get PDF
    A problem statistical offices and research institutes are faced with by releasing micro-data is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of data, i.e., information loss is potentially high. In this paper I discuss an alternative technique of creating scientific-use-files, which reproduce the characteristics of the original data quite well. It is based on Fienberg?s (1997 and 1994) [5], [6] idea to estimate and resample from the empirical multivariate cumulative distribution function of the data in order to get synthetic data. The procedure creates datasets - the resample - which have the same characteristics as the original survey data. In this paper I present some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and compare resampling with a common method of disclosure control, i.e. disturbance with multiplicative error, concerning confidentiality on the one hand and the appropriateness of the disturbed data for different kinds of analyses on the other. The results show that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if (a) the resampling procedure implements the correlation structure of the original data as a scale and (b) the data is multiplicative perturbed and a correction term is used. On average, anonymized data with multiplicative perturbed values better protect against re?identification as the various resampling methods used. --resampling,multiplicative data perturbation,Monte Carlo studies,business survey data

    Distribution-Agnostic Database De-Anonymization Under Synchronization Errors

    Full text link
    There has recently been an increased scientific interest in the de-anonymization of users in anonymized databases containing user-level microdata via multifarious matching strategies utilizing publicly available correlated data. Existing literature has either emphasized practical aspects where underlying data distribution is not required, with limited or no theoretical guarantees, or theoretical aspects with the assumption of complete availability of underlying distributions. In this work, we take a step towards reconciling these two lines of work by providing theoretical guarantees for the de-anonymization of random correlated databases without prior knowledge of data distribution. Motivated by time-indexed microdata, we consider database de-anonymization under both synchronization errors (column repetitions) and obfuscation (noise). By modifying the previously used replica detection algorithm to accommodate for the unknown underlying distribution, proposing a new seeded deletion detection algorithm, and employing statistical and information-theoretic tools, we derive sufficient conditions on the database growth rate for successful matching. Our findings demonstrate that a double-logarithmic seed size relative to row size ensures successful deletion detection. More importantly, we show that the derived sufficient conditions are the same as in the distribution-aware setting, negating any asymptotic loss of performance due to unknown underlying distributions

    Regulating Data as Property: A New Construct for Moving Forward

    Get PDF
    The global community urgently needs precise, clear rules that define ownership of data and express the attendant rights to license, transfer, use, modify, and destroy digital information assets. In response, this article proposes a new approach for regulating data as an entirely new class of property. Recently, European and Asian public officials and industries have called for data ownership principles to be developed, above and beyond current privacy and data protection laws. In addition, official policy guidances and legal proposals have been published that offer to accelerate realization of a property rights structure for digital information. But how can ownership of digital information be achieved? How can those rights be transferred and enforced? Those calls for data ownership emphasize the impact of ownership on the automotive industry and the vast quantities of operational data which smart automobiles and self-driving vehicles will produce. We looked at how, if at all, the issue was being considered in consumer-facing statements addressing the data being collected by their vehicles. To formulate our proposal, we also considered continued advances in scientific research, quantum mechanics, and quantum computing which confirm that information in any digital or electronic medium is, and always has been, physical, tangible matter. Yet, to date, data regulation has sought to adapt legal constructs for “intangible” intellectual property or to express a series of permissions and constraints tied to specific classifications of data (such as personally identifiable information). We examined legal reforms that were recently approved by the United Nations Commission on International Trade Law to enable transactions involving electronic transferable records, as well as prior reforms adopted in the United States Uniform Commercial Code and Federal law to enable similar transactions involving digital records that were, historically, physical assets (such as promissory notes or chattel paper). Finally, we surveyed prior academic scholarship in the U.S. and Europe to determine if the physical attributes of digital data had been previously considered in the vigorous debates on how to regulate personal information or the extent, if at all, that the solutions developed for transferable records had been considered for larger classes of digital assets. Based on the preceding, we propose that regulation of digital information assets, and clear concepts of ownership, can be built on existing legal constructs that have enabled electronic commercial practices. We propose a property rules construct that clearly defines a right to own digital information arises upon creation (whether by keystroke or machine), and suggest when and how that right attaches to specific data though the exercise of technological controls. This construct will enable faster, better adaptations of new rules for the ever-evolving portfolio of data assets being created around the world. This approach will also create more predictable, scalable, and extensible mechanisms for regulating data and is consistent with, and may improve the exercise and enforcement of, rights regarding personal information. We conclude by highlighting existing technologies and their potential to support this construct and begin an inventory of the steps necessary to further proceed with this process

    Microdata Disclosure by Resampling - Empirical Findings for Business Survey Data

    Get PDF
    A problem statistical oces and research institutes are faced with by releasing micro-data is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of data, i.e., information loss is potentially high. In this paper I discuss an alternative technique of creating scientific-use-files, which reproduce the characteristics of the original data quite well. It is based on Fienbergs (1997 and 1994) [5], [6] idea to estimate and resample from the empirical multivariate cumulative distribution function of the data in order to get synthetic data. The procedure creates datasets - the resample - which have the same characteristics as the original survey data. In this paper I present some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and compare resampling with a common method of disclosure control, i.e. disturbance with multiplicative error, concerning confidentiality on the one hand and the appropriateness of the disturbed data for different kinds of analyses on the other. The results show that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if (a) the resampling procedure implements the correlation structure of the original data as a scale and (b) the data is multiplicative perturbed and a correction term is used. On average, anonymized data with multiplicative perturbed values better protect against re-identification as the various resampling methods used
    corecore