157 research outputs found
Location Anonymization With Considering Errors and Existence Probability
Mobile devices that can sense their location using GPS or Wi-Fi have become extremely popular. However, many users hesitate to provide their accurate location information to unreliable third parties if it means that their identities or sensitive attribute values will be disclosed by doing so. Many approaches for anonymization, such as k-anonymity, have been proposed to tackle this issue. Existing studies for k-anonymity usually anonymize each user\u27s location so that the anonymized area contains k or more users. Existing studies, however, do not consider location errors and the probability that each user actually exists at the anonymized area. As a result, a specific user might be identified by untrusted third parties. We propose novel privacy and utility metrics that can treat the location and an efficient algorithm to anonymize the information associated with users\u27 locations. This is the first work that anonymizes location while considering location errors and the probability that each user is actually present at the anonymized area. By means of simulations, we have proven that our proposed method can reduce the risk of the user\u27s attributes being identified while maintaining the utility of the anonymized data
Differential Private Data Collection and Analysis Based on Randomized Multiple Dummies for Untrusted Mobile Crowdsensing
Mobile crowdsensing, which collects environmental information from mobile phone users, is growing in popularity. These data can be used by companies for marketing surveys or decision making. However, collecting sensing data from other users may violate their privacy. Moreover, the data aggregator and/or the participants of crowdsensing may be untrusted entities. Recent studies have proposed randomized response schemes for anonymized data collection. This kind of data collection can analyze the sensing data of users statistically without precise information about other users\u27 sensing results. However, traditional randomized response schemes and their extensions require a large number of samples to achieve proper estimation. In this paper, we propose a new anonymized data-collection scheme that can estimate data distributions more accurately. Using simulations with synthetic and real datasets, we prove that our proposed method can reduce the mean squared error and the JS divergence by more than 85% as compared with other existing studies
Anonymization of Sensitive Quasi-Identifiers for l-diversity and t-closeness
A number of studies on privacy-preserving data mining have been proposed. Most of them assume that they can separate quasi-identifiers (QIDs) from sensitive attributes. For instance, they assume that address, job, and age are QIDs but are not sensitive attributes and that a disease name is a sensitive attribute but is not a QID. However, all of these attributes can have features that are both sensitive attributes and QIDs in practice. In this paper, we refer to these attributes as sensitive QIDs and we propose novel privacy models, namely, (l1, ..., lq)-diversity and (t1, ..., tq)-closeness, and a method that can treat sensitive QIDs. Our method is composed of two algorithms: an anonymization algorithm and a reconstruction algorithm. The anonymization algorithm, which is conducted by data holders, is simple but effective, whereas the reconstruction algorithm, which is conducted by data analyzers, can be conducted according to each data analyzer’s objective. Our proposed method was experimentally evaluated using real data sets
Temporal and Spatial Expansion of Urban LOD for Solving Illegally Parked Bicycles in Tokyo
The illegal parking of bicycles is a serious urban problem in Tokyo. The purpose of this study was to sustainably build Linked Open Data (LOD) to assist in solving the problem of illegally parked bicycles (IPBs) by raising social awareness, in cooperation with the Office for Youth Affairs and Public Safety of the Tokyo Metropolitan Government (Tokyo Bureau). We first extracted information on the problem factors and designed LOD schema for IPBs. Then we collected pieces of data from the Social Networking Service (SNS) and the websites of municipalities to build the illegally parked bicycle LOD (IPBLOD) with more than 200,000 triples. We then estimated the temporal missing data in the LOD based on the causal relations from the problem factors and estimated spatial missing data based on geospatial features. As a result, the number of IPBs can be inferred with about 70% accuracy, and places where bicycles might be illegally parked are estimated with about 31% accuracy. Then we published the complemented LOD and a Web application to visualize the distribution of IPBs in the city. Finally, we applied IPBLOD to large social activity in order to raise social awareness of the IPB issues and to remove IPBs, in cooperation with the Tokyo Bureau
Privacy Protection by Anonymizing Based on Status of Provider and Community
When a user receives personal services from a service provider, the service can be of a higher quality if the user pro-vides more personal information. However, the risk of privacy violation could increase. Therefore, this paper proposes a privacy protection method that realizes avoidance of unwanted informa-tion disclosure by controlling disclosable attributes according to the results from monitoring two elements: user background in-formation of the provider and user community status. This is done before disclosing individual attributes corresponding to the privacy policy (i.e., the required anonymity level) by each user. The system architecture based on the aforementioned is also pro-posed. The validity of the proposed methods was confirmed by a desk model
Privacy-preserving chi-squared test of independence for small samples
Background:The importance of privacy protection in analyses of personal data, such as genome-wide association studies (GWAS), has grown in recent years. GWAS focuses on identifying single-nucleotide polymorphisms (SNPs) associated with certain diseases such as cancer and diabetes, and the chi-squared (χ2) hypothesis test of independence can be utilized for this identification. However, recent studies have shown that publishing the results of χ2 tests of SNPs or personal data could lead to privacy violations. Several studies have proposed anonymization methods for χ2 testing with ε-differential privacy, which is the cryptographic community’s de facto privacy metric. However, existing methods can only be applied to 2×2 or 2×3 contingency tables, otherwise their accuracy is low for small numbers of samples. It is difficult to collect numerous high-sensitive samples in many cases such as COVID-19 analysis in its early propagation stage.Results:We propose a novel anonymization method (RandChiDist), which anonymizes χ2 testing for small samples. We prove that RandChiDist satisfies differential privacy. We also experimentally evaluate its analysis using synthetic datasets and real two genomic datasets. RandChiDist achieved the least number of Type II errors among existing and baseline methods that can control the ratio of Type I errors.Conclusions:We propose a new differentially private method, named RandChiDist, for anonymizing χ2 values for an I×J contingency table with a small number of samples. The experimental results show that RandChiDist outperforms existing methods for small numbers of samples
Differentially Private Mobile Crowd Sensing Considering Sensing Errors
An increasingly popular class of software known as participatory sensing, or mobile crowdsensing, is a means of collecting people’s surrounding information via mobile sensing devices. To avoid potential undesired side effects of this data analysis method, such as privacy violations, considerable research has been conducted over the last decade to develop participatory sensing that looks to preserve privacy while analyzing participants’ surrounding information. To protect privacy, each participant perturbs the sensed data in his or her device, then the perturbed data is reported to the data collector. The data collector estimates the true data distribution from the reported data. As long as the data contains no sensing errors, current methods can accurately evaluate the data distribution. However, there has so far been little analysis of data that contains sensing errors. A more precise analysis that maintains privacy levels can only be achieved when a variety of sensing errors are considered
ユビキタスコンピューティングにおけるl-エントロピーを満たす匿名データ収集
ユビキタスコンピューティング環境において多くのユーザからセンシングしたデータを収集し,その分布を把握することによって,国の政策や企業における意思決定に役立てることができる.しかし,これらのデータには個人を特定できる情報が含まれることがあり,ユーザのプライバシー情報が漏洩するリスクがある.このような問題に対応し,全てのユーザが必ず正しくない情報を提供することで,プライバシーを保護しつつ,サーバ側で真のデータ分布を推測するNegative Surveyという手法が提案されている.従来のNegative Surveyでは多数のユーザ情報を収集しなければ分布を高精度に推測できないという欠点があった.近年,少ないユーザ数から真の分布を推測することができる手法が複数提案されているが,いずれもプライバシー保護レベルが低いという課題がある.本研究では,プライバシー保護レベルを一定レベルに保ち,従来手法よりも真の分布に近い情報を得られる手法を提案する.近年提案されている手法と比較して平均2乗誤差を約1/2から1/30程度にまで削減できることを数学的解析及びシミュレーションによって示す.Ubiquitous computing can collect sensing data of users. These data can be used for national policy or decision-making of companies. However, sensing users may violate their privacy. Negative surveys collect incorrect data of each user and can assume true data distributions of users. Traditional negative survey needs a lot of samples for precise estimation. These days several types of negative surveys, which can estimate data distributions with a high degree of accuracy, are proposed. However, a privacy level of these methods is relatively low. In this paper, we propose a new negative survey which can estimate data distributions with more precision and protect privacy more strictly. By mathematical analysis and simulations, we prove tha our proposed method can reduce MSE by between approximately 1/2 and 1/30
誤差を考慮した位置匿名化手法の提案
年齢,年収,趣味等のユーザ属性と,ユーザの行動履歴とを関連付けてマイニングすることで,ユーザ属性や位置情報に応じた適切なマーケティングや広告配信をすることが可能となる.しかし,あるユーザの行動履歴の一部を知る攻撃者にこの情報がわたると,関連付けられたユーザ属性と個人を結び付けられるリスクがある.従来研究において,ユーザの行動履歴を知る攻撃者に対してもユーザ属性と個人を結び付けられることを防ぐため,k-匿名性等の指標に基づく匿名化手法が多数提案されている.しかし,ユーザの位置情報には誤差が含まれていることが考慮されておらず,誤差がある環境下では個人が特定されるリスクが増加する.また,匿名化後のデータの有効性指標にも誤差が考慮されていない.本論文では,位置情報には誤差があるという現実的な環境を想定し,新しいプライバシー指標,匿名化後のデータにおける有効性指標,及びこれら指標に基づいた匿名化アルゴリズムを提案する.シミュレーション評価を実施し,従来手法と比べて匿名化後のデータの有効性を向上させ,同時に,個人が特定されるリスクを低減することを示す.Data mining can support effective marketing or advertisement based on users\u27 attributes such as sex and ages and their locations. However, attackers can identify specific user\u27s attributes if they know the user\u27s location. A lot of approaches for anonymization such as k-anonymity have been proposed to tackle this problem. Existing studies, however, do not take errors of the location information into consideration. Therefore, there is a risk that a specific user\u27s attribute can be identified by an attacker. Moreover, the utility measure proposed in existing studies does not consider errors of the location information. We propose a novel privacy measure and a utility measure that can treat the errors of the location information and propose a method to anonymize the information of users\u27 locations based on the proposed measures. By simulations, we prove our proposed method can improve the utility of the anonymized information and reduce the risk of the user\u27s attribute being identified
Randomized Responseを用いた柔軟な匿名データ収集
ユーザがカテゴリー化された自身のデータを改変してサーバに送信し,サーバは得た情報から統計的な解析を行う,というプライバシー保護モデルを実現するRandomized Responseスキームが提案されている.サーバ側は受け取った情報から,各カテゴリーに属すユーザ数の真の分布を推測する.各ユーザの真のカテゴリーがどのカテゴリーに改変されてサーバへ送信されるかは,あらかじめ設定された確率行列に基づいて決定される.確率行列の値を変更することで,異なるプライバシー保護レベルを実現できる.また,プライバシー保護レベルと,サーバにおける推測誤差とはトレードオフの関係にある.従来は,全ユーザが同一の確率行列を利用する状況のみが想定されており,ユーザごとにプライバシー保護レベルを変えることができないという制約があった.本論文では,ユーザごとに異なる確率行列を利用するモデルを提案する.異なる確率行列が利用される場合,サーバ側において各カテゴリーに属すユーザ数の分布を推測する手法は確立されていない.本論文では推測誤差を定量的に取扱い,最も確からしいユーザ数の分布を推測する手法を提案する.従来手法と比較してサーバ側での推測誤差を70%程度削減できることを,数学的解析及び実データを用いたシミュレーションによって示す.Randomized Response Scheme (RR) can realize a privacy-preserving model where each user replaces his original category of his data to another category probabilistically. Each user then sends the replaced category to a server which analyzes the collected data and estimates the distribution of the original categories. The replacement of categories depends on a probabilistic matrix. The level of privacy can be adjusted by changing values of the probability matrix, and there is a tradeoff between the amount of the estimated error at the server and the level of privacy. Existing studies assume that all users use the same probability matrix, so they cannot change the level of privacy depending on each user\u27s demand. In this paper, we propose a model where users can use different probabilistic matrix. Existing studies cannot estimate the distribution of original categories in the situation where different probabilistic matrixes are used. We provide quantitative analysis of the estimated errors and propose a method to estimate the distribution by a maximum likelihood estimate. By mathematical analysis and simulations, we prove our proposed method can reduce the estimated errors by approximately 70%
- …