3,130 research outputs found
Efficient Privacy Preserving Distributed Clustering Based on Secret Sharing
In this paper, we propose a privacy preserving distributed
clustering protocol for horizontally partitioned data based on a very efficient
homomorphic additive secret sharing scheme. The model we use
for the protocol is novel in the sense that it utilizes two non-colluding
third parties. We provide a brief security analysis of our protocol from
information theoretic point of view, which is a stronger security model.
We show communication and computation complexity analysis of our
protocol along with another protocol previously proposed for the same
problem. We also include experimental results for computation and communication
overhead of these two protocols. Our protocol not only outperforms
the others in execution time and communication overhead on
data holders, but also uses a more efficient model for many data mining
applications
Privacy Preserving Optics Clustering
OPTICS is a well-known density-based clustering algorithm which uses DBSCAN theme without producing a clustering of a data set openly, but as a substitute, it creates an augmented ordering of that particular database which represents its density-based clustering structure. This resulted cluster-ordering comprises information which is similar to the density based clustering’s conforming to a wide range of parameter settings. The same algorithm can be applied in the field of privacy-preserving data mining, where extracting the useful information from data which is distributed over a network requires preservation of privacy of individuals’ information. The problem of getting the clusters of a distributed database is considered as an example of this algorithm, where two parties want to know their cluster numbers on combined database without revealing one party information to other party. This issue can be seen as a particular example of secure multi-party computation and such sort of issues can be solved with the assistance of proposed protocols in our work along with some standard protocols
A privacy preservation masking method to support business collaboration.
This paper introduces a privacy preservation masking method to support business collaboration, called Dimensionality Reduction-Based Transformation (DRBT). This method relies on the intuition behind random projection to mask the underlying attribute values subject to cluster analysis. Using DRBT, data owners are able to find a solution that meets privacy requirements and guarantees valid clustering results. DRBT was validated taking into account five real datasets. The major features of this method are: a) it is independent of distance-based clustering algorithms; b) it has a sound mathematical foundation; and c) it does not require CPU-intensive operations.Na publicação: Stanley R. M. Oliveira
Formal Representation of the SS-DB Benchmark and Experimental Evaluation in EXTASCID
Evaluating the performance of scientific data processing systems is a
difficult task considering the plethora of application-specific solutions
available in this landscape and the lack of a generally-accepted benchmark. The
dual structure of scientific data coupled with the complex nature of processing
complicate the evaluation procedure further. SS-DB is the first attempt to
define a general benchmark for complex scientific processing over raw and
derived data. It fails to draw sufficient attention though because of the
ambiguous plain language specification and the extraordinary SciDB results. In
this paper, we remedy the shortcomings of the original SS-DB specification by
providing a formal representation in terms of ArrayQL algebra operators and
ArrayQL/SciQL constructs. These are the first formal representations of the
SS-DB benchmark. Starting from the formal representation, we give a reference
implementation and present benchmark results in EXTASCID, a novel system for
scientific data processing. EXTASCID is complete in providing native support
both for array and relational data and extensible in executing any user code
inside the system by the means of a configurable metaoperator. These features
result in an order of magnitude improvement over SciDB at data loading,
extracting derived data, and operations over derived data.Comment: 32 pages, 3 figure
Differentially Private Vertical Federated Clustering
In many applications, multiple parties have private data regarding the same
set of users but on disjoint sets of attributes, and a server wants to leverage
the data to train a model. To enable model learning while protecting the
privacy of the data subjects, we need vertical federated learning (VFL)
techniques, where the data parties share only information for training the
model, instead of the private data. However, it is challenging to ensure that
the shared information maintains privacy while learning accurate models. To the
best of our knowledge, the algorithm proposed in this paper is the first
practical solution for differentially private vertical federated k-means
clustering, where the server can obtain a set of global centers with a provable
differential privacy guarantee. Our algorithm assumes an untrusted central
server that aggregates differentially private local centers and membership
encodings from local data parties. It builds a weighted grid as the synopsis of
the global dataset based on the received information. Final centers are
generated by running any k-means algorithm on the weighted grid. Our approach
for grid weight estimation uses a novel, light-weight, and differentially
private set intersection cardinality estimation algorithm based on the
Flajolet-Martin sketch. To improve the estimation accuracy in the setting with
more than two data parties, we further propose a refined version of the weights
estimation algorithm and a parameter tuning strategy to reduce the final
k-means utility to be close to that in the central private setting. We provide
theoretical utility analysis and experimental evaluation results for the
cluster centers computed by our algorithm and show that our approach performs
better both theoretically and empirically than the two baselines based on
existing techniques
Semantic preserving text tepresentation and its applications in text clustering
Text mining using the vector space representation has proven to be an valuable tool for classification, prediction, information retrieval and extraction. The nature of text data presents several issues to these tasks, including large dimension and the existence of special polysemous and synonymous words. A variety of techniques have been devised to overcome these shortcomings, including feature selection and word sense disambiguation. Privacy preserving data mining is also an area of emerging interest. Existing techniques for privacy preserving data mining require the use of secure computation protocols, which often incur a greatly increased computational cost. In this paper, a generalization-based method is presented for creating a semantic-preserving vector space which reduces dimension as well as addresses problems with special word types. The SPVSM also allows private text data to be safely represented without degrading cluster accuracy or performance. Further, the result produced is also usable in combination with theoretic based techniques such as latent semantic indexing. The performance of text clustering using the semantic preserving generalization method is evaluated and compared to existing feature selection techniques, and shown to have significant merit from a clustering perspective
- …