22 research outputs found
Secure Similarity Search
One of the most substantial ways to protect users\u27 sensitive
information is encryption. This paper is about the keyword index
search system on encrypted documents. It has been thought that the
search with errors over encrypted data is impossible because 1 bit
difference over plaintexts may reduce to enormous bits difference
over cyphertexts. We propose a novel idea to deal with the search
with errors over encrypted data. We develop two similarity search
schemes, implement the prototypes and provide substantial analysis.
We define security requirements for the similarity search over
encrypted data. The first scheme can achieve perfect privacy in
similarity search but the second scheme is more efficient
SSS-V2: Secure Similarity Search
Encrypting information has been regarded as one of the most substantial approaches to protect usersā sensitive information in radically changing internet technology era. In prior research, researchers have considered similarity search over encrypted documents infeasible, because the single-bit difference of a plaintext would result in an enormous bits difference in the corresponding ciphertext. However, we propose a novel idea of Security Similarity Search (SSS) over encrypted documents by applying character-wise encryption with approximate string matching to keyword index search systems. In order to do this, we define the security requirements of similarity search over encrypted data, propose two similarity search schemes, and formally prove the security of the schemes. The first scheme is more efficient, while the second scheme achieves perfect similarity search privacy. Surprisingly, the second scheme turns out to be faster than other keyword index search schemes with keywordwise encryption, while enjoying the same level of security. The schemes of SSS support ālike query(āab%ā)ā and a query with misprints in that
the character-wise encryption preserves the degree of similarity between two plaintexts, and renders approximate string matching between the corresponding ciphertexts possible without decryption
Homomorphic Encryption for Speaker Recognition: Protection of Biometric Templates and Vendor Model Parameters
Data privacy is crucial when dealing with biometric data. Accounting for the
latest European data privacy regulation and payment service directive,
biometric template protection is essential for any commercial application.
Ensuring unlinkability across biometric service operators, irreversibility of
leaked encrypted templates, and renewability of e.g., voice models following
the i-vector paradigm, biometric voice-based systems are prepared for the
latest EU data privacy legislation. Employing Paillier cryptosystems, Euclidean
and cosine comparators are known to ensure data privacy demands, without loss
of discrimination nor calibration performance. Bridging gaps from template
protection to speaker recognition, two architectures are proposed for the
two-covariance comparator, serving as a generative model in this study. The
first architecture preserves privacy of biometric data capture subjects. In the
second architecture, model parameters of the comparator are encrypted as well,
such that biometric service providers can supply the same comparison modules
employing different key pairs to multiple biometric service operators. An
experimental proof-of-concept and complexity analysis is carried out on the
data from the 2013-2014 NIST i-vector machine learning challenge
Secure Metric-Based Index for Similarity Cloud
We propose a similarity index that ensures data privacy and thus is suitable for search systems outsourced in a cloud. The proposed solution can exploit existing efficient metric indexes based on a fixed set of reference points. The method has been fully implemented as a security extension of an existing established approach called M-Index. This Encrypted M-Index supports evaluation of standard range and nearest neighbors queries both in precise and approximate manner. In the first part of this work, we analyze various levels of privacy in existing or future similarity search systems; the proposed solution tries to keep a reasonable privacy level while relocating only the necessary amount of work from server to an authorized client. The Encrypted M-Index has been tested on three real data sets with focus on various cost components
Improving k-nn search and subspace clustering based on local intrinsic dimensionality
In several novel applications such as multimedia and recommender systems, data is often represented as object feature vectors in high-dimensional spaces. The high-dimensional data is always a challenge for state-of-the-art algorithms, because of the so-called curse of dimensionality . As the dimensionality increases, the discriminative ability of similarity measures diminishes to the point where many data analysis algorithms, such as similarity search and clustering, that depend on them lose their effectiveness. One way to handle this challenge is by selecting the most important features, which is essential for providing compact object representations as well as improving the overall search and clustering performance. Having compact feature vectors can further reduce the storage space and the computational complexity of search and learning tasks.
Support-Weighted Intrinsic Dimensionality (support-weighted ID) is a new promising feature selection criterion that estimates the contribution of each feature to the overall intrinsic dimensionality. Support-weighted ID identifies relevant features locally for each object, and penalizes those features that have locally lower discriminative power as well as higher density. In fact, support-weighted ID measures the ability of each feature to locally discriminate between objects in the dataset.
Based on support-weighted ID, this dissertation introduces three main research contributions: First, this dissertation proposes NNWID-Descent, a similarity graph construction method that utilizes the support-weighted ID criterion to identify and retain relevant features locally for each object and enhance the overall graph quality. Second, with the aim to improve the accuracy and performance of cluster analysis, this dissertation introduces k-LIDoids, a subspace clustering algorithm that extends the utility of support-weighted ID within a clustering framework in order to gradually select the subset of informative and important features per cluster. k-LIDoids is able to construct clusters together with finding a low dimensional subspace for each cluster. Finally, using the compact object and cluster representations from NNWID-Descent and k-LIDoids, this dissertation defines LID-Fingerprint, a new binary fingerprinting and multi-level indexing framework for the high-dimensional data. LID-Fingerprint can be used for hiding the information as a way of preventing passive adversaries as well as providing an efficient and secure similarity search and retrieval for the data stored on the cloud. When compared to other state-of-the-art algorithms, the good practical performance provides an evidence for the effectiveness of the proposed algorithms for the data in high-dimensional spaces
SoK: Cryptographically Protected Database Search
Protected database search systems cryptographically isolate the roles of
reading from, writing to, and administering the database. This separation
limits unnecessary administrator access and protects data in the case of system
breaches. Since protected search was introduced in 2000, the area has grown
rapidly; systems are offered by academia, start-ups, and established companies.
However, there is no best protected search system or set of techniques.
Design of such systems is a balancing act between security, functionality,
performance, and usability. This challenge is made more difficult by ongoing
database specialization, as some users will want the functionality of SQL,
NoSQL, or NewSQL databases. This database evolution will continue, and the
protected search community should be able to quickly provide functionality
consistent with newly invented databases.
At the same time, the community must accurately and clearly characterize the
tradeoffs between different approaches. To address these challenges, we provide
the following contributions:
1) An identification of the important primitive operations across database
paradigms. We find there are a small number of base operations that can be used
and combined to support a large number of database paradigms.
2) An evaluation of the current state of protected search systems in
implementing these base operations. This evaluation describes the main
approaches and tradeoffs for each base operation. Furthermore, it puts
protected search in the context of unprotected search, identifying key gaps in
functionality.
3) An analysis of attacks against protected search for different base
queries.
4) A roadmap and tools for transforming a protected search system into a
protected database, including an open-source performance evaluation platform
and initial user opinions of protected search.Comment: 20 pages, to appear to IEEE Security and Privac
Don't forget private retrieval: distributed private similarity search for large language models
While the flexible capabilities of large language models (LLMs) allow them to
answer a range of queries based on existing learned knowledge, information
retrieval to augment generation is an important tool to allow LLMs to answer
questions on information not included in pre-training data. Such private
information is increasingly being generated in a wide array of distributed
contexts by organizations and individuals. Performing such information
retrieval using neural embeddings of queries and documents always leaked
information about queries and database content unless both were stored locally.
We present Private Retrieval Augmented Generation (PRAG), an approach that uses
multi-party computation (MPC) to securely transmit queries to a distributed set
of servers containing a privately constructed database to return top-k and
approximate top-k documents. This is a first-of-its-kind approach to dense
information retrieval that ensures no server observes a client's query or can
see the database content. The approach introduces a novel MPC friendly protocol
for inverted file approximate search (IVF) that allows for fast document search
over distributed and private data in sublinear communication complexity. This
work presents new avenues through which data for use in LLMs can be accessed
and used without needing to centralize or forgo privacy
Efficient and secure document similarity search cloud utilizing mapreduce
Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reasonably short amount of time. In particular, pairwise comparison of documents sharing a common feature, necessitates prohibitively high storage and computation power. The wide spread availability of cloud computing provides users easy access to high storage and processing power. Furthermore, outsourcing their data to the cloud guarantees reliability and availability for their data while privacy and security concerns are not always properly addressed. This leads to the problem of protecting the privacy of sensitive data against adversaries including the cloud operator. Generally, traditional document similarity algorithms tend to compare all the documents in a data set sharing same terms (words) with query document. In our work, we propose a new filtering technique that works on plaintext data, which decreases the number of comparisons between the query set and the search set to find highly similar documents. The technique, referred as ZOLIP algorithm, is efficient and scalable, but does not provide security. We also design and implement three secure similarity search algorithms for text documents, namely Secure Sketch Search, Secure Minhash Search and Secure ZOLIP. The first algorithm utilizes locality sensitive hashing techniques and cosine similarity. While the second algorithm uses the Minhash Algorithm, the last one uses the encrypted ZOLIP Signature, which is the secure version of the ZOLIP algorithm. We utilize the Hadoop distributed file system and the MapReduce parallel programming model to scale our techniques to big data setting. Our experimental results on real data show that some of the proposed methods perform better than the previous work in the literature in terms of the number of joins, and therefore, speed
å¹ēēć§å®å ØćŖéåéé”ä¼¼ēµåć«é¢ććē ē©¶
ēę³¢å¤§å¦ (University of Tsukuba)201
Chameleon: A Secure Cloud-Enabled and Queryable System with Elastic Properties
There are two dominant themes that have become increasingly more important in our
technological society. First, the recurrent use of cloud-based solutions which provide
infrastructures, computation platforms and storage as services. Secondly, the use of applicational
large logs for analytics and operational monitoring in critical systems. Moreover,
auditing activities, debugging of applications and inspection of events generated by errors
or potential unexpected operations - including those generated as alerts by intrusion
detection systems - are common situations where extensive logs must be analyzed, and
easy access is required. More often than not, a part of the generated logs can be deemed
as sensitive, requiring a privacy-enhancing and queryable solution.
In this dissertation, our main goal is to propose a novel approach of storing encrypted
critical data in an elastic and scalable cloud-based storage, focusing on handling JSONbased
ciphered documents. To this end, we make use of Searchable and Homomorphic
Encryption methods to allow operations on the ciphered documents. Additionally, our
solution allows for the user to be near oblivious to our systemās internals, providing
transparency while in use. The achieved end goal is a unified middleware system capable
of providing improved system usability, privacy, and rich querying over the data. This
previously mentioned objective is addressed while maintaining server-side auditable logs,
allowing for searchable capabilities by the log owner or authorized users, with integrity
and authenticity proofs.
Our proposed solution, named Chameleon, provides rich querying facilities on ciphered
data - including conjunctive keyword, ordering correlation and boolean queries
- while supporting field searching and nested aggregations. The aforementioned operations
allow our solution to provide data analytics upon ciphered JSON documents, using
Elasticsearch as our storage and search engine.O uso recorrente de soluƧƵes baseadas em nuvem tornaram-se cada vez mais importantes
na nossa sociedade. Tais soluƧƵes fornecem infraestruturas, computaĆ§Ć£o e armazenamento
como serviƧos, para alem do uso de logs volumosos de sistemas e aplicaƧƵes para
anĆ”lise e monitoramento operacional em sistemas crĆticos. Atividades de auditoria, debugging
de aplicaƧƵes ou inspeĆ§Ć£o de eventos gerados por erros ou possĆveis operaƧƵes
inesperadas - incluindo alertas por sistemas de detecĆ§Ć£o de intrusĆ£o - sĆ£o situaƧƵes comuns
onde logs extensos devem ser analisados com facilidade. Frequentemente, parte dos
logs gerados podem ser considerados confidenciais, exigindo uma soluĆ§Ć£o que permite
manter a confidencialidades dos dados durante procuras.
Nesta dissertaĆ§Ć£o, o principal objetivo Ć© propor uma nova abordagem de armazenar
logs crĆticos num armazenamento elĆ”stico e escalĆ”vel baseado na cloud. A soluĆ§Ć£o proposta
suporta documentos JSON encriptados, fazendo uso de Searchable Encryption e
mĆ©todos de criptografia homomĆ³rfica com provas de integridade e autenticaĆ§Ć£o. O objetivo
alcanƧado Ʃ um sistema de middleware unificado capaz de fornecer privacidade,
integridade e autenticidade, mantendo registos auditƔveis do lado do servidor e permitindo
pesquisas pelo proprietĆ”rio dos logs ou usuĆ”rios autorizados. A soluĆ§Ć£o proposta,
Chameleon, visa fornecer recursos de consulta atuando em cima de dados cifrados - incluindo
queries conjuntivas, de ordenaĆ§Ć£o e booleanas - suportando pesquisas de campo
e agregaƧƵes aninhadas. As operaƧƵes suportadas permitem Ć nossa soluĆ§Ć£o suportar data
analytics sobre documentos JSON cifrados, utilizando o Elasticsearch como armazenamento
e motor de busca