9,625 research outputs found
A practical index for approximate dictionary matching with few mismatches
Approximate dictionary matching is a classic string matching problem
(checking if a query string occurs in a collection of strings) with
applications in, e.g., spellchecking, online catalogs, geolocation, and web
searchers. We present a surprisingly simple solution called a split index,
which is based on the Dirichlet principle, for matching a keyword with few
mismatches, and experimentally show that it offers competitive space-time
tradeoffs. Our implementation in the C++ language is focused mostly on data
compaction, which is beneficial for the search speed (e.g., by being cache
friendly). We compare our solution with other algorithms and we show that it
performs better for the Hamming distance. Query times in the order of 1
microsecond were reported for one mismatch for the dictionary size of a few
megabytes on a medium-end PC. We also demonstrate that a basic compression
technique consisting in -gram substitution can significantly reduce the
index size (up to 50% of the input text size for the DNA), while still keeping
the query time relatively low
Rank, select and access in grammar-compressed strings
Given a string of length on a fixed alphabet of symbols, a
grammar compressor produces a context-free grammar of size that
generates and only . In this paper we describe data structures to
support the following operations on a grammar-compressed string:
\mbox{rank}_c(S,i) (return the number of occurrences of symbol before
position in ); \mbox{select}_c(S,i) (return the position of the th
occurrence of in ); and \mbox{access}(S,i,j) (return substring
). For rank and select we describe data structures of size
bits that support the two operations in time. We
propose another structure that uses
bits and that supports the two queries in , where
is an arbitrary constant. To our knowledge, we are the first to
study the asymptotic complexity of rank and select in the grammar-compressed
setting, and we provide a hardness result showing that significantly improving
the bounds we achieve would imply a major breakthrough on a hard
graph-theoretical problem. Our main result for access is a method that requires
bits of space and time to extract
consecutive symbols from . Alternatively, we can achieve query time using bits of space. This matches a lower bound stated by Verbin
and Yu for strings where is polynomially related to .Comment: 16 page
EsPRESSo: Efficient Privacy-Preserving Evaluation of Sample Set Similarity
Electronic information is increasingly often shared among entities without
complete mutual trust. To address related security and privacy issues, a few
cryptographic techniques have emerged that support privacy-preserving
information sharing and retrieval. One interesting open problem in this context
involves two parties that need to assess the similarity of their datasets, but
are reluctant to disclose their actual content. This paper presents an
efficient and provably-secure construction supporting the privacy-preserving
evaluation of sample set similarity, where similarity is measured as the
Jaccard index. We present two protocols: the first securely computes the
(Jaccard) similarity of two sets, and the second approximates it, using MinHash
techniques, with lower complexities. We show that our novel protocols are
attractive in many compelling applications, including document/multimedia
similarity, biometric authentication, and genetic tests. In the process, we
demonstrate that our constructions are appreciably more efficient than prior
work.Comment: A preliminary version of this paper was published in the Proceedings
of the 7th ESORICS International Workshop on Digital Privacy Management (DPM
2012). This is the full version, appearing in the Journal of Computer
Securit
SoK: Cryptographically Protected Database Search
Protected database search systems cryptographically isolate the roles of
reading from, writing to, and administering the database. This separation
limits unnecessary administrator access and protects data in the case of system
breaches. Since protected search was introduced in 2000, the area has grown
rapidly; systems are offered by academia, start-ups, and established companies.
However, there is no best protected search system or set of techniques.
Design of such systems is a balancing act between security, functionality,
performance, and usability. This challenge is made more difficult by ongoing
database specialization, as some users will want the functionality of SQL,
NoSQL, or NewSQL databases. This database evolution will continue, and the
protected search community should be able to quickly provide functionality
consistent with newly invented databases.
At the same time, the community must accurately and clearly characterize the
tradeoffs between different approaches. To address these challenges, we provide
the following contributions:
1) An identification of the important primitive operations across database
paradigms. We find there are a small number of base operations that can be used
and combined to support a large number of database paradigms.
2) An evaluation of the current state of protected search systems in
implementing these base operations. This evaluation describes the main
approaches and tradeoffs for each base operation. Furthermore, it puts
protected search in the context of unprotected search, identifying key gaps in
functionality.
3) An analysis of attacks against protected search for different base
queries.
4) A roadmap and tools for transforming a protected search system into a
protected database, including an open-source performance evaluation platform
and initial user opinions of protected search.Comment: 20 pages, to appear to IEEE Security and Privac
Identifying Web Tables - Supporting a Neglected Type of Content on the Web
The abundance of the data in the Internet facilitates the improvement of
extraction and processing tools. The trend in the open data publishing
encourages the adoption of structured formats like CSV and RDF. However, there
is still a plethora of unstructured data on the Web which we assume contain
semantics. For this reason, we propose an approach to derive semantics from web
tables which are still the most popular publishing tool on the Web. The paper
also discusses methods and services of unstructured data extraction and
processing as well as machine learning techniques to enhance such a workflow.
The eventual result is a framework to process, publish and visualize linked
open data. The software enables tables extraction from various open data
sources in the HTML format and an automatic export to the RDF format making the
data linked. The paper also gives the evaluation of machine learning techniques
in conjunction with string similarity functions to be applied in a tables
recognition task.Comment: 9 pages, 4 figure
- …