10 research outputs found
On the average-case complexity of pattern matching with wildcards
Pattern matching with wildcards is a string matching problem with the goal of finding all factors of a text of length that match a pattern of length , where wildcards (characters that match everything) may be present.
In this paper we present a number of complexity results and fast average-case algorithms for pattern matching where wildcards are allowed in the pattern, however, the results are easily adapted to the case where wildcards are allowed in the text as well.
We analyse the \textit{average-case} complexity of these algorithms and derive non-trivial time bounds.
These are the first results on the average-case complexity of pattern matching with wildcards which provide a provable separation in time complexity between exact pattern matching and pattern matching with wildcards.
We introduce the \textit{wc-period} of a string which is the period of the binary mask where \textit{iff} and otherwise. We denote the length of the wc-period of a string by \textsc{wcp}(x).
We show the following results for constant and a pattern of length and wildcards with \textsc{wcp}(x)=p the prefix of length contains wildcards:
\begin{itemize}
\item If there is an optimal algorithm running in \cO(\frac{n \log_\sigma m}{m})-time on average.
\item If there is an algorithm running in \cO(\frac{n \log_\sigma m\log_2 p}{m})-time on average.
\item If any algorithm takes at least -time on average.
\end{itemize
Weighted ancestors in suffix trees
The classical, ubiquitous, predecessor problem is to construct a data
structure for a set of integers that supports fast predecessor queries. Its
generalization to weighted trees, a.k.a. the weighted ancestor problem, has
been extensively explored and successfully reduced to the predecessor problem.
It is known that any solution for both problems with an input set from a
polynomially bounded universe that preprocesses a weighted tree in O(n
polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most
important and frequent application of the weighted ancestors problem is for
suffix trees. It has been a long-standing open question whether the weighted
ancestors problem has better bounds for suffix trees. We answer this question
positively: we show that a suffix tree built for a text w[1..n] can be
preprocessed using O(n) extra space, so that queries can be answered in O(1)
time. Thus we improve the running times of several applications. Our
improvement is based on a number of data structure tools and a
periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201
SWiM: Secure Wildcard Pattern Matching From OT Extension
Suppose a server holds a long text string and a receiver holds a short pattern string. Secure pattern matching allows the receiver to learn the locations in the long text where the pattern appears, while leaking nothing else to either party besides the length of their inputs. In this work we consider secure wildcard pattern matching WPM, where the receiver\u27s pattern is allowed to contain wildcards that match to any character.
We present SWiM, a simple and fast protocol for WPM that is heavily based on oblivious transfer (OT) extension. As such, the protocol requires only a small constant number of public-key operations and otherwise uses only very fast symmetric-key primitives. SWiM is secure against semi-honest adversaries. We implemented a prototype of our protocol to demonstrate its practicality. We can perform WPM on a DNA text (4-character alphabet) of length and pattern of length in just over 2 seconds, which is over two orders of magnitude faster than the state-of-the-art scheme of Baron et al. (SCN 2012)
Space-efficient data structures for string searching and retrieval
Let D = {d_1, d_2, ...} be a collection of string documents of n characters in total, which are drawn from an alphabet set Sigma =[sigma] ={1,2,3,...sigma}. The top-k document retrieval problem is to maintain D as a data structure, such that when ever a query Q=(P, k) comes, we can report (the identifiers of) those k documents that are most relevant to the pattern P (of p characters). The relevance of a document d_r with respect to a pattern P is captured by score(P, d_r), which can be any function of the set of locations where P occurs in d_r. Finding the most relevant documents to the user query is the central task of any web-search engine. In the case of web-data, the documents can be demarcated along word boundaries. All the search engines use inverted index as the back-bone data structure. For each word occurring in the document collection, the inverted index stores the list of documents where it appears. It is often augmented with relevance score and/or positional information. However, when data consists of strings (e.g., in bioinformatics or Asian language texts), there are no word demarcation boundaries and the queries are arbitrary substrings instead of being proper valid words. In this case, string data structures have to be used and central approach is to use suffix tree (or string B-tree) with appropriate augmenting data structures. The work by Hon, Shah and Vitter [FOCS 2009], and Navarro and Nekrich [SODA 2012] resulted in a linear space data structure with optimal O(p+k) query time solution for this problem. This was based on geometric interpretation of the query. We extend this central problem, in two important areas of massive data sets. First, we consider an external memory disk based index, where we give near optimal results. Next, we consider compression aspects of data structure, reducing the storage space. This is central goal of the active research field of succinct data structures. We present several results, which improve upon several previous results, and are currently the best known space-time trade-offs in this area
Recommended from our members
Efficient Private Matching for Private Databases
Private matching (PM) is a key cryptographic primitive in secure computation that allows several parties to jointly compute some functions depending on their private inputs. Indeed, this primitive has many practical applications. For instance, in online advertising, two companies may wish to find their common customers for a joint marketing campaign. In this scenario, privacy is of utmost importance and it is imperative to ensure that neither company can learn more than their own data and the results of the match.
In recent years, secure computation in general and PM in particular has attracted considerable attention from the research community, partly due to the rise of Big Data along with the ever-increasing privacy concerns. This thesis describes three secure private set intersection (PSI) protocols, one private set union (PSU) construction, and one pattern matching scheme. PM can be considered as a main building block in all these protocols.
To securely compute the intersection of two sets of size our proposed protocols require only 3 seconds which is faster than the previous best protocol. In the multi-party setting, we provide the first implementation that takes only 72 seconds to compute PSI for 5 parties with data-sets of items each. For private set union (PSU), our protocol improves prior work by a factor up to for large instances. In addition, our wild-card pattern matching (WPM) protocol shows over two orders of magnitude faster than the state-of-the-art scheme
Computational Methods for Gene Expression and Genomic Sequence Analysis
Advances in technologies currently produce more and more cost-effective, high-throughput, and large-scale biological data. As a result, there is an urgent need for developing efficient computational methods for analyzing these massive data. In this dissertation, we introduce methods to address several important issues in gene expression and genomic sequence analysis, two of the most important areas in bioinformatics.Firstly, we introduce a novel approach to predicting patterns of gene response to multiple treatments in case of small sample size. Researchers are increasingly interested in experiments with many treatments such as chemicals compounds or drug doses. However, due to cost, many experiments do not have large enough samples, making it difficult for conventional methods to predict patterns of gene response. Here we introduce an approach which exploited dependencies of pairwise comparisons outcomes and resampling techniques to predict true patterns of gene response in case of insufficient samples. This approach deduced more and better functionally enriched gene clusters than conventional methods. Our approach is therefore useful for multiple-treatment studies which have small sample size or contain highly variantly expressed genes.Secondly, we introduce a novel method for aligning short reads, which are DNA fragments extracted across genomes of individuals, to reference genomes. Results from short read alignment can be used for many studies such as measuring gene expression or detecting genetic variants. Here we introduce a method which employed an iterated randomized algorithm based on FM-index, an efficient data structure for full-text indexing, to align reads to the reference. This method improved alignment performance across a wide range of read lengths and error rates compared to several popular methods, making it a good choice for community to perform short read alignment.Finally, we introduce a novel approach to detecting genetic variants such as SNPs (single nucleotide polymorphisms) or INDELs (insertions/deletions). This study has great significance in a wide range of areas, from bioinformatics and genetic research to medical field. For example, one can predict how genomic changes are related to phenotype in their organism of interest, or associate genetic changes to disease risk or medical treatment efficacy. Here we introduce a method which leveraged known genetic variants existing in well-established databases to improve accuracy of detecting variants. This method had higher accuracy than several state-of-the-art methods in many cases, especially for detecting INDELs. Our method therefore has potential to be useful in research and clinical applications which rely on identifying genetic variants accurately
Advanced rank/select data structures: succinctness, bounds and applications.
The thesis explores new theoretical results and applications of rank and select data structures. Given a string, select(c, i) gives the position of the ith occurrence of character c in the string, while rank(c, p) counts the number of instances of character c on the left of position p.
Succinct rank/select data structures are space-efficient versions of standard ones, designed to keep data compressed and at the same time answer to queries rapidly. They are at the basis of more involved compressed and succinct data structures which in turn are motivated by the nowadays need to analyze and operate on massive data sets quickly, where space efficiency is crucial. The thesis builds up on the state of the art left by years of study and produces results on multiple fronts.
Analyzing binary succinct data structures and their link with predecessor data structures, we integrate data structures for the latter problem in the former. The result is a data structure which outperforms the one of Patrascu 08 in a range of cases which were not studied before, namely when the lower bound for predecessor do not apply and constant-time rank is not feasible.
Further, we propose the first lower bound for succinct data structures on generic strings, achieving a linear trade-off between time for rank/select execution and additional space (w.r.t. to the plain data) needed by the data structure. The proposal addresses systematic data structures, namely those that only access the underlying string through ADT calls and do not encode it directly.
Also, we propose a matching upper bound that proves the tightness of our lower bound.
Finally, we apply rank/select data structures to the substring counting problem, where we seek to preprocess a text and generate a summary data structure which is stored in lieu of the text and answers to substring counting queries with additive error. The results include a theory-proven optimal data structure with generic additive error and a data structure that errs only on infrequent patterns with significative practical space gains
LIPIcs, Volume 248, ISAAC 2022, Complete Volume
LIPIcs, Volume 248, ISAAC 2022, Complete Volum