Search CORE

10 research outputs found

On the average-case complexity of pattern matching with wildcards

Author: Carl Barton
Publication venue: 'Elsevier BV'
Publication date: 12/04/2022
Field of study

Pattern matching with wildcards is a string matching problem with the goal of finding all factors of a text

t

of length

n

that match a pattern

x

of length

m

, where wildcards (characters that match everything) may be present. In this paper we present a number of complexity results and fast average-case algorithms for pattern matching where wildcards are allowed in the pattern, however, the results are easily adapted to the case where wildcards are allowed in the text as well. We analyse the \textit{average-case} complexity of these algorithms and derive non-trivial time bounds. These are the first results on the average-case complexity of pattern matching with wildcards which provide a provable separation in time complexity between exact pattern matching and pattern matching with wildcards. We introduce the \textit{wc-period} of a string which is the period of the binary mask

x_b

where

x_b[i]=a

\textit{iff}

x[i]\neq \phi

and

b

otherwise. We denote the length of the wc-period of a string

x

by \textsc{wcp}(x). We show the following results for constant

0< \epsilon < 1

and a pattern

x

of length

m

and

g

wildcards with \textsc{wcp}(x)=p the prefix of length

p

contains

g_p

wildcards: \begin{itemize} \item If

\displaystyle\lim_{m \rightarrow \infty} \frac{g_p}{p}=0

there is an optimal algorithm running in \cO(\frac{n \log_\sigma m}{m})-time on average. \item If

\displaystyle\lim_{m \rightarrow \infty} \frac{g_p}{p}=1-\epsilon

there is an algorithm running in \cO(\frac{n \log_\sigma m\log_2 p}{m})-time on average. \item If

\displaystyle\lim_{m \rightarrow \infty} \frac{g}{m} = \displaystyle\lim_{m \rightarrow \infty} 1-f(m)=1

any algorithm takes at least

\Omega(\frac{n \log_\sigma m}{f(m)})

-time on average. \end{itemize

Birkbeck Institutional Research Online

Weighted ancestors in suffix trees

Author: D.E. Willard
M. Farach
M.A. Bender
O. Berkman
P. Bille
P. Gawrychowski
T. Kopelowitz
Publication venue
Publication date: 01/01/2014
Field of study

The classical, ubiquitous, predecessor problem is to construct a data structure for a set of integers that supports fast predecessor queries. Its generalization to weighted trees, a.k.a. the weighted ancestor problem, has been extensively explored and successfully reduced to the predecessor problem. It is known that any solution for both problems with an input set from a polynomially bounded universe that preprocesses a weighted tree in O(n polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most important and frequent application of the weighted ancestors problem is for suffix trees. It has been a long-standing open question whether the weighted ancestors problem has better bounds for suffix trees. We answer this question positively: we show that a suffix tree built for a text w[1..n] can be preprocessed using O(n) extra space, so that queries can be answered in O(1) time. Thus we improve the running times of several applications. Our improvement is based on a number of data structure tools and a periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

SWiM: Secure Wildcard Pattern Matching From OT Extension

Author: C Hazay
C Hazay
C Hazay
Chris Thachuk
D Beaver
D Vergnaud
F Boudot
G Chen
GS Çetin
H Lipmaa
J Bringer
K Defrawy El
KB Frikken
M Yasuda
MJ Freedman
P Bille
P Mohassel
R Fagin
R Gennaro
R Saikkonen
S Faust
TK Saha
V Kolesnikov
X Wei
Y Ishai
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 24/03/2019
Field of study

Suppose a server holds a long text string and a receiver holds a short pattern string. Secure pattern matching allows the receiver to learn the locations in the long text where the pattern appears, while leaking nothing else to either party besides the length of their inputs. In this work we consider secure wildcard pattern matching WPM, where the receiver\u27s pattern is allowed to contain wildcards that match to any character. We present SWiM, a simple and fast protocol for WPM that is heavily based on oblivious transfer (OT) extension. As such, the protocol requires only a small constant number of public-key operations and otherwise uses only very fast symmetric-key primitives. SWiM is secure against semi-honest adversaries. We implemented a prototype of our protocol to demonstrate its practicality. We can perform WPM on a DNA text (4-character alphabet) of length

10^5

and pattern of length

10^3

in just over 2 seconds, which is over two orders of magnitude faster than the state-of-the-art scheme of Baron et al. (SCN 2012)

Crossref

Cryptology ePrint Archive

Space-efficient data structures for string searching and retrieval

Author: Valliyil Thankachan Sharma
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

Let D = {d_1, d_2, ...} be a collection of string documents of n characters in total, which are drawn from an alphabet set Sigma =[sigma] ={1,2,3,...sigma}. The top-k document retrieval problem is to maintain D as a data structure, such that when ever a query Q=(P, k) comes, we can report (the identifiers of) those k documents that are most relevant to the pattern P (of p characters). The relevance of a document d_r with respect to a pattern P is captured by score(P, d_r), which can be any function of the set of locations where P occurs in d_r. Finding the most relevant documents to the user query is the central task of any web-search engine. In the case of web-data, the documents can be demarcated along word boundaries. All the search engines use inverted index as the back-bone data structure. For each word occurring in the document collection, the inverted index stores the list of documents where it appears. It is often augmented with relevance score and/or positional information. However, when data consists of strings (e.g., in bioinformatics or Asian language texts), there are no word demarcation boundaries and the queries are arbitrary substrings instead of being proper valid words. In this case, string data structures have to be used and central approach is to use suffix tree (or string B-tree) with appropriate augmenting data structures. The work by Hon, Shah and Vitter [FOCS 2009], and Navarro and Nekrich [SODA 2012] resulted in a linear space data structure with optimal O(p+k) query time solution for this problem. This was based on geometric interpretation of the query. We extend this central problem, in two important areas of massive data sets. First, we consider an external memory disk based index, where we give near optimal results. Next, we consider compression aspects of data structure, reducing the storage space. This is central goal of the active research field of succinct data structures. We present several results, which improve upon several previous results, and are currently the best known space-time trade-offs in this area

Louisiana State University

Recommended from our members

Efficient Private Matching for Private Databases

Author: Trieu Thi Ni Ni
Publication venue: 'Oregon State University'
Publication date
Field of study

Private matching (PM) is a key cryptographic primitive in secure computation that allows several parties to jointly compute some functions depending on their private inputs. Indeed, this primitive has many practical applications. For instance, in online advertising, two companies may wish to find their common customers for a joint marketing campaign. In this scenario, privacy is of utmost importance and it is imperative to ensure that neither company can learn more than their own data and the results of the match. In recent years, secure computation in general and PM in particular has attracted considerable attention from the research community, partly due to the rise of Big Data along with the ever-increasing privacy concerns. This thesis describes three secure private set intersection (PSI) protocols, one private set union (PSU) construction, and one pattern matching scheme. PM can be considered as a main building block in all these protocols. To securely compute the intersection of two sets of size

2^{20}

our proposed protocols require only 3 seconds which is

4\times

faster than the previous best protocol. In the multi-party setting, we provide the first implementation that takes only 72 seconds to compute PSI for 5 parties with data-sets of

2^{20}

items each. For private set union (PSU), our protocol improves prior work by a factor up to

7,600 \times

for large instances. In addition, our wild-card pattern matching (WPM) protocol shows over two orders of magnitude faster than the state-of-the-art scheme

ScholarsArchive@OSU

Computational Methods for Gene Expression and Genomic Sequence Analysis

Author: Vo Nam Sy
Publication venue: University of Memphis Digital Commons
Publication date: 19/07/2016
Field of study

Advances in technologies currently produce more and more cost-effective, high-throughput, and large-scale biological data. As a result, there is an urgent need for developing efficient computational methods for analyzing these massive data. In this dissertation, we introduce methods to address several important issues in gene expression and genomic sequence analysis, two of the most important areas in bioinformatics.Firstly, we introduce a novel approach to predicting patterns of gene response to multiple treatments in case of small sample size. Researchers are increasingly interested in experiments with many treatments such as chemicals compounds or drug doses. However, due to cost, many experiments do not have large enough samples, making it difficult for conventional methods to predict patterns of gene response. Here we introduce an approach which exploited dependencies of pairwise comparisons outcomes and resampling techniques to predict true patterns of gene response in case of insufficient samples. This approach deduced more and better functionally enriched gene clusters than conventional methods. Our approach is therefore useful for multiple-treatment studies which have small sample size or contain highly variantly expressed genes.Secondly, we introduce a novel method for aligning short reads, which are DNA fragments extracted across genomes of individuals, to reference genomes. Results from short read alignment can be used for many studies such as measuring gene expression or detecting genetic variants. Here we introduce a method which employed an iterated randomized algorithm based on FM-index, an efficient data structure for full-text indexing, to align reads to the reference. This method improved alignment performance across a wide range of read lengths and error rates compared to several popular methods, making it a good choice for community to perform short read alignment.Finally, we introduce a novel approach to detecting genetic variants such as SNPs (single nucleotide polymorphisms) or INDELs (insertions/deletions). This study has great significance in a wide range of areas, from bioinformatics and genetic research to medical field. For example, one can predict how genomic changes are related to phenotype in their organism of interest, or associate genetic changes to disease risk or medical treatment efficacy. Here we introduce a method which leveraged known genetic variants existing in well-established databases to improve accuracy of detecting variants. This method had higher accuracy than several state-of-the-art methods in many cases, especially for detecting INDELs. Our method therefore has potential to be useful in research and clinical applications which rely on identifying genetic variants accurately

University of Memphis Digital Commons

Advanced rank/select data structures: succinctness, bounds and applications.

Author: ORLANDI ALESSIO
Publication venue: 'Pisa University Press'
Publication date: 10/12/2012
Field of study

The thesis explores new theoretical results and applications of rank and select data structures. Given a string, select(c, i) gives the position of the ith occurrence of character c in the string, while rank(c, p) counts the number of instances of character c on the left of position p. Succinct rank/select data structures are space-efficient versions of standard ones, designed to keep data compressed and at the same time answer to queries rapidly. They are at the basis of more involved compressed and succinct data structures which in turn are motivated by the nowadays need to analyze and operate on massive data sets quickly, where space efficiency is crucial. The thesis builds up on the state of the art left by years of study and produces results on multiple fronts. Analyzing binary succinct data structures and their link with predecessor data structures, we integrate data structures for the latter problem in the former. The result is a data structure which outperforms the one of Patrascu 08 in a range of cases which were not studied before, namely when the lower bound for predecessor do not apply and constant-time rank is not feasible. Further, we propose the first lower bound for succinct data structures on generic strings, achieving a linear trade-off between time for rank/select execution and additional space (w.r.t. to the plain data) needed by the data structure. The proposal addresses systematic data structures, namely those that only access the underlying string through ADT calls and do not encode it directly. Also, we propose a matching upper bound that proves the tightness of our lower bound. Finally, we apply rank/select data structures to the substring counting problem, where we seek to preprocess a text and generate a summary data structure which is stored in lieu of the text and answers to substring counting queries with additive error. The results include a theory-proven optimal data structure with generic additive error and a data structure that errs only on infrequent patterns with significative practical space gains

Electronic Thesis and Dissertation Archive - Università di Pisa

35th Symposium on Theoretical Aspects of Computer Science: STACS 2018, February 28-March 3, 2018, Caen, France

Author: STACS
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik GmbH, Dagstuhl Publishing
Publication date: 01/02/2018
Field of study

Digitale Bibliothek Thüringen

LIPIcs, Volume 248, ISAAC 2022, Complete Volume

Author: Bae Sang Won
Park Heejin
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 33rd International Symposium on Algorithms and Computation (ISAAC 2022)
Publication date: 01/01/2022
Field of study

LIPIcs, Volume 248, ISAAC 2022, Complete Volum

Dagstuhl Research Online Publication Server