491 research outputs found
Efficient Matching of Substrings in Uncertain Sequences
Substring matching is fundamental to data mining methods for se-quential data. It involves checking the existence of a short subse-quence within a longer sequence, ensuring no gaps within a match. Whilst a large amount of existing work has focused on substring matching and mining techniques for certain sequences, there are on-ly a few results for uncertain sequences. Uncertain sequences pro-vide powerful representations for modelling sequence behavioural characteristics in emerging domains, such as bioinformatics, sen-sor streams and trajectory analysis. In this paper, we focus on the core problem of computing substring matching probability in un-certain sequences and propose an efficient dynamic programming algorithm for this task. We demonstrate our approach is both com-petitive theoretically, as well as effective and scalable experimental-ly. Our results contribute towards a foundation for adapting classic sequence mining methods to deal with uncertain data.
Highly Scalable Algorithms for Robust String Barcoding
String barcoding is a recently introduced technique for genomic-based
identification of microorganisms. In this paper we describe the engineering of
highly scalable algorithms for robust string barcoding. Our methods enable
distinguisher selection based on whole genomic sequences of hundreds of
microorganisms of up to bacterial size on a well-equipped workstation, and can
be easily parallelized to further extend the applicability range to thousands
of bacterial size genomes. Experimental results on both randomly generated and
NCBI genomic data show that whole-genome based selection results in a number of
distinguishers nearly matching the information theoretic lower bounds for the
problem
String Searching with Ranking Constraints and Uncertainty
Strings play an important role in many areas of computer science. Searching pattern in a string or string collection is one of the most classic problems. Different variations of this problem such as document retrieval, ranked document retrieval, dictionary matching has been well studied. Enormous growth of internet, large genomic projects, sensor networks, digital libraries necessitates not just efficient algorithms and data structures for the general string indexing, but indexes for texts with fuzzy information and support for queries with different constraints. This dissertation addresses some of these problems and proposes indexing solutions. One such variation is document retrieval query for included and excluded/forbidden patterns, where the objective is to retrieve all the relevant documents that contains the included patterns and does not contain the excluded patterns. We continue the previous work done on this problem and propose more efficient solution. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. The forbidden pattern problem suffers from the drawback that linear space (in words) solutions are unlikely to yield a solution better than O(root(n/occ)) per document reporting time, where n is the total length of the documents and occ is the number of output documents. Continuing this path, we introduce a new variation, namely document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern.We also address the more general top-k version of the problem, which retrieves the top k documents, where the ranking is based on PageRank relevance metric. This problem finds motivation from search applications. It also holds theoretical interest as we show that the hardness of forbidden pattern problem is alleviated in this problem. We achieve linear space and optimal query time for this variation. We also propose succinct indexes for both these problems. Position restricted pattern matching considers the scenario where only part of the text is searched. We propose succinct index for this problem with efficient query time. An important application for this problem stems from searching in genomic sequences, where only part of the gene sequence is searched for interesting patterns. The problem of computing discriminating(resp. generic) words is to report all minimal(resp. maximal) extensions of a query pattern which are contained in at most(resp. at least) a given number of documents. These problems are motivated from applications in computational biology, text mining and automated text classification. We propose succinct indexes for these problems. Strings with uncertainty and fuzzy information play an important role in increasingly many applications. We propose a general framework for indexing uncertain strings such that a deterministic query string can be searched efficiently. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We consider two basic problems of string searching, namely substring searching and string listing. We formulate these well known problems for uncertain strings paradigm and propose exact and approximate solution for them. We also discuss a constrained variation of orthogonal range searching. Given a set of points, the task of orthogonal range searching is to build a data structure such that all the points inside a orthogonal query region can be reported. We introduce a new variation, namely shared constraint range searching which naturally arises in constrained pattern matching applications. Shared constraint range searching is a special four sided range reporting query problem where two constraints has sharing among them, effectively reducing the number of independent constraints. For this problem, we propose a linear space index that can match the best known bound for three dimensional dominance reporting problem. We extend our data structure in the external memory model
Efficient Indexing for Structured and Unstructured Data
The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation
Artificial Sequences and Complexity Measures
In this paper we exploit concepts of information theory to address the
fundamental problem of identifying and defining the most suitable tools to
extract, in a automatic and agnostic way, information from a generic string of
characters. We introduce in particular a class of methods which use in a
crucial way data compression techniques in order to define a measure of
remoteness and distance between pairs of sequences of characters (e.g. texts)
based on their relative information content. We also discuss in detail how
specific features of data compression techniques could be used to introduce the
notion of dictionary of a given sequence and of Artificial Text and we show how
these new tools can be used for information extraction purposes. We point out
the versatility and generality of our method that applies to any kind of
corpora of character strings independently of the type of coding behind them.
We consider as a case study linguistic motivated problems and we present
results for automatic language recognition, authorship attribution and self
consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression
approach to Information Extraction and Classification" by A. Baronchelli and
V. Loreto. 15 pages; 5 figure
Recommended from our members
Virus discovery using current and novel methods
Next Generation Sequencing (NGS) technology allows researchers to sequence genetic material from a wide range of sources, including patient and environmental samples, and ancient remains. The recovery of viruses from such datasets can provide insights into the diversity and evolution of both novel and already known viruses. This thesis focuses on two aspects of virus discovery in NGS datasets.
In the first part of this thesis, I present ancient viral sequences from hepatitis B virus, human parvovirus B19, and variola virus. The sequences were recovered from NGS datasets from individuals living in Eurasia between ∼150 to ∼31,630 years ago, using standard sequence matching tools. The data show the past existence of viruses similar to variants circulating today. The sequences reveal a complexity of virus evolution that is not evident when considering modern sequences alone, including revised substitution rates and most recent common ancestor dates, as well as geographic movement and extinction of strains.
The identification of viral sequences in NGS datasets relies heavily on sequence-based matching of unknown sequences to a database of known sequences. Comparisons are usually done at the nucleotide or amino acid level. However, those methods only work well on sequences closely related to those already present in the database. With the aim of identifying more diverged viral sequences, in the second part of this thesis, I present an algorithm to compare sequences based on predicted structural features, such as secondary structures and conserved amino acids. The algorithm is modelled after the music-matching algorithm ‘Shazam’. While initial results of the algorithm are somewhat encouraging, problems remain, in particular with the identification of adequate structural features. Identifying highly diverged viral sequences is thus still a challenging problem, hopefully to be solved in the future
Order-Preserving Pattern Matching Indeterminate Strings
Given an indeterminate string pattern p and an indeterminate string text t, the problem of order-preserving pattern matching with character uncertainties (muOPPM) is to find all substrings of t that satisfy one of the possible orderings defined by p. When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions. In this paper, we provide the first polynomial algorithms to answer the muOPPM problem when: 1) indetermination is observed on the pattern or text; and 2) indetermination is observed on both the pattern and the text and given by uncertainties between pairs of characters. First, given two strings with the same length m and O(r) uncertain characters per string position, we show that the muOPPM problem can be solved in O(mr lg r) time when one string is indeterminate and r in N^+ and in O(m^2) time when both strings are indeterminate and r=2. Second, given an indeterminate text string of length n, we show that muOPPM can be efficiently solved in polynomial time and linear space
- …