Search CORE

2,358 research outputs found

Dictionary Matching with One Gap

Author: A. Amir
A. Amir
A. Amir
A. Amir
A.V. Aho
E. Ukkonen
E.M. McCreight
G. Kucherov
G. Myers
G. Myers
G. Navarro
G. Navarro
G.S. Brodal
J.C. Naa
K. Fredriksson
M. Morgante
M. Zhang
M.S. Rahman
P. Bille
T. Haapasalo
Publication venue
Publication date: 01/01/2014
Field of study

The dictionary matching with gaps problem is to preprocess a dictionary

D

d

gapped patterns

P_1,\ldots,P_d

over alphabet

\Sigma

, where each gapped pattern

P_i

is a sequence of subpatterns separated by bounded sequences of don't cares. Then, given a query text

T

of length

n

over alphabet

\Sigma

, the goal is to output all locations in

T

in which a pattern

P_i\in D

1\leq i\leq d

, ends. There is a renewed current interest in the gapped matching problem stemming from cyber security. In this paper we solve the problem where all patterns in the dictionary have one gap with at least

\alpha

and at most

\beta

don't cares, where

\alpha

and

\beta

are given parameters. Specifically, we show that the dictionary matching with a single gap problem can be solved in either

O(d\log d + |D|)

time and

O(d\log^{\varepsilon} d + |D|)

space, and query time

O(n(\beta -\alpha )\log\log d \log ^2 \min \{ d, \log |D| \} + occ)

, where

occ

is the number of patterns found, or preprocessing time and space:

O(d^2 + |D|)

, and query time

O(n(\beta -\alpha ) + occ)

, where

occ

is the number of patterns found. As far as we know, this is the best solution for this setting of the problem, where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201

arXiv.org e-Print Archive

Crossref

Fast Indexes for Gapped Pattern Matching

Author: D Knuth
G Navarro
J Bader
K Fredriksson
M Crochemore
M Lewenstein
M Morgante
P Bille
P Bille
Philip Bille
R Saikkonen
SP Pissis
T Crawford
T Haapasalo
U Manber
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/02/2020
Field of study

We describe indexes for searching large data sets for variable-length-gapped (VLG) patterns. VLG patterns are composed of two or more subpatterns, between each adjacent pair of which is a gap-constraint specifying upper and lower bounds on the distance allowed between subpatterns. VLG patterns have numerous applications in computational biology (motif search), information retrieval (e.g., for language models, snippet generation, machine translation) and capture a useful subclass of the regular expressions commonly used in practice for searching source code. Our best approach provides search speeds several times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454 and has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Crossref

New Computational Approaches For Multiple Rna Alignment And Rna Search

Author: DeBlasio Daniel
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2009
Field of study

In this thesis we explore the the theory and history behind RNA alignment. Normal sequence alignments as studied by computer scientists can be completed in O(n2) time in the naive case. The process involves taking two input sequences and finding the list of edits that can transform one sequence into the other. This process is applied to biology in many forms, such as the creation of multiple alignments and the search of genomic sequences. When you take into account the RNA sequence structure the problem becomes even harder. Multiple RNA structure alignment is particularly challenging because covarying mutations make sequence information alone insufficient. Existing tools for multiple RNA alignments first generate pair-wise RNA structure alignments and then build the multiple alignment using only the sequence information. Here we present PMFastR, an algorithm which iteratively uses a sequence-structure alignment procedure to build a multiple RNA structure alignment. PMFastR also has low memory consumption allowing for the alignment of large sequences such as 16S and 23S rRNA. Specifically, we reduce the memory consumption to ∼O(band2 ∗ m) where band is the banding size. Other solutions are ∼ O(n2 ∗ m) where n and m are the lengths of the target and query respectively. The algorithm also provides a method to utilize a multi-core environment. We present results on benchmark data sets from BRAliBase, which shows PMFastR outperforms other state-of-the-art programs. Furthermore, we regenerate 607 Rfam seed alignments and show that our automated process creates similar multiple alignments to the manually-curated Rfam seed alignments. While these methods can also be applied directly to genome sequence search, the abundance of new multiple species genome alignments presents a new area for exploration. Many multiple alignments of whole genomes are available and these alignments keep growing in size. These alignments can provide more information to the searcher than just a single sequence. Using the methodology from sequence-structure alignment we developed AlnAlign, which searches an entire genome alignment using RNA sequence structure. While programs have been readily available to align alignments, this is the first to our knowledge that is specifically designed for RNA sequences. This algorithm is presented only in theory and is yet to be tested

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Functionally specified protein signatures distinctive for each of the different blue copper proteins

Author: Anishetty Sharmila
Gautam Pennathur
Giri Anuradha Vivekanandan
Publication venue: BioMed Central
Publication date: 01/09/2004
Field of study

BACKGROUND: Proteins having similar functions from different sources can be identified by the occurrence in their sequences, a conserved cluster of amino acids referred to as pattern, motif, signature or fingerprint. The wide usage of protein sequence analysis in par with the growth of databases signifies the importance of using patterns or signatures to retrieve out related sequences. Blue copper proteins are found in the electron transport chain of prokaryotes and eukaryotes. The signatures already existing in the databases like the type 1 copper blue, multiple copper oxidase, cyt b/b6, photosystem 1 psaA&B, psaG&K, and reiske iron sulphur protein are not specified signatures for blue copper proteins as the name itself suggests. Most profile and motif databases strive to classify protein sequences into a broad spectrum of protein families. This work describes the signatures designed based on the copper metal binding motifs in blue copper proteins. The common feature in all blue copper proteins is a trigonal planar arrangement of two nitrogen ligands [each from histidine] and one sulphur containing thiolate ligand [from cysteine], with strong interactions between the copper center and these ligands. RESULTS: Sequences that share such conserved motifs are crucial to the structure or function of the protein and this could provide a signature of family membership. The blue copper proteins chosen for the study were plantacyanin, plastocyanin, cucumber basic protein, stellacyanin, dicyanin, umecyanin, uclacyanin, cusacyanin, rusticyanin, sulfocyanin, halocyanin, azurin, pseudoazurin, amicyanin and nitrite reductase which were identified in both eukaryotes and prokaryotes. ClustalW analysis of the protein sequences of each of the blue copper proteins was the basis for designing protein signatures or peptides. The protein signatures and peptides identified in this study were designed involving the active site region involving the amino acids bound to the copper atom. It was highly specific for each kind of blue copper protein and the false picks were minimized. The set of signatures designed specifically for the BCP's was entirely different from the existing broad spectrum signatures as mentioned in the background section. CONCLUSIONS: These signatures can be very useful for the annotation of uncharacterized proteins and highly specific to retrieve blue copper protein sequences of interest from the non redundant databases containing a large deposition of protein sequences

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Programmable Hardware Accelerator For Regular Expression Queries

Author: N/A
Publication venue: Technical Disclosure Commons
Publication date: 29/10/2020
Field of study

Regular expression (regex) queries are used extensively in data analytics applications. Hardware-based regex searches can make searches efficient across a variety of application domains. However, hardware accelerators that can support arbitrary regular expressions are currently infeasible due to the very large number of possible states and state transitions. This disclosure describes techniques to map an input regular expression to a non-deterministic finite automaton and hardware such as FPGA or ASIC that can be programmed to filter an input data stream to search for arbitrary (customer-given) regular expressions

Technical Disclosure Common

Order-Preserving Pattern Matching Indeterminate Strings

Author: Bannai Hideo
Francisco Alexandre P.
Henriques Rui
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Annual Symposium on Combinatorial Pattern Matching (CPM 2018)
Publication date: 01/01/2018
Field of study

Given an indeterminate string pattern p and an indeterminate string text t, the problem of order-preserving pattern matching with character uncertainties (muOPPM) is to find all substrings of t that satisfy one of the possible orderings defined by p. When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions. In this paper, we provide the first polynomial algorithms to answer the muOPPM problem when: 1) indetermination is observed on the pattern or text; and 2) indetermination is observed on both the pattern and the text and given by uncertainties between pairs of characters. First, given two strings with the same length m and O(r) uncertain characters per string position, we show that the muOPPM problem can be solved in O(mr lg r) time when one string is indeterminate and r in N^+ and in O(m^2) time when both strings are indeterminate and r=2. Second, given an indeterminate text string of length n, we show that muOPPM can be efficiently solved in polynomial time and linear space

Dagstuhl Research Online Publication Server

Dagstuhl Reports : Volume 1, Issue 2, February 2011

Author: Schloss Dagstuhl Leibniz-Zentrum für Informatik
Publication venue
Publication date: 09/09/2011
Field of study

Online Privacy: Towards Informational Self-Determination on the Internet (Dagstuhl Perspectives Workshop 11061) : Simone Fischer-Hübner, Chris Hoofnagle, Kai Rannenberg, Michael Waidner, Ioannis Krontiris and Michael Marhöfer Self-Repairing Programs (Dagstuhl Seminar 11062) : Mauro Pezzé, Martin C. Rinard, Westley Weimer and Andreas Zeller Theory and Applications of Graph Searching Problems (Dagstuhl Seminar 11071) : Fedor V. Fomin, Pierre Fraigniaud, Stephan Kreutzer and Dimitrios M. Thilikos Combinatorial and Algorithmic Aspects of Sequence Processing (Dagstuhl Seminar 11081) : Maxime Crochemore, Lila Kari, Mehryar Mohri and Dirk Nowotka Packing and Scheduling Algorithms for Information and Communication Services (Dagstuhl Seminar 11091) Klaus Jansen, Claire Mathieu, Hadas Shachnai and Neal E. Youn

Hochschulschriftenserver - Universität Frankfurt am Main

Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations

Author: Nguyen Ken D
Publication venue: ScholarWorks @ Georgia State University
Publication date: 14/12/2011
Field of study

Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences\u27 structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm

ScholarWorks @ Georgia State University