Search CORE

56,644 research outputs found

On avoided words, absent words, and their application to biological sequence analysis

Author
Publication venue: BioMed Central
Publication date
Field of study

Springer - Publisher Connector

Optimal Computation of Avoided Words

Author: A Akalin
C Acquisti
C Barton
C Barton
D Belazzougui
DB Searls
F Mignosi
I Rusinov
M Crochemore
P Gawrychowski
RN Mantegna
V Brendel
Publication venue
Publication date: 29/04/2016
Field of study

The deviation of the observed frequency of a word

w

from its expected frequency in a given sequence

x

is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of

w

, denoted by

std(w)

, effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word

w

of length

k>2

is a

\rho

-avoided word in

x

std(w) \leq \rho

, for a given threshold

\rho < 0

. Notice that such a word may be completely absent from

x

. Hence computing all such words na\"{\i}vely can be a very time-consuming procedure, in particular for large

k

. In this article, we propose an

O(n)

-time and

O(n)

-space algorithm to compute all

\rho

-avoided words of length

k

in a given sequence

x

of length

n

over a fixed-sized alphabet. We also present a time-optimal

O(\sigma n)

-time and

O(\sigma n)

-space algorithm to compute all

\rho

-avoided words (of any length) in a sequence of length

n

over an alphabet of size

\sigma

. Furthermore, we provide a tight asymptotic upper bound for the number of

\rho

-avoided words and the expected length of the longest one. We make available an open-source implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation

arXiv.org e-Print Archive

Crossref

King's Research Portal

Fractals from genomes: exact solutions of a biology-inspired problem

Author: Bai-Lin Hao
Deckert
Gelfand
Goulden
Guibas
Hao
Jeffrey
Wolfram
Xie
Publication venue: 'Elsevier BV'
Publication date: 01/01/1999
Field of study

This is a review of a set of recent papers with some new data added. After a brief biological introduction a visualization scheme of the string composition of long DNA sequences, in particular, of bacterial complete genomes, will be described. This scheme leads to a class of self-similar and self-overlapping fractals in the limit of infinitely long constotuent strings. The calculation of their exact dimensions and the counting of true and redundant avoided strings at different string lengths turn out to be one and the same problem. We give exact solution of the problem using two independent methods: the Goulden-Jackson cluster method in combinatorics and the method of formal language theory.Comment: 24 pages, LaTeX, 5 PostScript figures (two in color), psfi

arXiv.org e-Print Archive

CiteSeerX

Crossref

On empirical methodology, constraints, and hierarchy in artificial grammar learning

Author: Levelt W.
Publication venue: 'Wiley'
Publication date: 24/09/2019
Field of study

This paper considers the AGL literature from a psycholinguistic perspective. It first presents a taxonomy of the experimental familiarization test procedures used, which is followed by a consideration of shortcomings and potential improvements of the empirical methodology. It then turns to reconsidering the issue of grammar learning from the point of view of acquiring constraints, instead of the traditional AGL approach in terms of acquiring sets of rewrite rules. This is, in particular, a natural way of handling long‐distance dependences. The final section addresses an underdeveloped issue in the AGL literature, namely how to detect latent hierarchical structure in AGL response patterns

MPG.PuRe

Patterns and Signals of Biology: An Emphasis On The Role of Post Translational Modifications in Proteomes for Function and Evolutionary Progression

Author: Bonham-Carter Oliver
Publication venue: DigitalCommons@UNO
Publication date: 01/05/2016
Field of study

After synthesis, a protein is still immature until it has been customized for a specific task. Post-translational modifications (PTMs) are steps in biosynthesis to perform this customization of protein for unique functionalities. PTMs are also important to protein survival because they rapidly enable protein adaptation to environmental stress factors by conformation change. The overarching contribution of this thesis is the construction of a computational profiling framework for the study of biological signals stemming from PTMs associated with stressed proteins. In particular, this work has been developed to predict and detect the biological mechanisms involved in types of stress response with PTMs in mitochondrial (Mt) and non-Mt protein. Before any mechanism can be studied, there must first be some evidence of its existence. This evidence takes the form of signals such as biases of biological actors and types of protein interaction. Our framework has been developed to locate these signals, distilled from “Big Data” resources such as public databases and the the entire PubMed literature corpus. We apply this framework to study the signals to learn about protein stress responses involving PTMs, modification sites (MSs). We developed of this framework, and its approach to analysis, according to three main facets: (1) by statistical evaluation to determine patterns of signal dominance throughout large volumes of data, (2) by signal location to track down the regions where the mechanisms must be found according to the types and numbers of associated actors at relevant regions in protein, and (3) by text mining to determine how these signals have been previously investigated by researchers. The results gained from our framework enable us to uncover the PTM actors, MSs and protein domains which are the major components of particular stress response mechanisms and may play roles in protein malfunction and disease

The University of Nebraska, Omaha

BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction

Author: Ari Eszter
Horváth Arnold
Ittzés Péter
Jakó Éena
Podani János
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN

Crossref

Repository of the Academy's Library

Optimal Computation of Overabundant Words

Author: Almirantis Yannis
Charalampopoulos Panagiotis
Gao Jia
Iliopoulos Costas S.
Mohamed Manal
Pissis Solon P.
Polychronopoulos Dimitris
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)
Publication date: 01/01/2017
Field of study

The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n-4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server