14 research outputs found

    Fast algorithms for computing sequence distances by exhaustive substring composition

    Get PDF
    The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background

    A reexamination of information theory-based methods for DNA-binding site identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods.</p> <p>Results</p> <p>Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as <it>Relative Entropy</it>, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results.</p> <p>Conclusion</p> <p>We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.</p

    Non-random pre-transcriptional evolution in HIV-1. A refutation of the foundational conditions for neutral evolution

    Get PDF
    The complete base sequence of HIV-1 virus and GP120 ENV gene were analyzed to establish their distance to the expected neutral random sequence. An especial methodology was devised to achieve this aim. Analyses included: a) proportion of dinucleotides (signatures); b) homogeneity in the distribution of dinucleotides and bases (isochores) by dividing both segments in ten and three sub-segments, respectively; c) probability of runs of bases and No-bases according to the Bose-Einstein distribution. The analyses showed a huge deviation from the random distribution expected from neutral evolution and neutral-neighbor influence of nucleotide sites. The most significant result is the tremendous lack of CG dinucleotides (p < 10-50 ), a selective trait of eukaryote and not of single stranded RNA virus genomes. Results not only refute neutral evolution and neutral neighbor influence, but also strongly indicate that any base at any nucleotide site correlates with all the viral genome or sub-segments. These results suggest that evolution of HIV-1 is pan-selective rather than neutral or nearly neutral

    Mind the Gap: Transitions Between Concepts of Information in Varied Domains

    Get PDF
    The concept of 'information' in five different realms – technological, physical, biological, social and philosophical – is briefly examined. The 'gaps' between these conceptions are dis‐ cussed, and unifying frameworks of diverse nature, including those of Shannon/Wiener, Landauer, Stonier, Bates and Floridi, are examined. The value of attempting to bridge the gaps, while avoiding shallow analogies, is explained. With information physics gaining general acceptance, and biology gaining the status of an information science, it seems rational to look for links, relationships, analogies and even helpful metaphors between them and the library/information sciences. Prospects for doing so, involving concepts of complexity and emergence, are suggested

    Information theory and the ethylene genetic network

    No full text
    The original aim of the Information Theory (IT) was to solve a purely technical problem: to increase the performance of communication systems, which are constantly affected by interferences that diminish the quality of the transmitted information. That is, the theory deals only with the problem of transmitting with the maximal precision the symbols constituting a message. In Shannon's theory messages are characterized only by their probabilities, regardless of their value or meaning. As for its present day status, it is generally acknowledged that Information Theory has solid mathematical foundations and has fruitful strong links with Physics in both theoretical and experimental areas. However, many applications of Information Theory to Biology are limited to using it as a technical tool to analyze biopolymers, such as DNA, RNA or protein sequences. The main point of discussion about the applicability of IT to explain the information flow in biological systems is that in a classic communication channel, the symbols that conform the coded message are transmitted one by one in an independent form through a noisy communication channel, and noise can alter each of the symbols, distorting the message; in contrast, in a genetic communication channel the coded messages are not transmitted in the form of symbols but signaling cascades transmit them. Consequently, the information flow from the emitter to the effector is due to a series of coupled physicochemical processes that must ensure the accurate transmission of the message. In this review we discussed a novel proposal to overcome this difficulty, which consists of the modeling of gene expression with a stochastic approach that allows Shannon entropy (H) to be directly used to measure the amount of uncertainty that the genetic machinery has in relation to the correct decoding of a message transmitted into the nucleus by a signaling pathway. From the value of H we can define a function I that measures the amount of information content in the input message that the cell's genetic machinery is processing during a given time interval. Furthermore, combining Information Theory with the frequency response analysis of dynamical systems we can examine the cell's genetic response to input signals with varying frequencies, amplitude and form, in order to determine if the cell can distinguish between different regimes of information flow from the environment. In the particular case of the ethylene signaling pathway, the amount of information managed by the root cell of Arabidopsis can be correlated with the frequency of the input signal. The ethylene signaling pathway cuts off very low and very high frequencies, allowing a window of frequency response in which the nucleus reads the incoming message as a varying input. Outside of this window the nucleus reads the input message as an approximately non-varying one. This frequency response analysis is also useful to estimate the rate of information transfer during the transport of each new ERF1 molecule into the nucleus. Additionally, application of Information Theory to analysis of the flow of information in the ethylene signaling pathway provides a deeper insight in the form in which the transition between auxin and ethylene hormonal activity occurs during a circadian cycle. An ambitious goal for the future would be to use Information Theory as a theoretical foundation for a suitable model of the information flow that runs at each level and through all levels of biological organization
    corecore