9 research outputs found
The Evolution of Word Composition in Metazoan Promoter Sequence
The field of molecular evolution provides many examples of the principle that molecular differences between species contain information about evolutionary history. One surprising case can be found in the frequency of short words in DNA: more closely related species have more similar word compositions. Interest in this has often focused on its utility in deducing phylogenetic relationships. However, it is also of interest because of the opportunity it provides for studying the evolution of genome function. Word-frequency differences between species change too slowly to be purely the result of random mutational drift. Rather, their slow pattern of change reflects the direct or indirect action of purifying selection and the presence of functional constraints. Many such constraints are likely to exist, and an important challenge is to distinguish them. Here we develop a method to do so by isolating the effects acting at different word sizes. We apply our method to 2-, 4-, and 8-base-pair (bp) words across several classes of noncoding sequence. Our major result is that similarities in 8-bp word frequencies scale with evolutionary time for regions immediately upstream of genes. This association is present although weaker in intronic sequence, but cannot be detected in intergenic sequence using our method. In contrast, 2-bp and 4-bp word frequencies scale with time in all classes of noncoding sequence. These results suggest that different genomic processes are involved at different word sizes. The pattern in 2-bp and 4-bp words may be due to evolutionary changes in processes such as DNA replication and repair, as has been suggested before. The pattern in 8-bp words may reflect evolutionary changes in gene-regulatory machinery, such as changes in the frequencies of transcription-factor binding sites, or in the affinity of transcription factors for particular sequences
The reach of the genome signature in prokaryotes
BACKGROUND: With the increased availability of sequenced genomes there have been several initiatives to infer evolutionary relationships by whole genome characteristics. One of these studies suggested good congruence between genome synteny, shared gene content, 16S ribosomal DNA identity, codon usage and the genome signature in prokaryotes. Here we rigorously test the phylogenetic signal of the genome signature, which consists of the genome-specific relative frequencies of dinucleotides, on 334 sequenced prokaryotic genome sequences. RESULTS: Intrageneric comparisons show that in general the genomic dissimilarity scores are higher than in intraspecific comparisons, in accordance with the suggested phylogenetic signal of the genome signature. Exceptions to this trend, (Bartonella spp., Bordetella spp., Salmonella spp. and Yersinia spp.), which have low average intrageneric genomic dissimilarity scores, suggest that members of these genera might be considered the same species. On the other hand, high genomic dissimilarity values for intraspecific analyses suggest that in some cases (e.g.Prochlorococcus marinus, Pseudomonas fluorescens, Buchnera aphidicola and Rhodopseudomonas palustris) different strains from the same species may actually represent different species. Comparing 16S rDNA identity with genomic dissimilarity values corroborates the previously suggested trend in phylogenetic signal, albeit that the dissimilarity values only provide low resolution. CONCLUSION: The genome signature has a distinct phylogenetic signal, independent of individual genetic marker genes. A reliable phylogenetic clustering cannot be based on dissimilarity values alone, as bootstrapping is not possible for this parameter. It can however be used to support or refute a given phylogeny and resulting taxonomy
Sequence composition similarities with the 7SL RNA are highly predictive of functional genomic features
Transposable elements derived from the 7SL RNA gene, such as Alu elements in primates, have had remarkable success in several mammalian lineages. The results presented here show a broad spectrum of functions for genomic segments that display sequence composition similarities with the 7SL RNA gene. Using thoroughly documented loci, we report that DNaseI-hypersensitive sites can be singled out in large genomic sequences by an assessment of sequence composition similarities with the 7SL RNA gene. We apply a root word frequency approach to illustrate a distinctive relationship between the sequence of the 7SL RNA gene and several classes of functional genomic features that are not presumed to be of transposable origin. Transposable elements that show noticeable similarities with the 7SL sequence include Alu sequences, as expected, but also long terminal repeats and the 5′-untranslated regions of long interspersed repetitive elements. In sequences masked for repeated elements, we find, when using the 7SL RNA gene as query sequence, distinctive similarities with promoters, exons and distal gene regulatory regions. The latter being the most notoriously difficult to detect, this approach may be useful for finding genomic segments that have regulatory functions and that may have escaped detection by existing methods
Generalized Whittle-Matrn random field as a model of correlated fluctuations
This paper considers a generalization of Gaussian random field with
covariance function of Whittle-Matrn family. Such a random
field can be obtained as the solution to the fractional stochastic differential
equation with two fractional orders. Asymptotic properties of the covariance
functions belonging to this generalized Whittle-Matrn family
are studied, which are used to deduce the sample path properties of the random
field. The Whittle-Matrn field has been widely used in
modeling geostatistical data such as sea beam data, wind speed, field
temperature and soil data. In this article we show that generalized
Whittle-Matrn field provides a more flexible model for wind
speed data.Comment: 22 pages, 10 figures, accepted by Journal of Physics
Estimating the Fraction of Non-Coding RNAs in Mammalian Transcriptomes
Recent studies of mammalian transcriptomes have identified numerous RNA transcripts that do not code for proteins; their identity, however, is largely unknown. Here we explore an approach based on sequence randomness patterns to discern different RNA classes. The relative z-score we use helps identify the known ncRNA class from the genome, intergene and intron classes. This leads us to a fractional ncRNA measure of putative ncRNA datasets which we model as a mixture of genuine ncRNAs and other transcripts derived from genomic, intergenic and intronic sequences. We use this model to analyze six representative datasets identified by the FANTOM3 project and two computational approaches based on comparative analysis (RNAz and EvoFold). Our analysis suggests fewer ncRNAs than estimated by DNA sequencing and comparative analysis, but the verity of our approach and its prediction requires more extensive experimental RNA data
Organization of Excitable Dynamics in Hierarchical Biological Networks
This study investigates the contributions of network topology features to the dynamic behavior of hierarchically organized excitable networks. Representatives of different types of hierarchical networks as well as two biological neural networks are explored with a three-state model of node activation for systematically varying levels of random background network stimulation. The results demonstrate that two principal topological aspects of hierarchical networks, node centrality and network modularity, correlate with the network activity patterns at different levels of spontaneous network activation. The approach also shows that the dynamic behavior of the cerebral cortical systems network in the cat is dominated by the network's modular organization, while the activation behavior of the cellular neuronal network of Caenorhabditis elegans is strongly influenced by hub nodes. These findings indicate the interaction of multiple topological features and dynamic states in the function of complex biological networks
Análise e compressão de sequências genómicas
Tese de doutoramento em InformáticaA informação dos códigos genéticos sequenciados é na actualidade, provavelmente, a
fonte mais inspiradora para o estudo e avanço das teorias da informação e da
codificação. Algoritmos eficientes para a sua compressão antevêm-se essenciais para a
optimização do armazenamento e comunicação da informação genómica. A compressão
de informação genómica é um caso particular da compressão de informação. A entropia
das sequências de ADN é elevada, contudo variável. Ao nível intra-genómico é maior
nas regiões codificantes e menor nas regiões não codificantes. Ao nível inter-genómico
é maior nos seres procarióticos e menor nos eucarióticos. Na base da redução da
entropia estão as regularidades que perfazem as regiões repetitivas do ADN. As regiões
repetitivas compõem-se sobretudo de padrões aproximados, que incluem pontualmente
mutações, delecções, inserções ou gaps. Os padrões exactos são menos relevantes e
geralmente apresentam-se em numerosas repetições adjacentes. A redundância do ADN
também tem manifestações estatísticas e probabilísticas. As redundâncias das
sequências de ADN são a fonte de recursos de compressão, as grandes repetições
indicam-se para a compressão substitucional com recurso a dicionário, enquanto que as
evidências estatísticas e probabilísticas permitem modelar e predizer parcialmente a
sucessão de símbolos (bases), utilizando compressores estatísticos para capitalizar esse
potencial de compressão. Considerando a entropia máxima para o ADN, a sua
codificação corresponde a 2 bits por base. Em média, os melhores compressores
disponíveis, concebidos para a especificidade do ADN, alcançam os 1,7 bits/base, o que
corresponde a uma taxa de compressão de apenas 15%, valor que é demonstrativo da
dificuldade inerente.
O trabalho realizado corresponde a um framework de análise e compressão de
sequências de ADN, cuja aplicação principal corresponde ao DNALight. O DNALight é
uma solução híbrida para compressão de informação genómica baseada na cooperação
de várias metodologias vocacionadas para absorver ocorrências das diferentes tipologias
de redundâncias presentes nas cadeias de nucleótidos. De facto, a compressão não é
possível sem análise. É na completa análise que reside a obtenção dos recursos que
permitirão reduzir a entropia. Para a análise de sequências de ADN desenvolveram-se
algoritmos inovadores para a pesquisa de padrões exactos (GRASPm) e aproximados (SimSearch) que alcançam desempenhos que superam destacadamente o estado da arte.
Estes algoritmos intervêm na primeira fase do DNALight que aproveita o potencial dos
padrões mais representativos para a compressão substitucional baseada em dicionário de
padrões exactos e aproximados. Para maximizar as captações de padrões, a pesquisa é
exaustiva e efectuada multi-nível, ou seja, na sequência normal 5’-3’, na complementar
natural 3’-5’, e também nas duas restantes complementares artificiais. Na segunda fase
do DNALight, que procura fazer o aproveitamento das redundâncias desconsideradas
pela captação da primeira fase, são construídos modelos probabilísticos de linguagem
compactos com bases nas regiões menos repetitivas que transitam para esta fase, e que
constituem o input para esta metodologia complementar. Em concorrência, os modelos
geram predições sustentadas nas apreciações probabilísticas de modelos de linguagem
globais e locais. As predições acertadas ou aproximadas permitem codificações mais
económicas pois criam maior desequilíbrio no modelo probabilístico de codificação,
beneficiando o desempenho da codificação aritmética que encerra o processo. O
processo de descompressão é similar mas reverso ao descrito para a compressão. Os
resultados experimentais colocam o DNALight como novo integrante do estado da arte
em compressão de sequências de ADN, superando consistentemente, mas em pequena
escala, os seus antecessores.Genetics is nowadays, probably, the most inspiring source for coding theory study and
developments. Efficient compression algorithms are essential to optimise genomic data
storage and communication. Genomic data compression is a particular case of data
compression. The entropy present in DNA sequences is high, however variable. At
intra-genomic level, it is higher in coding regions and lower in non-coding regions. At
inter-genomic level, it is higher in the prokaryotes and lower in eukaryotes. DNA
entropy reduction is achieved by coding more efficiently the repetitive regions of the
ADN. Repetitive regions are mainly composed of inexact patterns. Patterns’ errors are
caused by biological processes and DNA dynamics including mutations, deletions,
insertions or gaps. Exact patterns are less relevant and generally are presented in tandem
repetitions. DNA redundancies have also statistical and probabilistic manifestations.
The redundancies of DNA sequences are the most proficuous source of compression
resources, the larger repetitions are indicated for substitucional compression based on a
dictionary, whereas the statistical and probabilistic evidences allow to model and predict
the succession of symbols (bases) in the sequence, using statistical compression to
capitalize this compression potential. Considering the maximum DNA entropy, its
codification cost corresponds to 2 bits per base. On average, the best available
compressors, conceived accordingly DNA data specificities, reach 1,7 bits/base, which
corresponds to a compression rate of only 15%, and this value is demonstrative of the
inherent difficulty.
The developed work corresponds to a framework for the analysis and compression of
DNA sequences, being DNALight the most representative application. DNALight is a
hybrid solution for DNA compression based on the cooperative integration of
complementary methodologies to absorb the different redundancies present in DNA
sequences. In fact, compression is not possible without analysis. Gathering resources for
compression relies mostly in analysis, and the emerged recurrences will allow to reduce
the entropy. Innovative algorithms were developed for exact pattern-matching
(GRASPm) and approximate and exact pattern discovery (SimSearch) and their
performance notoriously surpasses the state of the art. These algorithms play an
important role in the first phase of the DNALight to implement substitucional compression based on dictionary of exact and approximated repeats. To maximize
pattern recollection, the searching is performed multi-level, i.e., in normal sequence 5' -
3', in natural complementary sequence 3' - 5', and also in the two remaining artificial
complementary sequences. In the second phase of DNALight, focused on taking
advantage of the missed redundancies in the first phase, probabilistic language models
are built based on the less repetitive regions as they constitute the input of this
complementary methodology. In competition, the models generate predictions
supported in the probabilistic analysis of global and local language models. Accurate or
approximated predictions allow compact codifications as they provide a more
disproportional probabilistic model for codification, benefiting the arithmetic coding
performance that encloses the process. The decompression process is similar, but
reverse when compared with compression. The experimental results place DNALight as
a new constituent of the state of the art in DNA sequences compression, surpassing
consistently, but in small scale, its predecessors.Programa de Desenvolvimento Educativo para Portugal (PRODEP