4,104 research outputs found
Identifying statistical dependence in genomic sequences via mutual information estimates
Questions of understanding and quantifying the representation and amount of
information in organisms have become a central part of biological research, as
they potentially hold the key to fundamental advances. In this paper, we
demonstrate the use of information-theoretic tools for the task of identifying
segments of biomolecules (DNA or RNA) that are statistically correlated. We
develop a precise and reliable methodology, based on the notion of mutual
information, for finding and extracting statistical as well as structural
dependencies. A simple threshold function is defined, and its use in
quantifying the level of significance of dependencies between biological
segments is explored. These tools are used in two specific applications. First,
for the identification of correlations between different parts of the maize
zmSRp32 gene. There, we find significant dependencies between the 5'
untranslated region in zmSRp32 and its alternatively spliced exons. This
observation may indicate the presence of as-yet unknown alternative splicing
mechanisms or structural scaffolds. Second, using data from the FBI's Combined
DNA Index System (CODIS), we demonstrate that our approach is particularly well
suited for the problem of discovering short tandem repeats, an application of
importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on
Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb
Genome-wide discovery of modulators of transcriptional interactions in human B lymphocytes
Transcriptional interactions in a cell are modulated by a variety of
mechanisms that prevent their representation as pure pairwise interactions
between a transcription factor and its target(s). These include, among others,
transcription factor activation by phosphorylation and acetylation, formation
of active complexes with one or more co-factors, and mRNA/protein degradation
and stabilization processes.
This paper presents a first step towards the systematic, genome-wide
computational inference of genes that modulate the interactions of specific
transcription factors at the post-transcriptional level. The method uses a
statistical test based on changes in the mutual information between a
transcription factor and each of its candidate targets, conditional on the
expression of a third gene. The approach was first validated on a synthetic
network model, and then tested in the context of a mammalian cellular system.
By analyzing 254 microarray expression profiles of normal and tumor related
human B lymphocytes, we investigated the post transcriptional modulators of the
MYC proto-oncogene, an important transcription factor involved in
tumorigenesis. Our method discovered a set of 100 putative modulator genes,
responsible for modulating 205 regulatory relationships between MYC and its
targets. The set is significantly enriched in molecules with function
consistent with their activities as modulators of cellular interactions,
recapitulates established MYC regulation pathways, and provides a notable
repertoire of novel regulators of MYC function. The approach has broad
applicability and can be used to discover modulators of any other transcription
factor, provided that adequate expression profile data are available.Comment: 15 pages, 3 figures, 2 tables; minor changes following referees'
comments; accepted to RECOMB0
Identification of direct residue contacts in protein-protein interaction by message passing
Understanding the molecular determinants of specificity in protein-protein
interaction is an outstanding challenge of postgenome biology. The availability
of large protein databases generated from sequences of hundreds of bacterial
genomes enables various statistical approaches to this problem. In this context
covariance-based methods have been used to identify correlation between amino
acid positions in interacting proteins. However, these methods have an
important shortcoming, in that they cannot distinguish between directly and
indirectly correlated residues. We developed a method that combines covariance
analysis with global inference analysis, adopted from use in statistical
physics. Applied to a set of >2,500 representatives of the bacterial
two-component signal transduction system, the combination of covariance with
global inference successfully and robustly identified residue pairs that are
proximal in space without resorting to ad hoc tuning parameters, both for
heterointeractions between sensor kinase (SK) and response regulator (RR)
proteins and for homointeractions between RR proteins. The spectacular success
of this approach illustrates the effectiveness of the global inference approach
in identifying direct interaction based on sequence information alone. We
expect this method to be applicable soon to interaction surfaces between
proteins present in only 1 copy per genome as the number of sequenced genomes
continues to expand. Use of this method could significantly increase the
potential targets for therapeutic intervention, shed light on the mechanism of
protein-protein interaction, and establish the foundation for the accurate
prediction of interacting protein partners.Comment: Supplementary information available on
http://www.pnas.org/content/106/1/67.abstrac
Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies
Existing sequence alignment algorithms use heuristic scoring schemes which
cannot be used as objective distance metrics. Therefore one relies on measures
like the p- or log-det distances, or makes explicit, and often simplistic,
assumptions about sequence evolution. Information theory provides an
alternative, in the form of mutual information (MI) which is, in principle, an
objective and model independent similarity measure. MI can be estimated by
concatenating and zipping sequences, yielding thereby the "normalized
compression distance". So far this has produced promising results, but with
uncontrolled errors. We describe a simple approach to get robust estimates of
MI from global pairwise alignments. Using standard alignment algorithms, this
gives for animal mitochondrial DNA estimates that are strikingly close to
estimates obtained from the alignment free methods mentioned above. Our main
result uses algorithmic (Kolmogorov) information theory, but we show that
similar results can also be obtained from Shannon theory. Due to the fact that
it is not additive, normalized compression distance is not an optimal metric
for phylogenetics, but we propose a simple modification that overcomes the
issue of additivity. We test several versions of our MI based distance measures
on a large number of randomly chosen quartets and demonstrate that they all
perform better than traditional measures like the Kimura or log-det (resp.
paralinear) distances. Even a simplified version based on single letter Shannon
entropies, which can be easily incorporated in existing software packages, gave
superior results throughout the entire animal kingdom. But we see the main
virtue of our approach in a more general way. For example, it can also help to
judge the relative merits of different alignment algorithms, by estimating the
significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia
The landscape of viral associations in human cancers
Here, as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, for which whole-genome and—for a subset—whole-transcriptome sequencing data from 2,658 cancers across 38 tumor types was aggregated, we systematically investigated potential viral pathogens using a consensus approach that integrated three independent pipelines. Viruses were detected in 382 genome and 68 transcriptome datasets. We found a high prevalence of known tumor-associated viruses such as Epstein–Barr virus (EBV), hepatitis B virus (HBV) and human papilloma virus (HPV; for example, HPV16 or HPV18). The study revealed significant exclusivity of HPV and driver mutations in head-and-neck cancer and the association of HPV with APOBEC mutational signatures, which suggests that impaired antiviral defense is a driving force in cervical, bladder and head-and-neck carcinoma. For HBV, HPV16, HPV18 and adeno-associated virus-2 (AAV2), viral integration was associated with local variations in genomic copy numbers. Integrations at the TERT promoter were associated with high telomerase expression evidently activating this tumor-driving process. High levels of endogenous retrovirus (ERV1) expression were linked to a worse survival outcome in patients with kidney cancer
On Weight Matrix and Free Energy Models for Sequence Motif Detection
The problem of motif detection can be formulated as the construction of a
discriminant function to separate sequences of a specific pattern from
background. In computational biology, motif detection is used to predict DNA
binding sites of a transcription factor (TF), mostly based on the weight matrix
(WM) model or the Gibbs free energy (FE) model. However, despite the wide
applications, theoretical analysis of these two models and their predictions is
still lacking. We derive asymptotic error rates of prediction procedures based
on these models under different data generation assumptions. This allows a
theoretical comparison between the WM-based and the FE-based predictions in
terms of asymptotic efficiency. Applications of the theoretical results are
demonstrated with empirical studies on ChIP-seq data and protein binding
microarray data. We find that, irrespective of underlying data generation
mechanisms, the FE approach shows higher or comparable predictive power
relative to the WM approach when the number of observed binding sites used for
constructing a discriminant decision is not too small.Comment: 23 pages, 1 figure and 4 table
Statistical inference of the generation probability of T-cell receptors from sequence repertoires
Stochastic rearrangement of germline DNA by VDJ recombination is at the
origin of immune system diversity. This process is implemented via a series of
stochastic molecular events involving gene choices and random nucleotide
insertions between, and deletions from, genes. We use large sequence
repertoires of the variable CDR3 region of human CD4+ T-cell receptor beta
chains to infer the statistical properties of these basic biochemical events.
Since any given CDR3 sequence can be produced in multiple ways, the probability
distribution of hidden recombination events cannot be inferred directly from
the observed sequences; we therefore develop a maximum likelihood inference
method to achieve this end. To separate the properties of the molecular
rearrangement mechanism from the effects of selection, we focus on
non-productive CDR3 sequences in T-cell DNA. We infer the joint distribution of
the various generative events that occur when a new T-cell receptor gene is
created. We find a rich picture of correlation (and absence thereof), providing
insight into the molecular mechanisms involved. The generative event statistics
are consistent between individuals, suggesting a universal biochemical process.
Our distribution predicts the generation probability of any specific CDR3
sequence by the primitive recombination process, allowing us to quantify the
potential diversity of the T-cell repertoire and to understand why some
sequences are shared between individuals. We argue that the use of formal
statistical inference methods, of the kind presented in this paper, will be
essential for quantitative understanding of the generation and evolution of
diversity in the adaptive immune system.Comment: 20 pages, including Appendi
- …