1,460 research outputs found
Error correction and diversity analysis of population mixtures determined by NGS
The impetus for this work was the need to analyse nucleotide diversity in a viral mix taken from honeybees. The paper has two findings. First, a method for correction of next generation sequencing error in the distribution of nucleotides at a site is developed. Second, a package of methods for assessment of nucleotide diversity is assembled. The error correction method is statistically based and works at the level of the nucleotide distribution rather than the level of individual nucleotides. The method relies on an error model and a sample of known viral genotypes that is used for model calibration. A compendium of existing and new diversity analysis tools is also presented, allowing hypotheses about diversity and mean diversity to be tested and associated confidence intervals to be calculated. The methods are illustrated using honeybee viral samples. Software in both Excel and Matlab and a guide are available at http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/,the Warwick University Systems Biology Centre software download site.Publisher PDFPeer reviewe
CodonLogo: a sequence logo-based viewer for codon patterns
Motivation: Conserved patterns across a multiple sequence alignment can be visualized by generating sequence logos. Sequence logos show each column in the alignment as stacks of symbol(s) where the height of a stack is proportional to its informational content, whereas the height of each symbol within the stack is proportional to its frequency in the column. Sequence logos use symbols of either nucleotide or amino acid alphabets. However, certain regulatory signals in messenger RNA (mRNA) act as combinations of codons. Yet no tool is available for visualization of conserved codon patterns
Recovering motifs from biased genomes: application of signal correction
A significant problem in biological motif analysis arises when the background symbol distribution is biased (e.g. high/low GC content in the case of DNA sequences). This can lead to overestimation of the amount of information encoded in a motif. A motif can be depicted as a signal using information theory (IT). We apply two concepts from IT, distortion and patterned interference (a type of noise), to model genomic and codon bias respectively. This modeling approach allows us to correct a raw signal to recover signals that are weakened by compositional bias. The corrected signal is more likely to be discriminated from a biased background by a macromolecule. We apply this correction technique to recover ribosome-binding site (RBS) signals from available sequenced and annotated prokaryotic genomes having diverse compositional biases. We observed that linear correction was sufficient for recovering signals even at the extremes of these biases. Further comparative genomics studies were made possible upon correction of these signals. We find that the average Euclidian distance between RBS signal frequency matrices of different genomes can be significantly reduced by using the correction technique. Within this reduced average distance, we can find examples of class-specific RBS signals. Our results have implications for motif-based prediction, particularly with regards to the estimation of reliable inter-genomic model parameters
TISs-ST: a web server to evaluate polymorphic translation initiation sites and their reflections on the secretory targets
<p>Abstract</p> <p>Background</p> <p>The nucleotide sequence flanking the translation initiation codon (start codon context) affects the translational efficiency of eukaryotic mRNAs, and may indicate the presence of an alternative translation initiation site (TIS) to produce proteins with different properties. Multi-targeting may reflect the translational variability of these other protein forms. In this paper we present a web server that performs computations to investigate the usage of alternative translation initiation sites for the synthesis of new protein variants that might have different functions.</p> <p>Results</p> <p>An efficient web-based tool entitled TISs-ST (Translation Initiation Sites and Secretory Targets) evaluates putative translation initiation sites and indicates the prediction of a signal peptide of the protein encoded from this site. The TISs-ST web server is freely available to both academic and commercial users and can be accessed at <url>http://ipe.cbmeg.unicamp.br/pub/TISs-ST</url>.</p> <p>Conclusion</p> <p>The program can be used to evaluate alternative translation initiation site consensus with user-specified sequences, based on their composition or on many position weight matrix models. TISs-ST provides analytical and visualization tools for evaluating the periodic frequency, the consensus pattern and the total information content of a sequence data set. A search option allows for the identification of signal peptides from predicted proteins using the PrediSi software.</p
Data analysis methods for copy number discovery and interpretation
Copy
number
variation
(CNV)
is
an
important
type
of
genetic
variation
that
can
give
rise
to
a
wide
variety
of
phenotypic
traits.
Differences
in
copy
number
are
thought
to
play
major
roles
in
processes
that
involve
dosage
sensitive
genes,
providing
beneficial,
deleterious
or
neutral
modifications
to
individual
phenotypes.
Copy
number
analysis
has
long
been
a
standard
in
clinical
cytogenetic
laboratories.
Gene
deletions
and
duplications
can
often
be
linked
with
genetic
Syndromes
such
as:
the
7q11.23
deletion
of
Williams-‐Bueren
Syndrome,
the
22q11
deletion
of
DiGeorge
syndrome
and
the
17q11.2
duplication
of
Potocki-‐Lupski
syndrome.
Interestingly,
copy
number
based
genomic
disorders
often
display
reciprocal
deletion
/
duplication
syndromes,
with
the
latter
frequently
exhibiting
milder
symptoms.
Moreover,
the
study
of
chromosomal
imbalances
plays
a
key
role
in
cancer
research.
The
datasets
used
for
the
development
of
analysis
methods
during
this
project
are
generated
as
part
of
the
cutting-‐edge
translational
project,
Deciphering
Developmental
Disorders
(DDD).
This
project,
the
DDD,
is
the
first
of
its
kind
and
will
directly
apply
state
of
the
art
technologies,
in
the
form
of
ultra-‐high
resolution
microarray
and
next
generation
sequencing
(NGS),
to
real-‐time
genetic
clinical
practice.
It
is
collaboration
between
the
Wellcome
Trust
Sanger
Institute
(WTSI)
and
the
National
Health
Service
(NHS)
involving
the
24
regional
genetic
services
across
the
UK
and
Ireland.
Although
the
application
of
DNA
microarrays
for
the
detection
of
CNVs
is
well
established,
individual
change
point
detection
algorithms
often
display
variable
performances.
The
definition
of
an
optimal
set
of
parameters
for
achieving
a
certain
level
of
performance
is
rarely
straightforward,
especially
where
data
qualities
vary ... [cont.]
High-quality, high-throughput measurement of protein-DNA binding using HiTS-FLIP
In order to understand in more depth and on a genome wide scale the behavior of transcription factors (TFs), novel quantitative experiments with high-throughput are needed.
Recently, HiTS-FLIP (High-Throughput Sequencing-Fluorescent Ligand Interaction Profiling) was invented by the Burge lab at the MIT (Nutiu et al. (2011)). Based on an Illumina GA-IIx machine for next-generation sequencing, HiTS-FLIP allows to measure the affinity of fluorescent labeled proteins to millions of DNA clusters at equilibrium in an unbiased and untargeted way examining the entire sequence space by Determination of dissociation constants (Kds) for all 12-mer DNA motifs. During my PhD I helped to
improve the experimental design of this method to allow measuring the protein-DNA binding events at equilibrium omitting any washing step by utilizing the TIRF (Total Internal Reflection Fluorescence) based optics of the GA-IIx. In addition, I developed the first versions of XML based controlling software that automates the measurement procedure. Meeting the needs for processing the vast amount of data produced by each run, I developed a sophisticated, high performance software pipeline that locates DNA
clusters, normalizes and extracts the fluorescent signals. Moreover, cluster contained k-mer motifs are ranked and their DNA binding affinities are quantified with high accuracy.
My approach of applying phase-correlation to estimate the relative translative Offset between the observed tile images and the template images omits resequencing and thus allows to reuse the flow cell for several HiTS-FLIP experiments, which greatly reduces cost and time. Instead of using information from the sequencing images like Nutiu et al. (2011) for normalizing the cluster intensities which introduces a nucleotide specific bias, I estimate the cluster related normalization factors directly from the protein Images which captures the non-even illumination bias more accurately and leads to an improved
correction for each tile image. My analysis of the ranking algorithm by Nutiu et al. (2011)
has revealed that it is unable to rank all measured k-mers. Discarding all the clusters
related to previously ranked k-mers has the side effect of eliminating any clusters on which k-mers could be ranked that share submotifs with previously ranked k-mers. This shortcoming affects even strong binding k-mers with only one mutation away from the top ranked k-mer. My findings show that omitting the cluster deletion step in the ranking process overcomes this limitation and allows to rank the full spectrum of all possible k-mers. In addition, the performance of the ranking algorithm is drastically reduced by my insight from a quadratic to a linear run time. The experimental improvements combined with the sophisticated processing of the data has led to a very high accuracy of the HiTS-FLIP dissociation constants (Kds) comparable to the Kds measured by the very sensitive HiP-FA assay (Jung et al. (2015)). However, experimentally HiTS-FLIP is a very challenging assay. In total, eight HiTS-FLIP experiments were performed but only one showed saturation, the others exhibited Protein aggregation occurring at the amplified DNA clusters. This biochemical issue could not be remedied. As example TF for studying the details of HiTS-FLIP, GCN4 was chosen which is a dimeric, basic leucine zipper TF and which acts as the master regulator of the amino acid starvation Response in Saccharomyces cerevisiae (Natarajan et al. (2001)). The fluorescent dye was mOrange.
The HiTS-FLIP Kds for the TF GCN4 were validated by the HiP-FA assay and a Pearson correlation coefficient of R=0.99 and a relative error of delta=30.91% was achieved. Thus, a unique and comprehensive data set of utmost quantitative precision was obtained that allowed to study the complex binding behavior of GCN4 in a new way. My Downstream analyses reveal that the known 7-mer consensus motif of GCN4, which is TGACTCA, is
modulated by its 2-mer neighboring flanking regions spanning an affinity range over two orders of magnitude from a Kd=1.56 nM to Kd=552.51 nM. These results suggest that the common 9-mer PWM (Position Weight Matrix) for GCN4 is insufficient to describe the binding behavior of GCN4. Rather, an additional left and right flanking nucleotide is required to extend the 9-mer to an 11-mer. My analyses regarding mutations and related delta delta G values suggest long-range interdependencies between nucleotides of the two dimeric half-sites of GCN4. Consequently, models assuming positional independence, such as a PWM, are insufficient to explain these interdependencies. Instead, the full spectrum of affinity values for all k-mers of appropriate size should be measured and applied in further analyses as proposed by Nutiu et al. (2011). Another discovery were new binding motifs of GCN4, which can only be detected with a method like HiTS-FLIP that examines the entire sequence space and allows for unbiased, de-novo motif discovery. All These new motifs contain GTGT as a submotif and the data collected suggests that GCN4 binds as monomer to these new motifs. Therefore, it might be even possible to detect different binding modes with HiTS-FLIP. My results emphasize the binding complexity of GCN4 and demonstrate the advantage of HiTS-FLIP for investigating the complexity of regulative processes
The molecular basis of high duty-cycle echolocation in bats, and its role in the divergence of populations and species
PhD thesisHow populations diverge and form new species in the face of gene flow is a key question in evolutionary biology. Recent research suggests this may be possible where the same traits affect the ecological niche and are involved in assortative mating, and that a small number of genes could be involved in driving speciation in these cases. Echolocation call frequency in bats has roles in ecology and social communication. Bats using HDC echolocation have hearing tuned to specific frequencies, with frequency shifts impacting ecological niche and mate recognition, meaning this is a good candidate trait to drive speciation. HDC echolocation has evolved independently in two highly divergent groups of bats, providing a unique opportunity to study the molecular basis of a trait potentially driving speciation. I have combined selection testing of specific loci with genomewide divergence scans to test hypotheses concerning the evolution of HDC echolocation. Members of the yangochiropteran genus Pteronotus use low duty-cycle echolocation, except for the subgenus Phyllodia. Selection tests on coding sequence data revealed loci associated with hearing under positive selection in Phyllodia and in Pteronotus, including eleven shared with a yinpterochiropteran HDC echolocator, Rhinolophus sinicus. Three size and acoustic morphs of Rhinolophus philippinensis exist in sympatry on Buton Island. Phylogenetic reconstructions revealed population structure between the morphs, though with conflicting topologies based on mitochondrial and nuclear data. Species delimitation identified at least two separate taxa. Genomewide scans of divergence indicated low background FST between the morphs, punctuated with highly diverged islands featuring an overrepresentation of genes associated with body size and hearing. 3 This thesis represents the first genome-wide investigation of HDC echolocation, highlighting candidate genes related to this trait. It additionally describes a rarely observed mammalian ecological speciation, providing support for the claim that species designated R. philippinensis represent a complex across their range
Sex differences in DNA methylation and expression in zebrafish brain: a test of an extended ‘male sex drive’ hypothesis
The sex drive hypothesis predicts that stronger selection on male traits has resulted in masculinization of the genome. Here we test whether such masculinizing effects can be detected at the level of the transcriptome and methylome in the adult zebrafish brain. Although methylation is globally similar, we identified 914 specific differentially methylated CpGs (DMCs) between males and females (435 were hypermethylated and 479 were hypomethylated in males compared to females). These DMCs were prevalent in gene body, intergenic regions and CpG island shores. We also discovered 15 distinct CpG clusters with striking sex-specific DNA methylation differences. In contrast, at transcriptome level, more female-biased genes than male-biased genes were expressed, giving little support for the male sex drive hypothesis. Our study provides genome-wide methylome and transcriptome assessment and sheds light on sex-specific epigenetic patterns and in zebrafish for the first time
arrayMap: A Reference Resource for Genomic Copy Number Imbalances in Human Malignancies
Background: The delineation of genomic copy number abnormalities (CNAs) from
cancer samples has been instrumental for identification of tumor suppressor
genes and oncogenes and proven useful for clinical marker detection. An
increasing number of projects have mapped CNAs using high-resolution microarray
based techniques. So far, no single resource does provide a global collection
of readily accessible oncoge- nomic array data.
Methodology/Principal Findings: We here present arrayMap, a curated reference
database and bioinformatics resource targeting copy number profiling data in
human cancer. The arrayMap database provides a platform for meta-analysis and
systems level data integration of high-resolution oncogenomic CNA data. To
date, the resource incorporates more than 40,000 arrays in 224 cancer types
extracted from several resources, including the NCBI's Gene Expression Omnibus
(GEO), EBIs ArrayExpress (AE), The Cancer Genome Atlas (TCGA), publication
supplements and direct submissions. For the majority of the included datasets,
probe level and integrated visualization facilitate gene level and genome wide
data re- view. Results from multi-case selections can be connected to
downstream data analysis and visualization tools.
Conclusions/Significance: To our knowledge, currently no data source provides
an extensive collection of high resolution oncogenomic CNA data which readily
could be used for genomic feature mining, across a representative range of
cancer entities. arrayMap represents our effort for providing a long term
platform for oncogenomic CNA data independent of specific platform
considerations or specific project dependence. The online database can be
accessed at http://www.arraymap.org.Comment: 17 pages, 5 inline figures, 3 tables, supplementary figures/tables
split into 4 PDF files; manuscript submitted to PLoS ON
- …