24 research outputs found

    Group Normalization for Genomic Data

    No full text
    <div><p>Data normalization is a crucial preliminary step in analyzing genomic datasets. The goal of normalization is to remove global variation to make readings across different experiments comparable. In addition, most genomic loci have non-uniform sensitivity to any given assay because of variation in local sequence properties. In microarray experiments, this non-uniform sensitivity is due to different DNA hybridization and cross-hybridization efficiencies, known as the probe effect. In this paper we introduce a new scheme, called Group Normalization (GN), to remove both global and local biases in one integrated step, whereby we determine the normalized probe signal by finding a set of reference probes with similar responses. Compared to conventional normalization methods such as Quantile normalization and physically motivated probe effect models, our proposed method is general in the sense that it does not require the assumption that the underlying signal distribution be identical for the treatment and control, and is flexible enough to correct for nonlinear and higher order probe effects. The Group Normalization algorithm is computationally efficient and easy to implement. We also describe a variant of the Group Normalization algorithm, called Cross Normalization, which efficiently amplifies biologically relevant differences between any two genomic datasets.</p> </div

    Mouse ChIP-seq results MotifSpec outperforms DREME when run on ChIP-seq data for 13 transcription factors from mouse embryonic stem cells.

    No full text
    <p>The left panel shows a plot of the AUC for the top motif reported by MotifSpec against the AUC for the top motif reported by DREME, while the right panel shows the improvement in AUC for the MotifSpec motif relative to the DREME motif.</p

    The top 5 motifs found by MotifSpec in a genome-wide search of a <i>C</i>. <i>elegans</i> sequence and expression dataset.

    No full text
    <p>Alongside each motif is its specificity score and any Gene Ontology (GO) and Anatomy Ontology (AO) terms that were enriched in the list of target genes.</p

    Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space

    No full text
    <div><p>The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than <i>k</i>-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in <i>C</i>. <i>elegans</i> using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.</p></div

    Group Normalization results for nucleosome positioning in yeast.

    No full text
    <p>(A) probe distribution before (left) and after (right) Group Normalization. (B) Inferred nucleosome pattern at HXT3 promoter before (blue ovals) and after (red ovals) glucose addition. HXT3 is upregulated at high glucose levels and repressed at low glucose levels. (C) Differential nucleosome occupancy in yeast in response to glucose addition: cells are grown on glycerol and then 2% glucose is added. Nucleosome positioning is measured before and 60 min after glucose addition (Zawadzki et al., 2009). The top curves show the spatially averaged raw tiling array data, at time zero (gray dotted) and tβ€Š=β€Š60 (magenta). The lower plot shows the result of our normalization method. The red curve is the normalized differential nucleosome occupancy for tβ€Š=β€Š60 min compared to tβ€Š=β€Š0 (high values imply increase in nucleosome occupancy in response to glucose). The blue dotted curve is the reverse analysis, comparing tβ€Š=β€Š0 to tβ€Š=β€Š60. The yellow diamonds indicate ADR1 binding regions from ChIP.</p

    MotifSpec performs better at recovery of seeded motifs from a synthetic sequence-expression dataset than two-step procedures of k-means clustering and motif-finding using AlignACE, MEME and Weeder.

    No full text
    <p>MotifSpec performs better at recovery of seeded motifs from a synthetic sequence-expression dataset than two-step procedures of k-means clustering and motif-finding using AlignACE, MEME and Weeder.</p

    Signal Quality measure.

    No full text
    <p>Two tiling array signals corresponding to nucleosome occupancy at two different experimental conditions are shown for the <i>HXT3</i> locus. We use two conditions and a replicate to determine signal and noise, as follows. In condition A (with glucose), the highlighted region is nucleosome free, and in condition B (no glucose), it is nucleosome bound. <i>S</i> is the difference of the tiling array signal at two different conditions and reflects the signal strength. <i>N</i> is a measure of noise and is estimated by comparing the signal of two replicate microarrays at similar experimental condition. We evaluate <i>S</i> over a set of significantly changed probes (indicated with open circles) and <i>N</i> over all the probes as described in the text. The ratio <i>S/N</i> is a genome wide measure of Signal Quality.</p

    MotifSpec optimizes for specificity rather than over-representation and uses a dynamic search space.

    No full text
    <p>(A) An over-represented motif is found in the search space more often than expected according to some background model. It is not necessarily predictive. A specific motif is found in a much higher frequency in the search space than in the background sequences. A dynamic search space threshold finds the optimal search space such that the motif is most discriminative. (B) A schematic of the MotifSpec algorithm. The PWM model is initialized with a random sequence and position in the search space. The model is iteratively refined and the motif and binding score thresholds are adjusted at convergence to maximize specificity. (C) An example of sequences scored using the model. Each sequence has a motif score and a binding score. The binding score determines if a sequence is in the search space. The motif score determines if the sequence has an instance of the motif. The sequences are color-coded according to the set to which they belong as defined in (B).</p

    Genomic assays are often highly reproducible, but have significant efficiency variation across the genome.

    No full text
    <p>(A) Two genomic hybridization signals (biological replicates) from (Lee et al., 2007) shown along a portion of Chr III are highly reproducible, but deviate significantly from the expected constant signal. (B) Across the whole genome, these variations are highly reproducible. Two genomic hybridizations for the entire yeast genome are highly correlated (Pearson Cβ€Š=β€Š0.966).</p

    Flowchart of Group Normalization.

    No full text
    <p>Control arrays are used to generate reference probe sets for each probe. Then we use the reference probe sets to estimate the probe parameters in the treatment arrays and to generate the normalized signal. We propose two distinct methods to normalize the arrays: a Binary method which parameterizes high and low signal for each probe (ΞΌ<sub>low</sub>, ΞΌ<sub>high</sub>); or a Quantile-based method which uses the rank of each probe in the reference set.</p
    corecore