10 research outputs found

    Differential analysis for high density tiling microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High density oligonucleotide tiling arrays are an effective and powerful platform for conducting unbiased genome-wide studies. The <it>ab initio </it>probe selection method employed in tiling arrays is unbiased, and thus ensures consistent sampling across coding and non-coding regions of the genome. These arrays are being increasingly used to study the associated processes of transcription, transcription factor binding, chromatin structure and their association. Studies of differential expression and/or regulation provide critical insight into the mechanics of transcription and regulation that occurs during the developmental program of a cell. The time-course experiment, which comprises an <it>in-vivo </it>system and the proposed analyses, is used to determine if annotated and un-annotated portions of genome manifest coordinated differential response to the induced developmental program.</p> <p>Results</p> <p>We have proposed a novel approach, based on a piece-wise function – to analyze genome-wide differential response. This enables segmentation of the response based on protein-coding and non-coding regions; for genes the methodology also partitions differential response with a 5' versus 3' versus intra-genic bias.</p> <p>Conclusion</p> <p>The algorithm built upon the framework of Significance Analysis of Microarrays, uses a generalized logic to define regions/patterns of coordinated differential change. By not adhering to the gene-centric paradigm, discordant differential expression patterns between exons and introns have been identified at a FDR of less than 12 percent. A co-localization of differential binding between RNA Polymerase II and tetra-acetylated histone has been quantified at a p-value < 0.003; it is most significant at the 5' end of genes, at a p-value < 10<sup>-13</sup>. The prototype R code has been made available as supplementary material [see Additional file <supplr sid="S1">1</supplr>].</p> <suppl id="S1"> <title> <p>Additional file 1</p> </title> <text> <p>gsam_prototypercode.zip. File archive comprising of prototype R code for gSAM implementation including readme and examples.</p> </text> <file name="1471-2105-8-359-S1.zip"> <p>Click here for file</p> </file> </suppl

    Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome

    Full text link
    Tiling arrays make possible a large scale exploration of the genome thanks to probes which cover the whole genome with very high density until 2 000 000 probes. Biological questions usually addressed are either the expression difference between two conditions or the detection of transcribed regions. In this work we propose to consider simultaneously both questions as an unsupervised classification problem by modeling the joint distribution of the two conditions. In contrast to previous methods, we account for all available information on the probes as well as biological knowledge like annotation and spatial dependence between probes. Since probes are not biologically relevant units we propose a classification rule for non-connected regions covered by several probes. Applications to transcriptomic and ChIP-chip data of Arabidopsis thaliana obtained with a NimbleGen tiling array highlight the importance of a precise modeling and the region classification

    STATISTICAL METHODS FOR AFFYMETRIX TILING ARRAY DATA

    Get PDF
    Tiling arrays are a microarray technology currently being used for a variety of genomic and epigenomic applications, such as the mapping of transcription, DNA methylation, and histone modifications. Tiling arrays provide high-density coverage of a genome, or a genomic region, through the systematic and sequential placement of probes without regard to genome annotation. In this paper we compare the Affymetrix tiling array to the Affymetrix GeneChip® 3’ expression array and propose methods that address statistical and bioinformatic issues that accompany gene expression data that are generated from Affymetrix tiling arrays. Real data from the model organism Arabidopsis thaliana motivate this work and application

    Ratio-Based Analysis of Differential mRNA Processing and Expression of a Polyadenylation Factor Mutant pcfs4 Using Arabidopsis Tiling Microarray

    Get PDF
    US National Institutes of Health [1R15GM07719201A1]; US National Science Foundation [IOS-0817818]; Ohio Plant Biotech Consortium; National Natural Science Foundation of China [60774033]; Specialized Research Fund for the Doctoral Program of Higher EducatiBackground: Alternative polyadenylation as a mechanism in gene expression regulation has been widely recognized in recent years. Arabidopsis polyadenylation factor PCFS4 was shown to function in leaf development and in flowering time control. The function of PCFS4 in controlling flowering time was correlated with the alternative polyadenylation of FCA, a flowering time regulator. However, genetic evidence suggested additional targets of PCFS4 that may mediate its function in both flowering time and leaf development. Methodology/Principal Findings: To identify further targets, we investigated the whole transcriptome of a PCFS4 mutant using Affymetrix Arabidopsis genomic tiling 1.0R array and developed a data analysis pipeline, termed RADPRE (Ratio-based Analysis of Differential mRNA Processing and Expression). In RADPRE, ratios of normalized probe intensities between wild type Columbia and a pcfs4 mutant were first generated. By doing so, one of the major problems of tiling array data-variations caused by differential probe affinity-was significantly alleviated. With the probe ratios as inputs, a hierarchy of statistical tests was carried out to identify differentially processed genes (DPG) and differentially expressed genes (DEG). The false discovery rate (FDR) of this analysis was estimated by using the balanced random combinations of Col/pcfs4 and pcfs4/Col ratios as inputs. Gene Ontology (GO) analysis of the DPGs and DEGs revealed potential new roles of PCFS4 in stress responses besides flowering time regulation. Conclusion/Significance: We identified 68 DPGs and 114 DEGs with FDR at 1% and 2%, respectively. Most of the 68 DPGs were subjected to alternative polyadenylation, splicing or transcription initiation. Quantitative PCR analysis of a set of DPGs confirmed that most of these genes were truly differentially processed in pcfs4 mutant plants. The enriched GO term "regulation of flower development'' among PCFS4 targets further indicated the efficacy of the RADPRE pipeline. This simple but effective program is available upon request

    Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide <it>prima facie </it>evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced <it>Aspergillus niger </it>fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another <it>A.niger </it>sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR).</p> <p>Results</p> <p>405 identified peptide sequences were mapped to 214 different <it>A.niger </it>genomic <it>loci </it>to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these <it>loci </it>either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models.</p> <p>Conclusion</p> <p>This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of <it>A.niger </it>sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method.</p

    The mapping task and its various applications in next-generation sequencing

    Get PDF
    The aim of this thesis is the development and benchmarking of computational methods for the analysis of high-throughput data from tiling arrays and next-generation sequencing. Tiling arrays have been a mainstay of genome-wide transcriptomics, e.g., in the identification of functional elements in the human genome. Due to limitations of existing methods for the data analysis of this data, a novel statistical approach is presented that identifies expressed segments as significant differences from the background distribution and thus avoids dataset-specific parameters. This method detects differentially expressed segments in biological data with significantly lower false discovery rates and equivalent sensitivities compared to commonly used methods. In addition, it is also clearly superior in the recovery of exon-intron structures. Moreover, the search for local accumulations of expressed segments in tiling array data has led to the identification of very large expressed regions that may constitute a new class of macroRNAs. This thesis proceeds with next-generation sequencing for which various protocols have been devised to study genomic, transcriptomic, and epigenomic features. One of the first crucial steps in most NGS data analyses is the mapping of sequencing reads to a reference genome. This work introduces algorithmic methods to solve the mapping tasks for three major NGS protocols: DNA-seq, RNA-seq, and MethylC-seq. All methods have been thoroughly benchmarked and integrated into the segemehl mapping suite. First, mapping of DNA-seq data is facilitated by the core mapping algorithm of segemehl. Since the initial publication, it has been continuously updated and expanded. Here, extensive and reproducible benchmarks are presented that compare segemehl to state-of-the-art read aligners on various data sets. The results indicate that it is not only more sensitive in finding the optimal alignment with respect to the unit edit distance but also very specific compared to most commonly used alternative read mappers. These advantages are observable for both real and simulated reads, are largely independent of the read length and sequencing technology, but come at the cost of higher running time and memory consumption. Second, the split-read extension of segemehl, presented by Hoffmann, enables the mapping of RNA-seq data, a computationally more difficult form of the mapping task due to the occurrence of splicing. Here, the novel tool lack is presented, which aims to recover missed RNA-seq read alignments using de novo splice junction information. It performs very well in benchmarks and may thus be a beneficial extension to RNA-seq analysis pipelines. Third, a novel method is introduced that facilitates the mapping of bisulfite-treated sequencing data. This protocol is considered the gold standard in genome-wide studies of DNA methylation, one of the major epigenetic modifications in animals and plants. The treatment of DNA with sodium bisulfite selectively converts unmethylated cytosines to uracils, while methylated ones remain unchanged. The bisulfite extension developed here performs seed searches on a collapsed alphabet followed by bisulfite-sensitive dynamic programming alignments. Thus, it is insensitive to bisulfite-related mismatches and does not rely on post-processing, in contrast to other methods. In comparison to state-of-the-art tools, this method achieves significantly higher sensitivities and performs time-competitive in mapping millions of sequencing reads to vertebrate genomes. Remarkably, the increase in sensitivity does not come at the cost of decreased specificity and thus may finally result in a better performance in calling the methylation rate. Lastly, the potential of mapping strategies for de novo genome assemblies is demonstrated with the introduction of a new guided assembly procedure. It incorporates mapping as major component and uses the additional information (e.g., annotation) as guide. With this method, the complete mitochondrial genome of Eulimnogammarus verrucosus has been successfully assembled even though the sequencing library has been heavily dominated by nuclear DNA. In summary, this thesis introduces algorithmic methods that significantly improve the analysis of tiling array, DNA-seq, RNA-seq, and MethylC-seq data, and proposes standards for benchmarking NGS read aligners. Moreover, it presents a new guided assembly procedure that has been successfully applied in the de novo assembly of a crustacean mitogenome.Diese Arbeit befasst sich mit der Entwicklung und dem Benchmarken von Verfahren zur Analyse von Daten aus Hochdurchsatz-Technologien, wie Tiling Arrays oder Hochdurchsatz-Sequenzierung. Tiling Arrays bildeten lange Zeit die Grundlage für die genomweite Untersuchung des Transkriptoms und kamen beispielsweise bei der Identifizierung funktioneller Elemente im menschlichen Genom zum Einsatz. In dieser Arbeit wird ein neues statistisches Verfahren zur Auswertung von Tiling Array-Daten vorgestellt. Darin werden Segmente als exprimiert klassifiziert, wenn sich deren Signale signifikant von der Hintergrundverteilung unterscheiden. Dadurch werden keine auf den Datensatz abgestimmten Parameterwerte benötigt. Die hier vorgestellte Methode erkennt differentiell exprimierte Segmente in biologischen Daten bei gleicher Sensitivität mit geringerer Falsch-Positiv-Rate im Vergleich zu den derzeit hauptsächlich eingesetzten Verfahren. Zudem ist die Methode bei der Erkennung von Exon-Intron Grenzen präziser. Die Suche nach Anhäufungen exprimierter Segmente hat darüber hinaus zur Entdeckung von sehr langen Regionen geführt, welche möglicherweise eine neue Klasse von macroRNAs darstellen. Nach dem Exkurs zu Tiling Arrays konzentriert sich diese Arbeit nun auf die Hochdurchsatz-Sequenzierung, für die bereits verschiedene Sequenzierungsprotokolle zur Untersuchungen des Genoms, Transkriptoms und Epigenoms etabliert sind. Einer der ersten und entscheidenden Schritte in der Analyse von Sequenzierungsdaten stellt in den meisten Fällen das Mappen dar, bei dem kurze Sequenzen (Reads) auf ein großes Referenzgenom aligniert werden. Die vorliegende Arbeit stellt algorithmische Methoden vor, welche das Mapping-Problem für drei wichtige Sequenzierungsprotokolle (DNA-Seq, RNA-Seq und MethylC-Seq) lösen. Alle Methoden wurden ausführlichen Benchmarks unterzogen und sind in der segemehl-Suite integriert. Als Erstes wird hier der Kern-Algorithmus von segemehl vorgestellt, welcher das Mappen von DNA-Sequenzierungsdaten ermöglicht. Seit der ersten Veröffentlichung wurde dieser kontinuierlich optimiert und erweitert. In dieser Arbeit werden umfangreiche und auf Reproduzierbarkeit bedachte Benchmarks präsentiert, in denen segemehl auf zahlreichen Datensätzen mit bekannten Mapping-Programmen verglichen wird. Die Ergebnisse zeigen, dass segemehl nicht nur sensitiver im Auffinden von optimalen Alignments bezüglich der Editierdistanz sondern auch sehr spezifisch im Vergleich zu anderen Methoden ist. Diese Vorteile sind in realen und simulierten Daten unabhängig von der Sequenzierungstechnologie oder der Länge der Reads erkennbar, gehen aber zu Lasten einer längeren Laufzeit und eines höheren Speicherverbrauchs. Als Zweites wird das Mappen von RNA-Sequenzierungsdaten untersucht, welches bereits von der Split-Read-Erweiterung von segemehl unterstützt wird. Aufgrund von Spleißen ist diese Form des Mapping-Problems rechnerisch aufwendiger. In dieser Arbeit wird das neue Programm lack vorgestellt, welches darauf abzielt, fehlende Read-Alignments mit Hilfe von de novo Spleiß-Information zu finden. Es erzielt hervorragende Ergebnisse und stellt somit eine sinnvolle Ergänzung zu Analyse-Pipelines für RNA-Sequenzierungsdaten dar. Als Drittes wird eine neue Methode zum Mappen von Bisulfit-behandelte Sequenzierungsdaten vorgestellt. Dieses Protokoll gilt als Goldstandard in der genomweiten Untersuchung der DNA-Methylierung, einer der wichtigsten epigenetischen Modifikationen in Tieren und Pflanzen. Dabei wird die DNA vor der Sequenzierung mit Natriumbisulfit behandelt, welches selektiv nicht methylierte Cytosine zu Uracilen konvertiert, während Methylcytosine davon unberührt bleiben. Die hier vorgestellte Bisulfit-Erweiterung führt die Seed-Suche auf einem reduziertem Alphabet durch und verifiziert die erhaltenen Treffer mit einem auf dynamischer Programmierung basierenden Bisulfit-sensitiven Alignment-Algorithmus. Das verwendete Verfahren ist somit unempfindlich gegenüber Bisulfit-Konvertierungen und erfordert im Gegensatz zu anderen Verfahren keine weitere Nachverarbeitung. Im Vergleich zu aktuell eingesetzten Programmen ist die Methode sensitiver und benötigt eine vergleichbare Laufzeit beim Mappen von Millionen von Reads auf große Genome. Bemerkenswerterweise wird die erhöhte Sensitivität bei gleichbleibend guter Spezifizität erreicht. Dadurch könnte diese Methode somit auch bessere Ergebnisse bei der präzisen Bestimmung der Methylierungsraten erreichen. Schließlich wird noch das Potential von Mapping-Strategien für Assemblierungen mit der Einführung eines neuen, Kristallisation-genanntes Verfahren zur unterstützten Assemblierung aufgezeigt. Es enthält Mapping als Hauptbestandteil und nutzt Zusatzinformation (z.B. Annotationen) als Unterstützung. Dieses Verfahren ermöglichte die erfolgreiche Assemblierung des kompletten mitochondrialen Genoms von Eulimnogammarus verrucosus trotz einer vorwiegend aus nukleärer DNA bestehenden genomischen Bibliothek. Zusammenfassend stellt diese Arbeit algorithmische Methoden vor, welche die Analysen von Tiling Array, DNA-Seq, RNA-Seq und MethylC-Seq Daten signifikant verbessern. Es werden zudem Standards für den Vergleich von Programmen zum Mappen von Daten der Hochdurchsatz-Sequenzierung vorgeschlagen. Darüber hinaus wird ein neues Verfahren zur unterstützten Genom-Assemblierung vorgestellt, welches erfolgreich bei der de novo-Assemblierung eines mitochondrialen Krustentier-Genoms eingesetzt wurde

    A representative density profile of the d-statistic for change in H3K27T histone modification between 0 and 2 hours of retinoic acid treatment for the ENCODE region on chromosome 1

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Differential analysis for high density tiling microarray data"</p><p>http://www.biomedcentral.com/1471-2105/8/359</p><p>BMC Bioinformatics 2007;8():359-359.</p><p>Published online 24 Sep 2007</p><p>PMCID:PMC2231405.</p><p></p> The curves of different colors illustrate differential change for the H3K27T modification in exonic (green), intronic (black) and intergenic (blue) regions. The shift into the negative territory for the d-statistic for all classes of regions suggest is a consistent downward trend for this modification between 0 and 2 hours

    The histogram summarizes the differential expression profiles in each ENCODE region on each chromosome

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Differential analysis for high density tiling microarray data"</p><p>http://www.biomedcentral.com/1471-2105/8/359</p><p>BMC Bioinformatics 2007;8():359-359.</p><p>Published online 24 Sep 2007</p><p>PMCID:PMC2231405.</p><p></p> Chromosome region specific differential expression is observed across the time-points – 30 percent change on chromosome 8 to no detectable change on chromosome 10. Globally, the highest fraction of differential expression when summarized across all transfrag is observed between 8–32 hours (53.8 percent),. The most statistically significant (FDR ≤12 percent) changes are also observed between 8–32 hours

    D-statistic versus FDR relationship at putative TREs, across the time-series (IGB view)

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Differential analysis for high density tiling microarray data"</p><p>http://www.biomedcentral.com/1471-2105/8/359</p><p>BMC Bioinformatics 2007;8():359-359.</p><p>Published online 24 Sep 2007</p><p>PMCID:PMC2231405.</p><p></p> Examples of enrichment fragments are observed within and upstream of the second intron of the HIC gene (pink). The upstream fragment is possibly un-annotated (UA), in so far as no RefSeq annotation is available. The top four tracks represent the HisH4 p-value graphs at 0 (red), 2 (light-blue), 8 (dark-blue) and 32 (green) hours, scaled appropriately for comparison; the subsequent tracks represent the d-statistic (top) and FDR (bottom) pair for the 0–2 (red), 2–8 (cyan) and 8–32 (blue) hour time intervals. The horizontal lines associated with the FDR data refer to the 5 percent threshold in each case
    corecore