92 research outputs found

    Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment

    Get PDF
    BACKGROUND: This study analyzes the predictions of a number of promoter predictors on the ENCODE regions of the human genome as part of the ENCODE Genome Annotation Assessment Project (EGASP). The systems analyzed operate on various principles and we assessed the effectiveness of different conceptual strategies used to correlate produced promoter predictions with the manually annotated 5' gene ends. RESULTS: The predictions were assessed relative to the manual HAVANA annotation of the 5' gene ends. These 5' gene ends were used as the estimated reference transcription start sites. With the maximum allowed distance for predictions of 1,000 nucleotides from the reference transcription start sites, the sensitivity of predictors was in the range 32% to 56%, while the positive predictive value was in the range 79% to 93%. The average distance mismatch of predictions from the reference transcription start sites was in the range 259 to 305 nucleotides. At the same time, using transcription start site estimates from DBTSS and H-Invitational databases as promoter predictions, we obtained a sensitivity of 58%, a positive predictive value of 92%, and an average distance from the annotated transcription start sites of 117 nucleotides. In this experiment, the best performing promoter predictors were those that combined promoter prediction with gene prediction. The main reason for this is the reduced promoter search space that resulted in smaller numbers of false positive predictions. CONCLUSION: The main finding, now supported by comprehensive data, is that the accuracy of human promoter predictors for high-throughput annotation purposes can be significantly improved if promoter prediction is combined with gene prediction. Based on the lessons learned in this experiment, we propose a framework for the preparation of the next similar promoter prediction assessment

    High Sensitivity TSS Prediction: Estimates of Locations Where TSS Cannot Occur

    Get PDF
    Although transcription in mammalian genomes can initiate from various genomic positions (e.g., 3′UTR, coding exons, etc.), most locations on genomes are not prone to transcription initiation. It is of practical and theoretical interest to be able to estimate such collections of non-TSS locations (NTLs). The identification of large portions of NTLs can contribute to better focusing the search for TSS locations and thus contribute to promoter and gene finding. It can help in the assessment of 5′ completeness of expressed sequences, contribute to more successful experimental designs, as well as more accurate gene annotation.Using comprehensive collections of Cap Analysis of Gene Expression (CAGE) and other transcript data from mouse and human genomes, we developed a methodology that allows us, by performing computational TSS prediction with very high sensitivity, to annotate, with a high accuracy in a strand specific manner, locations of mammalian genomes that are highly unlikely to harbor transcription start sites (TSSs). The properties of the immediate genomic neighborhood of 98,682 accurately determined mouse and 113,814 human TSSs are used to determine features that distinguish genomic transcription initiation locations from those that are not likely to initiate transcription. In our algorithm we utilize various constraining properties of features identified in the upstream and downstream regions around TSSs, as well as statistical analyses of these surrounding regions.

    A Bayesian system for modeling promoter structure: A case study of histone promoters

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

    Identification and characterization of TNFalpha responsive genes in human breast cancer cells

    Get PDF
    One of the hallmarks of cancer is the escape of the transformed cells from apoptosis. Therefore, the identification of survival genes, allowing cancer cells to circumvent programmed cell death, could provide new diagnostic markers as well as targets for therapeutic intervention. A well known transcription factor regulating the balance between pro- and anti- apoptotic factors is NF-kappaB, which is strongly induced by tumor necrosis factor alpha (TNFalpha). When cells are stimulated by TNFalpha their response is biphasic with an initial NF-kappaB induction of survival genes which is overridden by the subsequent activation of initiator caspases triggering apoptosis. By combining gene trap mutagenesis with site specific recombination a strategy was developed, which enriches for genes induced by TNFalpha in the human breast cancer cell line MCF-7. The strategy relies on a one way gene expression switch based on Cre/loxP mediated recombination, which uncouples the expression of a marker gene from the trapped cellular promoter thereby enabling the recovery of genes that are only transiently induced by TNFalpha. The marker gene used in these experiments was a dominant negative variant of the TNFalpha-receptor associated protein FADD (dnFADD), which blocks the apoptotic branch of the TNFalpha induced signaling pathway. Initial experiments indicated that MCF-7 cells expressing high levels of dnFADD were insensitive to TNFalpha induced apoptosis and therefore suitable for the installment of a one way gene expression switch susceptible to Cre/loxP mediated recombination. A MCF-7 reporter clone harboring the recombinase dependent gene expression switch was infected with the gene trap retrovirus U3Cre, which inserts the Cre recombinase gene into a large collection of chromosomal sites. Insertion of Cre downstream of an active cellular promoter induces dnFADD expression from the gene expression switch enabling the cells to block TNFalpha triggered apoptosis. From a gene trap integration library containing approximately 2000000 unique proviral integrations, 69 unique TNFalpha inducible gene trap insertion sites were recovered in a two step selection procedure. Sequencing of the genomic regions adjacent to the insertion sites, which were obtained by inverse PCR (gene trap sequence tags, GTSTs), and data base analysis revealed that 42% of the GTSTs belonged to annotated genes, 13% to known cDNAs with open reading frames, 17% to Genscan predicted genes, 9% to ESTs, 9% to repetitive sequences and 10% to unannotated genomic sequence. Overall, 44% of the annotated genes recovered in this screen were directly or indirectly related to cancer, indicating that the gene trap strategy developed here is suitable for the identification of cancer relevant genes. Analysis of the expression patterns of the trapped and annotated genes in wild type cells revealed that 19 out of 24 genes were either up- or down- regulated by a factor of at least 1.45 by TNFalpha. A large fraction of the gene trap insertions were located upstream, in introns or in opposite orientation to annotated transcripts, indicating that the strategy efficiently recovers non-coding RNAs (ncRNAs). While the biological significance of these transcripts still needs to be elucidated, they fall into two main categories. The first category includes gene trap insertions upstream of genes, which could either represent regulatory RNAs interacting with promoter elements or transcripts driven by bidirectional promoters. The second includes inverse orientation gene trap insertions in introns of annotated genes suggesting the presence of natural antisense transcripts (NATs). Interestingly, more than 50% of all antisense integrations are located downstream of transcription start sites predicted by different algorithms supporting the existence of RNAs transcribed from the corresponding genomic regions. Intronic integrations on the coding strand could be derived from cryptic splicing, alternative promoter usage or additional, so far uncharacterized transcripts. Preliminary functional analysis of two genes recovered in this screen encoding the transcription factor ZFP67 and the FLJ14451 protein revealed that FLJ14451 but not ZFP67 inhibited anchorage independent growth in soft agar, suggesting that FLJ14451 might have some tumor suppressor functions. In summary, besides identifying a putative tumor suppressor protein, the present experiments have shown that gene trapping is useful in identifying non-coding transcripts in living cells and may turn out to be the method of choice in characterizing these transcripts whose functions are still largely unknown.Ein zentrales Charakteristikum von Krebs ist die Unterdrückung des zellulären Apoptoseprogramms in den transformierten Zellen. Daher könnte die Identifizierung von Überlebensgenen, welche den Zellen die Umgehung des programmierten Zelltods erlauben, neue diagnostische Marker oder Zielmoleküle für therapeutische Intervention liefern. Die Balance zwischen pro- und anti-apoptotischen Signalen wird durch NF-kappaB, einen gut charakterisierter Transkriptionsfaktor reguliert, welcher wiederum durch Tumornekrosefaktor alpha (TNFalpha) induziert wird. Zellen, die mit TNFalpha stimuliert werden zeigen eine biphasische Antwort, während derer es zu einer initialen Induktion von Überlebensgenen kommt, bevor im folgenden Initiator-Caspasen aktiviert werden, die dann die Apoptose einleiten. Zur Identifizierung TNFalpha induzierbarer Gene in der menschlichen Brustkrebszellinie MCF-7 wurde eine Strategie benutzt, welche auf einer Kombination von Genfallen-Mutagenese und sequenzspezifischer Rekombination beruht. Diese Strategie macht sich einen irreversiblen molekularen Schalter zunutze, der die Expression eines Markergens von dem, die Genfalle aktivierenden, zellulären Promotor entkoppelt. Als Markergen wurde eine dominant negative Variante des TNFalpha-Rezeptor-assozierten Proteins FADD ("Fas-associated death domain protein"; dnFADD) benutzt, die den pro-apoptotische Zweig des TNFalpha-Rezeptor-Signalwegs blockiert. Ein MCF-7 Zellklon mit diesem Rekombinase-aktivierbaren, molekularen Schalter wurde mit dem Genfallen-Retrovirus U3Cre transduziert, welcher das Cre-Rekombinase-Gen in eine große Anzahl unterschiedlicher chromosomaler Loci inserieren kann. Kommt Cre dabei unter die Kontrolle eines aktiven zellulären Promoters, wird die Expression von dnFADD induziert und somit die Apoptose-Induktion durch TNFalpha verhindert. Aus einer Genfallen-Integrationsbank mit ca. 2000000 unabhängigen proviralen Integrationen wurden nach einer zweistufigen Selektion 69 Zellklone mit Genfallen-Integrationen in TNFalpha induzierten Genen erhalten. Die Sequenzierung der benachbart zu den Genfallen-Proviren liegenden, über inverse PCRs amplifizierten, genomischen Regionen und anschließende Datenbankanalysen ergaben folgende Verteilung der Genfallen-Integrationen: 42% lagen in annotierten Genen, 13% in Genen mit offenen Leserahmen unbekannter Funktion, 17% in hypothetischen Genen, 9% in ESTs ("expressed sequence tags"), 9% in repetitiven Elementen und 10% in nicht annotierten genomischen Regionen. 44% der aus diesem Screening-Verfahren erhaltenen, annotierten Gene ließen sich direkt oder indirekt mit Krebserkrankungen korrelieren. Dies ist ein Indiz dafür, dass der hier entwickelte, experimentelle Ansatz zur Identifizierung Krebs-relevanter Gene geeignet ist. Die Expressionanalyse ausgewählter, annotierter Gene in wildtypischen MCF-7 Zellen ergab eine Herauf- bzw. Herunterregulation um einen Faktor von mindestens 1.45 bei 19 von 24 untersuchten Genen. Ein großer Teil der Genfallen-Integrationen befand sich 5'-oberhalb von Genen, in Introns oder in umgekehrter Orientierung zu annotierten Transkripten, was darauf hindeutet, dass mit Hilfe der gewählten Strategie nicht-kodierende RNAs identifiziert werden können. Obwohl der Nachweis dieser Transkripte und ihrer biologischen Relevanz noch aussteht, können sie in zwei Kategorien eingeteilt werden. Die erste umfasst Integrationen 5' zu Genen, die entweder regulatorische RNAs repräsentieren, die mit Promotorelementen interagieren, oder Transkripte, welche unter der Kontrolle von bidirektionalen Promotoren stehen. Die zweite Kategorie, Insertionen auf dem nicht-kodierenden Strang innerhalb von Introns, legen das Vorkommen natürlicher antisense Transkripte (NATs) nahe. Interessanterweise liegen mehr als 50% aller antisense-Integrationen 3' zu potentiellen Transkriptionsstartstellen, die mit verschiedenen Algorithmen vorausgesagt wurden. Dies kann als Hinweis darauf gewertet werden, dass höchstwahrscheinlich RNAs existieren, die Transkripte der entsprechenden genomischen Regionen darstellen. Intronische Integrationen auf dem kodierenden Strang können als Resultat kryptischer Spleißvorgänge oder durch alternative Promotoren transkriptionell aktiviert werden; alternativ könnten es sich bei den durch die Genfalle "abgefangenen" RNAs um zusätzliche, bisher nicht charakterisierte Transkripte handeln. Vorläufige funktionelle Analysen zweier Gene, die für den Tanskriptionsfaktor ZFP67 und das Protein FLJ14451 kodieren, ergaben, dass FLJ14451, aber nicht ZFP67 das Substrat-unabhängige Wachstum von MCF-7 Zellen in Weichagar inhibieren kann. Dies weist auf eine mögliche Tumorsuppresorfunktion des Proteins hin. Die im Rahmen dieser Dissertation durchgeführten Experimente führten nicht nur zur Identifizierung eines potentiellen, neuen Tumorsuppressorgens, sondern zeigten auch, dass Genfallen ein nützliches Werkzeug bei der Suche nach nicht-kodierenden RNAs in lebenden Zellen sein können und ihr Einsatz möglicherweise die Methode der Wahl für die Identifizierung derartiger Transkripte darstellt

    Using signal processing techniques in promoter prediction

    Get PDF
    Master'sMASTER OF ENGINEERIN

    Concept Based Knowledge Discovery from Biomedical Literature

    Get PDF
    Philosophiae Doctor - PhDThis thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology, resented can be integrated with the researchers own knowledge, experimentation and observations for optimal progression of scientific research.South Afric
    corecore