7 research outputs found

    Statistical significance of cis-regulatory modules

    Get PDF
    BACKGROUND: It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cis-regulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cis-regulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning. RESULTS: We introduce methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization. In order to determine the statistical significance of module sites, we first need a method to determine the statistical significance of single transcription factor binding site matches. We introduce a straightforward method of estimating the statistical significance of single site matches using a database of known promoters to produce data structures that can be used to estimate p-values for binding site matches. We next introduce a technique to calculate the statistical significance of the arrangement of binding sites within a module using a max-gap model. If the module scanned for has defined organizational parameters, the probability of the module is corrected to account for organizational constraints. The statistical significance of single site matches and the architecture of sites within the module can be combined to provide an overall estimation of statistical significance of cis-regulatory module sites. CONCLUSION: The methods introduced in this paper allow for the detection and statistical evaluation of single transcription factor binding sites and cis-regulatory modules. The features described are implemented in the Search Tool for Occurrences of Regulatory Motifs (STORM) and MODSTORM software

    A unified statistical model to support local sequence order independent similarity searching for ligand-binding sites and its application to genome-based drug discovery

    Get PDF
    Functional relationships between proteins that do not share global structure similarity can be established by detecting their ligand-binding-site similarity. For a large-scale comparison, it is critical to accurately and efficiently assess the statistical significance of this similarity. Here, we report an efficient statistical model that supports local sequence order independent ligand–binding-site similarity searching. Most existing statistical models only take into account the matching vertices between two sites that are defined by a fixed number of points. In reality, the boundary of the binding site is not known or is dependent on the bound ligand making these approaches limited. To address these shortcomings and to perform binding-site mapping on a genome-wide scale, we developed a sequence-order independent profile–profile alignment (SOIPPA) algorithm that is able to detect local similarity between unknown binding sites a priori. The SOIPPA scoring integrates geometric, evolutionary and physical information into a unified framework. However, this imposes a significant challenge in assessing the statistical significance of the similarity because the conventional probability model that is based on fixed-point matching cannot be applied. Here we find that scores for binding-site matching by SOIPPA follow an extreme value distribution (EVD). Benchmark studies show that the EVD model performs at least two-orders faster and is more accurate than the non-parametric statistical method in the previous SOIPPA version. Efficient statistical analysis makes it possible to apply SOIPPA to genome-based drug discovery. Consequently, we have applied the approach to the structural genome of Mycobacterium tuberculosis to construct a protein–ligand interaction network. The network reveals highly connected proteins, which represent suitable targets for promiscuous drugs

    Large-Scale Discovery of Promoter Motifs in Drosophila melanogaster

    Get PDF
    A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes

    Discovery of Protein Phosphorylation Motifs through Exploratory Data Analysis

    Get PDF
    BACKGROUND: The need for efficient algorithms to uncover biologically relevant phosphorylation motifs has become very important with rapid expansion of the proteomic sequence database along with a plethora of new information on phosphorylation sites. Here we present a novel unsupervised method, called Motif Finder (in short, F-Motif) for identification of phosphorylation motifs. F-Motif uses clustering of sequence information represented by numerical features that exploit the statistical information hidden in some foreground data. Furthermore, these identified motifs are then filtered to find "actual" motifs with statistically significant motif scores. RESULTS AND DISCUSSION: We have applied F-Motif to several new and existing data sets and compared its performance with two well known state-of-the-art methods. In almost all cases F-Motif could identify all statistically significant motifs extracted by the state-of-the-art methods. More importantly, in addition to this, F-Motif uncovers several novel motifs. We have demonstrated using clues from the literature that most of these new motifs discovered by F-Motif are indeed novel. We have also found some interesting phenomena. For example, for CK2 kinase, the conserved sites appear only on the right side of S. However, for CDK kinase, the adjacent site on the right of S is conserved with residue P. In addition, three different encoding methods, including a novel position contrast matrix (PCM) and the simplest binary coding, are used and the ability of F-motif to discover motifs remains quite robust with respect to encoding schemes. CONCLUSIONS: An iterative algorithm proposed here uses exploratory data analysis to discover motifs from phosphorylated data. The effectiveness of F-Motif has been demonstrated using several real data sets as well as using a synthetic data set. The method is quite general in nature and can be used to find other types of motifs also. We have also provided a server for F-Motif at http://f-motif.classcloud.org/, http://bio.classcloud.org/f-motif/ or http://ymu.classcloud.org/f-motif/

    Learning the Language of Biological Sequences

    Get PDF
    International audienceLearning the language of biological sequences is an appealing challenge for the grammatical inference research field.While some first successes have already been recorded, such as the inference of profile hidden Markov models or stochastic context-free grammars which are now part of the classical bioinformatics toolbox, it is still a source of open and nice inspirational problems for grammatical inference, enabling us to confront our ideas to real fundamental applications. As an introduction to this field, we survey here the main ideas and concepts behind the approaches developed in pattern/motif discovery and grammatical inference to characterize successfully the biological sequences with their specificities

    CD8+ T cell epitope-enriched HIV-1-Gag antigens with preserved structure and function

    Get PDF
    Control of disease progression in certain HIV-1 infected individuals is often associated with CD8+ T cell responses directed towards Gag-derived epitopes presented on HLA class I molecules. This indicates that such responses play a crucial role in combating virus replication. However, both the large variability of HIV-1 and the diversity of HLA alleles impose a challenge on the elicitation of protective CD8+ T cell responses by vaccination. To address this problem, an algorithm was conceived to generate Gag antigens enriched with patient-derived CD8+ T cell epitopes. Since the function of Gag to produce virus-like particles (VLPs) was deemed important for priming of an adequate CD8+ T cell response, the program excluded all epitopes with budding-deleterious properties. To achieve this, all amino acid substitutions (AAS) that had been identified in the epitope set through mapping them to a Gag reference sequence, were assessed using a trained classifier that considers structural-energy- and sequence-conservation-based features to predict whether each AAS is compatible with budding. These predictions were validated experimentally for over 100 variants, showing a precision of 100% regarding classification of budding competence. Next, epitopes that contain only budding-retaining AAS were assigned a score that considers various customizable epitope-specific properties, like frequencies of HLA class I molecules presenting the epitope in a given population, subtype affiliation, and conservation status. Using a genetic algorithm, as many compatible epitopes as possible were combined into a novel Gag antigen sequence, aiming to maximize their cumulative score. After each round of antigen generation, all previously integrated epitopes were eliminated from the input data set. Thus, in subsequent rounds only the remaining epitopes were used, which resulted in a set of complementary antigens. To evaluate the performance of the algorithm, a trivalent set of globally applicable CD8+ T cell epitope-enriched Gag antigens (teeGags1-3) was generated and computationally validated in this thesis. It could be shown that the teeGags are superior to any known, naturally found or in silico generated Gag sequence from previously published work regarding the number and quality of epitopes, as well as the population coverage, defined as the average number of epitopes presented per person. The shape and size of teeGag VLPs were examined biochemically and wildtype-like characteristics were observed for teeGag1 and teeGag3. teeGag2, however, exhibited some aberrant, tubular structures and slightly larger particles, probably due to a set of mutations within the p2 region of Gag. To characterize the increased immunological breadth of the teeGags, a method to directly identify HLA-class-I-presented epitopes was conceived. For this, the conditioned supernatant from cells that produce soluble forms of HLA (sHLA) was used for HLA-affinity chromatography. Peptides from the isolated sHLA complexes were further purified and employed for sequencing through LC-MS/MS analysis. It was shown in this thesis that this method can be used to identify sHLA-restricted peptides. However, the sensitivity has to be further increased to allow examination of the immunological breadth of antigens. In conclusion, with the in silico validated enhanced immunological breadth and the biochemically verified structural conservation, the presented designer teeGags qualify as next-generation vaccine antigens that potentially elicit superior CD8+ T cell responses

    Bioinformatische Analysen zur Optimierung von Aptamerbibliotheken: Eine Studie zur AufklÀrung bindungsrelevanter Charakteristika in Sequenz und Struktur am Beispiel eines Norovirus-Aptamers

    Get PDF
    Aptamere besitzen als nahezu universelle Binder ein großes Anwendungspotential in Biotechnologie und Medizin; ihre Selektion aus zufĂ€lligen Sequenzbibliotheken liefert jedoch nur suboptimale Ergebnisse. WĂ€hrend der Selektion schlagen sich in der Bibliothek Bindungsinformationen nieder, die zur Optimierung des Verfahrens eingesetzt werden können. Die vorliegende Arbeit widmet sich der bioinformatischen Erschließung dieser Informationen. FĂŒr die initiale Auswertung der gewonnenen Aptamersequenzdaten erwiesen sich unter Zuhilfenahme von n-Gramm-Deskriptoren und AffinitĂ€ten die Regressionsanalyse und auf statistisch-symbolischer Ebene die Mustersuche als wirkungsvolle Verfahren. FĂŒr die AufklĂ€rung der Komplexstruktur aus Aptamer und Zielprotein wurde eine Bewertungsfunktion gefunden, die die Identifikation sogenannter nahe-nativer Bindungskonformationen unter den Ergebnissen einer Dockingsimulation erlaubt. Nach dieser methodischen Evaluation erfolgte die Selektion eines Aptamers gegen das Kapsid des humanen Norovirus. Auf Basis der erhobenen Sequenzdaten wurde die Bindegeometrie zwischen Aptamer und Zielprotein durch Anwendung der im Verfahrensprotokoll festgehaltenen Kombination der Analysemethoden aufgeklĂ€rt. Das dabei als bindungsrelevant identifizierte Sequenzmotiv kann bei der Erzeugung targetspezifisch optimierter Selektionsbibliotheken als a priori-Information einfließen.:Inhaltsverzeichnis Abbildungsverzeichnis Tabellenverzeichnis Formelverzeichnis AbkĂŒrzungsverzeichnis Vorwort 1 Thematische Einleitung 1.1 Hypothesen und Fragestellungen 1.2 Aufbau der Arbeit 2 Allgemeine Grundlagen 2.1 Protein-NukleinsĂ€ure-Komplexe 2.1.1 Aufbau von Proteinen 2.1.2 Aufbau von NukleinsĂ€uren 2.1.3 Interaktionen zwischen Proteinen und NukleinsĂ€uren 2.2 Aptamere und deren Gewinnung 2.2.1 Aptamere als universelle Binder 2.2.2 Das Grundverfahren der Aptamerselektion 2.2.3 Methodische Modifikationen des Selektionsverfahrens 3 Auswertung der PrimĂ€r- und SekundĂ€rstruktur von NukleinsĂ€uren 3.1 Numerische Beschreibung von NukleinsĂ€uren 3.1.1 Nukleobasendeskriptoren 3.1.2 Transformationsstrategien 3.1.3 Direkt anwendbare Beschreibungskonzepte 3.2 Konzeption einer Strategie zur Evaluation der Beschreibungskonzepte 3.2.1 Beschreibung und Vorverarbeitung des Datensatzes 3.2.2 Zusammenstellung von Deskriptorensets 3.2.3 Eingesetzte Methoden 3.3 Ergebnisse der Evaluation 3.3.1 GegenĂŒberstellung der Beschreibungskonzepte 3.3.2 ÜberprĂŒfung der PlausibilitĂ€t 3.3.3 VerhĂ€ltnismĂ€ĂŸigkeit 3.3.4 Abschließende Betrachtung 4 Mustersuche in biologischen Sequenzen 4.1 Sequenzmuster 4.1.1 Definition der Sequenzmuster 4.1.2 Bewertung konkreter Musterfunde 4.1.3 Bewertung einzelner Musterinstanzen 4.1.4 Visualisierung 4.2 Algorithmus zur Mustersuche 4.2.1 Suche in SuffixbĂ€umen 4.2.2 Durchsuchen des Musterraumes 4.2.3 Optimierung der Suchstrategie 4.2.4 Ordnung der Ergebnisse 4.2.5 Zusammenfassung 5 Auswertung der TertiĂ€rstruktur von Protein-NukleinsĂ€ure-Komplexen 5.1 Übersicht der Bewertungsmodelle fĂŒr Protein-NukleinsĂ€ure-Komplexe 5.1.1 Wissensbasierte Paarpotentiale 5.1.2 Molekularmechanische Bewertung 5.1.3 Auswahl der Konzepte fĂŒr die weitere Betrachtung 5.2 Vorstellung und Herleitung der ausgewĂ€hlten Bewertungsmodelle 5.2.1 Die SPA-PN-Potentiale 5.2.2 Die modifizierten SPA-PN Potentiale 5.2.3 Die ITScore-PR Potentiale 5.2.4 Molekularmechanische Bewertung 5.3 Konzeption des Vergleichs der Bewertungsmodelle 5.3.1 Auswahl und Vorstellung der Referenzkomplexe 5.3.2 Generierung von Decoy-Strukturen 5.3.3 Quantifizierung der strukturellen Abweichung 5.4 Ergebnisse des Vergleichs 5.4.1 Referenzkomplex 4PDB 5.4.2 Referenzkomplex 5CMX 5.4.3 Gewichtung der HADDOCK-Bewertung 5.4.4 Visualisierung der Bewertung 5.5 Zusammenfassung 6 Selektion und Analyse eines Norovirus-Aptamers 6.1 Der Norovirus als Zielstruktur der Aptamerselektion 6.1.1 Epidemiologische Aspekte 6.1.2 Aufbau des Norovirus 6.1.3 Nachweis des Norovirus 6.1.4 Eingesetzter Norovirusstamm 6.2 Experimentelle DurchfĂŒhrung und Auswertung 6.2.1 Aptamerselektion 6.2.2 Evaluation der Anreicherung von Aptamerkandidaten 6.2.3 Experimentelle Verifikation der Bindung 6.2.4 Entwurf eines bioinformatischen Analyseprotokolls 6.3 Bioinformatische Analysen auf Basis der PrimĂ€r- und SekundĂ€rstruktur 6.3.1 Vorhersage der SekundĂ€rstrukturen fĂŒr die Aptamerkandidaten 6.3.2 Untersuchung der Anreicherung ĂŒber die Clusteranalyse 6.3.3 Untersuchung der n-Gramm-Zusammensetzung 6.3.4 DurchfĂŒhrung einer Mustersuche 6.3.5 Validierung der Musterfunde 6.4 Bioinformatische Analyse auf Basis der TertiĂ€rstruktur 6.4.1 Bestimmung der Struktur des Zielproteins 6.4.2 Bestimmung der Aptamerstruktur 6.4.3 Bildung eines Komplexes aus Aptamer und Zielprotein 6.4.4 EinschĂ€tzung der Relevanz fĂŒr die Epitope der ProteinoberflĂ€che 6.5 Konzepte fĂŒr die spezifische Anreicherung von Oligonukleotidbibliotheken 6.5.1 Ableitbare UnterrĂ€ume des Sequenz- und Strukturraumes 6.5.2 Zusammensetzung einer Bibliothek 7 Abschließende Betrachtung 7.1 Zusammenfassung der Ergebnisse 7.2 Ausblick Literatur Versicherun
    corecore