853 research outputs found

    Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered.</p> <p>Results</p> <p>To improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-ÎşB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies.</p> <p>To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA.</p> <p>Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies.</p> <p>Conclusion</p> <p>Based on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.</p

    Computational representation and discovery of transcription factor binding sites

    Get PDF
    Tesi per compendi de publicacions.The information about how, when, and where are produced the proteins has been one of the major challenge in molecular biology. The studies about the control of the gene expression are essential in order to have a better knowledge about the protein synthesis. The gene regulation is a highly controlled process that starts with the DNA transcription. This process operates at the gene level, hereditary basic units, which will be copied into primary ribonucleic acid (RNA). This first step is controlled by the binding of specific proteins, called as Transcription Factors (TF), with a sequence of the DNA (Deoxyribonucleic Acid) in the regulatory region of the gene. These DNA sequences are known as binding sites (BS). The binding sites motifs are usually very short (5 to 20 bp long) and highly degenerate. These sequences are expected to occur at random every few hundred base pairs. Besides, a TF can bind among different sites. Due to its highly variability, it is difficult to establish a consensus sequence. The study and identification binding sites is important to clarify the control of the gene expression. Due to the importance of identifying binding sites sequences, projects such as ENCODE (Encyclopedia of DNA elements), have dedicated efforts to map binding sites for large set of transcription factor to identify regulatory regions. In this thesis, we have approached the problem of the binding site detection from another angle. We have developed a set of toolkit for motif binding detection based on linear and non-linear models. First of all, we have been able to characterize binding sites using different approaches. The first one is based on the information that there is in each binding sites position. The second one is based on the covariance model of an aligned set of binding sites sequences. From these motif characterizations, we have proposed a new set of computational methods to detect binding sites. First, it was developed a new method based on parametric uncertainty measurement (Rényi entropy). This detection algorithm evaluates the variation on the total Rényi entropy of a set of sequences when a candidate sequence is assumed to be a true binding site belonging to the set. This method was found to perform especially well on transcription factors that the correlation among binding sites was null. The correlation among binding sites positions was considered through linear, Q-residuals, and non-linear models, alpha-Divergence and SIGMA. Q-residuals is a novel motif finding method which constructs a subspace based on the covariance of numerical DNA sequences. When the number of available sequences was small, The Q-residuals performance was significantly better and faster than all the others methodologies. Alpha-Divergence was based on the variation of the total parametric divergence in a set of aligned sequenced with binding evidence when a candidate sequence is added. Given an optimal q-value, the alpha-Divergence performance had a better behavior than the others methodologies in most of the studied transcription factor binding sites. And finally, a new computational tool, SIGMA, was developed as a trade-off between the good generalisation properties of pure entropy methods and the ability of position-dependency metrics to improve detection power. In approximately 70% of the cases considered, SIGMA exhibited better performance properties, at comparable levels of computational resources, than the methods which it was compared. This set of toolkits and the models for the detection of a set of transcription factor binding sites (TFBS) has been included in an R-package called MEET.La informació sobre com, quan i on es produeixen les proteïnes ha estat un dels majors reptes en la biologia molecular. Els estudis sobre el control de l'expressió gènica són essencials per conèixer millor el procés de síntesis d'una proteïna. La regulació gènica és un procés altament controlat que s'inicia amb la transcripció de l'ADN. En aquest procés, els gens, unitat bàsica d'herència, són copiats a àcid ribonucleic (RNA). El primer pas és controlat per la unió de proteïnes, anomenades factors de transcripció (TF), amb una seqüència d'ADN (àcid desoxiribonucleic) en la regió reguladora del gen. Aquestes seqüències s'anomenen punts d'unió i són específiques de cada proteïna. La unió dels factors de transcripció amb el seu corresponent punt d'unió és l'inici de la transcripció. Els punts d'unió són seqüències molt curtes (5 a 20 parells de bases de llargada) i altament degenerades. Aquestes seqüències poden succeir de forma aleatòria cada centenar de parells de bases. A més a més, un factor de transcripció pot unir-se a diferents punts. A conseqüència de l'alta variabilitat, és difícil establir una seqüència consensus. Per tant, l'estudi i la identificació del punts d'unió és important per entendre el control de l'expressió gènica. La importància d'identificar seqüències reguladores ha portat a projectes com l'ENCODE (Encyclopedia of DNA Elements) a dedicar grans esforços a mapejar les seqüències d'unió d'un gran conjunt de factors de transcripció per identificar regions reguladores. L'accés a seqüències genòmiques i els avanços en les tecnologies d'anàlisi de l'expressió gènica han permès també el desenvolupament dels mètodes computacionals per la recerca de motius. Gràcies aquests avenços, en els últims anys, un gran nombre de algorismes han sigut aplicats en la recerca de motius en organismes procariotes i eucariotes simples. Tot i la simplicitat dels organismes, l'índex de falsos positius és alt respecte als veritables positius. Per tant, per estudiar organismes més complexes és necessari mètodes amb més sensibilitat. En aquesta tesi ens hem apropat al problema de la detecció de les seqüències d'unió des de diferents angles. Concretament, hem desenvolupat un conjunt d'eines per la detecció de motius basats en models lineals i no-lineals. Les seqüències d'unió dels factors de transcripció han sigut caracteritzades mitjançant dues aproximacions. La primera està basada en la informació inherent continguda en cada posició de les seqüències d'unió. En canvi, la segona aproximació caracteritza la seqüència d'unió mitjançant un model de covariància. A partir d'ambdues caracteritzacions, hem proposat un nou conjunt de mètodes computacionals per la detecció de seqüències d'unió. Primer, es va desenvolupar un nou mètode basat en la mesura paramètrica de la incertesa (entropia de Rényi). Aquest algorisme de detecció avalua la variació total de l'entropia de Rényi d'un conjunt de seqüències d'unió quan una seqüència candidata és afegida al conjunt. Aquest mètode va obtenir un bon rendiment per aquells seqüències d'unió amb poca o nul.la correlació entre posicions. La correlació entre posicions fou considerada a través d'un model lineal, Qresiduals, i dos models no-lineals, alpha-Divergence i SIGMA. Q-residuals és una nova metodologia per la recerca de motius basada en la construcció d'un subespai a partir de la covariància de les seqüències d'ADN numèriques. Quan el nombre de seqüències disponible és petit, el rendiment de Q-residuals fou significant millor i més ràpid que en les metodologies comparades. Alpha-Divergence avalua la variació total de la divergència paramètrica en un conjunt de seqüències d'unió quan una seqüència candidata és afegida. Donat un q-valor òptim, alpha-Divergence va tenir un millor rendiment que les metodologies comparades en la majoria de seqüències d'unió dels factors de transcripció considerats. Finalment, un nou mètode computacional, SIGMA, va ser desenvolupat per tal millorar la potència de deteccióPostprint (published version

    Most transcription factor binding sites are in a few mosaic classes of the human genome

    Get PDF
    Background: Many algorithms for finding transcription factor binding sites have concentrated on the characterisation of the binding site itself: and these algorithms lead to a large number of false positive sites. The DNA sequence which does not bind has been modeled only to the extent necessary to complement this formulation. Results We find that the human genome may be described by 19 pairs of mosaic classes, each defined by its base frequencies, (or more precisely by the frequencies of doublets), so that typically a run of 10 to 100 bases belongs to the same class. Most experimentally verified binding sites are in the same four pairs of classes. In our sample of seventeen transcription factors — taken from different families of transcription factors — the average proportion of sites in this subset of classes was 75%, with values for individual factors ranging from 48% to 98%. By contrast these same classes contain only 26% of the bases of the genome and only 31% of occurrences of the motifs of these factors — that is places where one might expect the factors to bind. These results are not a consequence of the class composition in promoter regions. Conclusions:This method of analysis will help to find transcription factor binding sites and assist with the problem of false positives. These results also imply a profound difference between the mosaic classes

    Selected Works in Bioinformatics

    Get PDF
    This book consists of nine chapters covering a variety of bioinformatics subjects, ranging from database resources for protein allergens, unravelling genetic determinants of complex disorders, characterization and prediction of regulatory motifs, computational methods for identifying the best classifiers and key disease genes in large-scale transcriptomic and proteomic experiments, functional characterization of inherently unfolded proteins/regions, protein interaction networks and flexible protein-protein docking. The computational algorithms are in general presented in a way that is accessible to advanced undergraduate students, graduate students and researchers in molecular biology and genetics. The book should also serve as stepping stones for mathematicians, biostatisticians, and computational scientists to cross their academic boundaries into the dynamic and ever-expanding field of bioinformatics

    Evaluation of protein microarray technology for tumor autoantibody screening in colon cancer

    Get PDF
    Dickdarmkrebs ist die dritthäufigste Krebserkrankung weltweit, mit zunehmender Häufigkeit an Neuerkrankungen in Industrieländern. Die Progression der Erkrankung erstreckt sich über mehrere Jahre, Früherkennung und Diagnose innerhalb eines bevölkerungsweiten Screenings haben die Überlebensrate beträchtlich gesteigert. Da die üblichen klinischen Screening-Verfahren wie fäkaler okkulter Bluttest (FOBT) geringe Sensitivität aufweisen bzw. Koloskopie eine invasive und für den Patient unangenehme Methode darstellen, ist die Entwicklung von serumbasierten minimal invasiven Methoden von hohem Interesse. Das Konzept von spezifischen molekularen Signaturen in unterschiedlichen Phasen der Tumorgenese und die sich daraus ergebenden tumorassoziierten Antigenen können voraussichtlich als Biomarker in der Krebsdiagnose fungieren. Tumorassoziierte Antigene wie z.B.: CEA und CA 19.9 finden schon klinische Anwendung, zeigen jedoch niedrige Sensitivität und Spezifität. Außerdem zeigen Studien, dass die Detektion von Krebserkrankungen mit Hilfe eines Tumorbiomarkerpanels höhere Sensitivität und Selektivität zeigt als bei Verwendung einzelner individueller Biomarker. Es ist bekannt, dass abnorme Expression von Proteinen innerhalb von Tumoren antigenische Eigenschaften aufweist. Diese werden durch das Immunsystem erkannt und draus folgend Tumorautoantikörper (TAA) gebildet. In Kombination mit Protein-Microarray Technologie sind TAAs ein vielversprechender Ansatz für die Entdeckung von Tumorbiomarkern. Jedoch auf Grund der komplexen Natur von Proteinen, zeigen Experimente mit Protein-Microarrays eine niedrige Reproduzierbarkeit, verglichen mit DNA-Microoarrays, und bedürfen daher sorgfältiger Optimierung. Diese Arbeit präsentiert ein Tumor-Autoantibody Screening von Dickdarmkrebsproben (gesunder Kontrollgruppe, Darmpolypen mit niedrigem Risiko Darmpolypen, Darmpolypen mit hohem Risiko und Dickdarmkarzinom) mittels Protein-Microarrays bestehend aus mehreren tausend Proteinen. Ziel ist es spezifische TAAs zu identifizieren, um ein Kandidatenmarker-Array herstellen und dieses mit einem größeren Probenset zu validieren. Aufgrund der niedrigen Effizienz des konventionellen Assay-Protokolls wurden einige Optimierungen vorgenommen, um so eine standardisierte Arbeitsvorschrift für den Umgang mit den vorliegenden Protein-Microarrays zu erstellen. Wichtige Aspekte bei der Prozessierung der Protein-Microarrays wurden berücksichtigt und auf mögliche Alternativen getestet, wie beispielsweise Protein-Microarray Oberflächen, Blockier- und Puffer chemikalien und experimentelle Bedingungen. Am wichtigsten war es, die Verwendbarkeit von aufgereinigtem IgG für das Tumorautiantikörper Screening zu testen. Letztendlich wurden einige Veränderungen des Protokolls vorgenommen: Probentestung in rotierenden Kammern bevorzugt gegenüber statischer Feuchtigkeitskammer, Verlängerung der Probentestzeit von zwei auf vier Stunden, Zusatz von Milchpulver zu den Proben, Optimierung der Verdünnung des Detektions-Antikörpers, Ersetzen des Detergenz Tween20 mit Triton X-100 im Puffer und Verwendung von aufgereinigtem IgG besser als Serum. Die Leistung des Tumorantikörper-Kandidatscreenings konnte signifikant, im Bezug auf Sensitivitäten und Spezifizitäten, gegenüber früheren Screenings verbessert werden. Eine Steigerung von ca. 54% auf 97% für die Unterscheidung zwischen Patienten gegen Kontrollen wurde unter der Verwendung von 25 greedy-pairs gene selection für statistische Klassenvorhersage erreicht. Mit diesem verbesserten Protokoll wurden mittels class prediction analyses 632 Klassifikatorklone, welche 593 Gene repräsentieren, als die am besten vorhersagenden TAAs ausgewählt. Zusätzlich wurden weitere 100 Gene aus publizierten Tumor-Antikörper Screenings mittel Protein-Microarrays in Dickdarmkrebs herangezogen. Die insgesamt 732 Klone werden auf einem Kandidatenmarker-Array zusammengefasst und in der Zukunft für eine Leistungsvalidierung des Arrays mit 384 Proben (unterteilt in 4 Probengruppen) verwendet.Colorectal cancer is the third ranking cancer type worldwide with increasing incidences in developed countries. Progression of the disease takes several years and early detection and diagnosis following population based screenings has increased survival rate considerably. The screening methods in clinical application such as faecal occult blood test (FOBT) lack sensitivity and colonoscopy is unpleasant and invasive. Therefore, serum based minimal invasive methods are in great demand. The concepts of specific molecular signatures in the different stages of tumorigenesis, and from there generation of tumor-associated antigens are highly anticipated as biomarkers for applications in diagnostics. Tumor associated antigens such as CEA and CA 19.9 are in clinical applications, but exhibit low sensitivity and specificity. Moreover, many studies have presented that panel of tumor biomarkers show higher sensitivity and specificity in detecting cancer than individual biomarkers. It is recognized that abnormal expression of proteins in tumors exhibits antigenic ability and are recognized by the immune system, consequently producing tumor autoantibodies (TAA). TAAs in combination with protein microarray technology are a promising approach for tumor biomarker discovery. However, due to the complex nature of proteins, protein microarray experiments have low reproducibility and require careful optimization. This thesis presents screening for tumor autoantibodies using colon cancer plasma samples of healthy controls, low risk polyps, high risk polyps and colon carcinoma groups utilizing protein microarrays containing several thousand proteins, aiming at identifying specific TAAs for generating a candidate marker array for subsequent validation in larger sample set. Due to the low performance of conventional assay protocols, several optimizations were carried out to establish a standard operating procedure for the particular type of protein microarray. Several aspects important for protein microarray processing were addressed and tested for the possible alternatives, such as protein microarray surfaces, blocking and buffer chemistries, as well as reaction conditions of the assays. Most importantly, the possibility of using purified IgG for tumor autoantibody marker screening was tested. As a result, several changes to the protocol were made: probing in rotating chambers rather than horizontal humidity chambers, extension of sample incubation from 2 hours to 4 hours, addition of milk powder to samples, and optimization of detection antibody dilutions, replacement of the detergent Tween20 with Triton X-100 in buffers and using purified IgG of samples rather than serum. Tumor autoantibody candidate marker screening performance could be significantly improved from previous screening, with respect to sensitivities and specificities, increasing from ~54% to 97% for distinguishing patients versus controls using 25 greedy-pairs gene selection criteria for class prediction. With this improved protocol, 632 classifier clones representing 593 genes were deduced from class prediction analyses as the most predictive TAAs. Additionally, 100 genes were selected from published literature on screening for tumor autoantibodies in colorectal cancer using protein microarray technology. This total of 732 clones will comprise the candidate marker array and will be applied in future studies for validation with 384 samples of 4 sample groups: healthy controls, low risk polyps, high risk polyps and colon carcinoma

    Interpretable Machine Learning Methods for Prediction and Analysis of Genome Regulation in 3D

    Get PDF
    With the development of chromosome conformation capture-based techniques, we now know that chromatin is packed in three-dimensional (3D) space inside the cell nucleus. Changes in the 3D chromatin architecture have already been implicated in diseases such as cancer. Thus, a better understanding of this 3D conformation is of interest to help enhance our comprehension of the complex, multipronged regulatory mechanisms of the genome. The work described in this dissertation largely focuses on development and application of interpretable machine learning methods for prediction and analysis of long-range genomic interactions output from chromatin interaction experiments. In the first part, we demonstrate that the genetic sequence information at the ge- nomic loci is predictive of the long-range interactions of a particular locus of interest (LoI). For example, the genetic sequence information at and around enhancers can help predict whether it interacts with a promoter region of interest. This is achieved by building string kernel-based support vector classifiers together with two novel, in- tuitive visualization methods. These models suggest a potential general role of short tandem repeat motifs in the 3D genome organization. But, the insights gained out of these models are still coarse-grained. To this end, we devised a machine learning method, called CoMIK for Conformal Multi-Instance Kernels, capable of providing more fine-grained insights. When comparing sequences of variable length in the su- pervised learning setting, CoMIK can not only identify the features important for classification but also locate them within the sequence. Such precise identification of important segments of the whole sequence can help in gaining de novo insights into any role played by the intervening chromatin towards long-range interactions. Although CoMIK primarily uses only genetic sequence information, it can also si- multaneously utilize other information modalities such as the numerous functional genomics data if available. The second part describes our pipeline, pHDee, for easy manipulation of large amounts of 3D genomics data. We used the pipeline for analyzing HiChIP experimen- tal data for studying the 3D architectural changes in Ewing sarcoma (EWS) which is a rare cancer affecting adolescents. In particular, HiChIP data for two experimen- tal conditions, doxycycline-treated and untreated, and for primary tumor samples is analyzed. We demonstrate that pHDee facilitates processing and easy integration of large amounts of 3D genomics data analysis together with other data-intensive bioinformatics analyses.Mit der Entwicklung von Techniken zur Bestimmung der Chromosomen-Konforma- tion wissen wir jetzt, dass Chromatin in einer dreidimensionalen (3D) Struktur in- nerhalb des Zellkerns gepackt ist. Änderungen in der 3D-Chromatin-Architektur sind bereits mit Krankheiten wie Krebs in Verbindung gebracht worden. Daher ist ein besseres Verständnis dieser 3D-Konformation von Interesse, um einen tieferen Einblick in die komplexen, vielschichtigen Regulationsmechanismen des Genoms zu ermöglichen. Die in dieser Dissertation beschriebene Arbeit konzentriert sich im Wesentlichen auf die Entwicklung und Anwendung interpretierbarer maschineller Lernmethoden zur Vorhersage und Analyse von weitreichenden genomischen Inter- aktionen aus Chromatin-Interaktionsexperimenten. Im ersten Teil zeigen wir, dass die genetische Sequenzinformation an den genomis- chen Loci prädiktiv für die weitreichenden Interaktionen eines bestimmten Locus von Interesse (LoI) ist. Zum Beispiel kann die genetische Sequenzinformation an und um Enhancer-Elemente helfen, vorherzusagen, ob diese mit einer Promotorregion von Interesse interagieren. Dies wird durch die Erstellung von String-Kernel-basierten Support Vector Klassifikationsmodellen zusammen mit zwei neuen, intuitiven Visual- isierungsmethoden erreicht. Diese Modelle deuten auf eine mögliche allgemeine Rolle von kurzen, repetitiven Sequenzmotiven (”tandem repeats”) in der dreidimensionalen Genomorganisation hin. Die Erkenntnisse aus diesen Modellen sind jedoch immer noch grobkörnig. Zu diesem Zweck haben wir die maschinelle Lernmethode CoMIK (für Conformal Multi-Instance-Kernel) entwickelt, welche feiner aufgelöste Erkennt- nisse liefern kann. Beim Vergleich von Sequenzen mit variabler Länge in überwachten Lernszenarien kann CoMIK nicht nur die für die Klassifizierung wichtigen Merkmale identifizieren, sondern sie auch innerhalb der Sequenz lokalisieren. Diese genaue Identifizierung wichtiger Abschnitte der gesamten Sequenz kann dazu beitragen, de novo Einblick in jede Rolle zu gewinnen, die das dazwischen liegende Chromatin für weitreichende Interaktionen spielt. Obwohl CoMIK hauptsächlich nur genetische Se- quenzinformationen verwendet, kann es gleichzeitig auch andere Informationsquellen nutzen, beispielsweise zahlreiche funktionellen Genomdaten sofern verfügbar. Der zweite Teil beschreibt unsere Pipeline pHDee für die einfache Bearbeitung großer Mengen von 3D-Genomdaten. Wir haben die Pipeline zur Analyse von HiChIP- Experimenten zur Untersuchung von dreidimensionalen Architekturänderungen bei der seltenen Krebsart Ewing-Sarkom (EWS) verwendet, welche Jugendliche betrifft. Insbesondere werden HiChIP-Daten für zwei experimentelle Bedingungen, Doxycyclin- behandelt und unbehandelt, und für primäre Tumorproben analysiert. Wir zeigen, dass pHDee die Verarbeitung und einfache Integration großer Mengen der 3D-Genomik- Datenanalyse zusammen mit anderen datenintensiven Bioinformatik-Analysen erle- ichtert

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here
    • …
    corecore