7 research outputs found

    Large-scale motif discovery using DNA Gray code and equiprobable oligomers

    Get PDF
    Motivation: How to find motifs from genome-scale functional sequences, such as all the promoters in a genome, is a challenging problem. Word-based methods count the occurrences of oligomers to detect excessively represented ones. This approach is known to be fast and accurate compared with other methods. However, two problems have hampered the application of such methods to large-scale data. One is the computational cost necessary for clustering similar oligomers, and the other is the bias in the frequency of fixed-length oligomers, which complicates the detection of significant words

    A dictionary based informational genome analysis

    Get PDF
    Background: In the post-genomic era several methods of computational genomics are emerging to understand how the whole information is structured within genomes. Literature of last five years accounts for several alignment-free methods, arisen as alternative metrics for dissimilarity of biological sequences. Among the others, recent approaches are based on empirical frequencies of DNA k-mers in whole genomes.Results. Any set of words (factors) occurring in a genome provides a genomic dictionary. About sixty genomes were analyzed by means of informational indexes based on genomic dictionaries, where a systemic view replaces a local sequence analysis. A software prototype applying a methodology here outlined carried out some computations on genomic data. We computed informational indexes, built the genomic dictionaries with different sizes, along with frequency distributions. The software performed three main tasks: computation of informational indexes, storage of these in a database, index analysis and visualization. The validation was done by investigating genomes of various organisms. A systematic analysis of genomic repeats of several lengths, which is of vivid interest in biology (for example to compute excessively represented functional sequences, such as promoters), was discussed, and suggested a method to define synthetic genetic networks.Conclusions. We introduced a methodology based on dictionaries, and an efficient motif-finding software application for comparative genomics. This approach could be extended along many investigation lines, namely exported in other contexts of computational genomics, as a basis for discrimination of genomic pathologies

    Allostery of the flavivirus NS3 helicase and bacterial IGPS studied with molecular dynamics simulations

    Get PDF
    2020 Spring.Includes bibliographical references.Allostery is a biochemical phenomenon where the binding of a molecule at one site in a biological macromolecule (e.g. a protein) results in a perturbation of activity or function at another distinct active site in the macromolecule's structure. Allosteric mechanisms are seen throughout biology and play important functions during cell signaling, enzyme activation, and metabolism regulation as well as genome transcription and replication processes. Biochemical studies have identified allosteric effects for numerous proteins, yet our understanding of the molecular mechanisms underlying allostery is still lacking. Molecular-level insights obtained from all-atom molecular dynamics simulations can drive our understanding and further experimentation on the allosteric mechanisms at play in a protein. This dissertation reports three such studies of allostery using molecular dynamics simulations in conjunction with other methods. Specifically, the first chapter introduces allostery and how computational simulation of proteins can provide insight into the mechanisms of allosteric enzymes. The second and third chapters are foundational studies of the flavivirus non-structural 3 (NS3) helicase. This enzyme hydrolyzes nucleoside triphosphate molecules to power the translocation of the enzyme along single-stranded RNA as well as the unwinding of double-stranded RNA; both the hydrolysis and helicase functions (translocation and unwinding) have allosteric mechanisms where the hydrolysis active site's ligand affects the protein-RNA interactions and bound RNA enhances the hydrolysis activity. Specifically, a bound RNA oligomer is seen to affect the behavior and positioning of waters within the hydrolysis active site, which is hypothesized to originate, in part, from the RNA-dependent conformational states of the RNA-binding loop. Additionally, the substrate states of the NTP hydrolysis reaction cycle are seen to affect protein-RNA interactions, which is hypothesized to drive unidirectional translocation of the enzyme along the RNA polymer. Finally, chapter four introduces a novel method to study the biophysical coupling between two active sites in a protein. The short-ranged residue-residue interactions within the protein's three dimensional structure are used to identify paths that connect the two active sites. This method is used to highlight the paths and residue-residue interactions that are important to the allosteric enhancement observed for the Thermatoga maritima imidazole glycerol phosphate synthase (IGPS) protein. Results from this new quantitative analysis have provided novel insights into the allosteric paths of IGPS. For both the NS3 and IGPS proteins, results presented in this dissertation have highlighted structural regions that may be targeted for small-molecule inhibition or mutagenesis studies. Towards this end, the future studies of both allosteric proteins as well as broader impacts of the presented research are discussed in the final chapter

    Computational analysis of a candidate region for psychosis

    Get PDF

    New approaches to unveil the Transcriptional landscape of dopaminergic neurons

    Get PDF
    Recent advances in studying the mammalian transcriptome arised new questions about how genes are organized and what is the function of noncoding RNAs. Furthermore, the discovery of large amounts of polyA- transcripts and antisense transcription proved that a portion of the transcriptome has still to be characterized. The complex anatomo-functional organization of the brain has prevented a comprehensive analysis of the transcriptional landscape of this tissue. New techniques must be developed to approach neuronal heterogeneity. In this study we combined Laser Capture Microdissection (LCM) and nanoCAGE, based on Cap Analysis of Gene Expression (CAGE), to describe expressed genes and map their transcription start sites (TSS) in two specific populations, A9 and A10, of mouse mesencephalic dopaminergic cells. Although sharing common dopaminergic marker genes, these two populations are part of different midbrain anatomical structures, substantia nigra (SN) for A9 and ventral tegmental area (VTA) for A10, project to relatively distinct areas, participate to distinct ascending dopaminergic pathways, exhibit different electrophysiological properties and different susceptibility to neurodegeneration in Parkinson`s disease. Specific neurons were identified by the expression of Green Fluorescent Protein driven by a celltype specific promoter in transgenic mice. High-quality RNAs were purified from 1000-2500 cells collected by LCM. We adapted the CAGE technique to analyze limiting amounts of RNAs (nanoCAGE). We took advantage of the cap-switching properties of the reverse transcriptase to specifically tag the 5`end of transcripts with a sequence containing a class III restriction site for EcoP15I. By creating 32bp 5`tags, we considerably improved the TSS mapping rate on the genome. A semi-suppressive PCR strategy was used to prevent primer dimers formation. The use of random priming in the 1st strand synthesis allowed to capture poly(A)- RNAs. 5`tags were sequenced with Illumina-Solexa platform. Here we show that this new nanoCAGE technology ensures a true high-throughput coverage of the transcriptome of a small number of identified neurons and can be used as an effective mean for gene discovery in the noncoding RNAs, to uncover putative alternative promoters associated to variants of protein coding transcripts and to detect potentially regulatory antisense transcripts. A further experimental validation by 5`RACE (Rapid Amplification of cDNA Ends) and RT-PCR on few candidate genes, have confirmed the existence in vivo of alternative TSS in the case of key regulatory genes involved in specifying and maintaining the dopaminergic phenotype of these neurons such as \u3b1-synuclein (Snca), dopamine transporter (Dat), vescicular monoamine transporter 2 (Vmat2), catechol-O-methyltransferase (Comt). Furthermore the differential expression of an antisense transcript overlapping to the polyubiquitin (Ubc) gene was detected as potentially interesting candidate gene accounting for differences in the ubiquitin-proteasome system (UPS) function in the two neuron populations. The potential implications deriving from these newly discovered alternative promoters and transcripts are discussed, considering also the potential consequences for the corresponding protein isoforms

    Genome Informatics for High-Throughput Sequencing Data Analysis: Methods and Applications

    Get PDF
    This thesis introduces three different algorithmical and statistical strategies for the analysis of high-throughput sequencing data. First, we introduce a heuristic method based on enhanced suffix arrays to map short sequences to larger reference genomes. The algorithm builds on the idea of an error-tolerant traversal of the suffix array for the reference genome in conjunction with the concept of matching statistics introduced by Chang and a bitvector based alignment algorithm proposed by Myers. The algorithm supports paired-end and mate-pair alignments and the implementation offers methods for primer detection, primer and poly-A trimming. In our own benchmarks as well as independent bench- marks this tool outcompetes other currently available tools with respect to sensitivity and specificity in simulated and real data sets for a large number of sequencing protocols. Second, we introduce a novel dynamic programming algorithm for the spliced alignment problem. The advantage of this algorithm is its capability to not only detect co-linear splice events, i.e. local splice events on the same genomic strand, but also circular and other non-collinear splice events. This succinct and simple algorithm handles all these cases at the same time with a high accuracy. While it is at par with other state- of-the-art methods for collinear splice events, it outcompetes other tools for many non-collinear splice events. The application of this method to publically available sequencing data led to the identification of a novel isoform of the tumor suppressor gene p53. Since this gene is one of the best studied genes in the human genome, this finding is quite remarkable and suggests that the application of our algorithm could help to identify a plethora of novel isoforms and genes. Third, we present a data adaptive method to call single nucleotide variations (SNVs) from aligned high-throughput sequencing reads. We demonstrate that our method based on empirical log-likelihoods automatically adjusts to the quality of a sequencing experiment and thus renders a \"decision\" on when to call an SNV. In our simulations this method is at par with current state-of-the-art tools. Finally, we present biological results that have been obtained using the special features of the presented alignment algorithm.Diese Arbeit stellt drei verschiedene algorithmische und statistische Strategien für die Analyse von Hochdurchsatz-Sequenzierungsdaten vor. Zuerst führen wir eine auf enhanced Suffixarrays basierende heuristische Methode ein, die kurze Sequenzen mit grossen Genomen aligniert. Die Methode basiert auf der Idee einer fehlertoleranten Traversierung eines Suffixarrays für Referenzgenome in Verbindung mit dem Konzept der Matching-Statistik von Chang und einem auf Bitvektoren basierenden Alignmentalgorithmus von Myers. Die vorgestellte Methode unterstützt Paired-End und Mate-Pair Alignments, bietet Methoden zur Erkennung von Primersequenzen und zum trimmen von Poly-A-Signalen an. Auch in unabhängigen Benchmarks zeichnet sich das Verfahren durch hohe Sensitivität und Spezifität in simulierten und realen Datensätzen aus. Für eine große Anzahl von Sequenzierungsprotokollen erzielt es bessere Ergebnisse als andere bekannte Short-Read Alignmentprogramme. Zweitens stellen wir einen auf dynamischer Programmierung basierenden Algorithmus für das spliced alignment problem vor. Der Vorteil dieses Algorithmus ist seine Fähigkeit, nicht nur kollineare Spleiß- Ereignisse, d.h. Spleiß-Ereignisse auf dem gleichen genomischen Strang, sondern auch zirkuläre und andere nicht-kollineare Spleiß-Ereignisse zu identifizieren. Das Verfahren zeichnet sich durch eine hohe Genauigkeit aus: während es bei der Erkennung kollinearer Spleiß-Varianten vergleichbare Ergebnisse mit anderen Methoden erzielt, schlägt es die Wettbewerber mit Blick auf Sensitivität und Spezifität bei der Vorhersage nicht-kollinearer Spleißvarianten. Die Anwendung dieses Algorithmus führte zur Identifikation neuer Isoformen. In unserer Publikation berichten wir über eine neue Isoform des Tumorsuppressorgens p53. Da dieses Gen eines der am besten untersuchten Gene des menschlichen Genoms ist, könnte die Anwendung unseres Algorithmus helfen, eine Vielzahl weiterer Isoformen bei weniger prominenten Genen zu identifizieren. Drittens stellen wir ein datenadaptives Modell zur Identifikation von Single Nucleotide Variations (SNVs) vor. In unserer Arbeit zeigen wir, dass sich unser auf empirischen log-likelihoods basierendes Modell automatisch an die Qualität der Sequenzierungsexperimente anpasst und eine \"Entscheidung\" darüber trifft, welche potentiellen Variationen als SNVs zu klassifizieren sind. In unseren Simulationen ist diese Methode auf Augenhöhe mit aktuell eingesetzten Verfahren. Schließlich stellen wir eine Auswahl biologischer Ergebnisse vor, die mit den Besonderheiten der präsentierten Alignmentverfahren in Zusammenhang stehen

    A complex systems approach to education in Switzerland

    Get PDF
    The insights gained from the study of complex systems in biological, social, and engineered systems enables us not only to observe and understand, but also to actively design systems which will be capable of successfully coping with complex and dynamically changing situations. The methods and mindset required for this approach have been applied to educational systems with their diverse levels of scale and complexity. Based on the general case made by Yaneer Bar-Yam, this paper applies the complex systems approach to the educational system in Switzerland. It confirms that the complex systems approach is valid. Indeed, many recommendations made for the general case have already been implemented in the Swiss education system. To address existing problems and difficulties, further steps are recommended. This paper contributes to the further establishment complex systems approach by shedding light on an area which concerns us all, which is a frequent topic of discussion and dispute among politicians and the public, where billions of dollars have been spent without achieving the desired results, and where it is difficult to directly derive consequences from actions taken. The analysis of the education system's different levels, their complexity and scale will clarify how such a dynamic system should be approached, and how it can be guided towards the desired performance
    corecore