226 research outputs found

    On the Informativeness of the DNA Promoter Sequences Domain Theory

    Full text link
    The DNA promoter sequences domain theory and database have become popular for testing systems that integrate empirical and analytical learning. This note reports a simple change and reinterpretation of the domain theory in terms of M-of-N concepts, involving no learning, that results in an accuracy of 93.4% on the 106 items of the database. Moreover, an exhaustive search of the space of M-of-N domain theory interpretations indicates that the expected accuracy of a randomly chosen interpretation is 76.5%, and that a maximum accuracy of 97.2% is achieved in 12 cases. This demonstrates the informativeness of the domain theory, without the complications of understanding the interactions between various learning algorithms and the theory. In addition, our results help characterize the difficulty of learning using the DNA promoters theory.Comment: See http://www.jair.org/ for any accompanying file

    Explicit probabilistic models for databases and networks

    Full text link
    Recent work in data mining and related areas has highlighted the importance of the statistical assessment of data mining results. Crucial to this endeavour is the choice of a non-trivial null model for the data, to which the found patterns can be contrasted. The most influential null models proposed so far are defined in terms of invariants of the null distribution. Such null models can be used by computation intensive randomization approaches in estimating the statistical significance of data mining results. Here, we introduce a methodology to construct non-trivial probabilistic models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt models allow for the natural incorporation of prior information. Furthermore, they satisfy a number of desirable properties of previously introduced randomization approaches. Lastly, they also have the benefit that they can be represented explicitly. We argue that our approach can be used for a variety of data types. However, for concreteness, we have chosen to demonstrate it in particular for databases and networks.Comment: Submitte

    Three-dimensional chromatin organisation in human pancreatic islets

    Get PDF
    Diabetes is a group of metabolic diseases that affects millions of people. Despite this, little is known about the underlying molecular mechanisms. Diabetes is characterised by an impaired blood-glucose regulation that can lead to severe consequences, such as kidney failure, and premature death. Pancreatic islets are one of the major tissues to understand diabetes pathogenesis as they produce insulin, a hormone central for blood-glucose homeostasis. Our previous work showed that studying epigenomic regulation is key to giving insight into the molecular mechanisms underlying diabetes, as risk-associated genomic variants are enriched at transcriptional regulatory regions named enhancers. To give further insight in pancreatic islet transcriptional regulation, I aimed to decipher the 3D chromatin organisation, an aspect of epigenomic regulation in human pancreatic islets that remained largely unexplored until now. As part of my PhD project I have studied high-resolution chromatin interaction maps that characterise 3D chromatin organisation at different levels, from single interactions between specific pair of genomic loci to large genomic topological domains known as TADs. These high-resolution chromatin interaction maps, integrated with a large collection of epigenomic datasets, allowed me to describe several aspects of islet 3D chromatin organisation, such as the identification of islet-selective chromatin structures associated to islet-specific gene expression. Moreover, I identified groups of enhancers that gather in 3D space. These 3D enhancer clusters were frequently found in loci key for islet function and highly enriched in diabetes associated variants. The results of this thesis allow us to have a more accurate picture of the epigenomic regulation in human pancreatic islets and how non-coding diabetes risk variants could be impairing enhancer-promoter communication.Open Acces

    Multi-criteria-based active learning for named entity recognition

    Get PDF
    Master'sMASTER OF SCIENC

    Iterative Random Forests to detect predictive and stable high-order interactions

    Get PDF
    Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology

    A Unified Analytic Framework for Prioritization of Non-Coding Variants of Uncertain Significance in Heritable Breast and Ovarian Cancer

    Get PDF
    Background Sequencing of both healthy and disease singletons yields many novel and low frequency variants of uncertain significance (VUS). Complete gene and genome sequencing by next generation sequencing (NGS) significantly increases the number of VUS detected. While prior studies have emphasized protein coding variants, non-coding sequence variants have also been proven to significantly contribute to high penetrance disorders, such as hereditary breast and ovarian cancer (HBOC). We present a strategy for analyzing different functional classes of non-coding variants based on information theory (IT) and prioritizing patients with large intragenic deletions. Methods We captured and enriched for coding and non-coding variants in genes known to harbor mutations that increase HBOC risk. Custom oligonucleotide baits spanning the complete coding, non-coding, and intergenic regions 10 kb up- and downstream of ATM, BRCA1, BRCA2, CDH1, CHEK2, PALB2, and TP53 were synthesized for solution hybridization enrichment. Unique and divergent repetitive sequences were sequenced in 102 high-risk, anonymized patients without identified mutations in BRCA1/2. Aside from protein coding and copy number changes, IT-based sequence analysis was used to identify and prioritize pathogenic non-coding variants that occurred within sequence elements predicted to be recognized by proteins or protein complexes involved in mRNA splicing, transcription, and untranslated region (UTR) binding and structure. This approach was supplemented by in silico and laboratory analysis of UTR structure. Results 15,311 unique variants were identified, of which 245 occurred in coding regions. With the unified IT-framework, 132 variants were identified and 87 functionally significant VUS were further prioritized. An intragenic 32.1 kb interval in BRCA2 that was likely hemizygous was detected in one patient. We also identified 4 stop-gain variants and 3 reading-frame altering exonic insertions/deletions (indels). Conclusions We have presented a strategy for complete gene sequence analysis followed by a unified framework for interpreting non-coding variants that may affect gene expression. This approach distills large numbers of variants detected by NGS to a limited set of variants prioritized as potential deleterious changes

    On the Origin of the Treponematoses: A Phylogenetic Approach

    Get PDF
    For 500 years, controversy has raged around the origin of T. pallidum subsp. pallidum, the bacterium responsible for syphilis. Did Christopher Columbus and his men introduce this pathogen into Renaissance Europe, after contracting it during their voyage to the New World? Or does syphilis have a much older history in the Old World? This paper represents the first attempt to use a phylogenetic approach to solve this question. In addition, it clarifies the evolutionary relationships between the pathogen that causes syphilis and the other T. pallidum subspecies, which cause the neglected tropical diseases yaws and endemic syphilis. Using a collection of pathogenic Treponema strains that is unprecedented in size, we show that yaws appears to be an ancient infection in humans while venereal syphilis arose relatively recently in human history. In addition, the closest relatives of syphilis-causing strains identified in this study were found in South America, providing support for the Columbian theory of syphilis's origin

    Exploiting loop transformations for the protection of software

    Get PDF
    Il software conserva la maggior parte del know-how che occorre per svilupparlo. Poich\ue9 oggigiorno il software pu\uf2 essere facilmente duplicato e ridistribuito ovunque, il rischio che la propriet\ue0 intellettuale venga violata su scala globale \ue8 elevato. Una delle pi\uf9 interessanti soluzioni a questo problema \ue8 dotare il software di un watermark. Ai watermark si richiede non solo di certificare in modo univoco il proprietario del software, ma anche di essere resistenti e pervasivi. In questa tesi riformuliamo i concetti di robustezza e pervasivit\ue0 a partire dalla semantica delle tracce. Evidenziamo i cicli quali costrutti di programmazione pervasivi e introduciamo le trasformazioni di ciclo come mattone di costruzione per schemi di watermarking pervasivo. Passiamo in rassegna alcune fra tali trasformazioni, studiando i loro principi di base. Infine, sfruttiamo tali principi per costruire una tecnica di watermarking pervasivo. La robustezza rimane una difficile, quanto affascinante, questione ancora da risolvere.Software retains most of the know-how required fot its development. Because nowadays software can be easily cloned and spread worldwide, the risk of intellectual property infringement on a global scale is high. One of the most viable solutions to this problem is to endow software with a watermark. Good watermarks are required not only to state unambiguously the owner of software, but also to be resilient and pervasive. In this thesis we base resiliency and pervasiveness on trace semantics. We point out loops as pervasive programming constructs and we introduce loop transformations as the basic block of pervasive watermarking schemes. We survey several loop transformations, outlining their underlying principles. Then we exploit these principles to build some pervasive watermarking techniques. Resiliency still remains a big and challenging open issue

    Allelic diversity in the CAD2 and LIM1 lignin biosynthetic genes of Eucalyptus grandis Hill ex Maiden and E. smithii R.T. Baker

    Get PDF
    Lignin is a highly abundant aromatic biopolymer deposited during the final stages of secondary cell wall formation in plants and it constitutes a substantial proportion of the dry weight of woody plant stems. Lignin contributes structural support to xylem cell walls and hydrophobisity to water-conducting vessels and forms a defence mechanism against pathogen invasion. Although being an essential part of normal plant cell development, lignin content and composition are targets for tree improvement, because residual lignin in paper pulp has negative effects on paper quality and lignin therefore has to be removed using treatments that are expensive and often detrimental to the environment. At present, little is known about the amount of allelic diversity in lignin biosynthetic genes and whether such diversity may be associated with variation in lignin content and composition. However, the identification of alleles associated with desirable lignin phenotypes is dependent on a detailed understanding of the molecular evolution and population genetics of these genes. This M.Sc. study was aimed at analysing nucleotide and allelic diversity in two lignin biosynthetic genes of Eucalyptus trees. Additionally, the study aimed to develop single nucleotide polymorphism (SNP) markers that could be used to assay allelic diversity for these genes in populations of two target species, E. grandis and E. smithii. Orthologues of the tobacco LIM-domain1 (NtLIM1) transcription factor gene involved in the regulation of lignin biosynthesis were isolated from E. grandis and E. smithii. Approximately 3 kb of genomic sequence including the promoter and full-length gene regions were isolated for the two orthologues, respectively labeled EgrLIM1 and EsLIM1. The predicted amino acid sequences of EgrLIM1 and EsLIM1 were 99.4% identical to each other and indicated that LIM1 is a small protein of only 188 residues in eucalypt trees and has a predicted molecular weight of 21.0 kDa. Quantitative, real-time RT-PCR analysis confirmed the expression of LIM1 in wood-forming tissues undergoing lignification. Ten putative cis-regulatory elements were observed in the promoter regions of EgrLIM1 and EsLIM1including a GA-dinucleotide microsatellite that appears to be specific to LIM1 promoters of Eucalyptus tree species. The full-length LIM1 gene sequences could subsequently be used in the assessment of nucleotide and allelic diversity, together with the full-length CAD2 sequences that were already available in the public domain. The level of nucleotide and allelic diversity and the distribution and decay of linkage disequilibrium (LD) were surveyed in 5’ and 3’ derived gene fragments of CAD2 and LIM1 obtained from 20 E. grandis and 20 E. smithii individuals. Each gene displayed a unique genetic diversity profile, but for the most part, nucleotide diversity (π) was estimated at approximately 0.0010 except for the E. grandis LIM1 gene where π lower than 0.0040 was observed. Generally, except for the high amounts of LD observed in the CAD2 gene of E. grandis (> 2.5 kb), LD decayed within 500 bp. A large number (13 to 45) of SNP sites (defined as single nucleotide changes with minor allele frequencies of at least 0.10 in each species) were observed in each gene of each species. The high SNP density (ranging from one per 45 to one per 155 bp) observed in the two genes facilitated the efficient development of SNP markers to be used in future aspects of LD mapping, association genetics and marker-assisted breeding. The allele sequences obtained for the CAD2 and LIM1 genes were used as templates for the development of SNP marker panels (a series of six or seven SNP markers analysed together) for the analysis (tagging) of SNP haplotype diversity in species-wide reference populations (100 E. grandis and 137E. smithii individuals) of the two species. Each tag SNP was assayed using a single base extension assay and capillary gel electrophoresis. High polymorphism information content (average PIC of 0.836) was observed for the SNP marker panels. Four SNPs in the CAD2 and two in the LIM1 genes were found to be polymorphic in E. grandis and E. smithii (i.e. trans-specific SNPs), suggesting a possible ancestral origin for these polymorphisms. Assessment of candidate gene variation in the genomes of forest trees is of importance to ultimately be able to predict the amount and structure of nucleotide diversity available for the future design of SNP assays at the whole-genome level. Such assays will be useful to study differentiation among tree species and populations, to associate nucleotide polymorphisms with desirable phenotypes and to increase the efficiency of tree improvement approaches.Dissertation (MSc (Genetics))--University of Pretoria, 2009.Geneticsunrestricte
    corecore