830 research outputs found

    Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

    Get PDF
    I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more beautiful. Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional sense of Boltzmann and Shannon but in the sense that it allows for compression progress because its regularity was not yet known. This drive maximizes interestingness, the first derivative of subjective beauty or compressibility, that is, the steepness of the learning curve. It motivates exploring infants, pure mathematicians, composers, artists, dancers, comedians, yourself, and (since 1990) artificial systems.Comment: 35 pages, 3 figures, based on KES 2008 keynote and ALT 2007 / DS 2007 joint invited lectur

    Finding Statistically Significant Communities in Networks

    Get PDF
    Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multi-purpose techniques, able to handle different types of datasets and the subtleties of community structure. In this paper we present OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics. OSLOM can be used alone or as a refinement procedure of partitions/covers delivered by other techniques. We have also implemented sequential algorithms combining OSLOM with other fast techniques, so that the community structure of very large networks can be uncovered. Our method has a comparable performance as the best existing algorithms on artificial benchmark graphs. Several applications on real networks are shown as well. OSLOM is implemented in a freely available software (http://www.oslom.org), and we believe it will be a valuable tool in the analysis of networks

    Advancing the analysis of bisulfite sequencing data in its application to ecological plant epigenetics

    Get PDF
    The aim of this thesis is to bridge the gap between the state-of-the-art bioinformatic tools and resources, currently at the forefront of epigenetic analysis, and their emerging applications to non-model species in the context of plant ecology. New, high-resolution research tools are presented; first in a specific sense, by providing new genomic resources for a selected non-model plant species, and also in a broader sense, by developing new software pipelines to streamline the analysis of bisulfite sequencing data, in a manner which is applicable to a wide range of non-model plant species. The selected species is the annual field pennycress, Thlaspi arvense, which belongs in the same lineage of the Brassicaceae as the closely-related model species, Arabidopsis thaliana, and yet does not benefit from such extensive genomic resources. It is one of three key species in a Europe-wide initiative to understand how epigenetic mechanisms contribute to natural variation, stress responses and long-term adaptation of plants. To this end, this thesis provides a high-quality, chromosome-level assembly for T. arvense, alongside a rich complement of feature annotations of particular relevance to the study of epigenetics. The genome assembly encompasses a hybrid approach, involving both PacBio continuous long reads and circular consensus sequences, alongside Hi-C sequencing, PCR-free Illumina sequencing and genetic maps. The result is a significant improvement in contiguity over the existing draft state from earlier studies. Much of the basis for building an understanding of epigenetic mechanisms in non-model species centres around the study of DNA methylation, and in particular the analysis of bisulfite sequencing data to bring methylation patterns into nucleotide-level resolution. In order to maintain a broad level of comparison between T. arvense and the other selected species under the same initiative, a suite of software pipelines which include mapping, the quantification of methylation values, differential methylation between groups, and epigenome-wide association studies, have also been developed. Furthermore, presented herein is a novel algorithm which can facilitate accurate variant calling from bisulfite sequencing data using conventional approaches, such as FreeBayes or Genome Analysis ToolKit (GATK), which until now was feasible only with specifically-adapted software. This enables researchers to obtain high-quality genetic variants, often essential for contextualising the results of epigenetic experiments, without the need for additional sequencing libraries alongside. Each of these aspects are thoroughly benchmarked, integrated to a robust workflow management system, and adhere to the principles of FAIR (Findability, Accessibility, Interoperability and Reusability). Finally, further consideration is given to the unique difficulties presented by population-scale data, and a number of concepts and ideas are explored in order to improve the feasibility of such analyses. In summary, this thesis introduces new high-resolution tools to facilitate the analysis of epigenetic mechanisms, specifically relating to DNA methylation, in non-model plant data. In addition, thorough benchmarking standards are applied, showcasing the range of technical considerations which are of principal importance when developing new pipelines and tools for the analysis of bisulfite sequencing data. The complete “Epidiverse Toolkit” is available at https://github.com/EpiDiverse and will continue to be updated and improved in the future.:ABSTRACT ACKNOWLEDGEMENTS 1 INTRODUCTION 1.1 ABOUT THIS WORK 1.2 BIOLOGICAL BACKGROUND 1.2.1 Epigenetics in plant ecology 1.2.2 DNA methylation 1.2.3 Maintenance of 5mC patterns in plants 1.2.4 Distribution of 5mC patterns in plants 1.3 TECHNICAL BACKGROUND 1.3.1 DNA sequencing 1.3.2 The case for a high-quality genome assembly 1.3.3 Sequence alignment for NGS 1.3.4 Variant calling approaches 2 BUILDING A SUITABLE REFERENCE GENOME 2.1 INTRODUCTION 2.2 MATERIALS AND METHODS 2.2.1 Seeds for the reference genome development 2.2.2 Sample collection, library preparation, and DNA sequencing 2.2.3 Contig assembly and initial scaffolding 2.2.4 Re-scaffolding 2.2.5 Comparative genomics 2.3 RESULTS 2.3.1 An improved reference genome sequence 2.3.2 Comparative genomics 2.4 DISCUSSION 3 FEATURE ANNOTATION FOR EPIGENOMICS 3.1 INTRODUCTION 3.2 MATERIALS AND METHODS 3.2.1 Tissue preparation for RNA sequencing 3.2.2 RNA extraction and sequencing 3.2.3 Transcriptome assembly 3.2.4 Genome annotation 3.2.5 Transposable element annotations 3.2.6 Small RNA annotations 3.2.7 Expression atlas 3.2.8 DNA methylation 3.3 RESULTS 3.3.1 Transcriptome assembly 3.3.2 Protein-coding genes 3.3.3 Non-coding loci 3.3.4 Transposable elements 3.3.5 Small RNA 3.3.6 Pseudogenes 3.3.7 Gene expression atlas 3.3.8 DNA Methylation 3.4 DISCUSSION 4 BISULFITE SEQUENCING METHODS 4.1 INTRODUCTION 4.2 PRINCIPLES OF BISULFITE SEQUENCING 4.3 EXPERIMENTAL DESIGN 4.4 LIBRARY PREPARATION 4.4.1 Whole Genome Bisulfite Sequencing (WGBS) 4.4.2 Reduced Representation Bisulfite Sequencing (RRBS) 4.4.3 Target capture bisulfite sequencing 4.5 BIOINFORMATIC ANALYSIS OF BISULFITE DATA 4.5.1 Quality Control 4.5.2 Read Alignment 4.5.3 Methylation Calling 4.6 ALTERNATIVE METHODS 5 FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS 5.1 INTRODUCTION 5.2 MATERIALS AND METHODS 5.2.1 Reference species 5.2.2 Natural accessions 5.2.3 Read simulation 5.2.4 Read alignment 5.2.5 Mapping rates 5.2.6 Precision-recall 5.2.7 Coverage deviation 5.2.8 DNA methylation analysis 5.3 RESULTS 5.4 DISCUSSION 5.5 A PIPELINE FOR WGBS ANALYSIS 6 THERE AND BACK AGAIN: INFERRING GENOMIC INFORMATION 6.1 INTRODUCTION 6.1.1 Implementing a new approach 6.2 MATERIALS AND METHODS 6.2.1 Validation datasets 6.2.2 Read processing and alignment 6.2.3 Variant calling 6.2.4 Benchmarking 6.3 RESULTS 6.4 DISCUSSION 6.5 A PIPELINE FOR SNP VARIANT ANALYSIS 7 POPULATION-LEVEL EPIGENOMICS 7.1 INTRODUCTION 7.2 CHALLENGES IN POPULATION-LEVEL EPIGENOMICS 7.3 DIFFERENTIAL METHYLATION 7.3.1 A pipeline for case/control DMRs 7.3.2 A pipeline for population-level DMRs 7.4 EPIGENOME-WIDE ASSOCIATION STUDIES (EWAS) 7.4.1 A pipeline for EWAS analysis 7.5 GENOTYPING-BY-SEQUENCING (EPIGBS) 7.5.1 Extending the epiGBS pipeline 7.6 POPULATION-LEVEL HAPLOTYPES 7.6.1 Extending the EpiDiverse/SNP pipeline 8 CONCLUSION APPENDICES A. SUPPLEMENT: BUILDING A SUITABLE REFERENCE GENOME B. SUPPLEMENT: FEATURE ANNOTATION FOR EPIGENOMICS C. SUPPLEMENT: FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS D. SUPPLEMENT: INFERRING GENOMIC INFORMATION BIBLIOGRAPH
    • …
    corecore