7 research outputs found

    Reusable, extensible, and modifiable R scripts and Kepler workflows for comprehensive single set ChIP-seq analysis

    Get PDF
    BACKGROUND: There has been an enormous expansion of use of chromatin immunoprecipitation followed by sequencing (ChIP-seq) technologies. Analysis of large-scale ChIP-seq datasets involves a complex series of steps and production of several specialized graphical outputs. A number of systems have emphasized custom development of ChIP-seq pipelines. These systems are primarily based on custom programming of a single, complex pipeline or supply libraries of modules and do not produce the full range of outputs commonly produced for ChIP-seq datasets. It is desirable to have more comprehensive pipelines, in particular ones addressing common metadata tasks, such as pathway analysis, and pipelines producing standard complex graphical outputs. It is advantageous if these are highly modular systems, available as both turnkey pipelines and individual modules, that are easily comprehensible, modifiable and extensible to allow rapid alteration in response to new analysis developments in this growing area. Furthermore, it is advantageous if these pipelines allow data provenance tracking. RESULTS: We present a set of 20 ChIP-seq analysis software modules implemented in the Kepler workflow system; most (18/20) were also implemented as standalone, fully functional R scripts. The set consists of four full turnkey pipelines and 16 component modules. The turnkey pipelines in Kepler allow data provenance tracking. Implementation emphasized use of common R packages and widely-used external tools (e.g., MACS for peak finding), along with custom programming. This software presents comprehensive solutions and easily repurposed code blocks for ChIP-seq analysis and pipeline creation. Tasks include mapping raw reads, peakfinding via MACS, summary statistics, peak location statistics, summary plots centered on the transcription start site (TSS), gene ontology, pathway analysis, and de novo motif finding, among others. CONCLUSIONS: These pipelines range from those performing a single task to those performing full analyses of ChIP-seq data. The pipelines are supplied as both Kepler workflows, which allow data provenance tracking, and, in the majority of cases, as standalone R scripts. These pipelines are designed for ease of modification and repurposing

    Evolutionary Signatures amongst Disease Genes Permit Novel Methods for Gene Prioritization and Construction of Informative Gene-Based Networks

    Get PDF
    Genes involved in the same function tend to have similar evolutionary histories, in that their rates of evolution covary over time. This coevolutionary signature, termed Evolutionary Rate Covariation (ERC), is calculated using only gene sequences from a set of closely related species and has demonstrated potential as a computational tool for inferring functional relationships between genes. To further define applications of ERC, we first established that roughly 55% of genetic diseases posses an ERC signature between their contributing genes. At a false discovery rate of 5% we report 40 such diseases including cancers, developmental disorders and mitochondrial diseases. Given these coevolutionary signatures between disease genes, we then assessed ERC's ability to prioritize known disease genes out of a list of unrelated candidates. We found that in the presence of an ERC signature, the true disease gene is effectively prioritized to the top 6% of candidates on average. We then apply this strategy to a melanoma-associated region on chromosome 1 and identify MCL1 as a potential causative gene. Furthermore, to gain global insight into disease mechanisms, we used ERC to predict molecular connections between 310 nominally distinct diseases. The resulting “disease map” network associates several diseases with related pathogenic mechanisms and unveils many novel relationships between clinically distinct diseases, such as between Hirschsprung's disease and melanoma. Taken together, these results demonstrate the utility of molecular evolution as a gene discovery platform and show that evolutionary signatures can be used to build informative gene-based networks

    Microbial ecology of hot desert edaphic systems

    Get PDF
    A significant proportion of the Earth's surface is desert or in the process of desertification. The extreme environmental conditions that characterize these areas result in a surface that is essentially barren, with a limited range of higher plants and animals. Microbial communities are probably the dominant drivers of these systems, mediating key ecosystem processes. In this review, we examine the microbial communities of hot desert terrestrial biotopes (including soils, cryptic and refuge niches and plant-root-associated microbes) and the processes that govern their assembly. We also assess the possible effects of global climate change on hot desert microbial communities and the resulting feedback mechanisms. We conclude by discussing current gaps in our understanding of the microbiology of hot deserts and suggest fruitful avenues for future research.South African National Research Foundation, the University of Pretoria and the Genomics Research Institute.http://femsre.oxfordjournals.org2016-03-31hb201

    유전체 서열 분석에서 고차 관계의 진화적 기계학습

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 협동과정 생물정보학전공, 2014. 2. 장병탁.One of the basic research goals in life science is to understand the complex relationships between biological factors and phenotypes, and to identify the various factors affecting the phenotype. In particular, genomic sequences play a significant role in determining the phenotype, such as gene expression and a susceptibility to disease, so the studies for the fundamental information stored in genome is essential to understanding biological processes. Previous genomic sequence analyses mainly focused on identification of a single associated factor or pairwise relationships with significant effects. Recent development of high-throughput technologies has made it possible to identify the causal factors by carrying out genome-wide analysis. However, it still remains as a challenge to discover higher-order interactions of multiple factors because this involves huge search spaces and computational costs. In this dissertation, we develop effective methods for identifying the higher-order relationships of sequence elements affecting the phenotype, by combining statistical learning with evolutionary computation. The methods are applied to finding the associated combinatorial factors and dysfunctional modules in various genome-wide sequence analysis problems. Firstly, we show statistical learning-based methods to detect co-regulatory sequence motifs and to investigate combinatorial effects of DNA methylation, affecting on downstream gene expression. Next, to examine the sequence datasets with a huge number of attributes on human genome, we apply evolutionary computation approaches. Our methods search the problem feature space based on machine learning techniques using training datasets in evolutionary computation processes and are able to find candidate solution well in computationally expensive optimization problems. The experimental results show that the approaches are useful to find the higher-order relationships associated to disease using genomic and epigenomic datasets. In conclusion, our studies would provide practical methods to analyze complex interactions among sequence elements in genomic/epigenomic studies.Abstract i 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Organization of the dissertation . . . . . . . . . . . . . . . . . . . . . 7 2 Genome biology and computational analysis 9 2.1 Fundamentals of genome biology . . . . . . . . . . . . . . . . . . . . 9 2.1.1 DNA, gene, chromosomes and cell biology . . . . . . . . . . . 9 2.1.2 Gene expression and regulation . . . . . . . . . . . . . . . . . 10 2.1.3 Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 Epigenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Evolutionary machine learning . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Machine learning and evolutionary computation . . . . . . . 13 2.2.2 Evolutionary computation in biology . . . . . . . . . . . . . . 13 3 Identifying co-regulatory sequence motifs 16 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Investigation of the relationship between regulatory sequence motifs and expression prolfies . . . . . . . . . . . . . . . . . . 18 3.2.2 Preparation of the gene expression datasets . . . . . . . . . . 21 3.2.3 Preparation of the gene sequence datasets . . . . . . . . . . . 22 3.2.4 Measurement of the eect of motif combinations . . . . . . . 23 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.1 Identication of the relationship between gene expression and known motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.2 Identification of cell cycle-related motifs . . . . . . . . . . . . 28 3.3.3 Combinational effects of regulatory motifs . . . . . . . . . . . 30 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4 Investigation of combinatorial eects of DNA methylation 35 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Proling of DNA methylation patterns . . . . . . . . . . . . . 39 4.2.3 Identifying differentially methylated/expressed genes by information theoretic analysis . . . . . . . . . . . . . . . . . . . . 39 4.2.4 Identifying downregulated genes in each subtype for integrative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.5 Correlation between DNA methylation and gene expression . 41 4.2.6 Combinatorial effects of DNA methylation in various genomic regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.7 Analysis of transcription factor binding regions possibly blocked by DNA methylation . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.1 DNA methylation in 30 ICBP cell lines . . . . . . . . . . . . 44 4.3.2 Information theoretic analysis of phenotype-differentially methylated and expressed genes . . . . . . . . . . . . . . . . . . . . 45 4.3.3 Integrated analysis of DNA methylation and gene expression 47 4.3.4 Investigation of the combinatorial eects of DNA methylation in various regions on downstream gene expression levels . . . 52 4.3.5 Integrative analysis of transcription factors, DNA methylation and gene expression . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5 Detecting multiple SNP interaction via evolutionary learning 63 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.1 Identifying higher-order interaction of SNPs . . . . . . . . . . 65 5.2.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . 66 5.2.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.1 Identifying interaction between features in simulation data . 72 5.3.2 Identifying higher-order SNP interactions in Korean population 74 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6 Identifying DNA methylation modules by probabilistic evolution- ary learning 85 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2.1 Evolutionary learning procedure to identify a set of DNA methylation sites associated to disease . . . . . . . . . . . . . . . . 87 6.2.2 Learning dependency graph . . . . . . . . . . . . . . . . . . . 88 6.2.3 Fitness evaluation in population . . . . . . . . . . . . . . . . 90 6.2.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.1 DNA methylation modules associated to breast cancer . . . 92 6.3.2 Modules associated to colorectal cancer using high-throughput sequencing data . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7 Conclusion 104 Bibliography 106 초록 133Docto
    corecore