9 research outputs found

    A Feature Selection Algorithm to Compute Gene Centric Methylation from Probe Level Methylation Data

    Get PDF
    DNA methylation is an important epigenetic event that effects gene expression during development and various diseases such as cancer. Understanding the mechanism of action of DNA methylation is important for downstream analysis. In the Illumina Infinium HumanMethylation 450K array, there are tens of probes associated with each gene. Given methylation intensities of all these probes, it is necessary to compute which of these probes are most representative of the gene centric methylation level. In this study, we developed a feature selection algorithm based on sequential forward selection that utilized different classification methods to compute gene centric DNA methylation using probe level DNA methylation data. We compared our algorithm to other feature selection algorithms such as support vector machines with recursive feature elimination, genetic algorithms and ReliefF. We evaluated all methods based on the predictive power of selected probes on their mRNA expression levels and found that a K-Nearest Neighbors classification using the sequential forward selection algorithm performed better than other algorithms based on all metrics. We also observed that transcriptional activities of certain genes were more sensitive to DNA methylation changes than transcriptional activities of other genes. Our algorithm was able to predict the expression of those genes with high accuracy using only DNA methylation data. Our results also showed that those DNA methylation-sensitive genes were enriched in Gene Ontology terms related to the regulation of various biological processes

    DiseaseMeth: a human disease methylation database

    Get PDF
    DNA methylation is an important epigenetic modification for genomic regulation in higher organisms that plays a crucial role in the initiation and progression of diseases. The integration and mining of DNA methylation data by methylation-specific PCR and genome-wide profiling technology could greatly assist the discovery of novel candidate disease biomarkers. However, this is difficult without a comprehensive DNA methylation repository of human diseases. Therefore, we have developed DiseaseMeth, a human disease methylation database (http://bioinfo.hrbmu.edu.cn/diseasemeth). Its focus is the efficient storage and statistical analysis of DNA methylation data sets from various diseases. Experimental information from over 14 000 entries and 175 high-throughput data sets from a wide number of sources have been collected and incorporated into DiseaseMeth. The latest release incorporates the gene-centric methylation data of 72 human diseases from a variety of technologies and platforms. To facilitate data extraction, DiseaseMeth supports multiple search options such as gene ID and disease name. DiseaseMeth provides integrated gene methylation data based on cross-data set analysis for disease and normal samples. These can be used for in-depth identification of differentially methylated genes and the investigation of gene–disease relationship

    Identifying Regulators from Multiple Types of Biological Data in Cancer

    Get PDF
    Cancer genomes accumulate alterations that promote cancer cell proliferation and survival. Structural, genetic and epigenetic alterations that have a selective advantage for tumorigenesis affect key regulatory genes and microRNAs that in turn regulate the expression of many target genes. The goal of this dissertation is to leverage the alteration-rich landscape of cancer genomes to detect key regulatory genes and microRNAs. To this end, we designed a feature selection algorithm to identify DNA methylation signals around a gene that would highly predict its expression. We found that genes whose expression could be predicted by DNA methylation accurately were enriched in Gene Ontology terms related to the regulation of various biological processes. This suggests that genes controlled by DNA methylation are regulatory genes. We also developed two tools that infer relationships between regulatory genes and target genes leveraging structural and epigenetic data. The first tool, ProcessDriver integrates copy number alteration and gene expression datasets to identify copy number cancer driver genes, target genes of these drivers and the disrupted biological processes. Our results showed that driver genes selected by ProcessDriver are enriched in known cancer genes. Using survival analysis, we showed that drivers are linked to new tumor events after initial treatment. The second tool was developed to leverage structural and epigenetic data to infer interactions between regulatory genes and targets on a network-level. Our canonical correlation analysis-based approach utilized the DNA methylation or copy number states of potential regulators and the expression states of potential targets to score regulatory interactions. We then incorporated these regulatory interaction scores as prior knowledge in a dynamic Bayesian framework utilizing time series gene expression data. Our results indicated that the canonical correlation analysis-based scores reflect the true interactions between genes with high accuracy, and the accuracy can be further increased by using the scores as a prior in the dynamic Bayesian framework. Finally, we are developing an algorithm to detect cancer-related microRNAs, associated targets and disrupted biological processes. Our preliminary results suggest that the modules of miRNAs and target genes identified in this approach are enriched in known microRNA-gene interactions

    MethylC-analyzer: A comprehensive downstream pipeline for the analysis of genome-wide DNA methylation

    Get PDF
    DNA methylation is a crucial epigenetic modification involved in multiple biological processes and diseases. Current approaches for measuring genome-wide DNA methylation via bisulfite sequencing (BS-seq) include whole-genome bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS), and enzymatic methyl-seq (EM-seq). The computational analysis tools available for BS-seq data include customized aligners for mapping bisulfite-converted reads and computational pipelines for downstream data analysis. Current post-alignment methylation tools are specialized for the interpretation of CG methylation, which is known to dominate mammalian genomes, however, non-CG methylation (CHG and CHH, where H refers to A, C, or T) is commonly observed in plants and fungi and is closely associated with gene regulation, transposon silencing, and plant development. Thus, we have developed a MethylC-analyzer to analyze and visualize post-alignment WGBS, RRBS, and EM-seq data focusing on CG. The tool is able to also analyze non-CG sites to enhance deciphering genomes of plants and fungi. By processing aligned data and gene location files, MethylC-analyzer generates a genome-wide view of methylation levels and methylation in user-specified genomic regions. The meta-plot, for example, allows the investigation of DNA methylation within specific genomic elements. Moreover, our tool identifies differentially methylated regions (DMRs) and investigates the enrichment of genomic features associated with variable methylation. MethylC-analyzer functionality is not limited to specific genomes, and we demonstrated its performance on both plant and human BS-seq data. MethylC-analyzer is a Python- and R-based program designed to perform comprehensive downstream analyses of methylation data, providing an intuitive analysis platform for scientists unfamiliar with DNA methylation analysis. It is available as either a standalone version for command-line uses or a graphical user interface (GUI) and is publicly accessible at https://github.com/RitataLU/MethylC-analyzer

    Refining epigenetic prediction of chronological and biological age

    Get PDF
    Background Epigenetic clocks can track both chronological age (cAge) and biological age (bAge). The latter is typically defined by physiological biomarkers and risk of adverse health outcomes, including all-cause mortality. As cohort sample sizes increase, estimates of cAge and bAge become more precise. Here, we aim to develop accurate epigenetic predictors of cAge and bAge, whilst improving our understanding of their epigenomic architecture. Methods First, we perform large-scale (N = 18,413) epigenome-wide association studies (EWAS) of chronological age and all-cause mortality. Next, to create a cAge predictor, we use methylation data from 24,674 participants from the Generation Scotland study, the Lothian Birth Cohorts (LBC) of 1921 and 1936, and 8 other cohorts with publicly available data. In addition, we train a predictor of time to all-cause mortality as a proxy for bAge using the Generation Scotland cohort (1214 observed deaths). For this purpose, we use epigenetic surrogates (EpiScores) for 109 plasma proteins and the 8 component parts of GrimAge, one of the current best epigenetic predictors of survival. We test this bAge predictor in four external cohorts (LBC1921, LBC1936, the Framingham Heart Study and the Women’s Health Initiative study). Results Through the inclusion of linear and non-linear age-CpG associations from the EWAS, feature pre-selection in advance of elastic net regression, and a leave-one-cohort-out (LOCO) cross-validation framework, we obtain cAge prediction with a median absolute error equal to 2.3 years. Our bAge predictor was found to slightly outperform GrimAge in terms of the strength of its association to survival (HRGrimAge = 1.47 [1.40, 1.54] with p = 1.08 × 10−52, and HRbAge = 1.52 [1.44, 1.59] with p = 2.20 × 10−60). Finally, we introduce MethylBrowsR, an online tool to visualise epigenome-wide CpG-age associations. Conclusions The integration of multiple large datasets, EpiScores, non-linear DNAm effects, and new approaches to feature selection has facilitated improvements to the blood-based epigenetic prediction of biological and chronological age

    TİROİD VE KRONİK BÖBREK HASTALIĞI VERİLERİNİN SINIFLANDIRILMASINDA GENETİK ALGORİTMALAR VE PCA İLE HİBRİT ÖZELLİK SEÇİMİ

    Get PDF
    TİROİD VE KRONİK BÖBREK HASTALIĞI VERİLERİNİN SINIFLANDIRILMASINDA GENETİK ALGORİTMALAR VE PCA İLE HİBRİT ÖZELLİK SEÇİMİÖzetBu çalışmada tiroid ve kronik böbrek hastalığının teşhisinde k-nearest neighbors sınıflandırıcının performansını arttırmak amacıyla genetik algoritmalar ve temel bileşenler analizi (PCA) hibrit şekilde kullanılmış ve yeni bir özellik seçimi yöntemi önerilmiştir. Hibrit özellik seçimi yönteminde elde edilen uygulama sonuçları, veri setlerinin özellik seçimi uygulanmamış başlangıç performansıyla karşılaştırılmıştır. Sonuç olarak önerilen hibrit metotla birlikte sınıflandırma başarısı tiroid veri seti için %93.44’ten %95.89’a, böbrek veri seti için %93.75’ten %98.25’e çıkarılmıştır. Sonuçların tutarlı olması için her iki veri setine 10-kat çapraz doğrulama yapılmıştır.Anahtar Kelimeler: Genetik Algoritmalar, PCA, Özellik Seçimi, K-nearest neighborsHYBRID FEATURE SELECTION USING GENETIC ALGORITHMS AND PCA IN CLASSIFICATION OF THYROID AND CHRONIC KIDNEY DISEASE DATAAbstractIn this study, genetic algorithms and principal component analysis (PCA) were used in a hybrid way to increase the performance of the k-nearest neighbors classifier in the diagnosis of thyroid and chronic kidney disease, and a new feature selection method was proposed. The application results obtained in the hybrid feature selection method were compared with the initial performance of the data sets before the feature selection was applied. As a result, with the proposed hybrid method, the classification success was increased from 93.44% to 95.89% for the thyroid data set and from 93.75%to 98.25%for the kidney data set. A 10-fold cross validation was applied to both data sets to ensure consistent results.Keywords: Genetic Algorithms, PCA, Feature Selection, K-nearest neighbor

    Inference and Analysis of Multilayered Mirna-Mediated Networks in Cancer

    Get PDF
    MicroRNAs (miRNAs) are small noncoding transcripts that can regulate gene expression, thereby controlling diverse biological processes. Aberrant disruptions of miRNA expression and their interactions with other biological agents (e.g., coding and noncoding transcripts) have been associated with several types of cancer. The goal of this dissertation is to use multidimensional genomic data to model two different gene regulation mechanisms by miRNAs in cancer. This dissertation results from two research projects. The first project investigates a miRNA-mediated gene regulation mechanism called competing endogenous RNA (ceRNA) interactions, which suggests that some transcripts can indirectly regulate one another\u27s activity through their interactions with a common set of miRNAs. Identification of context-specific ceRNA interactions is a challenging task. To address that, we proposed a computational method called Cancerin to identify genome-wide cancer-associated ceRNA interactions. Cancerin incorporates DNA methylation (DM), copy number alteration (CNA), and gene and miRNA expression datasets to construct cancer-specific ceRNA networks. Cancerin was applied to three cancer datasets from the Cancer Genome Atlas (TCGA) project. We found that the RNAs involved in ceRNA interactions were enriched with cancer-related genes and have high prognostic power. Moreover, the ceRNA modules in the inferred ceRNA networks were involved in cancer-associated biological processes. The second project investigates what biological functions are regulated by both miRNAs and transcription factors (TFs). While it has been known that miRNAs and TFs can coregulate common target genes having similar biological functions, it is challenging to associate specific biological functions to specific miRNAs and TFs. In this project, we proposed a computational method called CanMod to identify gene regulatory modules. Each module consists of miRNAs, TFs and their coregulated target genes. CanMod was applied on the breast cancer dataset from TCGA. Many hub regulators (i.e., miRNAs and TFs) found in the inferred modules were known cancer genes, and CanMod was able to find experimentally validated regulator-target interactions. In addition, the modules were associated with distinguishable and cancer-related biological processes. Given the biological findings obtained from Cancerin and CanMod, we believe that the two computational methods are valuable tools to explore novel miRNA involvement in cancer

    Unraveling expression and DNA methylation landscapes in cancer

    Get PDF
    Cancer is a complex, heterogeneous disease and associated with a pluralism of distinct molecular events occurring on multiple layers of cell activity. It is a disease of genomic regulation driven by genetic and epigenetic mechanisms. Consideration of these regulatory levels is inevitable for understanding cancer genesis and progression. Improved high-throughput techniques developed in the last decades enable a highly resolved view on these mechanisms but at the same time the technologies produce an incredible amount of molecular data. Hence it needs advances in computational methods to master the data. In this thesis we demonstrate how to cope with high-dimensional data to characterize molecular aspects of cancer. The main aim of this thesis is to develop and to apply bioinformatics methods to unravel molecular mechanisms, with special focus on gene expression and epigenetics, underlying cancer. Therefore, we selected two cancer entities, B-cell lymphoma and glioblastoma, for a more detailed, exemplary study. Bioinformatics methods dealing with molecular cancer data have to tackle tasks like data integration, dimension reduction, data compression and proper visualization. One effective method that fulfills the mentioned tasks is self organizing map (SOM) machine learning, a technique to ‘organize’ complex, multivariate data. We present an analytic framework based on SOMs that aims at characterizing single-omics landscapes, here either regarding genome wide expression or methylation, to describe the heterogeneity of cancer on the molecular level. Molecular data of each sample is presented in terms of ‘individual’ maps, which enable their evaluation by visual inspection. The portrayal method also realizes comprehensive downstream analysis tasks such as marker selection and clustering of co-regulated features into modules, stratification of cases into subtypes, knowledge discovery, function mining and pathway analysis. Further, we describe how to detect and to correct outlier samples. In a novel combining approach all these analytic tasks of the single-omics SOM are embedded in a workflow to integratively analyze gene expression and DNA methylation data of unmatched patient cohorts. We showed that this approach provides detailed insights into the transcriptome and methylome landscapes of cancer. Furthermore, we developed a new inter-omics method based on SOM machine learning for the combined analysis of gene expression and DNA methylation data obtained from the same patient cohort. The method allows the visual inspection of the data landscapes of each sample on a personalized and class-related level, where the relative contribution of each of both data entities can be tuned either to focus on expression or methylation landscapes or on a combination of both. Using the single-omics SOM approach, we studied molecular subtypes of B-cell lymphoma based on gene expression data. The method disentangles tumor heterogeneity and provides suited markers for the cancer subtypes. We proposed a refined subtyping of B-cell lymphoma into four subtypes, rather than a previously assumed three-group classification. In a second application of the single-omics SOM we studied a gene expression data set concerning glioblastoma for which we confirmed an established four-subtype classification. Our results suggested a similar gene activation pattern as observed in the lymphoma study characterized by an antagonistic switching between transcriptional modes related to immune response and cell division. Our integrative study on a larger lymphoma cohort comprising additional subtypes confirmed previous results about the role of stemness genes during development and maturation of B-cells. Their dysfunctions in lymphoma are governed by widespread epigenetic effects altering the promoter methylation of the involved genes, their activity status as moderated by histone modifications, and also by chromatin remodeling. We identified subtype-specific signatures that associate with epigenetic effects such as remodeling from transcriptionally inactive into active chromatin states, differential promoter methylation, and the enrichment of targets of transcription factors such as EZH2 and SUZ12. While studying the transcription of epigenetic modifiers in lymphoma and healthy controls, we found that the expression levels of nearly all modifiers are strongly disturbed in lymphoma and concluded that the epigenetic machinery is highly deregulated. Our results suggested that Burkitt’s lymphoma and diffuse large B-cell lymphoma differ by an imbal-ance of repressive and poised promoters, which is associated with an imbalance of the activity of histone- and DNA-modifying enzymes. Our inter-omics method was applied to a high-grade glioblastomas. Their expression and methylation landscapes were segmented into modes of co-expressed and co-methylated genes, which reflect underlying regulatory modes of cell activity. We found antagonistic methylation and gene expression changes between the IDH1 mutated and IDH1 wild type subtypes, which affect predominantly poised and repressed chromatin states. Therefore we assume that these effects deregulate developmental processes either by their blockage or by aberrant activation. Our methods presented in this thesis enable a holistic view on high-dimensional molecular data collected in large-scale cancer studies. The examples chosen illustrate the mutual dependence of regulatory effects on genetic, epigenetic and transcriptomic levels. Our finding revealed that epigenetic deregulation in cancer must go beyond simple schemes using only a few modes of regulation. By applying the tools and methods described above to large-scale cancer cohorts we could confirm and supplement previous findings about underlying cancer biology
    corecore