7 research outputs found

    A balanced iterative random forest for gene selection from microarray data

    Get PDF
    Background: The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging

    Iterative Random Forests to detect predictive and stable high-order interactions

    Get PDF
    Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology

    Genetic and Epigenetic Variations in Asthma and Wheeze Illnesses

    Get PDF
    Asthma, a chronic respiratory condition, is common worldwide with no cure and limited effective prevention strategies. It is well recognized that asthma has a multifaceted etiology, though many of the underlying mechanisms involved in asthma development, persistence and remission are still convoluted. Epigenetic mechanisms, such as DNA methylation, regulate gene-expression but are not related to changes in the actual DNA sequence. Recently, differential patterns of DNA methylation within many genes have been associated with asthma, particularly within genes involved in the differentiation of pro-inflammatory T-helper 2 (Th2) cells. DNA methylation patterns within less known biologic pathways undoubtedly are involved in asthma pathogenesis as well. The purpose of this dissertation was three-fold. First, we explored whether genetic and epigenetic variations within Th2-genes differed among persons with different phenotypic presentations of wheeze illnesses. Second, we conducted an epigenome-wide association study (EWAS) to identify novel DNA methylation loci associated with asthma. Last, we conducted a follow-up study of our top EWAS findings, to investigate whether the expression of the associated genes were predictive of infant wheeze. We found that DNA-M within GATA3 and IL4 varied based on different wheeze-illness phenotypes, suggesting that Th2-genes are under differential epigenetic regulation for different presentations of asthma. We also identified nine novel DNA methylation loci (cg25578728 in CHD7, cg16658191 in HK1, cg00100703 in UNC45B, cg07948085 [intergenic], cg04359558 in LITAF, cg20417424 in ST6GALNAC5, cg19974715 [intergenic], cg01046943 in NUP210 and cg14727512 in DGCR14) associated with asthma at age 18. For two of those genes (HK1 and LITAF), expression levels in cord blood were predictive of infant wheeze. Interestingly, the observed methylation and expression patterns of HK1 and LITAF could be consistent with increased resistance to apoptotic signaling. Apoptotic-resistance among pro-inflammatory cells can increase the duration of an inflammatory response and is affiliated with asthmatic pathophysiology. Thus we may have identified under-studied genes and their epigenetic regulation, which could play important roles in asthma pathophysiology. These genes may offer new insights into the etiology of asthma, be investigated as potential targets for therapy, or be considered for inclusion in algorithms used to predict early-life wheeze and later-life asthma

    Prostate Cancer Epigenetic Mechanism Study and Biomarker Discovery Using Bioinformatics Approaches

    Get PDF
    Most screening-detected prostate cancer (PCa) is indolent and not lethal. Biomarkers that can predict aggressive diseases independent of clinical features are needed to improve risk stratification of localized PCa patients and reduce overtreatment. Epigenetic, especially methylation biomarkers have better stability in biofluids or samples with a below-average quality. We aimed to identify DNA methylation differences in leukocytes between clinically defined aggressive and non-aggressive PCa to identify potential biomarkers for PCa diagnosis. To accomplish this aim, we performed DNA methylation profiling in leukocyte DNA samples obtained from 287 PCa patients with Gleason Score (GS) 6 and ≥8 using Illumina 450k methylation arrays, and 8 PCa patients using whole genome bisulfite sequencing. We observed the DNA methylation level in the core promoters and the first exon region were significantly higher in GS≥8 patients than GS=6 PCa. We then performed a 5-fold cross validated random forest model on 1,459 differentially methylated CpG Probes (DMPs) between the GS=6 and GS≥8 groups to identify PCa aggressiveness biomarkers. The power of the predictive model was further reinforced by ranking the DMPs with Decreased Gini and re-train the model with the top 97 DMPs (Testing AUC=0.920, predict accuracy=0.847). Similar approaches were performed to detect methylation differences between normal and PCa patient leukocyte DNA. Moreover, we analyzed 8 whole genome bisulfite sequencing (WGBS) patient leukocyte DNA specimens from the patient pool with Model based Analysis of Bisulfite Sequencing data (MOABS), an integrated tool for bisulfite sequencing analysis. DNA microarray and WGBS results were highly correlated (r=0.946) and mutual biomarkers were identified. To make MOABS analysis widely accessible, we also utilized bioinformatics methods to implement MOABS to the galaxy platform and validated the power of MOABS-Galaxy with quick test and public bisulfite sequencing datasets. In summary, we identified a CpG methylation signature in leukocyte DNA that is associated with PCa aggressiveness and biochemical recurrence and developed the MOABS-Galaxy web service for DNA methylation analysis using bisulfite sequencing data. Our epigenetic mechanism study may provide an alternative option for PCa screening from epigenetic biomarkers, and implementation of MOABS could benefit biologists from non-computational background on bisulfite sequencing data analysis

    Experimental vaccination for onchocerciasis and the identification of early markers of protective immunity

    Get PDF
    Onchocerciasis, caused by Onchocerca volvulus remains a major public health and socio-economic problem across the tropics, despite years of mass drug administration (MDA) with Ivermectin to reduce disease burden. Through modelling, it has been shown that elimination cannot be achieved with MDA alone and additional tools are needed, such as vaccination, which remains the most cost-effective tool for long-term disease control. The feasibility behind vaccination against O. volvulus can be demonstrated in the Litomosoides sigmodontis mouse model, which shows that vaccine induced protection can be achieved with immunisation using irradiated L3, the infective stage of L. sigmodontis and with microfilariae (Mf), the transmission stage of the parasite. There is further evidence of protective immunity in humans, with individuals living in endemic areas that show no signs of infection despite being exposed to the parasite (endemic normal). The protective efficacy of promising vaccine candidates were evaluated using an immunisation time course in the L. sigmodontis model, using either DNA plasmid or peptide vaccines. In immunisation experiments in L. sigmodontis, Mf numbers are used as a measure of protection and marks the end of an immunisation time course. However, when changes in gene expression were measured at the end of an immunisation time course, in attempts to identify gene signatures that could be used as markers of protection (correlates of protection) in the blood, no gene signatures were found to be associated with protection. This suggest that at the end of an immunisation time course, when protection is measured (change in Mf numbers), it is too late in infection to measure changes in immune pathways being triggered. Changes in gene expression were therefore measured in blood samples collected throughout an immunisation time course in the L. sigmodontis model, in order to identify the time point in an immunisation experiment which are the most indicative of protection. Two independent immunisation time courses were used, either using irradiated L3 or Mf as vaccine against L. sigmodontis, as these elicit the greatest protection. This generated a large high dimensional dataset, that was too large and complex for a differential fold-change analysis. Therefore, an analysis pipeline was created using machine learning algorithms, to detect changes in gene expression throughout the time courses to detect markers of protection. The 6 hour time point following immunisation showed the greatest change in gene expression, with the analysis pipeline identifying known pathways associated with vaccine-induced immunity. The pipeline was applied to gene expression data from human samples obtained from individuals living in endemic areas who were either infected with O. volvulus or endemic normal (naturally protected), this was to identify pathways associated with protective immunity in humans. When comparing vaccine induced immunity seen in mice and natural protective immunity in humans there was some overlap in pathways being triggered, suggesting that similar pathways are needed for protection and that if a vaccine can trigger the right pathways in mice, it is likely to be effective in humans. Overall the machine learning analysis of the gene expression data, not only shows that it is feasible to measure change in gene expression in blood during filarial infections, but that during an immunisation time course it is the early time points following immunisation that are the most predictive of vaccine efficacy (protection outcome). One of the vaccine candidates, cysteine protease inhibitor-2 (CPI), is a known immuno-modulator that inhibits MHC-II antigen presentation on antigen presenting cells such as dendritic cells (DC). This candidate has consistently been shown to induce protection if its immuno-modulatory active site was modified. In in vitro studies, it was shown that modification of the active site of CPI rescues antigen presentation in DC. This shows the importance of DC activation before the onset of infection, demonstrating the importance of triggering protective responses early in infection, and provides insight on how one of the vaccine candidates achieves protection
    corecore