7 research outputs found
A balanced iterative random forest for gene selection from microarray data
Background: The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging
Recommended from our members
An effective mixed-model for screening differentially expressed genes of breast cancer based on LR-RF
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.National Natural Science Foundation of China; International Technology Collaboration Research Program of Chin
Iterative Random Forests to detect predictive and stable high-order interactions
Genomics has revolutionized biology, enabling the interrogation of whole
transcriptomes, genome-wide binding sites for proteins, and many other
molecular processes. However, individual genomic assays measure elements that
interact in vivo as components of larger molecular machines. Understanding how
these high-order interactions drive gene expression presents a substantial
statistical challenge. Building on Random Forests (RF), Random Intersection
Trees (RITs), and through extensive, biologically inspired simulations, we
developed the iterative Random Forest algorithm (iRF). iRF trains a
feature-weighted ensemble of decision trees to detect stable, high-order
interactions with same order of computational cost as RF. We demonstrate the
utility of iRF for high-order interaction discovery in two prediction problems:
enhancer activity in the early Drosophila embryo and alternative splicing of
primary transcripts in human derived cell lines. In Drosophila, among the 20
pairwise transcription factor interactions iRF identifies as stable (returned
in more than half of bootstrap replicates), 80% have been previously reported
as physical interactions. Moreover, novel third-order interactions, e.g.
between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order
relationships that are candidates for follow-up experiments. In human-derived
cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated
splicing regulation, and identified novel 5th and 6th order interactions,
indicative of multi-valent nucleosomes with specific roles in splicing
regulation. By decoupling the order of interactions from the computational cost
of identification, iRF opens new avenues of inquiry into the molecular
mechanisms underlying genome biology
Genetic and Epigenetic Variations in Asthma and Wheeze Illnesses
Asthma, a chronic respiratory condition, is common worldwide with no cure and limited effective prevention strategies. It is well recognized that asthma has a multifaceted etiology, though many of the underlying mechanisms involved in asthma development, persistence and remission are still convoluted. Epigenetic mechanisms, such as DNA methylation, regulate gene-expression but are not related to changes in the actual DNA sequence. Recently, differential patterns of DNA methylation within many genes have been associated with asthma, particularly within genes involved in the differentiation of pro-inflammatory T-helper 2 (Th2) cells. DNA methylation patterns within less known biologic pathways undoubtedly are involved in asthma pathogenesis as well. The purpose of this dissertation was three-fold. First, we explored whether genetic and epigenetic variations within Th2-genes differed among persons with different phenotypic presentations of wheeze illnesses. Second, we conducted an epigenome-wide association study (EWAS) to identify novel DNA methylation loci associated with asthma. Last, we conducted a follow-up study of our top EWAS findings, to investigate whether the expression of the associated genes were predictive of infant wheeze. We found that DNA-M within GATA3 and IL4 varied based on different wheeze-illness phenotypes, suggesting that Th2-genes are under differential epigenetic regulation for different presentations of asthma. We also identified nine novel DNA methylation loci (cg25578728 in CHD7, cg16658191 in HK1, cg00100703 in UNC45B, cg07948085 [intergenic], cg04359558 in LITAF, cg20417424 in ST6GALNAC5, cg19974715 [intergenic], cg01046943 in NUP210 and cg14727512 in DGCR14) associated with asthma at age 18. For two of those genes (HK1 and LITAF), expression levels in cord blood were predictive of infant wheeze. Interestingly, the observed methylation and expression patterns of HK1 and LITAF could be consistent with increased resistance to apoptotic signaling. Apoptotic-resistance among pro-inflammatory cells can increase the duration of an inflammatory response and is affiliated with asthmatic pathophysiology. Thus we may have identified under-studied genes and their epigenetic regulation, which could play important roles in asthma pathophysiology. These genes may offer new insights into the etiology of asthma, be investigated as potential targets for therapy, or be considered for inclusion in algorithms used to predict early-life wheeze and later-life asthma
Prostate Cancer Epigenetic Mechanism Study and Biomarker Discovery Using Bioinformatics Approaches
Most screening-detected prostate cancer (PCa) is indolent and not lethal. Biomarkers that can predict aggressive diseases independent of clinical features are needed to improve risk stratification of localized PCa patients and reduce overtreatment. Epigenetic, especially methylation biomarkers have better stability in biofluids or samples with a below-average quality. We aimed to identify DNA methylation differences in leukocytes between clinically defined aggressive and non-aggressive PCa to identify potential biomarkers for PCa diagnosis. To accomplish this aim, we performed DNA methylation profiling in leukocyte DNA samples obtained from 287 PCa patients with Gleason Score (GS) 6 and ≥8 using Illumina 450k methylation arrays, and 8 PCa patients using whole genome bisulfite sequencing. We observed the DNA methylation level in the core promoters and the first exon region were significantly higher in GS≥8 patients than GS=6 PCa. We then performed a 5-fold cross validated random forest model on 1,459 differentially methylated CpG Probes (DMPs) between the GS=6 and GS≥8 groups to identify PCa aggressiveness biomarkers. The power of the predictive model was further reinforced by ranking the DMPs with Decreased Gini and re-train the model with the top 97 DMPs (Testing AUC=0.920, predict accuracy=0.847). Similar approaches were performed to detect methylation differences between normal and PCa patient leukocyte DNA. Moreover, we analyzed 8 whole genome bisulfite sequencing (WGBS) patient leukocyte DNA specimens from the patient pool with Model based Analysis of Bisulfite Sequencing data (MOABS), an integrated tool for bisulfite sequencing analysis. DNA microarray and WGBS results were highly correlated (r=0.946) and mutual biomarkers were identified. To make MOABS analysis widely accessible, we also utilized bioinformatics methods to implement MOABS to the galaxy platform and validated the power of MOABS-Galaxy with quick test and public bisulfite sequencing datasets. In summary, we identified a CpG methylation signature in leukocyte DNA that is associated with PCa aggressiveness and biochemical recurrence and developed the MOABS-Galaxy web service for DNA methylation analysis using bisulfite sequencing data. Our epigenetic mechanism study may provide an alternative option for PCa screening from epigenetic biomarkers, and implementation of MOABS could benefit biologists from non-computational background on bisulfite sequencing data analysis
Experimental vaccination for onchocerciasis and the identification of early markers of protective immunity
Onchocerciasis, caused by Onchocerca volvulus remains a major public health and
socio-economic problem across the tropics, despite years of mass drug administration
(MDA) with Ivermectin to reduce disease burden. Through modelling, it has been
shown that elimination cannot be achieved with MDA alone and additional tools are
needed, such as vaccination, which remains the most cost-effective tool for long-term
disease control. The feasibility behind vaccination against O. volvulus can be
demonstrated in the Litomosoides sigmodontis mouse model, which shows that
vaccine induced protection can be achieved with immunisation using irradiated L3, the
infective stage of L. sigmodontis and with microfilariae (Mf), the transmission stage
of the parasite. There is further evidence of protective immunity in humans, with
individuals living in endemic areas that show no signs of infection despite being
exposed to the parasite (endemic normal).
The protective efficacy of promising vaccine candidates were evaluated using an
immunisation time course in the L. sigmodontis model, using either DNA plasmid or
peptide vaccines. In immunisation experiments in L. sigmodontis, Mf numbers are
used as a measure of protection and marks the end of an immunisation time course.
However, when changes in gene expression were measured at the end of an
immunisation time course, in attempts to identify gene signatures that could be used
as markers of protection (correlates of protection) in the blood, no gene signatures were
found to be associated with protection. This suggest that at the end of an immunisation
time course, when protection is measured (change in Mf numbers), it is too late in
infection to measure changes in immune pathways being triggered. Changes in gene expression were therefore measured in blood samples collected
throughout an immunisation time course in the L. sigmodontis model, in order to
identify the time point in an immunisation experiment which are the most indicative
of protection. Two independent immunisation time courses were used, either using
irradiated L3 or Mf as vaccine against L. sigmodontis, as these elicit the greatest
protection. This generated a large high dimensional dataset, that was too large and
complex for a differential fold-change analysis. Therefore, an analysis pipeline was
created using machine learning algorithms, to detect changes in gene expression
throughout the time courses to detect markers of protection.
The 6 hour time point following immunisation showed the greatest change in gene
expression, with the analysis pipeline identifying known pathways associated with
vaccine-induced immunity. The pipeline was applied to gene expression data from
human samples obtained from individuals living in endemic areas who were either
infected with O. volvulus or endemic normal (naturally protected), this was to identify
pathways associated with protective immunity in humans. When comparing vaccine
induced immunity seen in mice and natural protective immunity in humans there was
some overlap in pathways being triggered, suggesting that similar pathways are needed
for protection and that if a vaccine can trigger the right pathways in mice, it is likely
to be effective in humans.
Overall the machine learning analysis of the gene expression data, not only shows that
it is feasible to measure change in gene expression in blood during filarial infections,
but that during an immunisation time course it is the early time points following
immunisation that are the most predictive of vaccine efficacy (protection outcome). One of the vaccine candidates, cysteine protease inhibitor-2 (CPI), is a known
immuno-modulator that inhibits MHC-II antigen presentation on antigen presenting
cells such as dendritic cells (DC). This candidate has consistently been shown to
induce protection if its immuno-modulatory active site was modified. In in vitro
studies, it was shown that modification of the active site of CPI rescues antigen
presentation in DC. This shows the importance of DC activation before the onset of
infection, demonstrating the importance of triggering protective responses early in
infection, and provides insight on how one of the vaccine candidates achieves
protection