3 research outputs found
Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks
Cross-experiment comparisons in public data compendia are challenged by unmatched conditions and technical noise. The ADAGE method, which performs unsupervised integration with denoising autoencoder neural networks, can identify biological patterns, but because ADAGE models, like many neural networks, are over-parameterized, different ADAGE models perform equally well. To enhance model robustness and better build signatures consistent with biological pathways, we developed an ensemble ADAGE (eADAGE) that integrated stable signatures across models. We applied eADAGE to a compendium of Pseudomonas aeruginosa gene expression profiling experiments performed in 78 media. eADAGE revealed a phosphate starvation response controlled by PhoB in media with moderate phosphate and predicted that a second stimulus provided by the sensor kinase, KinB, is required for this PhoB activation. We validated this relationship using both targeted and unbiased genetic approaches. eADAGE, which captures stable biological patterns, enables cross-experiment comparisons that can highlight measured but undiscovered relationships.Gordon and Betty Moore Foundation (GBMF 4552)National Institutes of Health (U.S.) (grant R01-AI091702)Cystic Fibrosis Foundation (STANTO15R0
Optimisation and parallelisation of the partitioning around medoids function in R
R is a free statistical programming language commonly used for the analysis of high-throughput microarray and other data. It is currently unable to easily utilise multi processor architectures without substantial changes to existing R scripts. Further, working with large volumes of data often leads to slow processing and even memory allocation faults. A recent survey highlighted clustering algorithms as both computation and data intensive bottlenecks in post-genomic data analyses. These algorithms aim to sort numeric vectors (such as gene expression profiles) into groups by minimising vector distances within groups and maximising them between groups. This paper describes the optimisation and parallelisation of a popular clustering algorithm, partitioning around medoids (PAM), for the Simple Parallel R INTerface (SPRINT). SPRINT allows R users to exploit high performance computing systems without expert knowledge of such systems. This paper reports on a serial optimisation of the original code and a subsequent parallel implementation. The parallel implementation enables the processing of data sets that exceed the available physical memory and can yield, depending on the data set, over 100-fold increase in performance
Statistical modelling of masked gene regulatory pathway changes across microarray studies of interferon gamma activated macrophages
Interferon gamma (IFN-γ) regulation of macrophages plays an essential role in innate immunity and
pathogenicity of viral infections by directing large and small genome-wide changes in the transcriptional
program of macrophages. Smaller changes at the transcriptional level are difficult to detect but can have
profound biological effects, motivating the hypothesis of this thesis that responses of macrophages to
immune activation by IFN-γ include small quantitative changes that are masked by noise but represent
meaningful transcriptional systems in pathways against infection. To test this hypothesis, statistical
meta-analysis of microarray studies is investigated as a tool to obtain the necessary increase in analysis
sensitivity. Three meta-analysis models (Effect size model, Rank Product model, Fisher’s sum of logs) and three
further modified versions were applied to a heterogeneous set of four microarray studies on the effect of
IFN-γ on murine macrophages. Performance assessments include recovery of known biology and are
followed by development of novel biological hypotheses through secondary analysis of meta-analysis
outcomes in context of independent biological data sources. A separate network analysis of a microarray
time course study investigate s if gene sets with coordinated time-dependent relationships overlap can
also identify subtle IFN-γ related transcriptional changes in macrophages that match those identified
through meta-analysis.
It was found that all meta-analysis models can identify biologically meaningful transcription at
enhanced sensitivity levels, with slightly improved performance advantages for a non-parametric model
(Rank Product meta-analysis). Meta-analysis yielded consistently regulated genes, hidden in individual
microarray studies, related to sterol biosynthesis (Stard3, Pgrmc1, Galnt6, Rab11a, Golga4, Lrp10),
implicated in cross-talk between type II and type I interferon or IL-10 signalling (Tbk1, Ikbke, Clic4,
Ptpre, Batf), and circadian rhythm (Csnk1e). Further network analysis confirms that meta-analysis
findings are highly concentrated in a distinct immune response cluster of co-expressed genes, and also
identifies global expression modularisation in IFN-γ treated macrophages, pointing to Trafd1 as a
central anti-correlated node topologically linked to interactions with down-regulated sterol biosynthesis
pathway members.
Outcomes from this thesis suggest that small transcriptional changes in IFN-γ activated macrophages
can be detected by enhancing sensitivity through combination of multiple microarray studies. Together
with use of bioinformatical resources, independent data sets and network analysis, further validation
assigns a potential role for low or variable transcription genes in linking type II interferon signalling to
type I and TLR signalling, as well as the sterol metabolic network