5 research outputs found
Recommended from our members
Collective analysis of multiple high-throughput gene expression datasets
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonModern technologies have resulted in the production of numerous high-throughput biological datasets. However, the pace of development of capable computational methods does not cope with the pace of generation of new high-throughput datasets. Amongst the most popular biological high-throughput datasets are gene expression datasets (e.g. microarray datasets). This work targets this aspect by proposing a suite of computational methods which can analyse multiple gene expression datasets collectively. The focal method in this suite is the unification of clustering results from multiple datasets using external specifications (UNCLES). This method applies clustering to multiple heterogeneous datasets which measure the expression of the same set of genes separately and then combines the resulting partitions in accordance to one of two types of external specifications; type A identifies the subsets of genes that are consistently co-expressed in all of the given datasets while type B identifies the subsets of genes that are consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets. This contributes to the types of questions which can addressed by computational methods because existing clustering, consensus clustering, and biclustering methods are inapplicable to address the aforementioned objectives. Moreover, in order to assist in setting some of the parameters required by UNCLES, the M-N scatter plots technique is proposed. These methods, and less mature versions of them, have been validated and applied to numerous real datasets from the biological contexts of budding yeast, bacteria, human red blood cells, and malaria. While collaborating with biologists, these applications have led to various biological insights. In yeast, the role of the poorly-understood gene CMR1 in the yeast cell-cycle has been further elucidated. Also, a novel subset of poorly understood yeast genes has been discovered with an expression profile consistently negatively correlated with the well-known ribosome biogenesis genes. Bacterial data analysis has identified two clusters of negatively correlated genes. Analysis of data from human red blood cells has produced some hypotheses regarding the regulation of the pathways producing such cells. On the other hand, malarial data analysis is still at a preliminary stage. Taken together, this thesis provides an original integrative suite of computational methods which scrutinise multiple gene expression datasets collectively to address previously unresolved questions, and provides the results and findings of many applications of these methods to real biological datasets from multiple contexts.National Institute for Health Research (NIHR) and the Brunel College of Engineering, Design and Physical Science
UNCLES: Method for the identification of genes differentially consistently co-expressed in a specific subset of datasets
Background: Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Results: Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. Conclusions: The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.The National Institute for Health Research (NIHR) under its Programme Grants for Applied Research
Programme (Grant Reference Number RP-PG-0310-1004)
Recommended from our members
Consensus clustering framework for analysing fMRI datasets.
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonNeuroimaging of humans has gained a position of status within neuroscience. Modern functional
magnetic resonance imaging (fMRI) technique provides neuroscientists with a powerful tool to
depict the complex architecture of human brains. fMRI generates large amount of data and many
analysis methods have been proposed to extract useful information from the data. Clustering
technique has been one of the most popular data-driven techniques to study brain functional connectivity,
which excels when traditional model-based approaches are difficult to implement. However,
the reliability and consistency of many findings are jeopardised by too many analysis methods,
parameters, and sometimes too few samples used. In this thesis, a consensus clustering
analysis framework for analysing fMRI data has been developed, aiming at overcoming the clustering
algorithm selection problem as well as reliability issues in neuroimaging. The framework is
able to identify groups of voxels representing brain regions that consistently exhibiting correlated
BOLD activities across many experimental conditions by integrating clustering results from multiple
clustering algorithms and various parameters such as the number of clusters . In the framework,
the individual clustering result generation is aided by high performance grid computing technique
to reduce the overall computational time. The integration of clustering results is implemented
by a technique named binarisation of consensus partition matrix (Bi-CoPaM) adapted and
enhanced for fMRI data analysis. The whole framework has been validated and is robust to participantsā
individual variability, yielding most complete and reproducible clusters compared to the
traditional single clustering approach. This framework has been applied to two real fMRI studies
that investigate brain responses to listening to the emotional music with different preferences. In
the first fMRI study, three brain structures related to visual, reward, and auditory processing are
found to have intrinsic temporal patterns of coherent neuroactivity during affective processing,
which is one of the few data-driven studies that have observed. In the second study, different
levels of engagement, i.e. intentional to unintentional, with music have unique effects on the auditory-
limbic connectivity when listening to music, which has not been investigated and understood well in euro science of music field. We believe the work in this thesis has demonstrated an effective and competent approach to address the reliability and consistency concerns in fMRI data analysis
Transcriptomic investigation of the adaptation of Streptococcus pneumoniae
Streptococcus pneumoniae colonises the human nasopharynx as a commensal but can translocate to the lungs, meninges, and blood to cause potentially fatal infections. These host niches exhibit diverse physiological environments. Differences in adaptation to these conditions may explain differences between serotypes and genotypes in their ability to colonise the human host, be transmitted, and to cause disease. RNA sequencing (RNA-Seq) was used to investigate adaptation of clinical S. pneumoniae strains to different stress environments. In Chapter 3, to establish the optimal experimental conditions, the effects of carbohydrate source, temperature, and iron concentrations on bacterial growth dynamics were evaluated. S. pneumoniae strains selected on the basis of their ability to be carried and cause disease, showed differential growth phenotypes. In Chapter 4, to facilitate robust transcriptomic analysis, high-quality genome assemblies of S. pneumoniae serotype 1 (highly invasive, rarely found in carriage) and serotype 6B (rarely invasive, highly carried) strains were generated and characterised. A pneumococcal transcriptomic analysis pipeline was developed in Chapter 5 by investigating the transcriptomic response of two single gene knockouts of S. pneumoniae serotype 6B lacking the biosynthesis genes fhs or proABC. These mutants have been shown to be attenuated in vivo and the aim was to identify the transcriptomic basis for this. Adaptation by fhs S. pneumoniae included upregulation of pathways involved in secondary metabolites biosynthesis and quorum sensing while the proABC S. pneumoniae was upregulated for carbohydrate metabolism pathways. In Chapters 6 and 7, the transcriptomic adaptations of S. pneumoniae serotype 1 and serotype 6B strains to altered iron and temperature levels were delineated respectively, indicating strain specific gene expression with the majority of differential regulation occurring in core pneumococcal genes. In Chapter 8, to pave the way for investigating the S. pneumoniae transcriptome in human samples, a challenge in pneumococcal research, an approach to directly isolate high-quality pneumococcal RNA from human carriers was developed. The work in this thesis provides new insights in the gene regulation of clinical S. pneumoniae strains under various environmental exposures