173 research outputs found
Paradigm of tunable clustering using binarization of consensus partition matrices (Bi-CoPaM) for gene discovery
Copyright @ 2013 Abu-Jamous et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Clustering analysis has a growing role in the study of co-expressed genes for gene discovery. Conventional binary and fuzzy clustering do not embrace the biological reality that some genes may be irrelevant for a problem and not be assigned to a cluster, while other genes may participate in several biological functions and should simultaneously belong to multiple clusters. Also, these algorithms cannot generate tight clusters that focus on their cores or wide clusters that overlap and contain all possibly relevant genes. In this paper, a new clustering paradigm is proposed. In this paradigm, all three eventualities of a gene being exclusively assigned to a single cluster, being assigned to multiple clusters, and being not assigned to any cluster are possible. These possibilities are realised through the primary novelty of the introduction of tunable binarization techniques. Results from multiple clustering experiments are aggregated to generate one fuzzy consensus partition matrix (CoPaM), which is then binarized to obtain the final binary partitions. This is referred to as Binarization of Consensus Partition Matrices (Bi-CoPaM). The method has been tested with a set of synthetic datasets and a set of five real yeast cell-cycle datasets. The results demonstrate its validity in generating relevant tight, wide, and complementary clusters that can meet requirements of different gene discovery studies.National Institute for Health Researc
Yeast gene CMR1/YDL156W is consistently co-expressed with genes participating in DNA-metabolic processes in a variety of stringent clustering experiments
Β© 2013 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0/, which permits unrestricted use, provided the original author and source are credited.The binarization of consensus partition matrices (Bi-CoPaM) method has, among its unique features, the ability to perform ensemble clustering over the same set of genes from multiple microarray datasets by using various clustering methods in order to generate tunable tight clusters. Therefore, we have used the Bi-CoPaM method to the most synchronized 500 cell-cycle-regulated yeast genes from different microarray datasets to produce four tight, specific and exclusive clusters of co-expressed genes. We found 19 genes formed the tightest of the four clusters and this included the gene CMR1/YDL156W, which was an uncharacterized gene at the time of our investigations. Two very recent proteomic and biochemical studies have independently revealed many facets of CMR1 protein, although the precise functions of the protein remain to be elucidated. Our computational results complement these biological results and add more evidence to their recent findings of CMR1 as potentially participating in many of the DNA-metabolism processes such as replication, repair and transcription. Interestingly, our results demonstrate the close co-expressions of CMR1 and the replication protein A (RPA), the cohesion complex and the DNA polymerases Ξ±, Ξ΄ and Ι, as well as suggest functional relationships between CMR1 and the respective proteins. In addition, the analysis provides further substantial evidence that the expression of the CMR1 gene could be regulated by the MBF complex. In summary, the application of a novel analytic technique in large biological datasets has provided supporting evidence for a gene of previously unknown function, further hypotheses to test, and a more general demonstration of the value of sophisticated methods to explore new large datasets now so readily generated in biological experiments.National Institute for Health Researc
Clustering consistency in neuroimaging data analysis
Clustering techniques have been applied to neuroscience data analysis for decades. New algorithms keep being developed and applied to address different problems. However, when it comes to the applications of clustering, it is often hard to select the appropriate algorithm and evaluate the quality of clustering results due to the unknown ground truth. It is also the case that conclusions might be biased based on only one specific algorithm because each algorithm has its own assumption of the structure of the data, which might not be the same as the real data. In this paper, we explore the benefits of integrating the clustering results from multiple clustering algorithms by a tunable consensus clustering strategy and demonstrate the importance and necessity of consistency in neuroimaging data analysis
Automatic region-of-interest extraction in low depth-of-field images
PhD ThesisAutomatic extraction of focused regions from images with low depth-of-field
(DOF) is a problem without an efficient solution yet. The capability of
extracting focused regions can help to bridge the semantic gap by integrating
image regions which are meaningfully relevant and generally do not exhibit
uniform visual characteristics. There exist two main difficulties for extracting
focused regions from low DOF images using high-frequency based techniques:
computational complexity and performance.
A novel unsupervised segmentation approach based on ensemble clustering is
proposed to extract the focused regions from low DOF images in two stages.
The first stage is to cluster image blocks in a joint contrast-energy feature space
into three constituent groups. To achieve this, we make use of a normal
mixture-based model along with standard expectation-maximization (EM)
algorithm at two consecutive levels of block size. To avoid the common
problem of local optima experienced in many models, an ensemble EM
clustering algorithm is proposed. As a result, relevant blocks, i.e., block-based
region-of-interest (ROI), closely conforming to image objects are extracted.
In stage two, two different approaches have been developed to extract
pixel-based ROI. In the first approach, a binary saliency map is constructed
from the relevant blocks at the pixel level, which is based on difference of
Gaussian (DOG) and binarization methods. Then, a set of morphological
operations is employed to create the pixel-based ROI from the map.
Experimental results demonstrate that the proposed approach achieves an
average segmentation performance of 91.3% and is computationally 3 times
faster than the best existing approach. In the second approach, a minimal graph
cut is constructed by using the max-flow method and also by using
object/background seeds provided by the ensemble clustering algorithm.
Experimental results demonstrate an average segmentation performance of 91.7%
and approximately 50% reduction of the average computational time by the
proposed colour based approach compared with existing unsupervised
approaches
Spectral Analysis Network for Deep Representation Learning and Image Clustering
Deep representation learning is a crucial procedure in multimedia analysis
and attracts increasing attention. Most of the popular techniques rely on
convolutional neural network and require a large amount of labeled data in the
training procedure. However, it is time consuming or even impossible to obtain
the label information in some tasks due to cost limitation. Thus, it is
necessary to develop unsupervised deep representation learning techniques. This
paper proposes a new network structure for unsupervised deep representation
learning based on spectral analysis, which is a popular technique with solid
theory foundations. Compared with the existing spectral analysis methods, the
proposed network structure has at least three advantages. Firstly, it can
identify the local similarities among images in patch level and thus more
robust against occlusion. Secondly, through multiple consecutive spectral
analysis procedures, the proposed network can learn more clustering-friendly
representations and is capable to reveal the deep correlations among data
samples. Thirdly, it can elegantly integrate different spectral analysis
procedures, so that each spectral analysis procedure can have their individual
strengths in dealing with different data sample distributions. Extensive
experimental results show the effectiveness of the proposed methods on various
image clustering tasks
UNCLES: Method for the identification of genes differentially consistently co-expressed in a specific subset of datasets
Background: Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Results: Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. Conclusions: The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.The National Institute for Health Research (NIHR) under its Programme Grants for Applied Research
Programme (Grant Reference Number RP-PG-0310-1004)
Recommended from our members
Collective analysis of multiple high-throughput gene expression datasets
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonModern technologies have resulted in the production of numerous high-throughput biological datasets. However, the pace of development of capable computational methods does not cope with the pace of generation of new high-throughput datasets. Amongst the most popular biological high-throughput datasets are gene expression datasets (e.g. microarray datasets). This work targets this aspect by proposing a suite of computational methods which can analyse multiple gene expression datasets collectively. The focal method in this suite is the unification of clustering results from multiple datasets using external specifications (UNCLES). This method applies clustering to multiple heterogeneous datasets which measure the expression of the same set of genes separately and then combines the resulting partitions in accordance to one of two types of external specifications; type A identifies the subsets of genes that are consistently co-expressed in all of the given datasets while type B identifies the subsets of genes that are consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets. This contributes to the types of questions which can addressed by computational methods because existing clustering, consensus clustering, and biclustering methods are inapplicable to address the aforementioned objectives. Moreover, in order to assist in setting some of the parameters required by UNCLES, the M-N scatter plots technique is proposed. These methods, and less mature versions of them, have been validated and applied to numerous real datasets from the biological contexts of budding yeast, bacteria, human red blood cells, and malaria. While collaborating with biologists, these applications have led to various biological insights. In yeast, the role of the poorly-understood gene CMR1 in the yeast cell-cycle has been further elucidated. Also, a novel subset of poorly understood yeast genes has been discovered with an expression profile consistently negatively correlated with the well-known ribosome biogenesis genes. Bacterial data analysis has identified two clusters of negatively correlated genes. Analysis of data from human red blood cells has produced some hypotheses regarding the regulation of the pathways producing such cells. On the other hand, malarial data analysis is still at a preliminary stage. Taken together, this thesis provides an original integrative suite of computational methods which scrutinise multiple gene expression datasets collectively to address previously unresolved questions, and provides the results and findings of many applications of these methods to real biological datasets from multiple contexts.National Institute for Health Research (NIHR) and the Brunel College of Engineering, Design and Physical Science
The modular structure of brain functional connectivity networks: a graph theoretical approach
Complex networks theory offers a framework for the analysis of brain functional connectivity as measured by magnetic resonance imaging. Within this approach the brain is represented as a graph comprising nodes connected by links, with nodes corresponding to brain regions and the links to measures of inter-regional interaction. A number of graph theoretical methods have been proposed to analyze the modular structure of these networks. The most widely used metric is Newman's Modularity, which identifies modules within which links are more abundant than expected on the basis of a random network. However, Modularity is limited in its ability to detect relatively small communities, a problem known as ``resolution limit''. As a consequence, unambiguously identifiable modules, like complete sub-graphs, may be unduly merged into larger communities when they are too small compared to the size of the network. This limit, first demonstrated for Newman's Modularity, is quite general and affects, to a different extent, all methods that seek to identify the community structure of a network through the optimization of a global quality function. Hence, the resolution limit may represent a critical shortcoming for the study of brain networks, and is likely to have affected many of the studies reported in the literature. This work pioneers the use of Surprise and Asymptotical Surprise, two quality functions rooted in probability theory that aims at overcoming the resolution limit for both binary and weighted networks. Hereby, heuristics for their optimization are developed and tested, showing that the resulting optimal partitioning can highlight anatomically and functionally plausible modules from brain connectivity datasets, on binary and weighted networks. This novel approach is applied to the partitioning of two different human brain networks that have been extensively characterized in the literature, to address the resolution-limit issue in the study of the brain modular structure. Surprise maximization in human resting state networks revealed the presence of a rich structure of modules with heterogeneous size distribution undetectable by current methods. Moreover, Surprise led to different, more accurate classification of the network's connector hubs, the elements that integrate the brain modules into a cohesive structure. In synthetic networks, Asymptotical Surprise showed high sensitivity and specificity in the detection of ground-truth structures, particularly in the presence of noise and variability such as those observed in experimental functional MRI data. Finally, the methodological advances hereby introduced are shown to be a helpful tool to better discern differences between the modular organization of functional connectivity of healthy subjects and schizophrenic patients. Importantly, these differences may point to new clinical hypotheses on the etiology of schizophrenia, and they would have gone unnoticed with resolution-limited methods. This may call for a revisitation of some of the current models of the modular organization of the healthy and diseased brain
- β¦