University of Warwick. Centre for Research in Statistical Methodology
Abstract
Background: An increasing number of microarray experiments produce time series of expression levels for many
genes. Some recent clustering algorithms respect the time ordering of the data and are, importantly, extremely
fast. The focus of this paper is the development of such an algorithm on a microarray data set consisting of
22,810 genes of the plant Arabidopsis thaliana measured at 13 time points over two days. Circadian rhythms
control the timing of various physiological and metabolic processes and are regulated by genes acting in
feedback loops. The aim is to cluster and classify the expression profiles in order to identify genes potentially
involved in, and regulated by, the circadian clock.
Results: A greedy search over time series of expression levels (where series are compared pairwise, the two most
similar put in the same cluster and so forth) will get a fast result but will only explore a very limited number of
the possible partitions of the profiles. We propose an improved, deterministic method based on a multi-step
application of a conjugate Bayesian clustering algorithm. It allows the entire space to be searched more fully and
intelligently. The values of the summary statistics are used to not only score clusters of genes, but also to guide
the search of the vast partition space. By following this procedure, we are able to cluster genes that are known
to be rhythmically expressed with genes of previously unknown function; thus suggesting potentially interesting
targets for future experiments