130 research outputs found
First CLADAG data mining prize : data mining for longitudinal data with different marketing campaigns
The CLAssification and Data Analysis Group (CLADAG) of the Italian
Statistical Society recently organised a competition, the 'Young Researcher Data
Mining Prize' sponsored by the SAS Institute. This paper was the winning entry
and in it we detail our approach to the problem proposed and our results. The main
methods used are linear regression, mixture models, Bayesian autoregressive and
Bayesian dynamic models
Orchestrated transcription of biological processes in the marine picoeukaryote Ostreococcus exposed to light/dark cycles
Background: Picoeukaryotes represent an important, yet poorly characterized component of marine phytoplankton. The recent genome availability for two species of Ostreococcus and Micromonas has led to the emergence of picophytoplankton comparative genomics. Sequencing has revealed many unexpected features about genome structure and led to several hypotheses on Ostreococcus biology and physiology. Despite the accumulation of genomic data, little is known about gene expression in eukaryotic picophytoplankton.
Results: We have conducted a genome-wide analysis of gene expression in Ostreococcus tauri cells exposed to light/dark cycles (L/D). A Bayesian Fourier Clustering method was implemented to cluster rhythmic genes according to their expression waveform. In a single L/D condition nearly all expressed genes displayed rhythmic patterns of expression. Clusters of genes were associated with the main biological processes such as transcription in the nucleus and the organelles, photosynthesis, DNA replication and mitosis.
Conclusions: Light/Dark time-dependent transcription of the genes involved in the main steps leading to protein synthesis (transcription basic machinery, ribosome biogenesis, translation and aminoacid synthesis) was observed, to an unprecedented extent in eukaryotes, suggesting a major input of transcriptional regulations in Ostreococcus. We propose that the diurnal co-regulation of genes involved in photoprotection, defence against oxidative stress and DNA repair might be an efficient mechanism, which protects cells against photo-damage thereby, contributing to the ability of O. tauri to grow under a wide range of light intensities
Bayesian clustering of curves and the search of the partition space
This thesis is concerned with the study of a Bayesian clustering algorithm, proposed by Heard et al. (2006), used successfully for microarray experiments over time. It focuses not only on the development of new ways of setting hyperparameters so that inferences both reflect the scientific needs and contribute to the inferential stability of the search, but also on the design of new fast algorithms for the search over the partition space. First we use the explicit forms of the associated Bayes factors to demonstrate that such methods can be unstable under common settings of the associated hyperparameters. We then prove that the regions of instability can be removed by setting the hyperparameters in an unconventional way. Moreover, we demonstrate that MAP (maximum a posteriori) search is satisfied when a utility function is defined according to the scientific interest of the clusters. We then focus on the search over the partition space. In model-based clustering a comprehensive search for the highest scoring partition is usually impossible, due to the huge number of partitions of even a moderately sized dataset. We propose two methods for the partition search. One method encodes the clustering as a weighted MAX-SAT problem, while the other views clusterings as elements of the lattice of partitions. Finally, this thesis includes the full analysis of two microarray experiments for identifying circadian genes
Beyond Conjugacy for Chain Event Graph Model Selection
Chain event graphs are a family of probabilistic graphical models that
generalise Bayesian networks and have been successfully applied to a wide range
of domains. Unlike Bayesian networks, these models can encode context-specific
conditional independencies as well as asymmetric developments within the
evolution of a process. More recently, new model classes belonging to the chain
event graph family have been developed for modelling time-to-event data to
study the temporal dynamics of a process. However, existing model selection
algorithms for chain event graphs and its variants rely on all parameters
having conjugate priors. This is unrealistic for many real-world applications.
In this paper, we propose a mixture modelling approach to model selection in
chain event graphs that does not rely on conjugacy. Moreover, we also show that
this methodology is more amenable to being robustly scaled than the existing
model selection algorithms used for this family. We demonstrate our techniques
on simulated datasets
Bayesian clustering of curves and the search of the partition space
This thesis is concerned with the study of a Bayesian clustering algorithm, proposed by Heard et al. (2006), used successfully for microarray experiments over time. It focuses not only on the development of new ways of setting hyperparameters so that inferences both reflect the scientific needs and contribute to the inferential stability of the search, but also on the design of new fast algorithms for the search over the partition space. First we use the explicit forms of the associated Bayes factors to demonstrate that such methods can be unstable under common settings of the associated hyperparameters. We then prove that the regions of instability can be removed by setting the hyperparameters in an unconventional way. Moreover, we demonstrate that MAP (maximum a posteriori) search is satisfied when a utility function is defined according to the scientific interest of the clusters. We then focus on the search over the partition space. In model-based clustering a comprehensive search for the highest scoring partition is usually impossible, due to the huge number of partitions of even a moderately sized dataset. We propose two methods for the partition search. One method encodes the clustering as a weighted MAX-SAT problem, while the other views clusterings as elements of the lattice of partitions. Finally, this thesis includes the full analysis of two microarray experiments for identifying circadian genes.EThOS - Electronic Theses Online ServiceUniversity of Warwick. Dept of Statistics (UoW)Engineering and Physical Sciences Research Council (Great Britain) (EPSRC)GBUnited Kingdo
Variance matrix priors for Dirichlet process mixture models with Gaussian kernels
Funding: The first author would like to acknowledge the support of the School of Mathematics and Statistics, as well as CREEM, at the University of St Andrews, and the University of St Andrews St Leonard’s 7th Century Scholarship.Bayesian mixture modelling is widely used for density estimation and clustering. The Dirichlet process mixture model (DPMM) is the most popular Bayesian non-parametric mixture modelling approach. In this manuscript, we study the choice of prior for the variance or precision matrix when Gaussian kernels are adopted. Typically, in the relevant literature, the assessment of mixture models is done by considering observations in a space of only a handful of dimensions. Instead, we are concerned with more realistic problems of higher dimensionality, in a space of up to 20 dimensions. We observe that the choice of prior is increasingly important as the dimensionality of the problem increases. After identifying certain undesirable properties of standard priors in problems of higher dimensionality, we review and implement possible alternative priors. The most promising priors are identified, as well as other factors that affect the convergence of MCMC samplers. Our results show that the choice of prior is critical for deriving reliable posterior inferences. This manuscript offers a thorough overview and comparative investigation into possible priors, with detailed guidelines for their implementation. Although our work focuses on the use of the DPMM in clustering, it is also applicable to density estimation.Peer reviewe
Circadian clock components control daily growth activities by modulating cytokinin levels and cell division-associated gene expression in <i>Populus</i> trees
Trees are carbon dioxide sinks and major producers of terrestrial biomass with distinct seasonal growth patterns. Circadian clocks enable the coordination of physiological and biochemical temporal activities, optimally regulating multiple traits including growth. To dissect the clock's role in growth, we analysed Populus tremula x P. tremuloides trees with impaired clock function due to down-regulation of central clock components. late elongated hypocotyl (lhy-10) trees, in which expression of LHY1 and LHY2 is reduced by RNAi, have a short free-running period and show disrupted temporal regulation of gene expression and reduced growth, producing 30-40% less biomass than wild-type trees. Genes important in growth regulation were expressed with an earlier phase in lhy-10, and CYCLIN D3 expression was misaligned and arrhythmic. Levels of cytokinins were lower in lhy-10 trees, which also showed a change in the time of peak expression of genes associated with cell division and growth. However, auxin levels were not altered in lhy-10 trees, and the size of the lignification zone in the stem showed a relative increase. The reduced growth rate and anatomical features of lhy-10 trees were mainly caused by misregulation of cell division, which may have resulted from impaired clock function
A semi-parametric approach to estimate risk functions associated with multi-dimensional exposure profiles: application to smoking and lung cancer
A common characteristic of environmental epidemiology is the multi-dimensional aspect of exposure patterns, frequently reduced to a cumulative exposure for simplicity of analysis. By adopting a flexible Bayesian clustering approach, we explore the risk function linking exposure history to disease. This approach is applied here to study the relationship between different smoking characteristics and lung cancer in the framework of a population based case control study
Bayesian Graphs of Intelligent Causation
Probabilistic Graphical Bayesian models of causation have continued to impact
on strategic analyses designed to help evaluate the efficacy of different
interventions on systems. However, the standard causal algebras upon which
these inferences are based typically assume that the intervened population does
not react intelligently to frustrate an intervention. In an adversarial setting
this is rarely an appropriate assumption. In this paper, we extend an
established Bayesian methodology called Adversarial Risk Analysis to apply it
to settings that can legitimately be designated as causal in this graphical
sense. To embed this technology we first need to generalize the concept of a
causal graph. We then proceed to demonstrate how the predicable intelligent
reactions of adversaries to circumvent an intervention when they hear about it
can be systematically modelled within such graphical frameworks, importing
these recent developments from Bayesian game theory. The new methodologies and
supporting protocols are illustrated through applications associated with an
adversary attempting to infiltrate a friendly state
- …
