11 research outputs found

    Nonparametric clustering for spatio-temporal data

    Get PDF
    Clustering algorithms attempt the identification of distinct subgroups within heterogeneous data and are commonly utilised as an exploratory tool. The definition of a cluster is dependent on the relevant dataset and associated constraints; clustering methods seek to determine homogeneous subgroups that each correspond to a distinct set of characteristics. This thesis focuses on the development of spatial clustering algorithms and the methods are motivated by the complexities posed by spatio-temporal data. The examples in this thesis primarily come from spatial structures described in the context of traffic modelling and are based on occupancy observations recorded over time for an urban road network. Levels of occupancy indicate the extent of traffic congestion and the goal is to identify distinct regions of traffic congestion in the urban road network. Spatial clustering for spatio-temporal data is an increasingly important research problem and the challenges posed by such research problems often demand the development of bespoke clustering methods. Many existing clustering algorithms, with a focus on accommodating the underlying spatial structure, do not generate clusters that adequately represent differences in the temporal pattern across the network. This thesis is primarily concerned with developing nonparametric clustering algorithms that seek to identify spatially contiguous clusters and retain underlying temporal patterns. Broadly, this thesis introduces two clustering algorithms that are capable of accommodating spatial and temporal dependencies that are inherent to the dataset. The first is a functional distributional clustering algorithm that is implemented within an agglomerative hierarchical clustering framework as a two-stage process. The method is based on a measure of distance that utilises estimated cumulative distribution functions over the data and this unique distance is both functional and distributional. This notion of distance utilises the differences in densities to identify distinct clusters in the graph, rather than raw recorded observations. However, distinct characteristics may not necessarily be identified and distinguishable by a densities-based distance measure, as defined within the agglomerative hierarchical clustering framework. In this thesis, we also introduce a formal Bayesian clustering approach that enables the researcher to determine spatially contiguous clusters in a data-driven manner. This framework varies from the set of assumptions introduced by the functional distributional clustering algorithm. This flexible Bayesian model employs a binary dependent Chinese restaurant process (binDCRP) to place a prior over the geographical constraints posed by a graph-based network. The binDCRP is a special case of the distance dependent Chinese restaurant process that was first introduced by Blei and Frazier (2011); the binDCRP is modified to account for data that poses spatial constraints. The binDCRP seeks to cluster data such that adjacent or neighbouring regions in a spatial structure are more likely to belong to the same cluster. The binDCRP introduces a large number of singletons within the spatial structure and we modify the binDCRP to enable the researcher to restrict the number of clusters in the graph. It is also reasonable to assume that individual junctions within a cluster are spatially correlated to adjacent junctions, due to the nature of traffic and the spread of congestion. In order to fully account for spatial correlation within a cluster structure, the model utilises a type of the conditional auto-regressive (CAR) model. The model also accounts for temporal dependencies using a first order auto-regressive (AR-1) model. In this mean-based flexible Bayesian model, the data is assumed to follow a Gaussian distribution and we utilise Kronecker product identities within the definition of the spatio-temporal precision matrix to improve the computational efficiency. The model utilises a Metropolis within Gibbs sampler to fully explore all possible partition structures within the network and infer the relevant parameters of the spatio-temporal precision matrix. The flexible Bayesian method is also applicable to map-based spatial structures and we describe the model in this context as well. The developed Bayesian model is applied to a simulated spatio-temporal dataset that is composed of three distinct known clusters. The differences in the clusters are reflected by distinct mean values over time associated with spatial regions. The nature of this mean-based comparison differs from the functional distributional clustering approach that seeks to identify differences across the distribution. We demonstrate the ability of the Bayesian model to restrict the number of clusters using a simulated data structure with distinctly defined clusters. The sampler is also able to explore potential cluster structures in an efficient manner and this is demonstrated using a simulated spatio-temporal data structure. The performance of this model is illustrated by an application to a dataset over an urban road network that presents traffic as a process varying continuously across space and time. We also apply this model to an areal unit dataset composed of property prices over a period of time for the Avon county in England

    Decision trees in epidemiological research

    Get PDF
    Background: In many studies, it is of interest to identify population subgroups that are relatively homogeneous with respect to an outcome. The nature of these subgroups can provide insight into effect mechanisms and suggest targets for tailored interventions. However, identifying relevant subgroups can be challenging with standard statistical methods. Main text: We review the literature on decision trees, a family of techniques for partitioning the population, on the basis of covariates, into distinct subgroups who share similar values of an outcome variable. We compare two decision tree methods, the popular Classification and Regression tree (CART) technique and the newer Conditional Inference tree (CTree) technique, assessing their performance in a simulation study and using data from the Box Lunch Study, a randomized controlled trial of a portion size intervention. Both CART and CTree identify homogeneous population subgroups and offer improved prediction accuracy relative to regression-based approaches when subgroups are truly present in the data. An important distinction between CART and CTree is that the latter uses a formal statistical hypothesis testing framework in building decision trees, which simplifies the process of identifying and interpreting the final tree model. We also introduce a novel way to visualize the subgroups defined by decision trees. Our novel graphical visualization provides a more scientifically meaningful characterization of the subgroups identified by decision trees. Conclusions: Decision trees are a useful tool for identifying homogeneous subgroups defined by combinations of individual characteristics. While all decision tree techniques generate subgroups, we advocate the use of the newer CTree technique due to its simplicity and ease of interpretation

    Distance Dependent Chinese restaurant process for Spatio-Temporal Clustering of Urban Traffic Networks

    No full text
    A novel Bayesian clustering method is presented for spatio-temporal data observed on a network. This method employs a Distance Dependent Chinese Restaurant Process (DDCRP) to incorporate the geographic constraints of the network. DDCRPs typically accommodate non-exchangeable distributions as a prior over partitions unlike the traditional Chinese restaurant Process. In addition, it does not exhibit the marginal invariance property and so one can capture the extent of the influence transferred from one node in the network to the next. We do not expect the DDCRP to fully capture the dependency structure of the data and thus a conditional auto-regressive model (CAR) is used to model the spatial dependency within a cluster. We will discuss different strategies for incorporating temporal dependency into a CAR-type model. Inference is carried out using a Metropolis within Gibbs sampler and we apply the model to cluster an urban traffic network, using occupancy data recorded at the junction level

    Spatio-Temporal Clustering of Urban Traffic Networks

    No full text
    Heterogeneous urban traffic networks with regions of varying congestion levels have unique fundamental properties that require tailor-made clustering algorithms. We propose a novel Bayesian clustering technique for spatio-temporal network data which is based on an amalgamation of a distance-dependent Chinese restaurant process (DDCRP) and a spatio-temporal conditional auto-regressive model (CAR). Our method employs a modified version of the DDCRP to incorporate the geographic constraints of the network and determine the shape and number of clusters. We do not expect the DDCRP to fully capture the dependency structure of the data and thus a CAR model is used to account for the spatial dependency within a cluster. This method is able to identify spatially contiguous clusters that also incorporate changes in levels of occupancy over time. Inference is carried out using a Metropolis within Gibbs sampler and we apply this developed clustering model to an urban traffic network, using occupancy data aggregated at the junction level

    Spatio-Temporal Clustering of Traffic Networks

    No full text
    We present a novel Bayesian clustering method for spatio-temporal data observed on a network and apply this model to cluster an urban traffic network. This method employs a distance dependent Chinese restaurant process (DDCRP) to provide a cluster structure, by incorporating the observed data and geographic constraints of the network. However, in order to fully capture the dependency structure of the data, a conditional auto-regressive model (CAR) is used to model the spatial dependency within each cluster

    Spatio-Temporal Clustering of Urban Traffic Networks

    No full text
    Heterogeneous urban traffic networks with regions of varying congestion levels have unique fundamental properties that require tailor-made clustering algorithms. We propose a novel Bayesian clustering technique for spatio-temporal network data which is based on an amalgamation of a distance-dependent Chinese restaurant process (DDCRP) and a spatio-temporal conditional auto-regressive model (CAR). Our method employs a modified version of the DDCRP to incorporate the geographic constraints of the network and determine the shape and number of clusters. We do not expect the DDCRP to fully capture the dependency structure of the data and thus a CAR model is used to account for the spatial dependency within a cluster. This method is able to identify spatially contiguous clusters that also incorporate changes in levels of occupancy over time. Inference is carried out using a Metropolis within Gibbs sampler and we apply this developed clustering model to an urban traffic network, using occupancy data aggregated at the junction level

    Spatio-Temporal Clustering of Traffic Networks

    No full text
    We present a novel Bayesian clustering method for spatio-temporal data observed on a network and apply this model to cluster an urban traffic network. This method employs a distance dependent Chinese restaurant process (DDCRP) to provide a cluster structure, by incorporating the observed data and geographic constraints of the network. However, in order to fully capture the dependency structure of the data, a conditional auto-regressive model (CAR) is used to model the spatial dependency within each cluster

    MOESM1 of Decision trees in epidemiological research

    No full text
    Additional file 1. Regression tree representing the relationship between adjusted residuals for energy intake (adjusted for age, sex, and BMI) and 22 baseline covariate
    corecore