7,561 research outputs found

    Clustering life trajectories: A new divisive hierarchical clustering algorithm for discrete-valued discrete time series

    Get PDF
    A new algorithm for clustering life course trajectories is presented and tested with large register data. Life courses are represented as sequences on a monthly timescale for the working-life with an age span from 16-65. A meaningful clustering result for this kind of data provides interesting subgroups with similar life course trajectories. The high sampling rate allows precise discrimination of the different subgroups, but it produces a lot of highly correlated data for phases with low variability. The main challenge is to select the variables (points in time) that carry most of the relevant information. The new algorithm deals with this problem by simultaneously clustering and identifying critical junctures for each of the relevant subgroups. The developed divisive algorithm is able to handle large amounts of data with multiple dimensions within reasonable time. This is demonstrated on data from the Federal German pension insurance. --Clustering,measures of association,discrete data,time series

    The supervised hierarchical Dirichlet process

    Full text link
    We propose the supervised hierarchical Dirichlet process (sHDP), a nonparametric generative model for the joint distribution of a group of observations and a response variable directly associated with that whole group. We compare the sHDP with another leading method for regression on grouped data, the supervised latent Dirichlet allocation (sLDA) model. We evaluate our method on two real-world classification problems and two real-world regression problems. Bayesian nonparametric regression models based on the Dirichlet process, such as the Dirichlet process-generalised linear models (DP-GLM) have previously been explored; these models allow flexibility in modelling nonlinear relationships. However, until now, Hierarchical Dirichlet Process (HDP) mixtures have not seen significant use in supervised problems with grouped data since a straightforward application of the HDP on the grouped data results in learnt clusters that are not predictive of the responses. The sHDP solves this problem by allowing for clusters to be learnt jointly from the group structure and from the label assigned to each group.Comment: 14 page

    Analysing the relationship between ectomycorrhizal infection and forest decline using marginal models

    Get PDF
    This statistical survey originates from the problem of discovering which relationship exists between root ectomycorrhizal infection and health status of forest plants. The sampling scheme takes observations from roots that come from sectors around the tree resulting in a hierarchical association structure of the observations. Marginal regression models are used to analyze the mean effect of the ectomycorrhizal state on a response variable proxy for the health degree of the plants
    corecore