34,697 research outputs found

    Labor Market Entry and Earnings Dynamics: Bayesian Inference Using Mixtures-of-Experts Markov Chain Clustering

    Get PDF
    This paper analyzes patterns in the earnings development of young labor market entrants over their life cycle. We identify four distinctly different types of transition patterns between discrete earnings states in a large administrative data set. Further, we investigate the effects of labor market conditions at the time of entry on the probability of belonging to each transition type. To estimate our statistical model we use a model-based clustering approach. The statistical challenge in our application comes from the di±culty in extending distance-based clustering approaches to the problem of identify groups of similar time series in a panel of discrete-valued time series. We use Markov chain clustering, proposed by Pamminger and Frühwirth-Schnatter (2010), which is an approach for clustering discrete-valued time series obtained by observing a categorical variable with several states. This method is based on finite mixtures of first-order time-homogeneous Markov chain models. In order to analyze group membership we present an extension to this approach by formulating a probabilistic model for the latent group indicators within the Bayesian classification rule using a multinomial logit model.Labor Market Entry Conditions, Transition Data, Markov Chain Monte Carlo, Multinomial Logit, Panel Data, Auxiliary Mixture Sampler, Bayesian Statistics

    Model-based clustering using copulas with applications

    Get PDF
    The majority of model-based clustering techniques is based on multivariate normal models and their variants. In this paper copulas are used for the construction of flexible families of models for clustering applications. The use of copulas in model-based clustering offers two direct advantages over current methods: (i) the appropriate choice of copulas provides the ability to obtain a range of exotic shapes for the clusters, and (ii) the explicit choice of marginal distributions for the clusters allows the modelling of multivariate data of various modes (either discrete or continuous) in a natural way. This paper introduces and studies the framework of copula-based finite mixture models for clustering applications. Estimation in the general case can be performed using standard EM, and, depending on the mode of the data, more efficient procedures are provided that can fully exploit the copula structure. The closure properties of the mixture models under marginalization are discussed, and for continuous, real-valued data parametric rotations in the sample space are introduced, with a parallel discussion on parameter identifiability depending on the choice of copulas for the components. The exposition of the methodology is accompanied and motivated by the analysis of real and artificial data

    Model-based clustering of large networks

    Get PDF
    We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables

    Maximum Likelihood Estimation of Discrete Log-Concave Distributions with Applications

    Get PDF
    Shape-constrained methods specify a class of distributions instead of a single parametric family. The approach increases the robustness of the estimation without much loss of efficiency. Among these, log-concavity is an appealing shape constraint in distribution modeling, because it falls into the popular unimodal shape-constraint and many parametric models are log-concave. This is, therefore, the focus of our work. First, we propose a maximum likelihood estimator of discrete log-concave distributions in higher dimensions. We define a new class of log-concave distributions in multiple dimensional spaces and study its properties. We show how to compute the maximum likelihood estimator from an independent and identically distributed sample, and establish consistency of the estimator, even if the class has been incorrectly specified. For finite sample sizes, the proposed estimator outperforms a purely nonparametric approach (the empirical distribution), but is able to remain comparable to the correct parametric approach. Furthermore, the new class has a natural relationship with log-concave densities when data has been grouped or discretized. We show how this property can be used in a real data example. Secondly, we apply the discrete log-concave maximum likelihood estimator in one-dimensional space to a clustering problem. Our work mainly focuses on the categorical nominal data. We develop a log-concave mixture model using the discrete log-concave maximum likelihood estimator. We then apply the log-concave mixture model to our clustering algorithm. We compare our proposed clustering algorithm with the other two clustering methods. Comparing results show that our proposed algorithm has a good performance

    Supervised Classification Using Finite Mixture Copula

    Get PDF
    Use of copula for statistical classification is recent and gaining popularity. For example, statistical classification using copula has been proposed for automatic character recognition, medical diagnostic and most recently in data mining. Classical discrimination rules assume normality. But in this data age time, this assumption is often questionable. In fact features of data could be a mixture of discrete and continues random variables. In this paper, mixture copula densities are used to model class conditional distributions. Such types of densities are useful when the marginal densities of the vector of features are not normally distributed and are of a mixed kind of variables. Authors have shown that such mixture models are very useful for uncovering hidden structures in the data, and used them for clustering in data mining. Under such mixture models, maximum likelihood estimation methods are not suitable and regular expectation maximization algorithm is inefficient and may not converge. A new estimation method is proposed to estimate such densities and build the classifier based on mixture finite Gaussian densities. Simulations are used to compare the performance of the copula based classifier with classical normal distribution based models, logistic regression based model and independent model cases. The method is also applied to a real data

    From here to infinity - sparse finite versus Dirichlet process mixtures in model-based clustering

    Get PDF
    In model-based-clustering mixture models are used to group data points into clusters. A useful concept introduced for Gaussian mixtures by Malsiner Walli et al (2016) are sparse finite mixtures, where the prior distribution on the weight distribution of a mixture with KK components is chosen in such a way that a priori the number of clusters in the data is random and is allowed to be smaller than KK with high probability. The number of cluster is then inferred a posteriori from the data. The present paper makes the following contributions in the context of sparse finite mixture modelling. First, it is illustrated that the concept of sparse finite mixture is very generic and easily extended to cluster various types of non-Gaussian data, in particular discrete data and continuous multivariate data arising from non-Gaussian clusters. Second, sparse finite mixtures are compared to Dirichlet process mixtures with respect to their ability to identify the number of clusters. For both model classes, a random hyper prior is considered for the parameters determining the weight distribution. By suitable matching of these priors, it is shown that the choice of this hyper prior is far more influential on the cluster solution than whether a sparse finite mixture or a Dirichlet process mixture is taken into consideration.Comment: Accepted versio

    Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs

    Full text link
    Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.Comment: 13 figures, 35 reference

    A Tight Convex Upper Bound on the Likelihood of a Finite Mixture

    Full text link
    The likelihood function of a finite mixture model is a non-convex function with multiple local maxima and commonly used iterative algorithms such as EM will converge to different solutions depending on initial conditions. In this paper we ask: is it possible to assess how far we are from the global maximum of the likelihood? Since the likelihood of a finite mixture model can grow unboundedly by centering a Gaussian on a single datapoint and shrinking the covariance, we constrain the problem by assuming that the parameters of the individual models are members of a large discrete set (e.g. estimating a mixture of two Gaussians where the means and variances of both Gaussians are members of a set of a million possible means and variances). For this setting we show that a simple upper bound on the likelihood can be computed using convex optimization and we analyze conditions under which the bound is guaranteed to be tight. This bound can then be used to assess the quality of solutions found by EM (where the final result is projected on the discrete set) or any other mixture estimation algorithm. For any dataset our method allows us to find a finite mixture model together with a dataset-specific bound on how far the likelihood of this mixture is from the global optimum of the likelihoodComment: icpr 201
    corecore