58 research outputs found

    Novel semi-metrics for multivariate change point analysis and anomaly detection

    Full text link
    This paper proposes a new method for determining similarity and anomalies between time series, most practically effective in large collections of (likely related) time series, by measuring distances between structural breaks within such a collection. We introduce a class of \emph{semi-metric} distance measures, which we term \emph{MJ distances}. These semi-metrics provide an advantage over existing options such as the Hausdorff and Wasserstein metrics. We prove they have desirable properties, including better sensitivity to outliers, while experiments on simulated data demonstrate that they uncover similarity within collections of time series more effectively. Semi-metrics carry a potential disadvantage: without the triangle inequality, they may not satisfy a "transitivity property of closeness." We analyse this failure with proof and introduce an computational method to investigate, in which we demonstrate that our semi-metrics violate transitivity infrequently and mildly. Finally, we apply our methods to cryptocurrency and measles data, introducing a judicious application of eigenvalue analysis.Comment: Accepted manuscript. Minor edits since v2. Equal contribution from first two author

    Multilayer Networks for Text Analysis with Multiple Data Types

    Full text link
    We are interested in the widespread problem of clustering documents and finding topics in large collections of written documents in the presence of metadata and hyperlinks. To tackle the challenge of accounting for these different types of datasets, we propose a novel framework based on Multilayer Networks and Stochastic Block Models. The main innovation of our approach over other techniques is that it applies the same non-parametric probabilistic framework to the different sources of datasets simultaneously. The key difference to other multilayer complex networks is the strong unbalance between the layers, with the average degree of different node types scaling differently with system size. We show that the latter observation is due to generic properties of text, such as Heaps' law, and strongly affects the inference of communities. We present and discuss the performance of our method in different datasets (hundreds of Wikipedia documents, thousands of scientific papers, and thousands of E-mails) showing that taking into account multiple types of information provides a more nuanced view on topic- and document-clusters and increases the ability to predict missing links.Comment: 17 pages, 6 figure

    Spatial risk mapping for rare disease with hidden Markov fields and variational EM

    Get PDF
    We recast the disease mapping issue of automatically classifying geographical units into risk classes as a clustering task using a discrete hidden Markov model and Poisson class-dependent distributions. The designed hidden Markov prior is non standard and consists of a variation of the Potts model where the interaction parameter can depend on the risk classes. The model parameters are estimated using an EM algorithm and the mean field approximation. This provides a way to face the intractability of the standard EM in this spatial context, with a computationally efficient alternative to more intensive simulation based Monte Carlo Markov Chain (MCMC) procedures. We then focus on the issue of dealing with very low risk values and small numbers of observed cases and population sizes. We address the problem of finding good initial parameter values in this context and develop a new initialization strategy appropriate for spatial Poisson mixtures in the case of not so well separated classes as encountered in animal disease risk analysis. Using both simulated and real data, we compare this strategy to other standard strategies and show that it performs well in a lot of situations.Nous abordons la cartographie automatique d' unités géographiques en classes de risque comme un problème de clustering à l'aide de modèles de Markov cachés discrets et de modèles de mélange de Poisson. Le modèle de Markov caché proposé est une variante du modèle de Potts, où le paramètre d'interaction dépend des classes de risque. Afin d'estimer les paramètres du modèle, nous utilisons l'algorithme EM combiné à une approche variationnelle champ-moyen. Cette approche nous permet d'appliquer l'algorithme EM dans un cadre spatial et présente une alternative efficace aux méthodes d'estimation basées sur des simulations intensives de type Markov chain Monte Carlo (MCMC). Nous abordons également les problèmes d'initialisation, spécialement quand les taux de risque sont petits (cas des maladies animales). Nous proposons une nouvelle stratégie d'initialisation appropriée aux modèles de mélange de Poisson quand les classes sont mal séparées. Pour illustrer notre méthodologie, nous présentons des résultats d'application sur des données épidémiologiques réelles et simulées et montrons la performance de la stratégie d'initialisation présentée en comparaison à celles utilisées usuellement

    A semi-parametric approach to estimate risk functions associated with multi-dimensional exposure profiles: application to smoking and lung cancer

    Get PDF
    A common characteristic of environmental epidemiology is the multi-dimensional aspect of exposure patterns, frequently reduced to a cumulative exposure for simplicity of analysis. By adopting a flexible Bayesian clustering approach, we explore the risk function linking exposure history to disease. This approach is applied here to study the relationship between different smoking characteristics and lung cancer in the framework of a population based case control study

    Champs aléatoires de Markov cachés pour la cartographie du risque en épidémiologie

    No full text
    The analysis of the geographical variations of a disease and their representation on a mapis an important step in epidemiology. The goal is to identify homogeneous regions in termsof disease risk and to gain better insights into the mechanisms underlying the spread of thedisease. We recast the disease mapping issue of automatically classifying geographical unitsinto risk classes as a clustering task using a discrete hidden Markov model and Poisson classdependent distributions. The designed hidden Markov prior is non standard and consists of avariation of the Potts model where the interaction parameter can depend on the risk classes.The model parameters are estimated using an EM algorithm and the mean field approximation. This provides a way to face the intractability of the standard EM in this spatial context,with a computationally efficient alternative to more intensive simulation based Monte CarloMarkov Chain (MCMC) procedures.We then focus on the issue of dealing with very low risk values and small numbers of observedcases and population sizes. We address the problem of finding good initial parameter values inthis context and develop a new initialization strategy appropriate for spatial Poisson mixturesin the case of not so well separated classes as encountered in animal disease risk analysis.We illustrate the performance of the proposed methodology on some animal epidemiologicaldatasets provided by INRA.La cartographie du risque en épidémiologie permet de mettre en évidence des régionshomogènes en terme du risque afin de mieux comprendre l’étiologie des maladies. Nousabordons la cartographie automatique d’unités géographiques en classes de risque commeun problème de classification à l’aide de modèles de Markov cachés discrets et de modèlesde mélange de Poisson. Le modèle de Markov caché proposé est une variante du modèle dePotts, où le paramètre d’interaction dépend des classes de risque.Afin d’estimer les paramètres du modèle, nous utilisons l’algorithme EM combiné à une approche variationnelle champ-moyen. Cette approche nous permet d’appliquer l’algorithmeEM dans un cadre spatial et présente une alternative efficace aux méthodes d’estimation deMonte Carlo par chaîne de Markov (MCMC).Nous abordons également les problèmes d’initialisation, spécialement quand les taux de risquesont petits (cas des maladies animales). Nous proposons une nouvelle stratégie d’initialisationappropriée aux modèles de mélange de Poisson quand les classes sont mal séparées. Pourillustrer ces solutions proposées, nous présentons des résultats d’application sur des jeux dedonnées épidémiologiques animales fournis par l’INRA

    Hidden Markov random fields for risk mapping in epidemiology

    No full text
    La cartographie du risque en épidémiologie permet de mettre en évidence des régionshomogènes en terme du risque afin de mieux comprendre l’étiologie des maladies. Nousabordons la cartographie automatique d’unités géographiques en classes de risque commeun problème de classification à l’aide de modèles de Markov cachés discrets et de modèlesde mélange de Poisson. Le modèle de Markov caché proposé est une variante du modèle dePotts, où le paramètre d’interaction dépend des classes de risque.Afin d’estimer les paramètres du modèle, nous utilisons l’algorithme EM combiné à une approche variationnelle champ-moyen. Cette approche nous permet d’appliquer l’algorithmeEM dans un cadre spatial et présente une alternative efficace aux méthodes d’estimation deMonte Carlo par chaîne de Markov (MCMC).Nous abordons également les problèmes d’initialisation, spécialement quand les taux de risquesont petits (cas des maladies animales). Nous proposons une nouvelle stratégie d’initialisationappropriée aux modèles de mélange de Poisson quand les classes sont mal séparées. Pourillustrer ces solutions proposées, nous présentons des résultats d’application sur des jeux dedonnées épidémiologiques animales fournis par l’INRA.The analysis of the geographical variations of a disease and their representation on a mapis an important step in epidemiology. The goal is to identify homogeneous regions in termsof disease risk and to gain better insights into the mechanisms underlying the spread of thedisease. We recast the disease mapping issue of automatically classifying geographical unitsinto risk classes as a clustering task using a discrete hidden Markov model and Poisson classdependent distributions. The designed hidden Markov prior is non standard and consists of avariation of the Potts model where the interaction parameter can depend on the risk classes.The model parameters are estimated using an EM algorithm and the mean field approximation. This provides a way to face the intractability of the standard EM in this spatial context,with a computationally efficient alternative to more intensive simulation based Monte CarloMarkov Chain (MCMC) procedures.We then focus on the issue of dealing with very low risk values and small numbers of observedcases and population sizes. We address the problem of finding good initial parameter values inthis context and develop a new initialization strategy appropriate for spatial Poisson mixturesin the case of not so well separated classes as encountered in animal disease risk analysis.We illustrate the performance of the proposed methodology on some animal epidemiologicaldatasets provided by INRA
    corecore