11 research outputs found

    A fast and recursive algorithm for clustering large datasets with kk-medians

    Get PDF
    Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the kk-means algorithm, a new class of recursive stochastic gradient algorithms designed for the kk-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which are known to have better performances, and a data-driven procedure that allows automatic selection of the value of the descent step is proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as kk-means, trimmed kk-means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.Comment: Under revision for Computational Statistics and Data Analysi

    Score de risque d'événement et score en ligne pour des insuffisants cardiaques

    Get PDF
    International audienceOn présente la construction d'un score de risque d'événement à court terme pour des insuffisants cardiaques. On suppose ensuite que les données de patients arrivent de façon continue et que l'on veut actualiser en ligne la fonction de score. On étudie en particulier l'estimation en ligne des paramètres d'un mo-dèle de régression linéaire par un processus de gradient stochastique en utilisant des données standardisées en ligne au lieu des données brutes

    Fast Estimation of the Median Covariation Matrix with Application to Online Robust Principal Components Analysis

    Full text link
    The geometric median covariation matrix is a robust multivariate indicator of dispersion which can be extended without any difficulty to functional data. We define estimators, based on recursive algorithms, that can be simply updated at each new observation and are able to deal rapidly with large samples of high dimensional data without being obliged to store all the data in memory. Asymptotic convergence properties of the recursive algorithms are studied under weak conditions. The computation of the principal components can also be performed online and this approach can be useful for online outlier detection. A simulation study clearly shows that this robust indicator is a competitive alternative to minimum covariance determinant when the dimension of the data is small and robust principal components analysis based on projection pursuit and spherical projections for high dimension data. An illustration on a large sample and high dimensional dataset consisting of individual TV audiences measured at a minute scale over a period of 24 hours confirms the interest of considering the robust principal components analysis based on the median covariation matrix. All studied algorithms are available in the R package Gmedian on CRAN

    Material metabolism of residential buildings in Sweden: Material intensity database, stocks and flows, and spatial analysis

    Get PDF
    Construction materials are used for the expansion and maintenance of the built environment. In the last century, construction material stock has increased globally 23-fold. Given the current situation, the accumulated stock can be viewed as a repository of anthropogenic resources, which at the end of life could be re-circulated through the economic system to minimize the inflow of raw materials and the outflow of waste. A major step toward increased material circularity is the development of the supporting knowledge infrastructure. For this reason, research has focused on developing methods intended for exposing the material metabolism, namely, estimating the stocks and flows and analyzing the spatial and temporal dynamics of stocks and flows. Residential buildings comprise a large share of the built environment. However, the material metabolism of these structures has remained unknown in many geographical contexts. Therefore, in this thesis, a bottom-up approach is employed to uncover the metabolism of residential buildings in Sweden. This goal is achieved through three methodological steps. First, a material intensity database is assembled based on architectural drawings of 46 residential buildings built within the period 1880–2010 in Sweden. Second, the stocks and flows are modeled with spatial and statistical inventory data and the developed material intensity database. Third, new spatial analysis approaches to the stocks and flows are conducted within urban and national boundaries. For the urban context, material stock indicators defined at the neighborhood level are clustered with well-known algorithms. At the national level, eight settlement types are considered to indicate the spatial dynamics. The developed database indicates historical trends in terms of the material intensity and composition for residential buildings in Sweden. Moreover, the results contribute to establishing a global database and, through an extended international cross-comparison, to the understanding of how the material intensity and composition of residential buildings differ geographically. Furthermore, the stocks and flows are estimated in million metric tons at different administrative boundary levels. Among the six categories considered, mineral-binding materials, such as concrete, comprise the largest share of the accumulated stock. Finally, spatial differences in material stock composition are depicted in urban geography and nationally, among the eight settlement types. At national level, densely built-up corridors are identified, which should be used for enhancing material circularity. This thesis contributes with data source exploration, methodological development, and critical analyses, relevant to researchers, policy makers, and practitioners interested in a more sustained metabolism of construction materials in the built environment

    Widening the scope of an eigenvector stochastic approximation process and application to streaming PCA and related methods

    Get PDF
    International audienceWe prove the almost sure convergence of Oja-type processes to eigenvectors of the expectation B of a random matrix while relaxing the i.i.d. assumption on the observed random matrices (B n) and assuming either (B n) converges to B or (E[B n |T n ]) converges to B where T n is the sigma-field generated by the events before time n. As an application of this generalization, the online PCA of a random vector Z can be performed when there is a data stream of i.i.d. observations of Z, even when both the metric M used and the expectation of Z are unknown and estimated online. Moreover, in order to update the stochastic approximation process at each step, we are no longer bound to using only a mini-batch of observations of Z, but all previous observations up to the current step can be used without having to store them. This is useful not only when dealing with streaming data but also with Big Data as one can process the latter sequentially as a data stream. In addition the general framework of this process, unlike other algorithms in the literature, also covers the case of factorial methods related to PCA

    An overview of clustering methods with guidelines for application in mental health research

    Get PDF
    Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements. In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently introduced. How to choose algorithms to address common issues as well as methods for pre-clustering data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms, we provide information on R functions and librarie

    Cluster no jerárquicos versus CART y BIPLOT

    Get PDF
    [ES] INTRODUCCIÓN Cada día estamos más inmersos en un mundo en el que los datos crecen y crecen. La minería de datos (MD) muy relacionada con el Descubrimiento de Conocimiento en Bases de datos (KDD -Knowledge Discovery in Databases) nos permite descubrir información de grandes volúmenes de datos y son fundamentales para analizarlos de manera eficaz, a la vez que revelan patrones que no eran conocidos (Holsheimer & Siebes, 1994). El KDD es un proceso que consta de un conjunto de fases que incluye el preprocesamiento minería y post procesamiento de los datos. La minería de datos es una técnica de Inteligencia Artificial que permite extraer conocimiento útil y comprensible previamente desconocido a partir de grandes volúmenes de datos y consiste en la aplicación de un algoritmo para extraer patrones de datos. Sin embargo, con el fin de analizar los datos enfocados en el descubrimiento del conocimiento se ha ido adaptando y ha surgido lo que se denomina minería de datos espacial (MDE), la cual se considera como el proceso automático de explorar grandes cantidades de datos espaciales con el objetivo de descubrir conocimiento. En la actividad investigadora resulta de gran interés identificar asociaciones, patrones y reglas. Dentro de las técnicas de MD se encuentra el agrupamiento (Clustering). El agrupamiento de datos es un problema fundamental en una variedad de áreas de la informática y campos relacionados, como el análisis de datos, la compresión de datos y el análisis de datos estadísticos (Aboubi, Drias, & Kamel, 2016). Puede considerarse el problema más importante de aprendizaje no supervisado tratando de encontrar una estructura de datos no etiquetados (Jain & Dubes, 1988; Jain, Murty, & Flynn, 1999). Los algoritmos de agrupamiento más conocidos son los métodos jerárquicos y los métodos de partición, aunque existen otros métodos basados en densidades y los métodos basados en Gird. Existen diversas razones por las que las agrupaciones particionadas o de aprendizaje no supervisado son de interés: implementación rápida y convergen rápidamente, permiten categorizar elementos, entre otras. Sin embargo, estos algoritmos sufren inconvenientes en la especificación de los parámetros iniciales no adecuados, que pueden generar una mala convergencia. Se han desarrollado diferentes métodos de agrupamiento que atienden a diversos problemas como costo computacional, sensibilidad a la inicialización, clases desbalanceadas y convergencia a un óptimo local, entre otros. Sin embargo, para la selección de un método, es necesario considerar la naturaleza de los datos y las condiciones del problema con el fin de agrupar patrones similares, de tal forma que se tenga un buen compromiso entre costo computacional y efectividad en la separabilidad de las clases. Algunos de los algoritmos basados en particiones son el algoritmo K-Medias, el algoritmo K-Medoids, el algoritmo de particionamiento alrededor de Medoids (PAM) y una versión de PAM diseñada para grupos de datos mayores denominado CLARA (Gupta & Panda, 2018). Hay numerosos investigadores que han propuesto algoritmos de K-Medias y K- Medoids (Borah & Ghose, 2009; Dunham, 2002; Han & Kamber, 2006; Khan & Ahmad, 2004; Park, Lee, & Jun, 2006; Rakhlin & Caponnetto, 2007; Xiong, Wu, & Chen, 2009). La agrupación ha ganado un amplio uso y su importancia ha crecido proporcionalmente debido a la cantidad cada vez mayor de datos y al aumento exponencial en las velocidades de procesamiento de la computadora. La importancia de la agrupación se puede entender por el hecho de que tiene una amplia variedad de aplicaciones, ya sea en educación o industrias o agricultura o economía. Las técnicas de agrupamiento se han vuelto muy útiles para grandes conjuntos de datos, incluso en redes sociales como Facebook y Twitter (Soni & Patel, 2017). El análisis de conglomerados juega un papel indispensable en la exploración de la estructura subyacente de un conjunto de datos dado, y se usa ampliamente en un variedad de temas de ingeniería y científicos, como, medicina, sociología, psicología y recuperación de imágenes Además en otras áreas, tales como, estudios de segmentación de clientes en el área financiera (Abonyi & Feil, 2007), biología (Der & Everitt, 2005; Quinn & Keough, 2002) , ecología (McGarigal, Cushman, & Stanford, 2000) , entre otros, puesto que la mayoría de las veces no utiliza ningún supuesto estadístico para llevar a cabo el proceso de agrupación (Leiva-Valdebenito & Torres-Avilés, 2010)..

    The applications of loyalty card data for social science

    Get PDF
    Large-scale consumer datasets have become increasingly abundant in recent years and many have turned their attention to harnessing these for insights within the social sciences. Whilst commercial organisations have been quick to recognise the benefits of these data as a source of competitive advantage, their emergence has been met with contention in research due to the epistemological, methodological and ethical challenges they present. These issues have seldom been addressed, primarily due to these data being hard to obtain outside of the commercial settings in which they are often generated. This thesis presents an exploration of a unique loyalty card dataset obtained from one of the most prominent UK high street retailers, and thus an opportunity to study the dynamics, potentialities and limitations when applying such data in a research context. The predominant aims of this work were to firstly, address issues of uncertainty surrounding novel consumer datasets by quantifying their inherent representation and data quality issues and secondly, to explore the extent to which we may enrich our current knowledge of spatiotemporal population processes through the analysis of consumer activity patterns. Our current understanding of such dynamics has been limited by the data-scarce era, yet loyalty card data provide individual level, georeferenced population data that are high in velocity. This provided a framework for understanding more detailed interactions between people and places, and what these might indicate for both consumption behaviours and wider societal phenomena. This work endeavoured to provide a substantive contribution to the integration of consumer datasets in social science research, by outlining pragmatic steps to ensure novel data sources can be fit for purpose, and to population geography research, by exploring the extent to which we may utilise spatiotemporal consumption activities to make broad inferences about the general population

    Société Francophone de Classification (SFC) Actes des 26èmes Rencontres

    Get PDF
    National audienceLes actes des rencontres de la Société Francophone de Classification (SFC, http://www.sfc-classification.net/) contiennent l'ensemble des contributions,présentés lors des rencontres entre les 3 et 5 septembre 2019 au Centre de Recherche Inria Nancy Grand Est/LORIA Nancy. La classification sous toutes ces formes, mathématiques, informatique (apprentissage, fouille de données et découverte de connaissances ...), et statistiques, est la thématique étudiée lors de ces journées. L'idée est d'illustrer les différentes facettes de la classification qui reflètent les intérêts des chercheurs dans la matière, provenant des mathématiques et de l'informatique
    corecore