11 research outputs found
A fast and recursive algorithm for clustering large datasets with -medians
Clustering with fast algorithms large samples of high dimensional data is an
important challenge in computational statistics. Borrowing ideas from MacQueen
(1967) who introduced a sequential version of the -means algorithm, a new
class of recursive stochastic gradient algorithms designed for the -medians
loss criterion is proposed. By their recursive nature, these algorithms are
very fast and are well adapted to deal with large samples of data that are
allowed to arrive sequentially. It is proved that the stochastic gradient
algorithm converges almost surely to the set of stationary points of the
underlying loss criterion. A particular attention is paid to the averaged
versions, which are known to have better performances, and a data-driven
procedure that allows automatic selection of the value of the descent step is
proposed.
The performance of the averaged sequential estimator is compared on a
simulation study, both in terms of computation speed and accuracy of the
estimations, with more classical partitioning techniques such as -means,
trimmed -means and PAM (partitioning around medoids). Finally, this new
online clustering technique is illustrated on determining television audience
profiles with a sample of more than 5000 individual television audiences
measured every minute over a period of 24 hours.Comment: Under revision for Computational Statistics and Data Analysi
Score de risque d'événement et score en ligne pour des insuffisants cardiaques
International audienceOn présente la construction d'un score de risque d'événement à court terme pour des insuffisants cardiaques. On suppose ensuite que les données de patients arrivent de façon continue et que l'on veut actualiser en ligne la fonction de score. On étudie en particulier l'estimation en ligne des paramètres d'un mo-dèle de régression linéaire par un processus de gradient stochastique en utilisant des données standardisées en ligne au lieu des données brutes
Fast Estimation of the Median Covariation Matrix with Application to Online Robust Principal Components Analysis
The geometric median covariation matrix is a robust multivariate indicator of
dispersion which can be extended without any difficulty to functional data. We
define estimators, based on recursive algorithms, that can be simply updated at
each new observation and are able to deal rapidly with large samples of high
dimensional data without being obliged to store all the data in memory.
Asymptotic convergence properties of the recursive algorithms are studied under
weak conditions. The computation of the principal components can also be
performed online and this approach can be useful for online outlier detection.
A simulation study clearly shows that this robust indicator is a competitive
alternative to minimum covariance determinant when the dimension of the data is
small and robust principal components analysis based on projection pursuit and
spherical projections for high dimension data. An illustration on a large
sample and high dimensional dataset consisting of individual TV audiences
measured at a minute scale over a period of 24 hours confirms the interest of
considering the robust principal components analysis based on the median
covariation matrix. All studied algorithms are available in the R package
Gmedian on CRAN
Material metabolism of residential buildings in Sweden: Material intensity database, stocks and flows, and spatial analysis
Construction materials are used for the expansion and maintenance of the built environment. In the last century, construction material stock has increased globally 23-fold. Given the current situation, the accumulated stock can be viewed as a repository of anthropogenic resources, which at the end of life could be re-circulated through the economic system to minimize the inflow of raw materials and the outflow of waste. A major step toward increased material circularity is the development of the supporting knowledge infrastructure. For this reason, research has focused on developing methods intended for exposing the material metabolism, namely, estimating the stocks and flows and analyzing the spatial and temporal dynamics of stocks and flows. Residential buildings comprise a large share of the built environment. However, the material metabolism of these structures has remained unknown in many geographical contexts. Therefore, in this thesis, a bottom-up approach is employed to uncover the metabolism of residential buildings in Sweden. This goal is achieved through three methodological steps. First, a material intensity database is assembled based on architectural drawings of 46 residential buildings built within the period 1880–2010 in Sweden. Second, the stocks and flows are modeled with spatial and statistical inventory data and the developed material intensity database. Third, new spatial analysis approaches to the stocks and flows are conducted within urban and national boundaries. For the urban context, material stock indicators defined at the neighborhood level are clustered with well-known algorithms. At the national level, eight settlement types are considered to indicate the spatial dynamics. The developed database indicates historical trends in terms of the material intensity and composition for residential buildings in Sweden. Moreover, the results contribute to establishing a global database and, through an extended international cross-comparison, to the understanding of how the material intensity and composition of residential buildings differ geographically. Furthermore, the stocks and flows are estimated in million metric tons at different administrative boundary levels. Among the six categories considered, mineral-binding materials, such as concrete, comprise the largest share of the accumulated stock. Finally, spatial differences in material stock composition are depicted in urban geography and nationally, among the eight settlement types. At national level, densely built-up corridors are identified, which should be used for enhancing material circularity. This thesis contributes with data source exploration, methodological development, and critical analyses, relevant to researchers, policy makers, and practitioners interested in a more sustained metabolism of construction materials in the built environment
Widening the scope of an eigenvector stochastic approximation process and application to streaming PCA and related methods
International audienceWe prove the almost sure convergence of Oja-type processes to eigenvectors of the expectation B of a random matrix while relaxing the i.i.d. assumption on the observed random matrices (B n) and assuming either (B n) converges to B or (E[B n |T n ]) converges to B where T n is the sigma-field generated by the events before time n. As an application of this generalization, the online PCA of a random vector Z can be performed when there is a data stream of i.i.d. observations of Z, even when both the metric M used and the expectation of Z are unknown and estimated online. Moreover, in order to update the stochastic approximation process at each step, we are no longer bound to using only a mini-batch of observations of Z, but all previous observations up to the current step can be used without having to store them. This is useful not only when dealing with streaming data but also with Big Data as one can process the latter sequentially as a data stream. In addition the general framework of this process, unlike other algorithms in the literature, also covers the case of factorial methods related to PCA
An overview of clustering methods with guidelines for application in mental health research
Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity
by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and
increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements.
In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and
implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic
models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently
introduced. How to choose algorithms to address common issues as well as methods for pre-clustering
data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general
guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms,
we provide information on R functions and librarie
Cluster no jerárquicos versus CART y BIPLOT
[ES] INTRODUCCIÓN Cada dÃa estamos más inmersos en un mundo en el que los datos crecen y crecen. La
minerÃa de datos (MD) muy relacionada con el Descubrimiento de Conocimiento en
Bases de datos (KDD -Knowledge Discovery in Databases) nos permite descubrir
información de grandes volúmenes de datos y son fundamentales para analizarlos de
manera eficaz, a la vez que revelan patrones que no eran conocidos (Holsheimer & Siebes,
1994).
El KDD es un proceso que consta de un conjunto de fases que incluye el preprocesamiento
minerÃa y post procesamiento de los datos. La minerÃa de datos es una técnica de
Inteligencia Artificial que permite extraer conocimiento útil y comprensible previamente
desconocido a partir de grandes volúmenes de datos y consiste en la aplicación de un
algoritmo para extraer patrones de datos. Sin embargo, con el fin de analizar los datos
enfocados en el descubrimiento del conocimiento se ha ido adaptando y ha surgido lo que
se denomina minerÃa de datos espacial (MDE), la cual se considera como el proceso
automático de explorar grandes cantidades de datos espaciales con el objetivo de
descubrir conocimiento.
En la actividad investigadora resulta de gran interés identificar asociaciones, patrones y
reglas. Dentro de las técnicas de MD se encuentra el agrupamiento (Clustering). El
agrupamiento de datos es un problema fundamental en una variedad de áreas de la
informática y campos relacionados, como el análisis de datos, la compresión de datos y
el análisis de datos estadÃsticos (Aboubi, Drias, & Kamel, 2016). Puede considerarse el problema más importante de aprendizaje no supervisado tratando de encontrar una
estructura de datos no etiquetados (Jain & Dubes, 1988; Jain, Murty, & Flynn, 1999).
Los algoritmos de agrupamiento más conocidos son los métodos jerárquicos y los
métodos de partición, aunque existen otros métodos basados en densidades y los métodos
basados en Gird. Existen diversas razones por las que las agrupaciones particionadas o de
aprendizaje no supervisado son de interés: implementación rápida y convergen
rápidamente, permiten categorizar elementos, entre otras. Sin embargo, estos algoritmos
sufren inconvenientes en la especificación de los parámetros iniciales no adecuados, que
pueden generar una mala convergencia. Se han desarrollado diferentes métodos de
agrupamiento que atienden a diversos problemas como costo computacional, sensibilidad
a la inicialización, clases desbalanceadas y convergencia a un óptimo local, entre otros.
Sin embargo, para la selección de un método, es necesario considerar la naturaleza de los
datos y las condiciones del problema con el fin de agrupar patrones similares, de tal forma
que se tenga un buen compromiso entre costo computacional y efectividad en la
separabilidad de las clases.
Algunos de los algoritmos basados en particiones son el algoritmo K-Medias, el algoritmo
K-Medoids, el algoritmo de particionamiento alrededor de Medoids (PAM) y una versión
de PAM diseñada para grupos de datos mayores denominado CLARA (Gupta & Panda,
2018). Hay numerosos investigadores que han propuesto algoritmos de K-Medias y K-
Medoids (Borah & Ghose, 2009; Dunham, 2002; Han & Kamber, 2006; Khan & Ahmad,
2004; Park, Lee, & Jun, 2006; Rakhlin & Caponnetto, 2007; Xiong, Wu, & Chen, 2009).
La agrupación ha ganado un amplio uso y su importancia ha crecido proporcionalmente
debido a la cantidad cada vez mayor de datos y al aumento exponencial en las velocidades
de procesamiento de la computadora. La importancia de la agrupación se puede entender por el hecho de que tiene una amplia variedad de aplicaciones, ya sea en educación o
industrias o agricultura o economÃa. Las técnicas de agrupamiento se han vuelto muy
útiles para grandes conjuntos de datos, incluso en redes sociales como Facebook y Twitter
(Soni & Patel, 2017). El análisis de conglomerados juega un papel indispensable en la
exploración de la estructura subyacente de un conjunto de datos dado, y se usa
ampliamente en un variedad de temas de ingenierÃa y cientÃficos, como, medicina,
sociologÃa, psicologÃa y recuperación de imágenes Además en otras áreas, tales como,
estudios de segmentación de clientes en el área financiera (Abonyi & Feil, 2007), biologÃa
(Der & Everitt, 2005; Quinn & Keough, 2002) , ecologÃa (McGarigal, Cushman, &
Stanford, 2000) , entre otros, puesto que la mayorÃa de las veces no utiliza ningún
supuesto estadÃstico para llevar a cabo el proceso de agrupación (Leiva-Valdebenito &
Torres-Avilés, 2010)..
The applications of loyalty card data for social science
Large-scale consumer datasets have become increasingly abundant in recent years and many have turned their attention to harnessing these for insights within the social sciences. Whilst commercial organisations have been quick to recognise the benefits of these data as a source of competitive advantage, their emergence has been met with contention in research due to the epistemological, methodological and ethical challenges they present. These issues have seldom been addressed, primarily due to these data being hard to obtain outside of the commercial settings in which they are often generated. This thesis presents an exploration of a unique loyalty card dataset obtained from one of the most prominent UK high street retailers, and thus an opportunity to study the dynamics, potentialities and limitations when applying such data in a research context. The predominant aims of this work were to firstly, address issues of uncertainty surrounding novel consumer datasets by quantifying their inherent representation and data quality issues and secondly, to explore the extent to which we may enrich our current knowledge of spatiotemporal population processes through the analysis of consumer activity patterns. Our current understanding of such dynamics has been limited by the data-scarce era, yet loyalty card data provide individual level, georeferenced population data that are high in velocity. This provided a framework for understanding more detailed interactions between people and places, and what these might indicate for both consumption behaviours and wider societal phenomena. This work endeavoured to provide a substantive contribution to the integration of consumer datasets in social science research, by outlining pragmatic steps to ensure novel data sources can be fit for purpose, and to population geography research, by exploring the extent to which we may utilise spatiotemporal consumption activities to make broad inferences about the general population
Société Francophone de Classification (SFC) Actes des 26èmes Rencontres
National audienceLes actes des rencontres de la Société Francophone de Classification (SFC, http://www.sfc-classification.net/) contiennent l'ensemble des contributions,présentés lors des rencontres entre les 3 et 5 septembre 2019 au Centre de Recherche Inria Nancy Grand Est/LORIA Nancy. La classification sous toutes ces formes, mathématiques, informatique (apprentissage, fouille de données et découverte de connaissances ...), et statistiques, est la thématique étudiée lors de ces journées. L'idée est d'illustrer les différentes facettes de la classification qui reflètent les intérêts des chercheurs dans la matière, provenant des mathématiques et de l'informatique