33 research outputs found
Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the \emph{microclustering property} and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets
A Primer on the Data Cleaning Pipeline
The availability of both structured and unstructured databases, such as
electronic health data, social media data, patent data, and surveys that are
often updated in real time, among others, has grown rapidly over the past
decade. With this expansion, the statistical and methodological questions
around data integration, or rather merging multiple data sources, has also
grown. Specifically, the science of the ``data cleaning pipeline'' contains
four stages that allow an analyst to perform downstream tasks, predictive
analyses, or statistical analyses on ``cleaned data.'' This article provides a
review of this emerging field, introducing technical terminology and commonly
used methods
Cohesion and Repulsion in Bayesian Distance Clustering
Clustering in high-dimensions poses many statistical challenges. While
traditional distance-based clustering methods are computationally feasible,
they lack probabilistic interpretation and rely on heuristics for estimation of
the number of clusters. On the other hand, probabilistic model-based clustering
techniques often fail to scale and devising algorithms that are able to
effectively explore the posterior space is an open problem. Based on recent
developments in Bayesian distance-based clustering, we propose a hybrid
solution that entails defining a likelihood on pairwise distances between
observations. The novelty of the approach consists in including both cohesion
and repulsion terms in the likelihood, which allows for cluster
identifiability. This implies that clusters are composed of objects which have
small "dissimilarities" among themselves (cohesion) and similar dissimilarities
to observations in other clusters (repulsion). We show how this modelling
strategy has interesting connection with existing proposals in the literature
as well as a decision-theoretic interpretation. The proposed method is
computationally efficient and applicable to a wide variety of scenarios. We
demonstrate the approach in a simulation study and an application in digital
numismatics.Comment: 1 supplementary PDF attached. To view the supplementary PDF, please
download the attachment under "Ancilliary Files
TweeProfiles4: a weighted multidimensional stream clustering algorithm
O aparecimento das redes sociais abriu aos utilizadores a possibilidade de facilmente partilharem as suas ideias a respeito de diferentes temas, o que constitui uma fonte de informação enriquecedora para diversos campos. As plataformas de microblogging sofreram um grande crescimento e de forma constante nos últimos anos. O Twitter é o site de microblogging mais popular, tornando-se uma fonte de dados interessante para extração de conhecimento. Um dos principais desafios na análise de dados provenientes de redes sociais é o seu fluxo, o que dificulta a aplicação de processos tradicionais de data mining. Neste sentido, a extração de conhecimento sobre fluxos de dados tem recebido um foco significativo recentemente. O TweeProfiles é a uma ferramenta de data mining para análise e visualização de dados do Twitter sobre quatro dimensões: espacial (a localização geográfica do tweet), temporal (a data de publicação do tweet), de conteúdo (o texto do tweet) e social (o grafo dos relacionamentos). Este é um projeto em desenvolvimento que ainda possui muitos aspetos que podem ser melhorados. Uma das recentes melhorias inclui a substituição do algoritmo de clustering original, o qual não suportava o fluxo contínuo dos dados, por um método de streaming. O objetivo desta dissertação passa pela continuação do desenvolvimento do TweeProfiles. Em primeiro lugar, será proposto um novo algoritmo de clustering para fluxos de dados com o objetivo de melhorar o existente. Para esse efeito será desenvolvido um algoritmo incremental com suporte para fluxos de dados multi-dimensionais. Esta abordagem deve permitir ao utilizador alterar dinamicamente a importância relativa de cada dimensão do processo de clustering. Adicionalmente, a avaliação empírica dos resultados será alvo de melhoramento através da identificação e implementação de medidas adequadas de avaliação dos padrões extraídos. O estudo empírico será realizado através de tweets georreferenciados obtidos pelo SocialBus.The emergence of social media made it possible for users to easily share their thoughts on different topics, which constitutes a rich source of information for many fields. Microblogging platforms experienced a large and steady growth over the last few years. Twitter is the most popular microblogging site, making it an interesting source of data for pattern extraction. One of the main challenges of analyzing social media data is its continuous nature, which makes it hard to use traditional data mining. Therefore, mining stream data has also received a lot of attention recently.TweeProfiles is a data mining tool for analyzing and visualizing Twitter data over four dimensions: spatial (the location of the tweet), temporal (the timestamp of the tweet), content (the text of the tweet) and social (relationship graph). This is an ongoing project which still has many aspects that can be improved. For instance, it was recently improved by replacing the original clustering algorithm which could not handle the continuous flow of data with a streaming method. The goal of this dissertation is to continue the development of TweeProfiles. First, the stream clustering process will be improved by proposing a new algorithm. This will be achieved by developing an incremental algorithm with support for multi-dimensional streaming data. Moreover, it should make it possible for the user to dynamically change the relative importance of each dimension in the clustering. Additionally, the empirical evaluation of the results will also be improved.Suitable measures to evaluate the extracted patterns will be identified and implemented. An empirical study will be done using data consisting of georeferenced tweets from SocialBus