3 research outputs found
Discovering structure without labels
The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction.
The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space.
The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success.
Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to more discrete and local structure of the data. Finally, we emphasize the link between contrastive neighbor embeddings and self-supervised contrastive learning. We show that different flavors of contrastive losses can work for both of them with few noise samples
User chat clustering using deep learning representations and unsupervised methods for dialog system applications
Os sistemas automáticos de conversação, conhecidos normalmente como chat bots,
estão a tornar-se cada vez mais populares e devem ser capazes de interpretar a
linguagem humana para compreender e comunicar com os seres humanos. A deteção de
intenções desempenha uma tarefa crucial para desenvolver conversas inteligentes nestes
sistemas de conversa. As implementações existentes destes sistemas requerem muitos
dados etiquetados e a sua aquisição pode ser dispendiosa e demorada. Esta tese visa
avaliar representações de texto existentes, utilizando abordagens clássicas, tais como
Word2Vec, GloVe e modelos de Transformer pré-treinados (BERT, RoBERTa, GPT2 e
outros), para possÃvel automatização de dados de diálogo não etiquetados através de
algoritmos de agrupamento. Os algoritmos de agrupamento testados, vão desde o
clássico K-Means até abordagens mais sofisticadas, tais como HDBSCAN, com a ajuda
de técnicas de redução de dimensão (t-SNE, UMAP). Um conjunto de dados é utilizado
para avaliação das técnicas utilizadas, que contêm diálogo de intents de utilizadores em
múltiplos domÃnios e taxonomia de intents variada que se encontram no mesmo
domÃnio.
Os resultados mostram que os Transformers apresentam um desempenho de
representação de texto superior às representações clássicas. No entanto, um modelo
ensemble com múltiplos algoritmos de agrupamento e de múltiplas representações de
fontes diferentes apresenta uma melhoria drástica na solução final. A aplicação do
UMAP e t-SNE em dimensões mais baixas pode também apresentar um desempenho tão
bom ou mesmo melhor do que as representações originais.Dialog systems commonly called chat bots are increasingly more popular and must
interpret spoken language to understand and communicate with humans. Intent
detection plays a crucial task to develop smart and intelligent conversations in these
conversational systems. Existing implementations require a lot of labeled data and
acquiring it can be costly and time-consuming. This thesis aims to evaluate existing text
representations, using classical approaches, such as Word2Vec, GloVe, and current state
of the art pre-trained Transformer models (BERT, RoBERTa, GPT2, and more) for
possible automation of unlabeled dialog data through clustering algorithms. The cluster
algorithms tested, range from the classical K-Means to more sophisticated approaches
such as HDBSCAN, with dimension reduction techniques (t-SNE, UMAP) as pre processing techniques. A dataset is used for evaluation that contains multiple user
intents in many domains and varying intents taxonomy in the same domain.
Results show that Transformers demonstrate superior text representation performance to
classical representations. Nevertheless, ensemble clustering with multiple clustering
algorithms and multiple representations from different sources shows massive
improvement in the final clustering solution. Applying UMAP and t-SNE in lower
dimensions may also perform as good or even better than the original clustering with
the original embeddings
Detective fiction in Cuban society and culture.
PhDThe object of this thesis is to reach towards an understanding of Cuban society through a
study of its detective fiction and more particularly contemporary Cuban society through
the novels of the author and critic, Leonardo Padura Fuentes.
The method has been to trace the development of Cuban detective writing and to
read Padura Fuentes in the light of the work of twentieth century Western European
literary critics and philosophers including Raymond Williams, Antonio Gramsci, Terry
Eagleton, Roland Barthes, Jean Paul Sartre, Michel Foucault, Jean François Lyotard and
Jean Baudrillard in order to gain a better understanding of the social and historical
context from which this genre emerged.
By concentrating on the literary texts, I have explored readings which lead out into
an analysis of the broader philosophical, political and historical issues raised by the
Cuban revolution. Since it deals primarily with modes of deviance and notions of legality
and justice within the context of the modern state, detective fiction is particularly well
suited to this type of investigation. The intention is to show how this is as valid in the
Cuban context as it is in advanced capitalist societies where such research has already
been carried out with some success.
The thesis comprises an introduction, ten chapters and a conclusion. The chapters
are divided into three sections. Chapters 1 to 3 attempt a broad theoretical, historical and
socio-political analysis of the cultural reality within which the Cuban revolutionary
detective genre emerged. Chapters 4 to 6 analyse the Cuban detective narrative from its
inception in the early part of the twentieth century until the emergence of Leonardo
Padura Fuentes as the foremost exponent of the genre in Cuba after 1991. Chapters 7-
10 concentrate upon the work of Leonardo Padura Fuentes, offering a reading of his
detective tetralogy informed by the preceding discussion.
The contribution made by the thesis to knowledge of the subject is to build upon the
work of Seymour Menton and Amelia S. Simpson on the development of the Cuban
detective novel and to provide analyses of the pre-Revolutionary Cuban detective
narrative and the work of Leonardo Padura Fuentes for the first time in the English
language. The thesis concludes that the study of this popular genre in Cuba is of crucial
importance to the scholar who wishes to reach as full an understanding of the social
dynamics within that society as possible. In particular, it proves that Cuban detective
fiction provides a useful barometer of social change which records the shifts in the Cuban
Zeitgeist that have taken place over the past century