3 research outputs found

    Discovering structure without labels

    Get PDF
    The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction. The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space. The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success. Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to more discrete and local structure of the data. Finally, we emphasize the link between contrastive neighbor embeddings and self-supervised contrastive learning. We show that different flavors of contrastive losses can work for both of them with few noise samples

    User chat clustering using deep learning representations and unsupervised methods for dialog system applications

    Get PDF
    Os sistemas automáticos de conversação, conhecidos normalmente como chat bots, estão a tornar-se cada vez mais populares e devem ser capazes de interpretar a linguagem humana para compreender e comunicar com os seres humanos. A deteção de intenções desempenha uma tarefa crucial para desenvolver conversas inteligentes nestes sistemas de conversa. As implementações existentes destes sistemas requerem muitos dados etiquetados e a sua aquisição pode ser dispendiosa e demorada. Esta tese visa avaliar representações de texto existentes, utilizando abordagens clássicas, tais como Word2Vec, GloVe e modelos de Transformer pré-treinados (BERT, RoBERTa, GPT2 e outros), para possível automatização de dados de diálogo não etiquetados através de algoritmos de agrupamento. Os algoritmos de agrupamento testados, vão desde o clássico K-Means até abordagens mais sofisticadas, tais como HDBSCAN, com a ajuda de técnicas de redução de dimensão (t-SNE, UMAP). Um conjunto de dados é utilizado para avaliação das técnicas utilizadas, que contêm diálogo de intents de utilizadores em múltiplos domínios e taxonomia de intents variada que se encontram no mesmo domínio. Os resultados mostram que os Transformers apresentam um desempenho de representação de texto superior às representações clássicas. No entanto, um modelo ensemble com múltiplos algoritmos de agrupamento e de múltiplas representações de fontes diferentes apresenta uma melhoria drástica na solução final. A aplicação do UMAP e t-SNE em dimensões mais baixas pode também apresentar um desempenho tão bom ou mesmo melhor do que as representações originais.Dialog systems commonly called chat bots are increasingly more popular and must interpret spoken language to understand and communicate with humans. Intent detection plays a crucial task to develop smart and intelligent conversations in these conversational systems. Existing implementations require a lot of labeled data and acquiring it can be costly and time-consuming. This thesis aims to evaluate existing text representations, using classical approaches, such as Word2Vec, GloVe, and current state of the art pre-trained Transformer models (BERT, RoBERTa, GPT2, and more) for possible automation of unlabeled dialog data through clustering algorithms. The cluster algorithms tested, range from the classical K-Means to more sophisticated approaches such as HDBSCAN, with dimension reduction techniques (t-SNE, UMAP) as pre processing techniques. A dataset is used for evaluation that contains multiple user intents in many domains and varying intents taxonomy in the same domain. Results show that Transformers demonstrate superior text representation performance to classical representations. Nevertheless, ensemble clustering with multiple clustering algorithms and multiple representations from different sources shows massive improvement in the final clustering solution. Applying UMAP and t-SNE in lower dimensions may also perform as good or even better than the original clustering with the original embeddings

    Detective fiction in Cuban society and culture.

    Get PDF
    PhDThe object of this thesis is to reach towards an understanding of Cuban society through a study of its detective fiction and more particularly contemporary Cuban society through the novels of the author and critic, Leonardo Padura Fuentes. The method has been to trace the development of Cuban detective writing and to read Padura Fuentes in the light of the work of twentieth century Western European literary critics and philosophers including Raymond Williams, Antonio Gramsci, Terry Eagleton, Roland Barthes, Jean Paul Sartre, Michel Foucault, Jean François Lyotard and Jean Baudrillard in order to gain a better understanding of the social and historical context from which this genre emerged. By concentrating on the literary texts, I have explored readings which lead out into an analysis of the broader philosophical, political and historical issues raised by the Cuban revolution. Since it deals primarily with modes of deviance and notions of legality and justice within the context of the modern state, detective fiction is particularly well suited to this type of investigation. The intention is to show how this is as valid in the Cuban context as it is in advanced capitalist societies where such research has already been carried out with some success. The thesis comprises an introduction, ten chapters and a conclusion. The chapters are divided into three sections. Chapters 1 to 3 attempt a broad theoretical, historical and socio-political analysis of the cultural reality within which the Cuban revolutionary detective genre emerged. Chapters 4 to 6 analyse the Cuban detective narrative from its inception in the early part of the twentieth century until the emergence of Leonardo Padura Fuentes as the foremost exponent of the genre in Cuba after 1991. Chapters 7- 10 concentrate upon the work of Leonardo Padura Fuentes, offering a reading of his detective tetralogy informed by the preceding discussion. The contribution made by the thesis to knowledge of the subject is to build upon the work of Seymour Menton and Amelia S. Simpson on the development of the Cuban detective novel and to provide analyses of the pre-Revolutionary Cuban detective narrative and the work of Leonardo Padura Fuentes for the first time in the English language. The thesis concludes that the study of this popular genre in Cuba is of crucial importance to the scholar who wishes to reach as full an understanding of the social dynamics within that society as possible. In particular, it proves that Cuban detective fiction provides a useful barometer of social change which records the shifts in the Cuban Zeitgeist that have taken place over the past century
    corecore