9 research outputs found
Unsupervised Learning via Total Correlation Explanation
Learning by children and animals occurs effortlessly and largely without
obvious supervision. Successes in automating supervised learning have not
translated to the more ambiguous realm of unsupervised learning where goals and
labels are not provided. Barlow (1961) suggested that the signal that brains
leverage for unsupervised learning is dependence, or redundancy, in the sensory
environment. Dependence can be characterized using the information-theoretic
multivariate mutual information measure called total correlation. The principle
of Total Cor-relation Ex-planation (CorEx) is to learn representations of data
that "explain" as much dependence in the data as possible. We review some
manifestations of this principle along with successes in unsupervised learning
problems across diverse domains including human behavior, biology, and
language.Comment: Invited contribution for IJCAI 2017 Early Career Spotlight. 5 pages,
1 figur
Deep unsupervised clustering with Gaussian mixture variational autoencoders
We study a variant of the variational autoencoder model with a Gaussian mixture as a prior distribution, with the goal of performing unsupervised clustering through deep generative models. We observe that the standard variational approach in these models is unsuited for unsupervised clustering, and mitigate this problem by leveraging a principled information-theoretic regularisation term known as consistency violation. Adding this term to the standard variational optimisation objective yields networks with both meaningful internal representations and well-defined clusters. We demonstrate the performance of this scheme on synthetic data, MNIST and SVHN, showing that the obtained clusters are distinct, interpretable and result in achieving higher performance on unsupervised clustering classification than previous approaches
Group Minds and the Case of Wikipedia
Group-level cognitive states are widely observed in human social systems, but
their discussion is often ruled out a priori in quantitative approaches. In
this paper, we show how reference to the irreducible mental states and
psychological dynamics of a group is necessary to make sense of large scale
social phenomena. We introduce the problem of mental boundaries by reference to
a classic problem in the evolution of cooperation. We then provide an explicit
quantitative example drawn from ongoing work on cooperation and conflict among
Wikipedia editors, showing how some, but not all, effects of individual
experience persist in the aggregate. We show the limitations of methodological
individualism, and the substantial benefits that come from being able to refer
to collective intentions, and attributions of cognitive states of the form
"what the group believes" and "what the group values".Comment: 21 pages, 6 figures; matches published versio
RefiNym: using names to refine types
Source code is bimodal: it combines a formal, algorithmic channel and a natural language channel of identifiers and comments. In this work, we model the bimodality of code with name flows, an assignment flow graph augmented to track identifier names. Conceptual types are logically distinct types that do not always coincide with program types. Passwords and URLs are example conceptual types that can share the program type string. Our tool, RefiNym, is an unsupervised method that mines a lattice of conceptual types from name flows and reifies them into distinct nominal types. For string, RefiNym finds and splits conceptual types originally merged into a single type, reducing the number of same-type variables per scope from 8.7 to 2.2 while eliminating 21.9% of scopes that have more than one same-type variable in scope. This makes the code more self-documenting and frees the type system to prevent a developer from inadvertently assigning data across conceptual types
An Unsupervised Cluster: Learning Water Customer Behavior Using Variation of Information on a Reconstructed Phase Space
The unsupervised clustering algorithm described in this dissertation addresses the need to divide a population of water utility customers into groups based on their similarities and differences, using only the measured flow data collected by water meters. After clustering, the groups represent customers with similar consumption behavior patterns and provide insight into ‘normal’ and ‘unusual’ customer behavior patterns. This research focuses upon individually metered water utility customers and includes both residential and commercial customer accounts serviced by utilities within North America. The contributions of this dissertation not only represent a novel academic work, but also solve a practical problem for the utility industry. This dissertation introduces a method of agglomerative clustering using information theoretic distance measures on Gaussian mixture models within a reconstructed phase space. The clustering method accommodates a utility’s limited human, financial, computational, and environmental resources. The proposed weighted variation of information distance measure for comparing Gaussian mixture models places emphasis upon those behaviors whose statistical distributions are more compact over those behaviors with large variation and contributes a novel addition to existing comparison options