18,665 research outputs found
A Scalable Algorithm for Metric High-Quality Clustering in Information Retrieval Tasks
We consider the problem of finding efficiently a high quality k-clustering of n points in a (possibly discrete) metric space. Many methods are known when the point are vectors in a real vector space, and the distance function is a standard geometric distance such as L1, L2 (Euclidean) or L2 2 (squared Euclidean distance). In such cases efficiency is often sought via sophisticated multidimensional search structures for speeding up nearest neighbor queries (e.g. variants of kd-trees). Such techniques usually work well in spaces of moderately high dimension say up to 6 or 8). Our target is a scenario in which either the metric space cannot be mapped into a vector space, or, if this mapping is possible, the dimension of such a space is so high as to rule out the use of the above mentioned techniques. This setting is rather typical in Information Retrieval applications. We augment the well known furthest-point-first algorithm for kcenter clustering in metric spaces with a filtering step based on the triangular inequality and we compare this algorithm with some recent fast variants of the classical k-means iterative algorithm augmented with an analogous filtering schemes. We extensively tested the two solutions on synthetic geometric data and real data from Information Retrieval applications. The main conclusion we draw is that our modified furthest-point-first method attains solutions of better or comparable quality within a fraction of the time used by the fast k-means algorithm. Thus our algorithm is valuable when either real time constraints or the large amount of data highlight the poor scalability of traditional clustering methods
Improving speaker turn embedding by crossmodal transfer learning from face embedding
Learning speaker turn embeddings has shown considerable improvement in
situations where conventional speaker modeling approaches fail. However, this
improvement is relatively limited when compared to the gain observed in face
embedding learning, which has been proven very successful for face verification
and clustering tasks. Assuming that face and voices from the same identities
share some latent properties (like age, gender, ethnicity), we propose three
transfer learning approaches to leverage the knowledge from the face domain
(learned from thousands of images and identities) for tasks in the speaker
domain. These approaches, namely target embedding transfer, relative distance
transfer, and clustering structure transfer, utilize the structure of the
source face embedding space at different granularities to regularize the target
speaker turn embedding space as optimizing terms. Our methods are evaluated on
two public broadcast corpora and yield promising advances over competitive
baselines in verification and audio clustering tasks, especially when dealing
with short speaker utterances. The analysis of the results also gives insight
into characteristics of the embedding spaces and shows their potential
applications
Semantic distillation: a method for clustering objects by their contextual specificity
Techniques for data-mining, latent semantic analysis, contextual search of
databases, etc. have long ago been developed by computer scientists working on
information retrieval (IR). Experimental scientists, from all disciplines,
having to analyse large collections of raw experimental data (astronomical,
physical, biological, etc.) have developed powerful methods for their
statistical analysis and for clustering, categorising, and classifying objects.
Finally, physicists have developed a theory of quantum measurement, unifying
the logical, algebraic, and probabilistic aspects of queries into a single
formalism. The purpose of this paper is twofold: first to show that when
formulated at an abstract level, problems from IR, from statistical data
analysis, and from physical measurement theories are very similar and hence can
profitably be cross-fertilised, and, secondly, to propose a novel method of
fuzzy hierarchical clustering, termed \textit{semantic distillation} --
strongly inspired from the theory of quantum measurement --, we developed to
analyse raw data coming from various types of experiments on DNA arrays. We
illustrate the method by analysing DNA arrays experiments and clustering the
genes of the array according to their specificity.Comment: Accepted for publication in Studies in Computational Intelligence,
Springer-Verla
- …