86,276 research outputs found
EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING
This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document\ud
clustering with lesser human involvement, accompanied by effective improvements in result?” In the\ud
devised system, we propose a method to exploit the importance of N-grams in a document and use\ud
Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams\ud
in a document depends on several features including, but not limited to: frequency, position of their\ud
occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we\ud
introduce a new similarity measure, which takes the weighted N-gram importance into account, in the\ud
calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area
Latent Class Model with Application to Speaker Diarization
In this paper, we apply a latent class model (LCM) to the task of speaker
diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in
that it uses soft information and avoids premature hard decisions in its
iterations. In contrast to the VB method, which is based on a generative model,
LCM provides a framework allowing both generative and discriminative models.
The discriminative property is realized through the use of i-vector (Ivec),
probabilistic linear discriminative analysis (PLDA), and a support vector
machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid are introduced. In addition, three further improvements are
applied to enhance its performance. 1) Adding neighbor windows to extract more
speaker information for each short segment. 2) Using a hidden Markov model to
avoid frequent speaker change points. 3) Using an agglomerative hierarchical
cluster to do initialization and present hard and soft priors, in order to
overcome the problem of initial sensitivity. Experiments on the National
Institute of Standards and Technology Rich Transcription 2009 speaker
diarization database, under the condition of a single distant microphone, show
that the diarization error rate (DER) of the proposed methods has substantial
relative improvements compared with mainstream systems. Compared to the VB
method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments
on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial
conditions also show that the proposed LCM-Ivec-Hybrid system has the best
overall performance
Fast Algorithm and Implementation of Dissimilarity Self-Organizing Maps
In many real world applications, data cannot be accurately represented by
vectors. In those situations, one possible solution is to rely on dissimilarity
measures that enable sensible comparison between observations. Kohonen's
Self-Organizing Map (SOM) has been adapted to data described only through their
dissimilarity matrix. This algorithm provides both non linear projection and
clustering of non vector data. Unfortunately, the algorithm suffers from a high
cost that makes it quite difficult to use with voluminous data sets. In this
paper, we propose a new algorithm that provides an important reduction of the
theoretical cost of the dissimilarity SOM without changing its outcome (the
results are exactly the same as the ones obtained with the original algorithm).
Moreover, we introduce implementation methods that result in very short running
times. Improvements deduced from the theoretical cost model are validated on
simulated and real world data (a word list clustering problem). We also
demonstrate that the proposed implementation methods reduce by a factor up to 3
the running time of the fast algorithm over a standard implementation
- …