361 research outputs found

    An efficient Particle Swarm Optimization approach to cluster short texts

    Full text link
    This is the author’s version of a work that was accepted for publication in Information Sciencies. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information Sciences, VOL 265, MAY 1 2014 DOI 10.1016/j.ins.2013.12.010.Short texts such as evaluations of commercial products, news, FAQ's and scientific abstracts are important resources on the Web due to the constant requirements of people to use this on line information in real life. In this context, the clustering of short texts is a significant analysis task and a discrete Particle Swarm Optimization (PSO) algorithm named CLUDIPSO has recently shown a promising performance in this type of problems. CLUDIPSO obtained high quality results with small corpora although, with larger corpora, a significant deterioration of performance was observed. This article presents CLUDIPSO*, an improved version of CLUDIPSO, which includes a different representation of particles, a more efficient evaluation of the function to be optimized and some modifications in the mutation operator. Experimental results with corpora containing scientific abstracts, news and short legal documents obtained from the Web, show that CLUDIPSO* is an effective clustering method for short-text corpora of small and medium size. (C) 2013 Elsevier Inc. All rights reserved.The research work is partially funded by the European Commission as part of the WIQ-EI IRSES research project (Grant No. 269180) within the FP 7 Marie Curie People Framework and it has been developed in the framework of the Microcluster VLC/Campus (International Campus of Excellence) on Multimodal Intelligent Systems. The research work of the first author is partially funded by the program PAID-02-10 2257 (Universitat Politecnica de Valencia) and CONICET (Argentina).Cagnina, L.; Errecalde, M.; Ingaramo, D.; Rosso, P. (2014). An efficient Particle Swarm Optimization approach to cluster short texts. Information Sciences. 265:36-49. https://doi.org/10.1016/j.ins.2013.12.010S364926

    Indexing of Audio Databases: Event Log of Broadcast News

    Get PDF
    The amount of non-textual media on the Internet is increasing, which creates a greater need of being able to search in this type of media. The goal with this thesis is to be able to do information search by use of soundtracks in audio databases. To get to know the content in an audio file, one wants a system that can automatically extract necessary information. The first step in making this system is to record what is happening at which time in an event log. This thesis treats the beginning of such a process. The experiments performed dealt with detection of pauses lasting longer than 1 second and detection of speaker changes. The corpus used in experiments consists of news broadcasts from The Norwegian Broadcasting Corporation (NRK) radio. Each broadcast had a transcription, which was used as a reference when evaluating the results. Another corpus, the HUB-4 1997 evaluation data, was used for comparative tests.A lot of work treating indexing of audio databases has already been conducted. As corpora are different, there may be varying results obtained from the same methods. In this thesis, common segmentation methods have been used with the parameters adapted to give as good results as possible with the given corpus. In the pause detection, model-based segmentation was used. A Gaussian mixture model was implemented for each of the two events: sound and long pause. For the speaker segmentation, experiments with different metric-based segmentation techniques were performed. The Bayesian information criterion (BIC) and a modified version of this criterion were tested with different options and parameter values. A false alarm compensation based on the symmetric Kullback-Leibler distance was implemented as an attempt to reduce the number of false change points. The pause detection was not successful. By using the manual transcription as reference, an F-score of 38.1 % was obtained when the settings were adjusted to result in about the same numbers for false alarms and false rejections. However, further investigation showed that the transcription had flaws with respect to labeling of pauses. An evaluation of the wrongly inserted pauses showed that most of these segments actually contained silence or noise. However, the number of pauses missed was unknown, and it was not possible to get a reliable F-score. An attempt on labeling all pauses in the HUB-4 1997 data was done. With the modified transcription, an F-score of 81.7 % was obtained. However, it is possible that unlabeled pauses still exist in the transcription, as the labeling was performed by only looking at the audio signal. From classification experiments it became clear that using 1st and 2nd order delta coefficients in the feature vectors gave an improvement over just using static MFCCs. An F-score of 98.8 % was obtained from these experiments, which implies that the models are good when the segment boundaries are known. In order to get trustworthy results from the recognition task, a review of the transcription must be done.When using the modified version of BIC and false alarm compensation for speaker change detection, an F-score of 77.1 % were obtained. The average mismatch between correctly detected change points and reference transcription was 339 milliseconds. As a measure of how good the algorithm is, an F-score of 72.8 % was obtained with the HUB-4 1997 data. Ajmera et al. (2002) obtained an F-score of 67 % with the same data. It became clear that full covariance matrices gave an improvement over diagonal covariance matrices and that static MFCCs as feature vectors gave better results than MFCCs including delta coefficients. Inclusion of pitch as another feature did not contribute to any improvement of the results
    • …
    corecore