Search CORE

831 research outputs found

Multi-view constrained clustering with an incomplete mapping between views

Author: desJardins Marie
Eaton Eric
Jacob Sara
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Multi-view learning algorithms typically assume a complete bipartite mapping between the different views in order to exchange information during the learning process. However, many applications provide only a partial mapping between the views, creating a challenge for current methods. To address this problem, we propose a multi-view algorithm based on constrained clustering that can operate with an incomplete mapping. Given a set of pairwise constraints in each view, our approach propagates these constraints using a local similarity measure to those instances that can be mapped to the other views, allowing the propagated constraints to be transferred across views via the partial mapping. It uses co-EM to iteratively estimate the propagation within each view based on the current clustering model, transfer the constraints across views, and then update the clustering model. By alternating the learning process between views, this approach produces a unified clustering model that is consistent with all views. We show that this approach significantly improves clustering performance over several other methods for transferring constraints and allows multi-view clustering to be reliably applied when given a limited mapping between the views. Our evaluation reveals that the propagated constraints have high precision with respect to the true clusters in the data, explaining their benefit to clustering performance in both single- and multi-view learning scenarios

arXiv.org e-Print Archive

CiteSeerX

Russian word sense induction by clustering averaged word embeddings

Author: Kutuzov Andrey
Publication venue
Publication date: 01/01/2018
Field of study

The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data - not only in intrinsic evaluation, but also in downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue-2018

arXiv.org e-Print Archive

NORA - Norwegian Open Research Archives

Synergetic information bottleneck for joint multi-view and ensemble clustering

Author: Qiu Xueying
Yan Xiaoqiang
Ye Yangdong
Yu Hui
Publication venue: 'Elsevier BV'
Publication date: 09/10/2019
Field of study

Portsmouth University Research Portal (Pure)

Centroid-Based Lexical Clustering

Author: Abdalgader Khaled
Publication venue: 'IntechOpen'
Publication date: 05/11/2018
Field of study

Conventional lexical-clustering algorithms treat text fragments as a mixed collection of words, with a semantic similarity between them calculated based on the term of how many the particular word occurs within the compared fragments. Whereas this technique is appropriate for clustering large-sized textual collections, it operates poorly when clustering small-sized texts such as sentences. This is due to compared sentences that may be linguistically similar despite having no words in common. This chapter presents a new version of the original k-means method for sentence-level text clustering that is relay on the idea of use of the related synonyms in order to construct the rich semantic vectors. These vectors represent a sentence using linguistic information resulting from a lexical database founded to determine the actual sense to a word, based on the context in which it occurs. Therefore, while traditional k-means method application is relay on calculating the distance between patterns, the new proposed version operates by calculating the semantic similarity between sentences. This allows it to capture a higher degree of semantic or linguistic information existing within the clustered sentences. Experimental results illustrate that the proposed version of clustering algorithm performs favorably against other well-known clustering algorithms on several standard datasets

IntechOpen

Crossref

Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction

Author: Alrehamy Hassan
Walker Coral
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2018
Field of study

Keyphrases are single- or multi-word phrases that are used to describe the essential content of a document. Utilizing an external knowledge source such as WordNet is often used in keyphrase extraction methods to obtain relation information about terms and thus improves the result, but the drawback is that a sole knowledge source is often limited. This problem is identified as the coverage limitation problem. In this paper, we introduce SemCluster, a clustering-based unsupervised keyphrase extraction method that addresses the coverage limitation problem by using an extensible approach that integrates an internal ontology (i.e., WordNet) with other knowledge sources to gain a wider background knowledge. SemCluster is evaluated against three unsupervised methods, TextRank, ExpandRank, and KeyCluster, and under the F1-measure metric. The evaluation results demonstrate that SemCluster has better accuracy and computational efficiency and is more robust when dealing with documents from different domains

Online Research @ Cardiff

Graph-based Neural Multi-Document Summarization

Author: Meelu Kshitijh
Pareek Ayush
Radev Dragomir
Srinivasan Krishnan
Yasunaga Michihiro
Zhang Rui
Publication venue
Publication date: 01/01/2017
Field of study

We propose a neural multi-document summarization (MDS) system that incorporates sentence relation graphs. We employ a Graph Convolutional Network (GCN) on the relation graphs, with sentence embeddings obtained from Recurrent Neural Networks as input node features. Through multiple layer-wise propagation, the GCN generates high-level hidden sentence features for salience estimation. We then use a greedy heuristic to extract salient sentences while avoiding redundancy. In our experiments on DUC 2004, we consider three types of sentence relation graphs and demonstrate the advantage of combining sentence relations in graphs with the representation power of deep neural networks. Our model improves upon traditional graph-based extractive approaches and the vanilla GRU sequence model with no graph, and it achieves competitive results against other state-of-the-art multi-document summarization systems.Comment: In CoNLL 201

arXiv.org e-Print Archive

Crossref

Clustering Spectral avec Contraintes de Paires réglées par Noyaux Gaussiens

Author: Chatel David
Denis Pascal
Tommasi Marc
Publication venue: HAL CCSD
Publication date: 08/07/2014
Field of study

International audienceRésumé Nous considérons le problème du clustering spectral partielle-ment supervisé par des contraintes de la forme « must-link » et « cannot-link ». De telles contraintes apparaissent fréquemment dans divers pro-blèmes, comme la résolution de la coréférence en traitement automatique du langage naturel. L'approche développée dans ce papier consiste à ap-prendre une nouvelle représentation de l'espace pour les données, ainsi qu'une nouvelle distance dans cet espace. Cette représentation est ob-tenue via une transformation linéaire de l'enveloppe spectrale des don-nées. Les contraintes sont exprimées avec des fonctions Gaussiennes qui réajustent localement les similarités entre les objets. Un problème d'op-timisation global et non convexe est alors obtenu et l'apprentissage du modèle se fait grâce à des techniques de descentes de gradient. Nous évaluons notre algorithme sur des jeux de données standards et le com-parons à divers algorithmes de l'état de l'art, comme [14,18,32]. Les ré-sultats sur ces jeux de données, ainsi que sur le jeu de données de la tâche de coréférence CoNLL-2012, montrent que notre algorithme amé-liore significativement la qualité des clusters obtenus par les précédentes approches, et est plus robuste en montée en charge

HAL - Lille 3

INRIA a CCSD electronic archive server