83,597 research outputs found
Clustering documents with active learning using Wikipedia
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%
Switcher-random-walks: a cognitive-inspired mechanism for network exploration
Semantic memory is the subsystem of human memory that stores knowledge of
concepts or meanings, as opposed to life specific experiences. The organization
of concepts within semantic memory can be understood as a semantic network,
where the concepts (nodes) are associated (linked) to others depending on
perceptions, similarities, etc. Lexical access is the complementary part of
this system and allows the retrieval of such organized knowledge. While
conceptual information is stored under certain underlying organization (and
thus gives rise to a specific topology), it is crucial to have an accurate
access to any of the information units, e.g. the concepts, for efficiently
retrieving semantic information for real-time needings. An example of an
information retrieval process occurs in verbal fluency tasks, and it is known
to involve two different mechanisms: -clustering-, or generating words within a
subcategory, and, when a subcategory is exhausted, -switching- to a new
subcategory. We extended this approach to random-walking on a network
(clustering) in combination to jumping (switching) to any node with certain
probability and derived its analytical expression based on Markov chains.
Results show that this dual mechanism contributes to optimize the exploration
of different network models in terms of the mean first passage time.
Additionally, this cognitive inspired dual mechanism opens a new framework to
better understand and evaluate exploration, propagation and transport phenomena
in other complex systems where switching-like phenomena are feasible.Comment: 9 pages, 3 figures. Accepted in "International Journal of
Bifurcations and Chaos": Special issue on "Modelling and Computation on
Complex Networks
Easy Semantification of Bioassays
Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. We propose a solution for automatically semantifying biological assays. Our solution contrasts the problem of automated semantification as labeling versus clustering where the two methods are on opposite ends of the method complexity spectrum. Characteristically modeling our problem, we find the clustering solution significantly outperforms a deep neural network state-of-the-art labeling approach. This novel contribution is based on two factors: 1) a learning objective closely modeled after the data outperforms an alternative approach with sophisticated semantic modeling; 2) automatically semantifying biological assays achieves a high performance F1 of nearly 83%, which to our knowledge is the first reported standardized evaluation of the task offering a strong benchmark model
Multi-lingual Common Semantic Space Construction via Cluster-consistent Word Embedding
We construct a multilingual common semantic space based on distributional
semantics, where words from multiple languages are projected into a shared
space to enable knowledge and resource transfer across languages. Beyond word
alignment, we introduce multiple cluster-level alignments and enforce the word
clusters to be consistently distributed across multiple languages. We exploit
three signals for clustering: (1) neighbor words in the monolingual word
embedding space; (2) character-level information; and (3) linguistic properties
(e.g., apposition, locative suffix) derived from linguistic structure knowledge
bases available for thousands of languages. We introduce a new
cluster-consistent correlational neural network to construct the common
semantic space by aligning words as well as clusters. Intrinsic evaluation on
monolingual and multilingual QVEC tasks shows our approach achieves
significantly higher correlation with linguistic features than state-of-the-art
multi-lingual embedding learning methods do. Using low-resource language name
tagging as a case study for extrinsic evaluation, our approach achieves up to
24.5\% absolute F-score gain over the state of the art.Comment: 10 page
NASARI: a novel approach to a Semantically-Aware Representation of items
The semantic representation of individual word senses and concepts is of fundamental importance to several applications in Natural Language Processing. To date, concept modeling techniques have in the main based their representation either on lexicographic resources, such as WordNet, or on encyclopedic resources, such as Wikipedia. We propose a vector representation technique that combines the complementary knowledge of both these types of resource. Thanks to its use of explicit semantics combined with a novel cluster-based dimensionality reduction and an effective weighting scheme, our representation attains state-of-the-art performance on multiple datasets in two standard benchmarks: word similarity and sense clustering. We are releasing our vector representations at http://lcl.uniroma1.it/nasari/
A combined weighting for the feature-based method on topological parameters in semantic taxonomy using social media
The textual analysis has become most important task due to the rapid increase of the number of texts that have been continuously generated in several forms such as posts and chats in social media, emails, articles, and news. The management of these texts requires efficient and effective methods, which can handle the linguistic issues that come from the complexity of natural languages. In recent years, the exploitation of semantic features from the lexical sources has been widely investigated by researchers to deal with the issues of âsynonymy and ambiguityâ in the tasks involved in the Social Media like document clustering. The main challenges of exploiting the lexical knowledge sources such as 1WordNet 3.1 in these tasks are how to integrate the various types of semantic relations for capturing additional semantic evidence, and how to settle the high dimensionality of current semantic representing approaches. In this paper, the proposed weighting of features for a new semantic feature-based method as which combined four things as which is âSynonymy, Hypernym, non-taxonomy, and Glossesâ. Therefore, this research proposes a new knowledge-based semantic representation approach for text mining, which can handle the linguistic issues as well as the high dimensionality issue. Thus, the proposed approach consists of two main components: a feature-based method for incorporating the relations in the lexical sources, and a topic-based reduction method to overcome the high dimensionality issue. The proposed method approach will evaluated using WordNet 3.1 in the text clustering and text classification
Image Clustering with External Guidance
The core of clustering is incorporating prior knowledge to construct
supervision signals. From classic k-means based on data compactness to recent
contrastive clustering guided by self-supervision, the evolution of clustering
methods intrinsically corresponds to the progression of supervision signals. At
present, substantial efforts have been devoted to mining internal supervision
signals from data. Nevertheless, the abundant external knowledge such as
semantic descriptions, which naturally conduces to clustering, is regrettably
overlooked. In this work, we propose leveraging external knowledge as a new
supervision signal to guide clustering, even though it seems irrelevant to the
given data. To implement and validate our idea, we design an externally guided
clustering method (Text-Aided Clustering, TAC), which leverages the textual
semantics of WordNet to facilitate image clustering. Specifically, TAC first
selects and retrieves WordNet nouns that best distinguish images to enhance the
feature discriminability. Then, to improve image clustering performance, TAC
collaborates text and image modalities by mutually distilling cross-modal
neighborhood information. Experiments demonstrate that TAC achieves
state-of-the-art performance on five widely used and three more challenging
image clustering benchmarks, including the full ImageNet-1K dataset
Multi-Mode Clustering for Graph-Based Lifelog Retrieval
As part of the 6th Lifelog Search Challenge, this paper presents an approach to arrange Lifelog data in a multi-modal knowledge graph based on cluster hierarchies. We use multiple sequence clustering approaches to address the multi-modal nature of Lifelogs in relation to temporal, spatial, and visual factors. The resulting clusters, along with semantic metadata captions and augmentations based on OpenCLIP, provide for the semantic structure of a graph including all Lifelogs as entries. Textual queries on this hierarchical graph can be expressed to retrieve individual Lifelogs, as well as clusters of Lifelogs
Combining Clustering techniques and Formal Concept Analysis to characterize Interestingness Measures
Formal Concept Analysis "FCA" is a data analysis method which enables to
discover hidden knowledge existing in data. A kind of hidden knowledge
extracted from data is association rules. Different quality measures were
reported in the literature to extract only relevant association rules. Given a
dataset, the choice of a good quality measure remains a challenging task for a
user. Given a quality measures evaluation matrix according to semantic
properties, this paper describes how FCA can highlight quality measures with
similar behavior in order to help the user during his choice. The aim of this
article is the discovery of Interestingness Measures "IM" clusters, able to
validate those found due to the hierarchical and partitioning clustering
methods "AHC" and "k-means". Then, based on the theoretical study of sixty one
interestingness measures according to nineteen properties, proposed in a recent
study, "FCA" describes several groups of measures.Comment: 13 pages, 2 figure
- âŠ