48,159 research outputs found
Clustering documents with active learning using Wikipedia
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
Semantic industrial categorisation based on search engine index
Analysis of specialist language is one of the most pressing
problems when trying to build intelligent content analysis
system. Identifying the scope of the language used and then understanding the relationships between the language entities is a key problem. A semantic relationship analysis of the search engine index was devised and evaluated. Using search engine index provides us with access to the widest database of knowledge in any particular field (if not now, then surely in the future). Social network analysis of keywords collection seems to generate a viable list of the specialist terms and relationships among them. This approach has been tested in the engineering and medical sectors
Novel Metaknowledge-based Processing Technique for Multimedia Big Data clustering challenges
Past research has challenged us with the task of showing relational patterns
between text-based data and then clustering for predictive analysis using Golay
Code technique. We focus on a novel approach to extract metaknowledge in
multimedia datasets. Our collaboration has been an on-going task of studying
the relational patterns between datapoints based on metafeatures extracted from
metaknowledge in multimedia datasets. Those selected are significant to suit
the mining technique we applied, Golay Code algorithm. In this research paper
we summarize findings in optimization of metaknowledge representation for
23-bit representation of structured and unstructured multimedia data in order
toComment: IEEE Multimedia Big Data (BigMM 2015
- ā¦