3,802 research outputs found
Document Clustering based on Topic Maps
Importance of document clustering is now widely acknowledged by researchers
for better management, smart navigation, efficient filtering, and concise
summarization of large collection of documents like World Wide Web (WWW). The
next challenge lies in semantically performing clustering based on the semantic
contents of the document. The problem of document clustering has two main
components: (1) to represent the document in such a form that inherently
captures semantics of the text. This may also help to reduce dimensionality of
the document, and (2) to define a similarity measure based on the semantic
representation such that it assigns higher numerical values to document pairs
which have higher semantic relationship. Feature space of the documents can be
very challenging for document clustering. A document may contain multiple
topics, it may contain a large set of class-independent general-words, and a
handful class-specific core-words. With these features in mind, traditional
agglomerative clustering algorithms, which are based on either Document Vector
model (DVM) or Suffix Tree model (STC), are less efficient in producing results
with high cluster quality. This paper introduces a new approach for document
clustering based on the Topic Map representation of the documents. The document
is being transformed into a compact form. A similarity measure is proposed
based upon the inferred information through topic maps data and structures. The
suggested method is implemented using agglomerative hierarchal clustering and
tested on standard Information retrieval (IR) datasets. The comparative
experiment reveals that the proposed approach is effective in improving the
cluster quality
Reconstructing Native Language Typology from Foreign Language Usage
Linguists and psychologists have long been studying cross-linguistic
transfer, the influence of native language properties on linguistic performance
in a foreign language. In this work we provide empirical evidence for this
process in the form of a strong correlation between language similarities
derived from structural features in English as Second Language (ESL) texts and
equivalent similarities obtained from the typological features of the native
languages. We leverage this finding to recover native language typological
similarity structure directly from ESL text, and perform prediction of
typological features in an unsupervised fashion with respect to the target
languages. Our method achieves 72.2% accuracy on the typology prediction task,
a result that is highly competitive with equivalent methods that rely on
typological resources.Comment: CoNLL 201
Recent Developments in Document Clustering
This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed
Document Clustering with K-tree
This paper describes the approach taken to the XML Mining track at INEX 2008
by a group at the Queensland University of Technology. We introduce the K-tree
clustering algorithm in an Information Retrieval context by adapting it for
document clustering. Many large scale problems exist in document clustering.
K-tree scales well with large inputs due to its low complexity. It offers
promising results both in terms of efficiency and quality. Document
classification was completed using Support Vector Machines.Comment: 12 pages, INEX 200
- …