Search CORE

233 research outputs found

An Information Retrieval System for Performing Hierarchical Document Clustering

Author: Hagen Eric
Publication venue: Dartmouth Digital Commons
Publication date: 30/05/1997
Field of study

This thesis presents a system for web-based information retrieval that supports precise and informative post-query organization (automated document clustering by topic) to decrease real search time on the part of the user. Most existing Information Retrieval systems depend on the user to perform intelligent, specific queries with Boolean operators in order to minimize the set of returned documents. The user essentially must guess the appropriate keywords before performing the query. Other systems use a vector space model which is more suitable to performing the document similarity operations which permit hierarchical clustering of returned documents by topic. This allows post query refinement by the user. The system we propose is a hybrid beween these two systems, compatibile with the former, while providing the enhanced document organization permissable by the latter

Dartmouth Digital Commons (Dartmouth College)

High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets

Author: Kender John R.
Malik Hassan H.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

High dimensionality remains a significant challenge for document clustering. Recent approaches used frequent itemsets and closed frequent itemsets to reduce dimensionality, and to improve the efficiency of hierarchical document clustering. In this paper, we introduce the notion of "closed interesting" itemsets (i.e. closed itemsets with high interestingness). We provide heuristics such as "super item" to efficiently mine these itemsets and show that they provide significant dimensionality reduction over closed frequent itemsets. Using "closed interesting" itemsets, we propose a new hierarchical document clustering method that outperforms state of the art agglomerative, partitioning and frequent-itemset based methods both in terms of FScore and Entropy, without requiring dataset specific parameter tuning. We evaluate twenty interestingness measures on nine standard datasets and show that when used to generate "closed interesting" itemsets, and to select parent nodes, Mutual Information, Added Value, Yule's Q and Chi-Square offers best clustering performance, regardless of the characteristics of underlying dataset. We also show that our method is more scalable, and results in better run-time performance as compare to leading approaches. On a dual processor machine, our method scaled sub-linearly and was able to cluster 200K documents in about 40 seconds

Crossref

Columbia University Academic Commons

Privileged information for hierarchical document clustering: a metric learning approach

Author: Domingues Marcos Aurelio
Hruschka Eduardo Raul
Marcacini Ricardo M.
Rezende Solange Oliveira
Publication venue: Stockholm
Publication date
Field of study

Traditional hierarchical text clustering methods assume that the documents are represented only by “technical information”, i.e., keywords, phrases, expressions and named entities that can be directly extracted from the texts. However, in many scenarios there is an additional and valuable information about the documents which is usually disregarded during the clustering task, such as user-validated tags, annotations and comments from experts, dictionaries and domain ontologies. Recently, Vapnik introduced a new learning paradigm, called LUPI - Learning Using Privileged Information, which allows the incorporation of this additional (privileged) information in a supervised learning setting. We investigated the incorporation of privileged information in unsupervised setting. The key idea in our proposed approach is to extract important relationships among documents represented in the privileged information dimensional space to learn a more accurate metric for text clustering in the technical information space. A thorough experimental evaluation indicates that the incorporation of privileged information through metric learning significantly improves the hierarchical clustering accuracy.São Paulo Research Foundation (FAPESP) (grants 2010/20564-8, 2011/17366-2, 2011/19850-9, 2012/13830-9, 2013/16039-3, 2013/22547-1)PROPP/UFMSCAPESCNP

A keyquery-based classification system for CORE

Author: Gollub Tim
Hagen Matthias
Stein Benno
Völske Michael
Publication venue
Publication date: 26/04/2017
Field of study

We apply keyquery-based taxonomy composition to compute a classification system for the CORE dataset, a shared crawl of about 850,000 scientific papers. Keyquery-based taxonomy composition can be understood as a two-phase hierarchical document clustering technique that utilizes search queries as cluster labels: In a first phase, the document collection is indexed by a reference search engine, and the documents are tagged with the search queries they are relevant—for their so-called keyqueries. In a second phase, a hierarchical clustering is formed from the keyqueries within an iterative process. We use the explicit topic model ESA as document retrieval model in order to index the CORE dataset in the reference search engine. Under the ESA retrieval model, documents are represented as vectors of similarities to Wikipedia articles; a methodology proven to be advantageous for text categorization tasks. Our paper presents the generated taxonomy and reports on quantitative properties such as document coverage and processing requirements

Online-Publikationssystem der Bauhaus-Universität Weimar

Digitale Bibliothek Thüringen

Nomenclature and Contemporary Affirmation of the Unsupervised Learning in Text and Document Mining

Author: Annaluri Sreenivasa Rao
Prof. S. Ramakrishna
Publication venue: Global Journals Inc. (US)
Publication date: 21/02/2015
Field of study

Document clustering is primarily a method applied for an uncomplicated, document search, analysis and review of content or is a process of automatic classification of documents of similar type categorized to relevant clusters, in a clustering hierarchy. In this paper a review of the related work in the field of document clustering from the simple techniques of word and phrase to the present complex techniques of statistical analysis, machine learning etc are illustrated with their implications for future research work

Global Journal of Computer Science and Technology (GJCST)