27,670 research outputs found
Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology
This paper presents some experiments in clustering homogeneous XMLdocuments
to validate an existing classification or more generally anorganisational
structure. Our approach integrates techniques for extracting knowledge from
documents with unsupervised classification (clustering) of documents. We focus
on the feature selection used for representing documents and its impact on the
emerging classification. We mix the selection of structured features with fine
textual selection based on syntactic characteristics.We illustrate and evaluate
this approach with a collection of Inria activity reports for the year 2003.
The objective is to cluster projects into larger groups (Themes), based on the
keywords or different chapters of these activity reports. We then compare the
results of clustering using different feature selections, with the official
theme structure used by Inria.Comment: (postprint); This version corrects a couple of errors in authors'
names in the bibliograph
Incremental document map formation: multi-stage approach
The paper presents methodology for the incremental map formation in a multi-stage process of a search engine with the map based user interface1. The architecture of the experimental system allows for comparative evaluation of different constituent technologies for various stages of the process. The quality of the map generation process has been investigated based on a number of clustering and classification measures. Some conclusions concerning the impact of various technological solutions on map quality are presented
Evidence Transfer for Improving Clustering Tasks Using External Categorical Evidence
In this paper we introduce evidence transfer for clustering, a deep learning
method that can incrementally manipulate the latent representations of an
autoencoder, according to external categorical evidence, in order to improve a
clustering outcome. By evidence transfer we define the process by which the
categorical outcome of an external, auxiliary task is exploited to improve a
primary task, in this case representation learning for clustering. Our proposed
method makes no assumptions regarding the categorical evidence presented, nor
the structure of the latent space. We compare our method, against the baseline
solution by performing k-means clustering before and after its deployment.
Experiments with three different kinds of evidence show that our method
effectively manipulates the latent representations when introduced with real
corresponding evidence, while remaining robust when presented with low quality
evidence
Digital Image Access & Retrieval
The 33th Annual Clinic on Library Applications of Data Processing, held at the University of Illinois at Urbana-Champaign in March of 1996, addressed the theme of "Digital Image Access & Retrieval." The papers from this conference cover a wide range of topics concerning digital imaging technology for visual resource collections. Papers covered three general areas: (1) systems, planning, and implementation; (2) automatic and semi-automatic indexing; and (3) preservation with the bulk of the conference focusing on indexing and retrieval.published or submitted for publicatio
A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity
Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate
- …