7 research outputs found
Document Clustering with K-tree
This paper describes the approach taken to the XML Mining track at INEX 2008
by a group at the Queensland University of Technology. We introduce the K-tree
clustering algorithm in an Information Retrieval context by adapting it for
document clustering. Many large scale problems exist in document clustering.
K-tree scales well with large inputs due to its low complexity. It offers
promising results both in terms of efficiency and quality. Document
classification was completed using Support Vector Machines.Comment: 12 pages, INEX 200
Random Indexing K-tree
Random Indexing (RI) K-tree is the combination of two algorithms for
clustering. Many large scale problems exist in document clustering. RI K-tree
scales well with large inputs due to its low complexity. It also exhibits
features that are useful for managing a changing collection. Furthermore, it
solves previous issues with sparse document vectors when using K-tree. The
algorithms and data structures are defined, explained and motivated. Specific
modifications to K-tree are made for use with RI. Experiments have been
executed to measure quality. The results indicate that RI K-tree improves
document cluster quality over the original K-tree algorithm.Comment: 8 pages, ADCS 2009; Hyperref and cleveref LaTeX packages conflicted.
Removed clevere
Exploiting index pruning methods for clustering XML collections
In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C3M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics. © 2010 Springer-Verlag Berlin Heidelberg
Exploiting Index Pruning Methods for Clustering XML Collections
In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C3 M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics
Efficiency and effectiveness of XML keyword search using a full element index
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2010.Thesis (Master's) -- Bilkent University, 2010.Includes bibliographical references leaves 63-67.In the last decade, both the academia and industry proposed several techniques
to allow keyword search on XML databases and document collections. A common
data structure employed in most of these approaches is an inverted index, which
is the state-of-the-art for conducting keyword search over large volumes of textual
data, such as world wide web. In particular, a full element-index considers (and
indexes) each XML element as a separate document, which is formed of the text
directly contained in it and the textual content of all of its descendants. A major
criticism for a full element-index is the high degree of redundancy in the index
(due to the nested structure of XML documents), which diminishes its usage for
large-scale XML retrieval scenarios.
As the rst contribution of this thesis, we investigate the e ciency and e ectiveness
of using a full element-index for XML keyword search. First, we suggest
that lossless index compression methods can signi cantly reduce the size of a full
element-index so that query processing strategies, such as those employed in a
typical search engine, can e ciently operate on it. We show that once the most
essential problem of a full element-index, i.e., its size, is remedied, using such
an index can improve both the result quality (e ectiveness) and query execution
performance (e ciency) in comparison to other recently proposed techniques in
the literature. Moreover, using a full element-index also allows generating query
results in di erent forms, such as a ranked list of documents (as expected by a
search engine user) or a complete list of elements that include all of the query
terms (as expected by a DBMS user), in a uni ed framework.
As a second contribution of this thesis, we propose to use a lossy approach,
static index pruning, to further reduce the size of a full element-index. In this way, we aim to eliminate the repetition of an element's terms at upper levels in an
adaptive manner considering the element's textual content and search system's
ranking function. That is, we attempt to remove the repetitions in the index only
when we expect that removal of them would not reduce the result quality. We
conduct a well-crafted set of experiments and show that pruned index les are
comparable or even superior to the full element-index up to very high pruning
levels for various ad hoc tasks in terms of retrieval e ectiveness.
As a nal contribution of this thesis, we propose to apply index pruning
strategies to reduce the size of the document vectors in an XML collection to
improve the clustering performance of the collection. Our experiments show that
for certain cases, it is possible to prune up to 70% of the collection (or, more
speci cally, underlying document vectors) and still generate a clustering structure
that yields the same quality with that of the original collection, in terms of a set
of evaluation metrics.Atılgan, DuyguM.S
Evaluating Clusterings by Estimating Clarity
In this thesis I examine clustering evaluation, with a subfocus on text clusterings specifically. The principal work
of this thesis is the development, analysis, and testing of a new internal clustering quality measure called informativeness.
I begin by reviewing clustering in general. I then review current clustering
quality measures, accompanying this with an in-depth discussion of many of the important properties one needs to understand about such measures. This is followed by extensive document clustering experiments that show problems with standard clustering evaluation practices.
I then develop informativeness, my new internal clustering quality measure for estimating the clarity of clusterings. I show that informativeness, which uses classification accuracy as a proxy for human assessment of clusterings, is both theoretically sensible and works empirically. I present a generalization of informativeness that leverages external clustering quality measures. I also show its use in a realistic application: email spam filtering. I show that informativeness can be used to select clusterings which lead to superior spam filters when few true labels are available.
I conclude this thesis with a discussion of clustering evaluation in general, informativeness, and the directions I believe clustering evaluation research should take in the future