47,422 research outputs found
A new approach to search result clustering and labeling
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2011.Thesis (Master's) -- Bilkent University, 2011.Includes bibliographical references leaves 58-62.Search engines present query results as a long ordered list of web snippets divided
into several pages. Post-processing of information retrieval results for easier access
to the desired information is an important research problem. A post-processing
technique is clustering search results by topics and labeling these groups to reflect
the topic of each cluster. In this thesis, we present a novel search result clustering
approach to split the long list of documents returned by search engines into
meaningfully grouped and labeled clusters. Our method emphasizes clustering
quality by using cover coefficient and sequential k-means clustering algorithms.
Cluster labeling is crucial because meaningless or confusing labels may mislead
users to check wrong clusters for the query and lose extra time. Additionally,
labels should reflect the contents of documents within the cluster accurately. To
be able to label clusters effectively, a new cluster labeling method based on term
weighting is introduced. We also present a new metric that employs precision and
recall to assess the success of cluster labeling. We adopt a comparative evaluation
strategy to derive the relative performance of the proposed method with respect
to the two prominent search result clustering methods: Suffix Tree Clustering
and Lingo. Moreover, we perform the experiments using the publicly available
Ambient and ODP-239 datasets. Experimental results show that the proposed
method can successfully achieve both clustering and labeling tasks.Türel, AnılM.S
A new approach to search result clustering and labeling
Search engines present query results as a long ordered list of web snippets divided into several pages. Post-processing of retrieval results for easier access of desired information is an important research problem. In this paper, we present a novel search result clustering approach to split the long list of documents returned by search engines into meaningfully grouped and labeled clusters. Our method emphasizes clustering quality by using cover coefficient-based and sequential k-means clustering algorithms. A cluster labeling method based on term weighting is also introduced for reflecting cluster contents. In addition, we present a new metric that employs precision and recall to assess the success of cluster labeling. We adopt a comparative strategy to derive the relative performance of the proposed method with respect to two prominent search result clustering methods: Suffix Tree Clustering and Lingo. Experimental results in the publicly available AMBIENT and ODP-239 datasets show that our method can successfully achieve both clustering and labeling tasks. © 2011 Springer-Verlag Berlin Heidelberg
Transforming Graph Representations for Statistical Relational Learning
Relational data representations have become an increasingly important topic
due to the recent proliferation of network datasets (e.g., social, biological,
information networks) and a corresponding increase in the application of
statistical relational learning (SRL) algorithms to these domains. In this
article, we examine a range of representation issues for graph-based relational
data. Since the choice of relational data representation for the nodes, links,
and features can dramatically affect the capabilities of SRL algorithms, we
survey approaches and opportunities for relational representation
transformation designed to improve the performance of these algorithms. This
leads us to introduce an intuitive taxonomy for data representation
transformations in relational domains that incorporates link transformation and
node transformation as symmetric representation tasks. In particular, the
transformation tasks for both nodes and links include (i) predicting their
existence, (ii) predicting their label or type, (iii) estimating their weight
or importance, and (iv) systematically constructing their relevant features. We
motivate our taxonomy through detailed examples and use it to survey and
compare competing approaches for each of these tasks. We also discuss general
conditions for transforming links, nodes, and features. Finally, we highlight
challenges that remain to be addressed
Distantly Labeling Data for Large Scale Cross-Document Coreference
Cross-document coreference, the problem of resolving entity mentions across
multi-document collections, is crucial to automated knowledge base construction
and data mining tasks. However, the scarcity of large labeled data sets has
hindered supervised machine learning research for this task. In this paper we
develop and demonstrate an approach based on ``distantly-labeling'' a data set
from which we can train a discriminative cross-document coreference model. In
particular we build a dataset of more than a million people mentions extracted
from 3.5 years of New York Times articles, leverage Wikipedia for distant
labeling with a generative model (and measure the reliability of such
labeling); then we train and evaluate a conditional random field coreference
model that has factors on cross-document entities as well as mention-pairs.
This coreference model obtains high accuracy in resolving mentions and entities
that are not present in the training data, indicating applicability to
non-Wikipedia data. Given the large amount of data, our work is also an
exercise demonstrating the scalability of our approach.Comment: 16 pages, submitted to ECML 201
- …