610 research outputs found
From Keyword Search to Exploration: How Result Visualization Aids Discovery on the Web
A key to the Web's success is the power of search. The elegant way in which search results are returned is usually remarkably effective. However, for exploratory search in which users need to learn, discover, and understand novel or complex topics, there is substantial room for improvement. Human computer interaction researchers and web browser designers have developed novel strategies to improve Web search by enabling users to conveniently visualize, manipulate, and organize their Web search results. This monograph offers fresh ways to think about search-related cognitive processes and describes innovative design approaches to browsers and related tools. For instance, while key word search presents users with results for specific information (e.g., what is the capitol of Peru), other methods may let users see and explore the contexts of their requests for information (related or previous work, conflicting information), or the properties that associate groups of information assets (group legal decisions by lead attorney). We also consider the both traditional and novel ways in which these strategies have been evaluated. From our review of cognitive processes, browser design, and evaluations, we reflect on the future opportunities and new paradigms for exploring and interacting with Web search results
Knowledge Management and Cultural Heritage Repositories. Cross-Lingual Information Retrieval Strategies
In the last years important initiatives, like the development of the European Library and Europeana, aim to increase the availability of cultural content from various types of providers and institutions. The accessibility to these resources requires the development of environments which allow both to manage multilingual complexity and to preserve the semantic interoperability. The creation of Natural Language Processing (NLP) applications is finalized to the achievement of CrossLingual Information Retrieval (CLIR). This paper presents an ongoing research on language processing based on the LexiconGrammar (LG) approach with the goal of improving knowledge management in the Cultural Heritage repositories. The proposed framework aims to guarantee interoperability between multilingual systems in order to overcome crucial issues like cross-language and cross-collection retrieval. Indeed, the LG methodology tries to overcome the shortcomings of statistical approaches as in Google Translate or Bing by Microsoft concerning Multi-Word Unit (MWU) processing in queries, where the lack of linguistic context represents a serious obstacle to disambiguation. In particular, translations concerning specific domains, as it is has been widely recognized, is unambiguous since the meanings of terms are mono-referential and the type of relation that links a given term to its equivalent in a foreign language is biunivocal, i.e. a one-to-one coupling which causes this relation to be exclusive and reversible. Ontologies are used in CLIR and are considered by several scholars a promising research area to improve the effectiveness of Information Extraction (IE) techniques particularly for technical-domain queries. Therefore, we present a methodological framework which allows to map both the data and the metadata among the language-specific ont
TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank
Learning-to-Rank deals with maximizing the utility of a list of examples
presented to the user, with items of higher relevance being prioritized. It has
several practical applications such as large-scale search, recommender systems,
document summarization and question answering. While there is widespread
support for classification and regression based learning, support for
learning-to-rank in deep learning has been limited. We propose TensorFlow
Ranking, the first open source library for solving large-scale ranking problems
in a deep learning framework. It is highly configurable and provides
easy-to-use APIs to support different scoring mechanisms, loss functions and
evaluation metrics in the learning-to-rank setting. Our library is developed on
top of TensorFlow and can thus fully leverage the advantages of this platform.
For example, it is highly scalable, both in training and in inference, and can
be used to learn ranking models over massive amounts of user activity data,
which can include heterogeneous dense and sparse features. We empirically
demonstrate the effectiveness of our library in learning ranking functions for
large-scale search and recommendation applications in Gmail and Google Drive.
We also show that ranking models built using our model scale well for
distributed training, without significant impact on metrics. The proposed
library is available to the open source community, with the hope that it
facilitates further academic research and industrial applications in the field
of learning-to-rank.Comment: KDD 201
Intelligent Fusion of Structural and Citation-Based Evidence for Text Classification
This paper investigates how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity, five derived from the citation structure of the collection, and three measures derived from the structural content, and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our empirical experiments using documents from the ACM digital library and the ACM classification scheme show that we can discover similarity functions that work better than any evidence in isolation and whose combined performance through a simple majority voting is comparable to that of Support Vector Machine classifiers
Lexical cohesion and term proximity in document ranking
Cataloged from PDF version of article.We demonstrate effective new methods of document ranking based on lexical cohesive relationships between query terms. The proposed methods rely solely on the lexical relationships between original query terms, and do not involve query expansion or relevance feedback. Two types of lexical cohesive relationship information between query terms are used in document ranking: short-distance collocation relationship between query terms, and long-distance relationship, determined by the collocation of query terms with other words. The methods are evaluated on TREC corpora, and show improvements over baseline systems. (C) 2008 Elsevier Ltd. All rights reserved
Simulated evaluation of faceted browsing based on feature selection
In this paper we explore the limitations of facet based browsing which uses sub-needs of an information need for querying and organising the search process in video retrieval. The underlying assumption of this approach is that the search effectiveness will be enhanced if such an approach is employed for interactive video retrieval using textual and visual features. We explore the performance bounds of a faceted system by carrying out a simulated user evaluation on TRECVid data sets, and also on the logs of a prior user experiment with the system. We first present a methodology to reduce the dimensionality of features by selecting the most important ones. Then, we discuss the simulated evaluation strategies employed in our evaluation and the effect on the use of both textual and visual features. Facets created by users are simulated by clustering video shots using textual and visual features. The experimental results of our study demonstrate that the faceted browser can potentially improve the search effectiveness
Kernel-Based Ranking. Methods for Learning and Performance Estimation
Machine learning provides tools for automated construction of predictive
models in data intensive areas of engineering and science. The family of
regularized kernel methods have in the recent years become one of the mainstream
approaches to machine learning, due to a number of advantages the
methods share. The approach provides theoretically well-founded solutions
to the problems of under- and overfitting, allows learning from structured
data, and has been empirically demonstrated to yield high predictive performance
on a wide range of application domains. Historically, the problems
of classification and regression have gained the majority of attention in the
field. In this thesis we focus on another type of learning problem, that of
learning to rank.
In learning to rank, the aim is from a set of past observations to learn
a ranking function that can order new objects according to how well they
match some underlying criterion of goodness. As an important special case
of the setting, we can recover the bipartite ranking problem, corresponding
to maximizing the area under the ROC curve (AUC) in binary classification.
Ranking applications appear in a large variety of settings, examples
encountered in this thesis include document retrieval in web search, recommender
systems, information extraction and automated parsing of natural
language. We consider the pairwise approach to learning to rank, where
ranking models are learned by minimizing the expected probability of ranking
any two randomly drawn test examples incorrectly. The development
of computationally efficient kernel methods, based on this approach, has in
the past proven to be challenging. Moreover, it is not clear what techniques
for estimating the predictive performance of learned models are the most
reliable in the ranking setting, and how the techniques can be implemented
efficiently.
The contributions of this thesis are as follows. First, we develop
RankRLS, a computationally efficient kernel method for learning to rank,
that is based on minimizing a regularized pairwise least-squares loss. In
addition to training methods, we introduce a variety of algorithms for tasks
such as model selection, multi-output learning, and cross-validation, based
on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm,
which is one of the most well established methods for learning to
rank. Third, we study the combination of the empirical kernel map and reduced
set approximation, which allows the large-scale training of kernel machines
using linear solvers, and propose computationally efficient solutions
to cross-validation when using the approach. Next, we explore the problem
of reliable cross-validation when using AUC as a performance criterion,
through an extensive simulation study. We demonstrate that the proposed
leave-pair-out cross-validation approach leads to more reliable performance
estimation than commonly used alternative approaches. Finally, we present
a case study on applying machine learning to information extraction from
biomedical literature, which combines several of the approaches considered
in the thesis. The thesis is divided into two parts. Part I provides the background
for the research work and summarizes the most central results, Part
II consists of the five original research articles that are the main contribution
of this thesis.Siirretty Doriast
- âŠ