605 research outputs found
Toward Entity-Aware Search
As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability
BlogForever D3.2: Interoperability Prospects
This report evaluates the interoperability prospects of the BlogForever platform. Therefore, existing interoperability models are reviewed, a Delphi study to identify crucial aspects for the interoperability of web archives and digital libraries is conducted, technical interoperability standards and protocols are reviewed regarding their relevance for BlogForever, a simple approach to consider interoperability in specific usage scenarios is proposed, and a tangible approach to develop a succession plan that would allow a reliable transfer of content from the current digital archive to other digital repositories is presented
Fidelity-Weighted Learning
Training deep neural networks requires many training samples, but in practice
training labels are expensive to obtain and may be of varying quality, as some
may be from trusted expert labelers while others might be from heuristics or
other sources of weak supervision such as crowd-sourcing. This creates a
fundamental quality versus-quantity trade-off in the learning process. Do we
learn from the small amount of high-quality data or the potentially large
amount of weakly-labeled data? We argue that if the learner could somehow know
and take the label-quality into account when learning the data representation,
we could get the best of both worlds. To this end, we propose
"fidelity-weighted learning" (FWL), a semi-supervised student-teacher approach
for training deep neural networks using weakly-labeled data. FWL modulates the
parameter updates to a student network (trained on the task we care about) on a
per-sample basis according to the posterior confidence of its label-quality
estimated by a teacher (who has access to the high-quality labels). Both
student and teacher are learned from the data. We evaluate FWL on two tasks in
information retrieval and natural language processing where we outperform
state-of-the-art alternative semi-supervised methods, indicating that our
approach makes better use of strong and weak labels, and leads to better
task-dependent data representations.Comment: Published as a conference paper at ICLR 201
SEARCHING AS THINKING: THE ROLE OF CUES IN QUERY REFORMULATION
Given the growing volume of information that surrounds us, search, and particularly web search, is now a fundamental part of how people perceive and experience the world. Understanding how searchers interact with search engines is thus an important topic both for designers of information retrieval systems and educators working in the area of digital literacy. Reaching such understanding, however, with the more established, system-centric, approaches in information retrieval (IR) is limited. While inherently iterative nature of the search process is generally acknowledged in the field of IR, research on query reformulation is typically limited to dealing with the what or the how of the query reformulation process. Drawing a complete picture of searchers\u27 behavior is thus incomplete without addressing the why of query reformulation, including what pieces of information, or cues, trigger the reformulation process. Unpacking that aspect of the searchers\u27 behavior requires a more user-centric approach.
The overall goal of this study is to advance understanding of the reformulation process and the cues that influence it. It was driven by two broad questions about the use of cues (on the search engine result pages or the full web pages) in the searchers\u27 decisions regarding query reformulation and the effects of that use on search effectiveness. The study draws on data collected in a lab setting from a sample of students who performed a series of search tasks and then went through a process of stimulated recall focused on their query reformulations. Both, query reformulations recorded during the search tasks and cues elicited during the stimulated recall exercise, were coded and then modeled using the mixed effects method. The final models capture the relationships between cues and query reformulation strategies as well as cues and search effectiveness; in both cases some relationships are moderated by search expertise and domain knowledge.
The results demonstrate that searchers systematically elicit and use cues with regard to query reformulation. Some of these relationships are independent from search expertise and domain knowledge, while others manifest themselves differently at different levels of search expertise and domain knowledge. Similarly, due to the fact that the majority of the reformulations in this study indicated a failure of the preceding query, mixed results were achieved with identifying relationships between the use of cues and search effectiveness. As a whole, this work offers two contributions to the field of user-centered information retrieval. First, it reaffirms some of the earlier conceptual work about the role of cues in search behavior, and then expands on it by proposing specific relationships between cues and reformulations. Second, it highlights potential design considerations in creating search engine results pages and query term suggestions, as well as and training suggestion for educators working on digital literacy
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Automatic Extraction and Assessment of Entities from the Web
The search for information about entities, such as people or movies, plays an increasingly important role on the Web. This information is still scattered across many Web pages, making it more time consuming for a user to find all relevant information about an entity. This thesis describes techniques to extract entities and information about these entities from the Web, such as facts, opinions, questions and answers, interactive multimedia objects, and events. The findings of this thesis are that it is possible to create a large knowledge base automatically using a manually-crafted ontology. The precision of the extracted information was found to be between 75–90 % (facts and entities respectively) after using assessment algorithms. The algorithms from this thesis can be used to create such a knowledge base, which can be used in various research fields, such as question answering, named entity recognition, and information retrieval
- …