562 research outputs found
Improving web search results with explanation-aware snippets: an experimental study
In this paper, we focus on a typical task on a web search, in which users want to discover the coherency between two concepts on the Web. In our point of view, this task can be seen as a retrieval process: starting with some source information, the goal is to find target information by following hyperlinks. Given two concepts, e.g. chemistry and gunpowder, are search engines able to find the coherency and explain it? In this paper, we introduce a novel way of linking two concepts by following paths of hyperlinks and collecting short text snippets. We implemented a proof-of-concept prototype, which extracts paths and snippets from Wikipedia articles. Our goal is to provide the user with an overview about the coherency, enriching the connection with a
short but meaningful description. In our experimental study, we compare the results of our approach with the capability of web search engines. The results show that 72% of the participants find ours better than these of web search engines. (author's abstract
A Survey of Source Code Search: A 3-Dimensional Perspective
(Source) code search is widely concerned by software engineering researchers
because it can improve the productivity and quality of software development.
Given a functionality requirement usually described in a natural language
sentence, a code search system can retrieve code snippets that satisfy the
requirement from a large-scale code corpus, e.g., GitHub. To realize effective
and efficient code search, many techniques have been proposed successively.
These techniques improve code search performance mainly by optimizing three
core components, including query understanding component, code understanding
component, and query-code matching component. In this paper, we provide a
3-dimensional perspective survey for code search. Specifically, we categorize
existing code search studies into query-end optimization techniques, code-end
optimization techniques, and match-end optimization techniques according to the
specific components they optimize. Considering that each end can be optimized
independently and contributes to the code search performance, we treat each end
as a dimension. Therefore, this survey is 3-dimensional in nature, and it
provides a comprehensive summary of each dimension in detail. To understand the
research trends of the three dimensions in existing code search studies, we
systematically review 68 relevant literatures. Different from existing code
search surveys that only focus on the query end or code end or introduce
various aspects shallowly (including codebase, evaluation metrics, modeling
technique, etc.), our survey provides a more nuanced analysis and review of the
evolution and development of the underlying techniques used in the three ends.
Based on a systematic review and summary of existing work, we outline several
open challenges and opportunities at the three ends that remain to be addressed
in future work.Comment: submitted to ACM Transactions on Software Engineering and Methodolog
Supporting Source Code Search with Context-Aware and Semantics-Driven Query Reformulation
Software bugs and failures cost trillions of dollars every year, and could even lead to deadly accidents (e.g., Therac-25 accident). During maintenance, software developers fix numerous bugs and implement hundreds of new features by making necessary changes to the existing software code. Once an issue report (e.g., bug report, change request) is assigned to a developer, she chooses a few important keywords from the report as a search query, and then attempts to find out the exact locations in the software code that need to be either repaired or enhanced. As a part of this maintenance, developers also often select ad hoc queries on the fly, and attempt to locate the reusable code from the Internet that could assist them either in bug fixing or in feature implementation. Unfortunately, even the experienced developers often fail to construct the right search queries. Even if the developers come up with a few ad hoc queries, most of them require frequent modifications which cost significant development time and efforts. Thus, construction of an appropriate query for localizing the software bugs, programming concepts or even the reusable code is a major challenge. In this thesis, we overcome this query construction challenge with six studies, and develop a novel, effective code search solution (BugDoctor) that assists the developers in localizing the software code of interest (e.g., bugs, concepts and reusable code) during software maintenance. In particular, we reformulate a given search query (1) by designing novel keyword selection algorithms (e.g., CodeRank) that outperform the traditional alternatives (e.g., TF-IDF), (2) by leveraging the bug report quality paradigm and source document structures which were previously overlooked and (3) by exploiting the crowd knowledge and word semantics derived from Stack Overflow Q&A site, which were previously untapped. Our experiment using 5000+ search queries (bug reports, change requests, and ad hoc queries) suggests that our proposed approach can improve the given queries significantly through automated query reformulations. Comparison with 10+ existing studies on bug localization, concept location and Internet-scale code search suggests that our approach can outperform the state-of-the-art approaches with a significant margin
Adaptive Visualization for Focused Personalized Information Retrieval
The new trend on the Web has totally changed todays information access environment. The traditional information overload problem has evolved into the qualitative level beyond the quantitative growth. The mode of producing and consuming information is changing and we need a new paradigm for accessing information.Personalized search is one of the most promising answers to this problem. However, it still follows the old interaction model and representation method of classic information retrieval approaches. This limitation can harm the potential of personalized search, with which users are intended to interact with the system, learn and investigate the problem, and collaborate with the system to reach the final goal.This dissertation proposes to incorporate interactive visualization into personalized search in order to overcome the limitation. By combining the personalized search and the interac- tive visualization, we expect our approach will be able to help users to better explore the information space and locate relevant information more efficiently.We extended a well-known visualization framework called VIBE (Visual Information Browsing Environment) and implemented Adaptive VIBE, so that it can fit into the per- sonalized searching environment. We tested the effectiveness of this adaptive visualization method and investigated its strengths and weaknesses by conducting a full-scale user study.We also tried to enrich the user models with named-entities considering the possibility that the traditional keyword-based user models could harm the effectiveness of the system in the context of interactive information retrieval.The results of the user study showed that the Adaptive VIBE could improve the precision of the personalized search system and could help the users to find out more diverse set of information. The named-entity based user model integrated into Adaptive VIBE showed improvements of precision of user annotations while maintaining the level of diverse discovery of information
Knowledge mining over scientific literature and technical documentation
Abstract This dissertation focuses on the extraction of information implicitly encoded in domain descriptions (technical terminology and related items) and its usage within a restricted-domain question answering system (QA). Since different variants of the same term can be used to refer to the same domain entity, it is necessary to recognize all possible forms of a given term and structure them, so that they can be used in the question answering process. The knowledge about domain descriptions and their mutual relations is leveraged in an extension to an existing QA system, aimed at the technical maintenance manual of a well-known commercial aircraft. The original version of the QA system did not make use of domain descriptions, which are the novelty introduced by the present work. The explicit treatment of domain descriptions provided considerable gains in terms of efficiency, in particular in the process of analysis of the background document collection. Similar techniques were later applied to another domain (biomedical scientific literature), focusing in particular on protein- protein interactions. This dissertation describes in particular: (1) the extraction of domain specific lexical items which refer to entities of the domain; (2) the detection of relationships (like synonymy and hyponymy) among such items, and their organization into a conceptual structure; (3) their usage within a domain restricted question answering system, in order to facilitate the correct identification of relevant answers to a query; (4) the adaptation of the system to another domain, and extension of the basic hypothesis to tasks other than question answering.
Zusammenfassung Das Thema dieser Dissertation ist die Extraktion von Information, welche implizit in technischen Terminologien und Ă€hnlichen Ressourcen enthalten ist, sowie ihre Anwendung in einem Antwortextraktionssystem (AE). Da verschiedene Varianten desselben Terms verwendet werden können, um auf den gleichen Begriff zu verweisen, ist die Erkennung und Strukturierung aller möglichen Formen Voraussetzung fĂŒr den Einsatz in einem AE-System. Die Kenntnisse ĂŒber Terme und deren Relationen werden in einem AE System angewandt, welches auf dem Wartungshandbuch eines bekannten Verkehrsflugzeug fokussiert. Die ursprĂŒngliche Version des Systems hatte keine explizite Behandlung von Terminologie. Die explizite Behandlung von Terminologie lieferte eine beachtliche Verbesserung der Effizienz des Systems, insbesondere was die Analyse der zugrundeliegenden Dokumentensammlung betrifft. Ăhnliche Methodologien wurden spĂ€ter auf einer anderen DomĂ€ne angewandt (biomedizinische Literatur), mit einen besonderen Fokus auf Interaktionen zwischen Proteinen. Diese Dissertation beschreibt insbesondere: (1) die Extraktion der Terminologie (2) die Identifikation der Relationen zwischen Termen (wie z.B. Synonymie und Hyponymie) (3) deren Verwendung in einen AE System (4) die Portierung des Systems auf eine andere DomĂ€ne
Topic indexing and retrieval for open domain factoid question answering
Factoid Question Answering is an exciting area of Natural Language Engineering that
has the potential to replace one major use of search engines today. In this dissertation,
I introduce a new method of handling factoid questions whose answers are proper
names. The method, Topic Indexing and Retrieval, addresses two issues that prevent
current factoid QA system from realising this potential: They canât satisfy usersâ demand
for almost immediate answers, and they canât produce answers based on evidence
distributed across a corpus.
The first issue arises because the architecture common to QA systems is not easily
scaled to heavy use because so much of the work is done on-line: Text retrieved by
information retrieval (IR) undergoes expensive and time-consuming answer extraction
while the user awaits an answer. If QA systems are to become as heavily used as
popular web search engines, this massive process bottle-neck must be overcome.
The second issue of how to make use of the distributed evidence in a corpus is relevant
when no single passage in the corpus provides sufficient evidence for an answer
to a given question. QA systems commonly look for a text span that contains sufficient
evidence to both locate and justify an answer. But this will fail in the case of questions
that require evidence from more than one passage in the corpus.
Topic Indexing and Retrieval method developed in this thesis addresses both these
issues for factoid questions with proper name answers by restructuring the corpus in
such a way that it enables direct retrieval of answers using off-the-shelf IR. The method
has been evaluated on 377 TREC questions with proper name answers and 41 questions
that require multiple pieces of evidence from different parts of the TREC AQUAINT
corpus. With regards to the first evaluation, scores of 0.340 in Accuracy and 0.395 in
Mean Reciprocal Rank (MRR) show that the Topic Indexing and Retrieval performs
well for this type of questions. A second evaluation compares performance on a corpus
of 41 multi-evidence questions by a question-factoring baseline method that can
be used with the standard QA architecture and by my Topic Indexing and Retrieval
method. The superior performance of the latter (MRR of 0.454 against 0.341) demonstrates
its value in answering such questions
- âŠ