111 research outputs found

    Relevance-based language models : new estimations and applications

    Get PDF
    [Abstratc] Relevance-Based Language Models introduced in the Language Modelling framework the concept of relevance, which is explicit in other retrieval models such as the Probabilistic models. Relevance Models have been mainly used for a specific task within Information Retrieval called Pseudo-Relevance Feedback, a kind of local query expansion technique where relevance is assumed over a top of documents from the initial retrieval and where those documents are used to select expansion terms for the original query and produce a, hopefully more effective, second retrieval. In this thesis we investigate some new estimations for Relevance Models for both Pseudo-Relevance Feedback and other tasks beyond retrieval, particularly, constrained text clustering and item recommendation in Recommender Systems. We study the benefits of our proposals for those tasks in comparison with existing estimations. This new modellings are able not only to improve the effectiveness of the existing estimations and methods but also to outperform their robustness, a critical factor when dealing with Pseudo-Relevance Feedback methods. These objectives are pursued by different means: promoting divergent terms in the estimation of the Relevance Models, presenting new cluster-based retrieval models, introducing new methods for automatically determine the size of the pseudo-relevant set on a query-basis, and originally producing new modellings under the Relevance-Based Language Modelling framework for the constrained text clustering and the item recommendation problems

    ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ํ…์ŠคํŠธ ๋žญํ‚น ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ •๊ต๋ฏผ.The question answering (QA) system has attracted huge interests due to its applicability in real-world applications. This dissertation proposes novel ranking algorithms for the QA system based on deep neural networks. We first tackle the long-text QA that requires the model to understand the excessively large sequence of text inputs. To solve this problem, we propose a hierarchical recurrent dual encoder that encodes texts from word-level to paragraph-level. We further propose a latent topic clustering method that utilizes semantic information in the target corpus, and thus it increases the performance of the QA system. Secondly, we investigate the short-text QA, where the information in text pairs are limited. To overcome the insufficiency, we combine a pretrained language model and an enhanced latent clustering method to the QA model. This novel architecture enables the model to utilizes additional information, resulting in achieving state-of-the-art performance for the standard answer-selection tasks (i.e., WikiQA, TREC-QA). Finally, we investigate detecting supporting sentences for complex QA system. As opposed to the previous studies, the model needs to understand the relationship between sentences to answer the question. Inspired by the hierarchical nature of the text, we propose a graph neural network-based model that iteratively propagates necessary information between text nodes and achieve the best performance among existing methods.๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์€ ๋”ฅ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ๊ธฐ๋ฐ˜ ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ์— ๊ด€ํ•œ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ € ๊ธด ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์งˆ์˜์‘๋‹ต์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ณ„์ธต ๊ตฌ์กฐ์˜ ์žฌ๊ท€์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์„ ์งง์€ ์‹œํ€€์Šค ๋‹จ์œ„๋กœ ํšจ์œจ์ ์œผ๋กœ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์—ฌ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์–ป์—ˆ๋‹ค. ๋˜ํ•œ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ ์•ˆ์— ๋‚ดํฌ๋œ ํ† ํ”ฝ์„ ์ž๋™ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ์กด ์งˆ์˜์‘๋‹ต ๋ชจ๋ธ์— ๋ณ‘ํ•ฉํ•˜์—ฌ ์ถ”๊ฐ€ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์ด๋ฃจ์—ˆ๋‹ค. ์ด์–ด์ง€๋Š” ์—ฐ๊ตฌ๋กœ ์งง์€ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์งˆ์˜์‘๋‹ต ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ฌธ์žฅ์˜ ๊ธธ์ด๊ฐ€ ์งง์•„์งˆ์ˆ˜๋ก ๋ฌธ์žฅ ์•ˆ์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด์˜ ์–‘๋„ ์ค„์–ด๋“ค๊ฒŒ ๋œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์‚ฌ์ „ ํ•™์Šต๋œ ์–ธ์–ด ๋ชจ๋ธ๊ณผ ์ƒˆ๋กœ์šด ํ† ํ”ฝ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ๋ชจ๋ธ์€ ์ข…๋ž˜ ์งง์€ ๋ฌธ์žฅ ์งˆ์˜์‘๋‹ต ์—ฐ๊ตฌ ์ค‘ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ํš๋“ํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์—ฌ๋Ÿฌ ๋ฌธ์žฅ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ต๋ณ€์„ ์ฐพ์•„์•ผ ํ•˜๋Š” ์งˆ์˜์‘๋‹ต ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ฌธ์„œ ๋‚ด ๊ฐ ๋ฌธ์žฅ์„ ๊ทธ๋ž˜ํ”„๋กœ ๋„์‹ํ™”ํ•œ ํ›„ ์ด๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๊ทธ๋ž˜ํ”„ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ๋ชจ๋ธ์€ ๊ฐ ๋ฌธ์žฅ์˜ ๊ด€๊ณ„์„ฑ์„ ์„ฑ๊ณต์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋ณต์žก๋„๊ฐ€ ๋†’์€ ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ์—์„œ ๊ธฐ์กด์— ์ œ์•ˆ๋œ ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ํš๋“ํ•˜์˜€๋‹ค.1 Introduction 1 2 Background 8 2.1 Textual Data Representation 8 2.2 Encoding Sequential Information in Text 12 3 Question-Answer Pair Ranking for Long Text 16 3.1 Related Work 18 3.2 Method 19 3.2.1 Baseline Approach 19 3.2.2 Proposed Approaches (HRDE+LTC) 22 3.3 Experimental Setup and Dataset 26 3.3.1 Dataset 26 3.3.2 Consumer Product Question Answering Corpus 30 3.3.3 Implementation Details 32 3.4 Empirical Results 34 3.4.1 Comparison with other methods 35 3.4.2 Degradation Comparison for Longer Texts 37 3.4.3 Effects of the LTC Numbers 38 3.4.4 Comprehensive Analysis of LTC 38 3.5 Further Investigation on Ranking Lengthy Document 40 3.5.1 Problem and Dataset 41 3.5.2 Methods 45 3.5.3 Experimental Results 51 3.6 Conclusion 55 4 Answer-Selection for Short Sentence 56 4.1 Related Work 57 4.2 Method 59 4.2.1 Baseline approach 59 4.2.2 Proposed Approaches (Comp-Clip+LM+LC+TL) 62 4.3 Experimental Setup and Dataset 66 4.3.1 Dataset 66 4.3.2 Implementation Details 68 4.4 Empirical Results 69 4.4.1 Comparison with Other Methods 69 4.4.2 Impact of Latent Clustering 72 4.5 Conclusion 72 5 Supporting Sentence Detection for Question Answering 73 5.1 Related Work 75 5.2 Method 76 5.2.1 Baseline approaches 76 5.2.2 Proposed Approach (Propagate-Selector) 78 5.3 Experimental Setup and Dataset 82 5.3.1 Dataset 82 5.3.2 Implementation Details 83 5.4 Empirical Results 85 5.4.1 Comparisons with Other Methods 85 5.4.2 Hop Analysis 86 5.4.3 Impact of Various Graph Topologies 88 5.4.4 Impact of Node Representation 91 5.5 Discussion 92 5.6 Conclusion 93 6 Conclusion 94Docto

    Efficient and effective retrieval using Higher-Order proximity models

    Get PDF
    Information Retrieval systems are widely used to retrieve documents that are relevant to a user's information need. Systems leveraging proximity heuristics to estimate the relevance of a document have shown to be effective. However, the computational cost of proximity-based models is rarely considered, which is an important concern over large-scale document collections. The large-scale collections also make collection-based evaluation challenging since only a small number of documents are judged given the limited budget. Effectiveness, efficiency and reliable evaluation are coherent components that should be considered when developing a good retrieval system.This thesis makes several contributions from the three aspects. Many proximity-based retrieval models are effective, but it is also important to find efficient solutions to extract proximity features, especially for models using higher-order proximity statistics. We therefore propose a one-pass algorithm based on the PlaneSweep approach. We demonstrate that the new one-pass algorithm reduces the cost of capturing a full dependency relation of a query, regardless of the input representations. Although our proposed methods can capture higher-ordered proximity features efficiently, the trade-offs between effectiveness and efficiency when using proximity-based models remains largely unexplored. We consider different variants of proximity statistics and demonstrate that using local proximity statistics can achieve an improved trade-off between effectiveness and efficiency. Another important aspect in IR is reliable system comparisons. We conduct a series of experiments that explore the interaction between pooling and evaluation depth, interactions between evaluation metrics and evaluation depth and also correlations between two different evaluation metrics. We show that different evaluation configurations on large test collections, where only a limited number of relevance labels are available, can lead to different system comparison conclusions. We also demonstrate the pitfalls of choosing an arbitrary evaluation depth regardless of the metrics employed and the pooling depth of the test collections. Lastly, we provide suggestions on the evaluation configurations for the reliable comparisons of retrieval systems on large test collections. On these large test collections, a shallow judgment pool may be employed as assumed budgets are often limited, which may lead to an imprecise evaluation of system performance, especially when a deep evaluation metric is used. We propose an estimation framework for estimating deep metric score on shallow judgment pools. With an initial shallow judgment pool, rank-level estimators are designed to estimate the effectiveness gain at each ranking. Based on the rank-level estimations, we propose an optimization framework to obtain a more precise score estimate

    Neural networks for text matching

    Get PDF

    Structured learning for information retrieval

    Get PDF
    Information retrieval is the area of study concerned with the process of searching, recovering and interpreting information from large amounts of data. In this Thesis we show that many of the problems in information retrieval consist of structured learning, where the goal is to learn predictors of complex output structures, consisting of many inter-dependent variables. We then attack these problems using principled machine learning methods that are specifically suited for such scenarios. In the process of doing so, we develop new models, new model extensions and new algorithms that, when integrated with existing methodology, comprise a new set of tools for solving a variety of information retrieval problems. Firstly, we cover the multi-label classification problem, where we seek to predict a set of labels associated with a given object; the output in this case is structured, as the output variables are interdependent. Secondly, we focus on document ranking, where given a query and a set of documents associated with it we want to rank them according to their relevance with respect to the query; here, again, we have a structured output - a ranking of documents. Thirdly, we address topic models, where we are given a set of documents and attempt to find a compact representation of them, by learning latent topics and associating a topic distribution to each document; the output is again structured, consisting of word and topic distributions. For all the above problems, we obtain state-of-the-art solutions as attested by empirical performance in publicly available real-world datasets

    Entity-Oriented Search

    Get PDF
    This open access book covers all facets of entity-oriented searchโ€”where โ€œsearchโ€ can be interpreted in the broadest sense of information accessโ€”from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book. The book is divided into three main parts, sandwiched between introductory and concluding chapters. The first two chapters introduce readers to the basic concepts, provide an overview of entity-oriented search tasks, and present the various types and sources of data that will be used throughout the book. Part I deals with the core task of entity ranking: given a textual query, possibly enriched with additional elements or structural hints, return a ranked list of entities. This core task is examined in a number of different variants, using both structured and unstructured data collections, and numerous query formulations. In turn, Part II is devoted to the role of entities in bridging unstructured and structured data. Part III explores how entities can enable search engines to understand the concepts, meaning, and intent behind the query that the user enters into the search box, and how they can provide rich and focused responses (as opposed to merely a list of documents)โ€”a process known as semantic search. The final chapter concludes the book by discussing the limitations of current approaches, and suggesting directions for future research. Researchers and graduate students are the primary target audience of this book. A general background in information retrieval is sufficient to follow the material, including an understanding of basic probability and statistics concepts as well as a basic knowledge of machine learning concepts and supervised learning algorithms

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information โ€“ provided implicitly or explicitly โ€“ is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    The Janus Faced Scholar:a Festschrift in honour of Peter Ingwersen

    Get PDF
    • โ€ฆ
    corecore