Search CORE

448 research outputs found

Optimizing Two-Stage Bigram Language Models for IR

Author: Jianfeng Gao
Kuansan Wang
Sara Javanmardi
Publication venue
Publication date: 03/04/2020
Field of study

ABSTRACT Although higher order language models (LMs) have shown benefit of capturing word dependencies for Information retrieval (IR), the tuning of the increased number of free parameters remains a formidable engineering challenge. Consequently, in many real-world retrieval systems, applying higher order LMs is an exception rather than the rule. In this study, we address the parameter tuning problem using a framework based on a linear ranking model in which different component models are incorporated as features. Using unigram and bigram LMs with 2-stage smoothing as examples, we show that our method leads to a bigram LM that outperforms significantly its unigram counterpart and the well-tuned BM25 model

CiteSeerX

Learning to Efficiently Rank

Author: Wang Lidan
Publication venue
Publication date: 01/01/2012
Field of study

Web search engines allow users to find information on almost any topic imaginable. To be successful, a search engine must return relevant information to the user in a short amount of time. However, efficiency (speed) and effectiveness (relevance) are competing forces that often counteract each other. It is often the case that methods developed for improving effectiveness incur moderate-to-large computational costs, thus sustained effectiveness gains typically have to be counter-balanced by buying more/faster hardware, implementing caching strategies if possible, or spending additional effort in low-level optimizations. This thesis describes the "Learning to Efficiently Rank" framework for building highly effective ranking models for Web-scale data, without sacrificing run-time efficiency for returning results. It introduces new classes of ranking models that have the capability of being simultaneously fast and effective, and discusses the issue of how to optimize the models for speed and effectiveness. More specifically, a series of concrete instantiations of the general "Learning to Efficiently Rank" framework are illustrated in detail. First, given a desired tradeoff between effectiveness/efficiency, efficient linear models, which have a mechanism to directly optimize the tradeoff metric and achieve an optimal balance between effectiveness/efficiency, are introduced. Second, temporally constrained models for returning the most effective ranked results possible under a time constraint are described. Third, a cascade ranking model for efficient top-K retrieval over Web-scale documents is proposed, where the ranking effectiveness and efficiency are simultaneously optimized. Finally, a constrained cascade for returning results within time constraints by simultaneously reducing document set size and unnecessary features is discussed in detail

Digital Repository at the University of Maryland

Utilizing Knowledge Bases In Information Retrieval For Clinical Decision Support And Precision Medicine

Author: Balaneshinkordan Saeid
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2019
Field of study

Accurately answering queries that describe a clinical case and aim at finding articles in a collection of medical literature requires utilizing knowledge bases in capturing many explicit and latent aspects of such queries. Proper representation of these aspects needs knowledge-based query understanding methods that identify the most important query concepts as well as knowledge-based query reformulation methods that add new concepts to a query. In the tasks of Clinical Decision Support (CDS) and Precision Medicine (PM), the query and collection documents may have a complex structure with different components, such as disease and genetic variants that should be transformed to enable an effective information retrieval. In this work, we propose methods for representing domain-specific queries based on weighted concepts of different types whether exist in the query itself or extracted from the knowledge bases and top retrieved documents. Besides, we propose an optimization framework, which allows unifying query analysis and expansion by jointly determining the importance weights for the query and expansion concepts depending on their type and source. We also propose a probabilistic model to reformulate the query given genetic information in the query and collection documents. We observe significant improvement of retrieval accuracy will be obtained for our proposed methods over state-of-the-art baselines for the tasks of clinical decision support and precision medicine

Digital Commons@Wayne State University

Recommended from our members

SEARCHING BASED ON QUERY DOCUMENTS

Author: Kim Youngho
Publication venue: ScholarWorks@UMass Amherst
Publication date: 12/11/2014
Field of study

Searches can start with query documents where search queries are formulated based on document-level descriptions. This type of searches is more common in domain-specific search environments. For example, in patent retrieval, one major search task is finding relevant information for new (query) patents, and search queries are generated from the query patents One unique characteristic of this search is that the search process can take longer and be more comprehensive, compared to general web search. As an example, to complete a single patent retrieval task, a typical user may generate 15 queries and examine more than 100 retrieved documents. In these search environments, searchers need to formulate multiple queries based on query documents that are typically complex and difficult to understand. In this work, we describe methods for automatically generating queries and diversifying search results based on query documents, which can be used for query vi suggestion and for improving the quality of retrieval results. In particular, we focus on resolving three main issues related to query document-based searches: (1) query generation, (2) query suggestion and formulation, and (3) search result diversification. Automatic query generation helps users by reducing the burden of formulating queries from query documents. Using generated queries as suggestions is investigated as a method of presenting alternative queries. Search result diversification is important in domain-specific search because of the nature of the query documents. Since query documents generally contain long complex descriptions, diverse query topics can be identified, and a range of relevant documents can be found that are related to these diverse topics. The proposed methods we study in this thesis explicitly address these three issues. To solve the query generation issue, we use binary decision trees to generate effective Boolean queries and labeling propagation to formulate more effective phrasal-concept queries. In order to diversify search results, we propose two different approaches: query-side and result-level diversification. To generate diverse queries, we identify important topics from query documents and generate queries based on the identified topics. For result-level diversification, we extract query topics from query documents, and apply state-of-the-art diversification algorithms based on the extracted topics. In addition, we devise query suggestion techniques for each query generation method. To demonstrate the effectiveness of our approach, we conduct experiments for various domain-specific search tasks, and devise appropriate evaluation measures for domain-specific search environments

ScholarWorks@UMass Amherst

Discovery of Linguistic Relations Using Lexical Attraction

Author: Yuret Deniz
Publication venue
Publication date: 01/01/1998
Field of study

This work has been motivated by two long term goals: to understand how humans learn language and to build programs that can understand language. Using a representation that makes the relevant features explicit is a prerequisite for successful learning and understanding. Therefore, I chose to represent relations between individual words explicitly in my model. Lexical attraction is defined as the likelihood of such relations. I introduce a new class of probabilistic language models named lexical attraction models which can represent long distance relations between words and I formalize this new class of models using information theory. Within the framework of lexical attraction, I developed an unsupervised language acquisition program that learns to identify linguistic relations in a given sentence. The only explicitly represented linguistic knowledge in the program is lexical attraction. There is no initial grammar or lexicon built in and the only input is raw text. Learning and processing are interdigitated. The processor uses the regularities detected by the learner to impose structure on the input. This structure enables the learner to detect higher level regularities. Using this bootstrapping procedure, the program was trained on 100 million words of Associated Press material and was able to achieve 60% precision and 50% recall in finding relations between content-words. Using knowledge of lexical attraction, the program can identify the correct relations in syntactically ambiguous sentences such as ``I saw the Statue of Liberty flying over New York.''Comment: dissertation, 56 page

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Sequeval: an offline evaluation framework for sequence-based recommender systems

Author: Monti Diego
Morisio Maurizio
Palumbo Enrico
Rizzo Giuseppe
Publication venue: 'MDPI AG'
Publication date: 01/05/2019
Field of study

Recommender systems have gained a lot of popularity due to their large adoption in various industries such as entertainment and tourism. Numerous research efforts have focused on formulating and advancing state-of-the-art of systems that recommend the right set of items to the right person. However, these recommender systems are hard to compare since the published evaluation results are computed on diverse datasets and obtained using different methodologies. In this paper, we researched and prototyped an offline evaluation framework called Sequeval that is designed to evaluate recommender systems capable of suggesting sequences of items. We provide a mathematical definition of such sequence-based recommenders, a methodology for performing their evaluation, and the implementation details of eight metrics. We report the lessons learned using this framework for assessing the performance of four baselines and two recommender systems based on Conditional Random Fields (CRF) and Recurrent Neural Networks (RNN), considering two different datasets. Sequeval is publicly available and it aims to become a focal point for researchers and practitioners when experimenting with sequence-based recommender systems, providing comparable and objective evaluation results

Directory of Open Access Journals

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)