127 research outputs found

    Retrieval for Extremely Long Queries and Documents with RPRS: a Highly Efficient and Effective Transformer-based Re-Ranker

    Full text link
    Retrieval with extremely long queries and documents is a well-known and challenging task in information retrieval and is commonly known as Query-by-Document (QBD) retrieval. Specifically designed Transformer models that can handle long input sequences have not shown high effectiveness in QBD tasks in previous work. We propose a Re-Ranker based on the novel Proportional Relevance Score (RPRS) to compute the relevance score between a query and the top-k candidate documents. Our extensive evaluation shows RPRS obtains significantly better results than the state-of-the-art models on five different datasets. Furthermore, RPRS is highly efficient since all documents can be pre-processed, embedded, and indexed before query time which gives our re-ranker the advantage of having a complexity of O(N) where N is the total number of sentences in the query and candidate documents. Furthermore, our method solves the problem of the low-resource training in QBD retrieval tasks as it does not need large amounts of training data, and has only three parameters with a limited range that can be optimized with a grid search even if a small amount of labeled data is available. Our detailed analysis shows that RPRS benefits from covering the full length of candidate documents and queries.Comment: Accepted at ACM Transactions on Information Systems (ACM TOIS journal

    Query refinement for patent prior art search

    Get PDF
    A patent is a contract between the inventor and the state, granting a limited time period to the inventor to exploit his invention. In exchange, the inventor must put a detailed description of his invention in the public domain. Patents can encourage innovation and economic growth but at the time of economic crisis patents can hamper such growth. The long duration of the application process is a big obstacle that needs to be addressed to maximize the benefit of patents on innovation and economy. This time can be significantly improved by changing the way we search the patent and non-patent literature.Despite the recent advancement of general information retrieval and the revolution of Web Search engines, there is still a huge gap between the emerging technologies from the research labs and adapted by major Internet search engines, and the systems which are in use by the patent search communities.In this thesis we investigate the problem of patent prior art search in patent retrieval with the goal of finding documents which describe the idea of a query patent. A query patent is a full patent application composed of hundreds of terms which does not represent a single focused information need. Other relevance evidences (e.g. classification tags, and bibliographical data) provide additional details about the underlying information need of the query patent. The first goal of this thesis is to estimate a uni-gram query model from the textual fields of a query patent. We then improve the initial query representation using noun phrases extracted from the query patent. We show that expansion in a query-dependent manner is useful.The second contribution of this thesis is to address the term mismatch problem from a query formulation point of view by integrating multiple relevance evidences associated with the query patent. To do this, we enhance the initial representation of the query with the term distribution of the community of inventors related to the topic of the query patent. We then build a lexicon using classification tags and show that query expansion using this lexicon and considering proximity information (between query and expansion terms) can improve the retrieval performance. We perform an empirical evaluation of our proposed models on two patent datasets. The experimental results show that our proposed models can achieve significantly better results than the baseline and other enhanced models

    Automating the search for a patent's prior art with a full text similarity search

    Full text link
    More than ever, technical inventions are the symbol of our society's advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore, a search for published work that describes similar inventions to a given patent application needs to be performed. Currently, this so-called search for prior art is executed with semi-automatically composed keyword queries, which is not only time consuming, but also prone to errors. In particular, errors may systematically arise by the fact that different keywords for the same technical concepts may exist across disciplines. In this paper, a novel approach is proposed, where the full text of a given patent application is compared to existing patents using machine learning and natural language processing techniques to automatically detect inventions that are similar to the one described in the submitted document. Various state-of-the-art approaches for feature extraction and document comparison are evaluated. In addition to that, the quality of the current search process is assessed based on ratings of a domain expert. The evaluation results show that our automated approach, besides accelerating the search process, also improves the search results for prior art with respect to their quality

    Evaluating Information Retrieval and Access Tasks

    Get PDF
    This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one

    DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding

    Get PDF
    Recent advances in computer vision (CV) and natural language processing have been driven by exploiting big data on practical applications. However, these research fields are still limited by the sheer volume, versatility, and diversity of the available datasets. CV tasks, such as image captioning, which has primarily been carried out on natural images, still struggle to produce accurate and meaningful captions on sketched images often included in scientific and technical documents. The advancement of other tasks such as 3D reconstruction from 2D images requires larger datasets with multiple viewpoints. We introduce DeepPatent2, a large-scale dataset, providing more than 2.7 million technical drawings with 132,890 object names and 22,394 viewpoints extracted from 14 years of US design patent documents. We demonstrate the usefulness of DeepPatent2 with conceptual captioning. We further provide the potential usefulness of our dataset to facilitate other research areas such as 3D image reconstruction and image retrieval

    特許中の画像とテキストを用いた手順オントロジーの構築

    Get PDF

    QuantumCLEF: A Shared-Task Proposal to Evaluate the Performance of Quantum Computing for Information Retrieval Systems

    Get PDF
    Quantum Computing has been a focus of research for many researchers overthe last few years. As a result of technological development, nowadays Quantum Computing resources are becoming available and usable to solve practical problems also in the Information Retrieval (IR) field. In this work, we firstly dive into the paradigms of Universal Quantum Computing and, in particular, Quantum Annealing which is the main focus. We also show how problems such as Feature Selection, a well-known -Hard problem, can be formulated as Quadratic Unconstrained Binary Optimization (QUBO) problems and embedded into Quantum Annealers. Then we propose some possible Shared Tasks to evaluate the efficiency and effectiveness of Quantum Computing in the Information Retrieval field. These tasks will be proposed in the future to CLEF in order to start the QuantumCLEF evaluation campaign whose aim is to acknowledge the potential benefits of Quantum Annealing technologies in the IR field and to create a common ground for the research community to start learning and employing these precious resources to improve the current state-of-the-art solutions. Finally we design and implement a Submission System that can be employed in order to carry out the Shared Tasks. This system is designed to be scalable, secure and fault-tolerant.Quantum Computing has been a focus of research for many researchers overthe last few years. As a result of technological development, nowadays Quantum Computing resources are becoming available and usable to solve practical problems also in the Information Retrieval (IR) field. In this work, we firstly dive into the paradigms of Universal Quantum Computing and, in particular, Quantum Annealing which is the main focus. We also show how problems such as Feature Selection, a well-known -Hard problem, can be formulated as Quadratic Unconstrained Binary Optimization (QUBO) problems and embedded into Quantum Annealers. Then we propose some possible Shared Tasks to evaluate the efficiency and effectiveness of Quantum Computing in the Information Retrieval field. These tasks will be proposed in the future to CLEF in order to start the QuantumCLEF evaluation campaign whose aim is to acknowledge the potential benefits of Quantum Annealing technologies in the IR field and to create a common ground for the research community to start learning and employing these precious resources to improve the current state-of-the-art solutions. Finally we design and implement a Submission System that can be employed in order to carry out the Shared Tasks. This system is designed to be scalable, secure and fault-tolerant
    corecore