12 research outputs found

    Query refinement for patent prior art search

    Get PDF
    A patent is a contract between the inventor and the state, granting a limited time period to the inventor to exploit his invention. In exchange, the inventor must put a detailed description of his invention in the public domain. Patents can encourage innovation and economic growth but at the time of economic crisis patents can hamper such growth. The long duration of the application process is a big obstacle that needs to be addressed to maximize the benefit of patents on innovation and economy. This time can be significantly improved by changing the way we search the patent and non-patent literature.Despite the recent advancement of general information retrieval and the revolution of Web Search engines, there is still a huge gap between the emerging technologies from the research labs and adapted by major Internet search engines, and the systems which are in use by the patent search communities.In this thesis we investigate the problem of patent prior art search in patent retrieval with the goal of finding documents which describe the idea of a query patent. A query patent is a full patent application composed of hundreds of terms which does not represent a single focused information need. Other relevance evidences (e.g. classification tags, and bibliographical data) provide additional details about the underlying information need of the query patent. The first goal of this thesis is to estimate a uni-gram query model from the textual fields of a query patent. We then improve the initial query representation using noun phrases extracted from the query patent. We show that expansion in a query-dependent manner is useful.The second contribution of this thesis is to address the term mismatch problem from a query formulation point of view by integrating multiple relevance evidences associated with the query patent. To do this, we enhance the initial representation of the query with the term distribution of the community of inventors related to the topic of the query patent. We then build a lexicon using classification tags and show that query expansion using this lexicon and considering proximity information (between query and expansion terms) can improve the retrieval performance. We perform an empirical evaluation of our proposed models on two patent datasets. The experimental results show that our proposed models can achieve significantly better results than the baseline and other enhanced models

    Neural pseudo-relevance feedback models for information retrieval

    Get PDF
    Verbatim queries submitted to search engines often do not sufficiently describe the user’s search intent. Moreover, even with well-formed user queries, retrieval failures can still occur, caused by lexical or semantic mismatches, or both, between the language of the user’s query and that used in the relevant documents. Pseudo-relevance feedback (PRF) techniques, which modify a query’s representation using top-ranked documents, have been shown to overcome such inadequacies and improve retrieval effectiveness. In this thesis, we argue that the pseudo-relevance feedback information can be used in neuralbased models to improve retrieval effectiveness, for both sparse retrieval and dense retrieval paradigms. Indeed, recent advancements in pretrained generative language models, such as T5 and FlanT5, have demonstrated their ability to generate textual responses that are relevant to a given prompt. In light of this success, we study the capacity of such models to perform query reformulation and how they compare with long-standing query reformulation methods that use pseudo-relevance feedback. In particular, we investigate two representative query reformulation frameworks, GenQR and GenPRF. Specifically, GenQR directly reformulates the user’s input query, while GenPRF provides additional context for the query by making use of pseudo-relevance feedback information in the top-ranked documents. For each reformulation method, we leverage different techniques, including fine tuning and direct prompting, to harness the knowledge of language models. The reformulated queries produced by the generative models are demonstrated to markedly benefit the effectiveness of sparse retrieval on various TREC test collections. In addition, Dense retrieval models, in both single representation dense retrieval and multiple representation dense retrieval paradigms, have shown higher effectiveness over traditional sparse retrieval by mitigating the lexical and semantic mismatch issues to some extent. However, underrepresented queries can still cause retrieval failures. In particular, in this thesis, we investigate the potential for multiple representation dense retrieval (exemplified by ColBERT) to be enhanced using pseudo-relevance feedback, and thereby present our proposed approach, ColBERT-PRF. More specifically, ColBERT-PRF extracts representative feedback embeddings from the document embeddings of the pseudo-relevant set and uses corresponding token statistics to identify good expansion embeddings among the representative embeddings. These expansion embeddings are then appended to the original query representation to form a refined query representation. We show that these additional expansion embeddings benefit the effectiveness of a reranking of the initial query results as well as an additional dense retrieval operation. Evaluation experiments conducted on MSMARCO passage and document ranking as well as the TREC Robust04 document ranking tasks demonstrate the effectiveness of our proposed ColBERT-PRF technique. In addition, we study the effectiveness of variants of the ColBERT-PRF model with different weighting methods. Finally, we show that ColBERT-PRF can be made more efficient, and with little impact on effectiveness, through the application of approximate scoring and different clustering methods. While PRF techniques are effective in closing the vocabulary gap between the user’s query formulations and the relevant documents, they are typically applied on the same target corpus as the final retrieval. In the past, external expansion techniques have sometimes been applied to obtain a high-quality pseudo-relevant feedback set using a high quality external corpus. However, such external expansion approaches have only been studied for sparse retrieval, and their effectiveness for recent dense retrieval methods remains under investigation. Moreover, dense retrieval approaches such as ANCE and ColBERT have been shown to face challenges when it comes to out-of-domain evaluations, due to the knowledge shift between different domains. Therefore, in this thesis, we propose a dense external expansion technique to improve the zeroshot retrieval effectiveness of both single and multiple representation dense retrieval. In particular, we employ the MSMARCO passage collection as the external corpus. The experimental results performed on two TREC datasets indicate the effectiveness of our proposed external dense query expansion techniques for both the sparse retrieval and the (single or multiple) dense retrievals. Furthermore, we note that the ColBERT model has only been applied to the BERT model with its corresponding WordPiece tokeniser. However, the effect of the pre-trained model and the tokenisation method for the contextualised late interaction mechanism used by ColERT is not well understood. Therefore, in this thesis, we extend ColBERT to Col⋆ and ColBERT-PRF to Col⋆-PRF, by generalising the de-facto standard BERT PLM to various different PLMs. As different tokenisation methods can directly impact the matching behaviour within the late interaction mechanism, we study the nature of matches occurring in different Col⋆ and Col⋆-PRF models, and further quantify the contribution of lexical and semantic matching on retrieval effectiveness. Finally, both the ColBERT-PRF as well as the Col⋆-PRF models perform dense query expansion in an unsupervised manner and might be affected by heuristic techniques such as clustering and IDF statistics. Therefore, in this thesis, we propose a contrastive solution that learns to select the most useful embeddings for expansion. More specifically, a deep language model-based contrastive weighting model, called CWPRF, is trained to learn to discriminate between relevant and non-relevant documents for semantic search. Our experimental results show that our contrastive weighting model can aid in selecting useful expansion embeddings and outperform various baselines. In particular, CWPRF can further improve nDCG@10 by upto 4.1% compared to our proposed ColBERT-PRF approach while maintaining its efficiency

    An Approach to Guide Users Towards Less Revealing Internet Browsers

    Get PDF
    When browsing the Internet, HTTP headers enable both clients and servers send extra data in their requests or responses such as the User-Agent string. This string contains information related to the sender’s device, browser, and operating system. Previous research has shown that there are numerous privacy and security risks result from exposing sensitive information in the User-Agent string. For example, it enables device and browser fingerprinting and user tracking and identification. Our large analysis of thousands of User-Agent strings shows that browsers differ tremendously in the amount of information they include in their User-Agent strings. As such, our work aims at guiding users towards using less exposing browsers. In doing so, we propose to assign an exposure score to browsers based on the information they expose and vulnerability records. Thus, our contribution in this work is as follows: first, provide a full implementation that is ready to be deployed and used by users. Second, conduct a user study to identify the effectiveness and limitations of our proposed approach. Our implementation is based on using more than 52 thousand unique browsers. Our performance and validation analysis show that our solution is accurate and efficient. The source code and data set are publicly available and the solution has been deployed

    Continuous Rationale Management

    Get PDF
    Continuous Software Engineering (CSE) is a software life cycle model open to frequent changes in requirements or technology. During CSE, software developers continuously make decisions on the requirements and design of the software or the development process. They establish essential decision knowledge, which they need to document and share so that it supports the evolution and changes of the software. The management of decision knowledge is called rationale management. Rationale management provides an opportunity to support the change process during CSE. However, rationale management is not well integrated into CSE. The overall goal of this dissertation is to provide workflows and tool support for continuous rationale management. The dissertation contributes an interview study with practitioners from the industry, which investigates rationale management problems, current practices, and features to support continuous rationale management beneficial for practitioners. Problems of rationale management in practice are threefold: First, documenting decision knowledge is intrusive in the development process and an additional effort. Second, the high amount of distributed decision knowledge documentation is difficult to access and use. Third, the documented knowledge can be of low quality, e.g., outdated, which impedes its use. The dissertation contributes a systematic mapping study on recommendation and classification approaches to treat the rationale management problems. The major contribution of this dissertation is a validated approach for continuous rationale management consisting of the ConRat life cycle model extension and the comprehensive ConDec tool support. To reduce intrusiveness and additional effort, ConRat integrates rationale management activities into existing workflows, such as requirements elicitation, development, and meetings. ConDec integrates into standard development tools instead of providing a separate tool. ConDec enables lightweight capturing and use of decision knowledge from various artifacts and reduces the developers' effort through automatic text classification, recommendation, and nudging mechanisms for rationale management. To enable access and use of distributed decision knowledge documentation, ConRat defines a knowledge model of decision knowledge and other artifacts. ConDec instantiates the model as a knowledge graph and offers interactive knowledge views with useful tailoring, e.g., transitive linking. To operationalize high quality, ConRat introduces the rationale backlog, the definition of done for knowledge documentation, and metrics for intra-rationale completeness and decision coverage of requirements and code. ConDec implements these agile concepts for rationale management and a knowledge dashboard. ConDec also supports consistent changes through change impact analysis. The dissertation shows the feasibility, effectiveness, and user acceptance of ConRat and ConDec in six case study projects in an industrial setting. Besides, it comprehensively analyses the rationale documentation created in the projects. The validation indicates that ConRat and ConDec benefit CSE projects. Based on the dissertation, continuous rationale management should become a standard part of CSE, like automated testing or continuous integration

    Q(sqrt(-3))-Integral Points on a Mordell Curve

    Get PDF
    We use an extension of quadratic Chabauty to number fields,recently developed by the author with Balakrishnan, Besser and M ̈uller,combined with a sieving technique, to determine the integral points overQ(√−3) on the Mordell curve y2 = x3 − 4

    AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model

    Get PDF
    © 2020, The Author(s). The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution
    corecore