Search CORE

12 research outputs found

Query refinement for patent prior art search

Author: Crestani Fabio
Landoni Monica
Mahdabi Parvaz
Publication venue
Publication date: 18/09/2014
Field of study

A patent is a contract between the inventor and the state, granting a limited time period to the inventor to exploit his invention. In exchange, the inventor must put a detailed description of his invention in the public domain. Patents can encourage innovation and economic growth but at the time of economic crisis patents can hamper such growth. The long duration of the application process is a big obstacle that needs to be addressed to maximize the benefit of patents on innovation and economy. This time can be significantly improved by changing the way we search the patent and non-patent literature.Despite the recent advancement of general information retrieval and the revolution of Web Search engines, there is still a huge gap between the emerging technologies from the research labs and adapted by major Internet search engines, and the systems which are in use by the patent search communities.In this thesis we investigate the problem of patent prior art search in patent retrieval with the goal of finding documents which describe the idea of a query patent. A query patent is a full patent application composed of hundreds of terms which does not represent a single focused information need. Other relevance evidences (e.g. classification tags, and bibliographical data) provide additional details about the underlying information need of the query patent. The first goal of this thesis is to estimate a uni-gram query model from the textual fields of a query patent. We then improve the initial query representation using noun phrases extracted from the query patent. We show that expansion in a query-dependent manner is useful.The second contribution of this thesis is to address the term mismatch problem from a query formulation point of view by integrating multiple relevance evidences associated with the query patent. To do this, we enhance the initial representation of the query with the term distribution of the community of inventors related to the topic of the query patent. We then build a lexicon using classification tags and show that query expansion using this lexicon and considering proximity information (between query and expansion terms) can improve the retrieval performance. We perform an empirical evaluation of our proposed models on two patent datasets. The experimental results show that our proposed models can achieve significantly better results than the baseline and other enhanced models

RERO DOC Digital Library

Neural pseudo-relevance feedback models for information retrieval

Author: Wang Xiao
Publication venue
Publication date: 01/01/2024
Field of study

Verbatim queries submitted to search engines often do not sufficiently describe the user’s search intent. Moreover, even with well-formed user queries, retrieval failures can still occur, caused by lexical or semantic mismatches, or both, between the language of the user’s query and that used in the relevant documents. Pseudo-relevance feedback (PRF) techniques, which modify a query’s representation using top-ranked documents, have been shown to overcome such inadequacies and improve retrieval effectiveness. In this thesis, we argue that the pseudo-relevance feedback information can be used in neuralbased models to improve retrieval effectiveness, for both sparse retrieval and dense retrieval paradigms. Indeed, recent advancements in pretrained generative language models, such as T5 and FlanT5, have demonstrated their ability to generate textual responses that are relevant to a given prompt. In light of this success, we study the capacity of such models to perform query reformulation and how they compare with long-standing query reformulation methods that use pseudo-relevance feedback. In particular, we investigate two representative query reformulation frameworks, GenQR and GenPRF. Specifically, GenQR directly reformulates the user’s input query, while GenPRF provides additional context for the query by making use of pseudo-relevance feedback information in the top-ranked documents. For each reformulation method, we leverage different techniques, including fine tuning and direct prompting, to harness the knowledge of language models. The reformulated queries produced by the generative models are demonstrated to markedly benefit the effectiveness of sparse retrieval on various TREC test collections. In addition, Dense retrieval models, in both single representation dense retrieval and multiple representation dense retrieval paradigms, have shown higher effectiveness over traditional sparse retrieval by mitigating the lexical and semantic mismatch issues to some extent. However, underrepresented queries can still cause retrieval failures. In particular, in this thesis, we investigate the potential for multiple representation dense retrieval (exemplified by ColBERT) to be enhanced using pseudo-relevance feedback, and thereby present our proposed approach, ColBERT-PRF. More specifically, ColBERT-PRF extracts representative feedback embeddings from the document embeddings of the pseudo-relevant set and uses corresponding token statistics to identify good expansion embeddings among the representative embeddings. These expansion embeddings are then appended to the original query representation to form a refined query representation. We show that these additional expansion embeddings benefit the effectiveness of a reranking of the initial query results as well as an additional dense retrieval operation. Evaluation experiments conducted on MSMARCO passage and document ranking as well as the TREC Robust04 document ranking tasks demonstrate the effectiveness of our proposed ColBERT-PRF technique. In addition, we study the effectiveness of variants of the ColBERT-PRF model with different weighting methods. Finally, we show that ColBERT-PRF can be made more efficient, and with little impact on effectiveness, through the application of approximate scoring and different clustering methods. While PRF techniques are effective in closing the vocabulary gap between the user’s query formulations and the relevant documents, they are typically applied on the same target corpus as the final retrieval. In the past, external expansion techniques have sometimes been applied to obtain a high-quality pseudo-relevant feedback set using a high quality external corpus. However, such external expansion approaches have only been studied for sparse retrieval, and their effectiveness for recent dense retrieval methods remains under investigation. Moreover, dense retrieval approaches such as ANCE and ColBERT have been shown to face challenges when it comes to out-of-domain evaluations, due to the knowledge shift between different domains. Therefore, in this thesis, we propose a dense external expansion technique to improve the zeroshot retrieval effectiveness of both single and multiple representation dense retrieval. In particular, we employ the MSMARCO passage collection as the external corpus. The experimental results performed on two TREC datasets indicate the effectiveness of our proposed external dense query expansion techniques for both the sparse retrieval and the (single or multiple) dense retrievals. Furthermore, we note that the ColBERT model has only been applied to the BERT model with its corresponding WordPiece tokeniser. However, the effect of the pre-trained model and the tokenisation method for the contextualised late interaction mechanism used by ColERT is not well understood. Therefore, in this thesis, we extend ColBERT to Col⋆ and ColBERT-PRF to Col⋆-PRF, by generalising the de-facto standard BERT PLM to various different PLMs. As different tokenisation methods can directly impact the matching behaviour within the late interaction mechanism, we study the nature of matches occurring in different Col⋆ and Col⋆-PRF models, and further quantify the contribution of lexical and semantic matching on retrieval effectiveness. Finally, both the ColBERT-PRF as well as the Col⋆-PRF models perform dense query expansion in an unsupervised manner and might be affected by heuristic techniques such as clustering and IDF statistics. Therefore, in this thesis, we propose a contrastive solution that learns to select the most useful embeddings for expansion. More specifically, a deep language model-based contrastive weighting model, called CWPRF, is trained to learn to discriminate between relevant and non-relevant documents for semantic search. Our experimental results show that our contrastive weighting model can aid in selecting useful expansion embeddings and outperform various baselines. In particular, CWPRF can further improve nDCG@10 by upto 4.1% compared to our proposed ColBERT-PRF approach while maintaining its efficiency

Glasgow Theses Service

Recommended from our members

Characterising semantically coherent classes of text through feature discovery

Author: Robertson Andrew David
Publication venue
Publication date: 09/07/2019
Field of study

There is a growing need to provide support for social scientists and humanities scholars to gather and “engage” with very large datasets of free text, to perform very bespoke analyses. method52 is a text analysis platform built for this purpose (Wibberley et al., 2014), and forms a foundation that this thesis builds upon. A central part of method52 and its methodologies is a classifier training component based on dualist (Settles, 2011), and the general process of data engagement with method52 is determined to constitute a continuous cycle of characterising semantically coherent sub-collections, classes, of the text. Two broad methodologies exist for supporting this type of engagement process: (1) a top-down approach wherein concepts and their relationships are explicitly modelled for reasoning, and (2) a more surface-level, bottom-up approach, which entails the use of key terms (surface features) to characterise data. Following the second of these approaches, this thesis examines ways of better supporting this type of data engagement to more effectively support the needs of social scientists and humanities scholars in engaging with text data. The classifier component provides an active learning training environment emphasising the labelling of individual features. However, it can be difficult to interpret and incorporate prior knowledge of features. The process of feature discovery based on the current classifier model does not always produce useful results. And understanding the data well enough to produce successful classifiers is timeconsuming. A new method for discovering features in a corpus is introduced, and feature discovery methods are explored to resolve these issues. When collecting social media data, documents are often obtained by querying an API with a set of key phrases. Therefore, the set of possible classes characterising the data is defined by these basic surface features. It is difficult to know exactly which terms must searched for, and the usefulness of terms can change over time as new discussions and vocabulary emerge. Building on the feature discovery techniques, a framework is presented in this thesis for streaming data with an automatically adapting query to deal with these issues

Sussex Research Online

An Approach to Guide Users Towards Less Revealing Internet Browsers

Author: Mohammad Lena
Mohsen Fadi
Naser Riham
Shtayyeh Adel
Struijk Marten
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/07/2022
Field of study

When browsing the Internet, HTTP headers enable both clients and servers send extra data in their requests or responses such as the User-Agent string. This string contains information related to the sender’s device, browser, and operating system. Previous research has shown that there are numerous privacy and security risks result from exposing sensitive information in the User-Agent string. For example, it enables device and browser fingerprinting and user tracking and identification. Our large analysis of thousands of User-Agent strings shows that browsers differ tremendously in the amount of information they include in their User-Agent strings. As such, our work aims at guiding users towards using less exposing browsers. In doing so, we propose to assign an exposure score to browsers based on the information they expose and vulnerability records. Thus, our contribution in this work is as follows: first, provide a full implementation that is ready to be deployed and used by users. Second, conduct a user study to identify the effectiveness and limitations of our proposed approach. Our implementation is based on using more than 52 thousand unique browsers. Our performance and validation analysis show that our solution is accurate and efficient. The source code and data set are publicly available and the solution has been deployed

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Continuous Rationale Management

Author: Kleebaum Anja
Publication venue
Publication date: 01/01/2023
Field of study

Continuous Software Engineering (CSE) is a software life cycle model open to frequent changes in requirements or technology. During CSE, software developers continuously make decisions on the requirements and design of the software or the development process. They establish essential decision knowledge, which they need to document and share so that it supports the evolution and changes of the software. The management of decision knowledge is called rationale management. Rationale management provides an opportunity to support the change process during CSE. However, rationale management is not well integrated into CSE. The overall goal of this dissertation is to provide workflows and tool support for continuous rationale management. The dissertation contributes an interview study with practitioners from the industry, which investigates rationale management problems, current practices, and features to support continuous rationale management beneficial for practitioners. Problems of rationale management in practice are threefold: First, documenting decision knowledge is intrusive in the development process and an additional effort. Second, the high amount of distributed decision knowledge documentation is difficult to access and use. Third, the documented knowledge can be of low quality, e.g., outdated, which impedes its use. The dissertation contributes a systematic mapping study on recommendation and classification approaches to treat the rationale management problems. The major contribution of this dissertation is a validated approach for continuous rationale management consisting of the ConRat life cycle model extension and the comprehensive ConDec tool support. To reduce intrusiveness and additional effort, ConRat integrates rationale management activities into existing workflows, such as requirements elicitation, development, and meetings. ConDec integrates into standard development tools instead of providing a separate tool. ConDec enables lightweight capturing and use of decision knowledge from various artifacts and reduces the developers' effort through automatic text classification, recommendation, and nudging mechanisms for rationale management. To enable access and use of distributed decision knowledge documentation, ConRat defines a knowledge model of decision knowledge and other artifacts. ConDec instantiates the model as a knowledge graph and offers interactive knowledge views with useful tailoring, e.g., transitive linking. To operationalize high quality, ConRat introduces the rationale backlog, the definition of done for knowledge documentation, and metrics for intra-rationale completeness and decision coverage of requirements and code. ConDec implements these agile concepts for rationale management and a knowledge dashboard. ConDec also supports consistent changes through change impact analysis. The dissertation shows the feasibility, effectiveness, and user acceptance of ConRat and ConDec in six case study projects in an industrial setting. Besides, it comprehensively analyses the rationale documentation created in the projects. The validation indicates that ConRat and ConDec benefit CSE projects. Based on the dissertation, continuous rationale management should become a standard part of CSE, like automated testing or continuous integration

Heidelberger Dokumentenserver

Q(sqrt(-3))-Integral Points on a Mordell Curve

Author: Bianchi Francesca
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

We use an extension of quadratic Chabauty to number fields,recently developed by the author with Balakrishnan, Besser and M ̈uller,combined with a sieving technique, to determine the integral points overQ(√−3) on the Mordell curve y2 = x3 − 4

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model

Author: A Barker
A Tsakonas
AGC Sá de
F Mohr
M Martin Salvador
MM Salvador
W Tan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/01/2020
Field of study

© 2020, The Author(s). The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

Scipedia

Control of noise-induced behavior in neural network

Author: Janson Natalia
Patidar Sandhya
Pototsky Andrey
Publication venue
Publication date: 01/07/2007
Field of study

Heriot Watt Pure

Control of noise-induced behavior in neural network

Author: Janson Natalia
Patidar Sandhya
Pototsky Andrey
Publication venue
Publication date: 01/07/2007
Field of study

Heriot Watt Pure

Recommended from our members

Proceedings of the 33rd Annual Workshop of the Psychology of Programming Interest Group

Author
Publication venue: PPIG
Publication date: 10/02/2023
Field of study

This is the Proceedings of the 33rd Annual Workshop of the Psychology of Programming Interest Group (PPIG). This was the first PPIG to be held physically since 2019, following the two online-only PPIGs in 2020 and 2021, both during the Covid pandemic. It was also the first PPIG conference to be designed specifically for hybrid attendance. Reflecting the theme, it was hosted by Music Computing Lab at the Open University in Milton Keynes

Open Research Online (The Open University)