Search CORE

109,953 research outputs found

Deeper Text Understanding for IR with Contextual Neural Language Modeling

Author: Callan Jamie
Dai Zhuyun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/05/2019
Field of study

Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR. Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited.Comment: In proceedings of SIGIR 201

arXiv.org e-Print Archive

Crossref

Short Text Pre-training with Extended Token Classification for E-commerce Query Understanding

Author: Cao Tianyu
Goutam Rahul
Jiang Haoming
Li Zheng
Luo Chen
Tang Xianfeng
Yin Bing
Yin Qingyu
Zhang Danqing
Publication venue
Publication date: 08/10/2022
Field of study

E-commerce query understanding is the process of inferring the shopping intent of customers by extracting semantic meaning from their search queries. The recent progress of pre-trained masked language models (MLM) in natural language processing is extremely attractive for developing effective query understanding models. Specifically, MLM learns contextual text embedding via recovering the masked tokens in the sentences. Such a pre-training process relies on the sufficient contextual information. It is, however, less effective for search queries, which are usually short text. When applying masking to short search queries, most contextual information is lost and the intent of the search queries may be changed. To mitigate the above issues for MLM pre-training on search queries, we propose a novel pre-training task specifically designed for short text, called Extended Token Classification (ETC). Instead of masking the input text, our approach extends the input by inserting tokens via a generator network, and trains a discriminator to identify which tokens are inserted in the extended input. We conduct experiments in an E-commerce store to demonstrate the effectiveness of ETC

arXiv.org e-Print Archive

Towards Query Logs for Privacy Studies: On Deriving Search Queries from Questions

Author: Biega Asia J.
Roy Rishiraj Saha
Schmidt Jana
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Translating verbose information needs into crisp search queries is a phenomenon that is ubiquitous but hardly understood. Insights into this process could be valuable in several applications, including synthesizing large privacy-friendly query logs from public Web sources which are readily available to the academic research community. In this work, we take a step towards understanding query formulation by tapping into the rich potential of community question answering (CQA) forums. Specifically, we sample natural language (NL) questions spanning diverse themes from the Stack Exchange platform, and conduct a large-scale conversion experiment where crowdworkers submit search queries they would use when looking for equivalent information. We provide a careful analysis of this data, accounting for possible sources of bias during conversion, along with insights into user-specific linguistic patterns and search behaviors. We release a dataset of 7,000 question-query pairs from this study to facilitate further research on query understanding.Comment: ECIR 2020 Short Pape

arXiv.org e-Print Archive

MPG.PuRe

Log Exploration and Analytics Using Large Language Models

Author: Bhaskar S Hari
Publication venue: Technical Disclosure Commons
Publication date: 27/07/2023
Field of study

Log data are typically stored in databases as schema-based entries in a structured format. Conventionally, log exploration requires an understanding of the fields, schema, and query parameters of the database. This disclosure describes techniques that use tabular large language models (LLMs) to process, mine, and make log data amenable to natural language queries. A relatively unsophisticated user with no database skills can query log files using natural language search. The LLMs can be fine-tuned using prompt engineering and causation information. The conventional, tedious mining of logs across multiple systems using database queries is replaced by a simple natural language interface that provides the ability to determine meaningful relationships and context across events captured within the logs. Natural language queries can enable help desks to do a basic level of troubleshooting, saving time for administrators. As more information gets added, querying and analytics of logs are simplified, with a resultant improvement in the speed and quality of troubleshooting

Technical Disclosure Common

A Reinforcement Learning-driven Translation Model for Search-Oriented Conversational Systems

Author: Aissa Wafa
Denoyer Ludovic
Soulier Laure
Publication venue
Publication date: 29/08/2018
Field of study

Search-oriented conversational systems rely on information needs expressed in natural language (NL). We focus here on the understanding of NL expressions for building keyword-based queries. We propose a reinforcement-learning-driven translation model framework able to 1) learn the translation from NL expressions to queries in a supervised way, and, 2) to overcome the lack of large-scale dataset by framing the translation model as a word selection approach and injecting relevance feedback in the learning process. Experiments are carried out on two TREC datasets and outline the effectiveness of our approach.Comment: This is the author's pre-print version of the work. It is posted here for your personal use, not for redistribution. Please cite the definitive version which will be published in Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI - ISBN: 978-1-948087-75-

arXiv.org e-Print Archive

Query understanding: applying machine learning algorithms for named entity recognition

Author: Ashaolu Paul Olufunso
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2014
Field of study

The term-frequency inverse-document(tf-idf) paradigm which is often used in general search engines for ranking the relevance of documents in a corpus to a given user query, is based on the frequency of occurrence of the search key terms in the corpus. These search terms are mostly expressed in natural language thus requiring natural language processing methods. But for domain-speciffic search engines like a software download portal, search terms are usually expressed in forms that does not conform to grammatical rules present in natural language and as such, they cannot be tackled using natural language processing techniques. This thesis proposes named entity recognition using supervised machine learning methods as a means to understanding queries for such domain-speciffic search engines. Particularly, our main objective is to apply machine learning techniques to automatically learn to recognize and classify search terms according to named entity class of predefined categories they belong. By so doing, we are able to understand user intents and rank result sets according to their relevance to detected named entities present in search query. Our approach involved three machine learning algorithms; Hidden Markov Models (HMM), Conditional Random Field(CRF) and Neural Network(NN). We followed the supervised learning approach in training these algorithms using labeled training data from sample queries, we then evaluated their performance on new unseen queries. Our empirical results showed precisions of 93% for NN which was based on distributed representations proposed by Yoshua Bengio, 85.60% for CRF and 82.84% for HMM. CRF 's precision improved to about 2% , achieving 87.40% after we generated gazetteer-based and morphological features. From our results, we were able to prove that machine learning methods for named entity recognition is useful for understanding query intents

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Automatic web translators as part of a multilingual question-answering (QA) system: translation of questions

Author: García Santiago Lola
Olvera Lobo María Dolores
Publication venue
Publication date: 01/01/2010
Field of study

Artículo de la editorial: http://translationjournal.net/journal/51webtranslators.htmThe traditional model of information retrieval entails some implicit restrictions, including: a) the assumption that users search for documents, not answers; and that the documents per se will respond to and satisfy the query, and b) the assumption that the queries and the document that will satisfy the particular informational need are written in the same language. However, many times users will need specific data in response to the queries put forth. Cross-language question-answering systems (QA) can be the solution, as they pursue the search for a minimal fragment of text—not a complete document—that applies to the query, regardless of the language in which the question is formulated or the language in which the answer is found. Cross-language QA calls for some sort of underlying translating process. At present there are many types of software for natural language translation, several of them available online for free. In this paper we describe the main features of the multilingual Question-Answering (QA) systems, and then analyze the effectiveness of the translations obtained through three of the most popular online translating tools (Google Translator, Promt and Worldlingo). The methodology used for evaluation, on the basis of automatic and subjective measures, is specifically oriented here to obtain a translation that will serve as input in a QA system. The results obtained contribute to the realm of innovative search systems by enhancing our understanding of online translators and their potential in the context of multilingual information retrieval

E-LIS

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Granada

Conclave: ontology-driven measurement of semantic relatedness between source code elements and problem domain concepts

Author: A. Mayrhauser Von
B. Dit
D. Poshyvanyk
F. Deissenboeck
G. Antoniol
G.W. Furnas
I. Horrocks
J. Pérez
M.P. Robillard
T. Eisenbarth
T.A. Corbi
V.I. Levenshtein
W. Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Software maintainers are often challenged with source code changes to improve software systems, or eliminate defects, in unfamiliar programs. To undertake these tasks a sufficient understanding of the system (or at least a small part of it) is required. One of the most time consuming tasks of this process is locating which parts of the code are responsible for some key functionality or feature. Feature (or concept) location techniques address this problem. This paper introduces Conclave, an environment for software analysis, and in particular the Conclave-Mapper tool that provides a feature location facility. This tool explores natural language terms used in programs (e.g. function and variable names), and using textual analysis and a collection of Natural Language Processing techniques, computes synonymous sets of terms. These sets are used to score relatedness between program elements, and search queries or problem domain concepts, producing sorted ranks of program elements that address the search criteria, or concepts. An empirical study is also discussed to evaluate the underlying feature location technique.info:eu-repo/semantics/publishedVersio

Universidade do Minho: RepositoriUM

Crossref

Biblioteca Digital do IPB