109,953 research outputs found

    Deeper Text Understanding for IR with Contextual Neural Language Modeling

    Full text link
    Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR. Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited.Comment: In proceedings of SIGIR 201

    Short Text Pre-training with Extended Token Classification for E-commerce Query Understanding

    Full text link
    E-commerce query understanding is the process of inferring the shopping intent of customers by extracting semantic meaning from their search queries. The recent progress of pre-trained masked language models (MLM) in natural language processing is extremely attractive for developing effective query understanding models. Specifically, MLM learns contextual text embedding via recovering the masked tokens in the sentences. Such a pre-training process relies on the sufficient contextual information. It is, however, less effective for search queries, which are usually short text. When applying masking to short search queries, most contextual information is lost and the intent of the search queries may be changed. To mitigate the above issues for MLM pre-training on search queries, we propose a novel pre-training task specifically designed for short text, called Extended Token Classification (ETC). Instead of masking the input text, our approach extends the input by inserting tokens via a generator network, and trains a discriminator to identify which tokens are inserted in the extended input. We conduct experiments in an E-commerce store to demonstrate the effectiveness of ETC

    Towards Query Logs for Privacy Studies: On Deriving Search Queries from Questions

    Get PDF
    Translating verbose information needs into crisp search queries is a phenomenon that is ubiquitous but hardly understood. Insights into this process could be valuable in several applications, including synthesizing large privacy-friendly query logs from public Web sources which are readily available to the academic research community. In this work, we take a step towards understanding query formulation by tapping into the rich potential of community question answering (CQA) forums. Specifically, we sample natural language (NL) questions spanning diverse themes from the Stack Exchange platform, and conduct a large-scale conversion experiment where crowdworkers submit search queries they would use when looking for equivalent information. We provide a careful analysis of this data, accounting for possible sources of bias during conversion, along with insights into user-specific linguistic patterns and search behaviors. We release a dataset of 7,000 question-query pairs from this study to facilitate further research on query understanding.Comment: ECIR 2020 Short Pape

    Log Exploration and Analytics Using Large Language Models

    Get PDF
    Log data are typically stored in databases as schema-based entries in a structured format. Conventionally, log exploration requires an understanding of the fields, schema, and query parameters of the database. This disclosure describes techniques that use tabular large language models (LLMs) to process, mine, and make log data amenable to natural language queries. A relatively unsophisticated user with no database skills can query log files using natural language search. The LLMs can be fine-tuned using prompt engineering and causation information. The conventional, tedious mining of logs across multiple systems using database queries is replaced by a simple natural language interface that provides the ability to determine meaningful relationships and context across events captured within the logs. Natural language queries can enable help desks to do a basic level of troubleshooting, saving time for administrators. As more information gets added, querying and analytics of logs are simplified, with a resultant improvement in the speed and quality of troubleshooting

    A Reinforcement Learning-driven Translation Model for Search-Oriented Conversational Systems

    Full text link
    Search-oriented conversational systems rely on information needs expressed in natural language (NL). We focus here on the understanding of NL expressions for building keyword-based queries. We propose a reinforcement-learning-driven translation model framework able to 1) learn the translation from NL expressions to queries in a supervised way, and, 2) to overcome the lack of large-scale dataset by framing the translation model as a word selection approach and injecting relevance feedback in the learning process. Experiments are carried out on two TREC datasets and outline the effectiveness of our approach.Comment: This is the author's pre-print version of the work. It is posted here for your personal use, not for redistribution. Please cite the definitive version which will be published in Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI - ISBN: 978-1-948087-75-

    Query understanding: applying machine learning algorithms for named entity recognition

    Get PDF
    The term-frequency inverse-document(tf-idf) paradigm which is often used in general search engines for ranking the relevance of documents in a corpus to a given user query, is based on the frequency of occurrence of the search key terms in the corpus. These search terms are mostly expressed in natural language thus requiring natural language processing methods. But for domain-speciffic search engines like a software download portal, search terms are usually expressed in forms that does not conform to grammatical rules present in natural language and as such, they cannot be tackled using natural language processing techniques. This thesis proposes named entity recognition using supervised machine learning methods as a means to understanding queries for such domain-speciffic search engines. Particularly, our main objective is to apply machine learning techniques to automatically learn to recognize and classify search terms according to named entity class of predefined categories they belong. By so doing, we are able to understand user intents and rank result sets according to their relevance to detected named entities present in search query. Our approach involved three machine learning algorithms; Hidden Markov Models (HMM), Conditional Random Field(CRF) and Neural Network(NN). We followed the supervised learning approach in training these algorithms using labeled training data from sample queries, we then evaluated their performance on new unseen queries. Our empirical results showed precisions of 93% for NN which was based on distributed representations proposed by Yoshua Bengio, 85.60% for CRF and 82.84% for HMM. CRF 's precision improved to about 2% , achieving 87.40% after we generated gazetteer-based and morphological features. From our results, we were able to prove that machine learning methods for named entity recognition is useful for understanding query intents

    Automatic web translators as part of a multilingual question-answering (QA) system: translation of questions

    Get PDF
    Artículo de la editorial: http://translationjournal.net/journal/51webtranslators.htmThe traditional model of information retrieval entails some implicit restrictions, including: a) the assumption that users search for documents, not answers; and that the documents per se will respond to and satisfy the query, and b) the assumption that the queries and the document that will satisfy the particular informational need are written in the same language. However, many times users will need specific data in response to the queries put forth. Cross-language question-answering systems (QA) can be the solution, as they pursue the search for a minimal fragment of text—not a complete document—that applies to the query, regardless of the language in which the question is formulated or the language in which the answer is found. Cross-language QA calls for some sort of underlying translating process. At present there are many types of software for natural language translation, several of them available online for free. In this paper we describe the main features of the multilingual Question-Answering (QA) systems, and then analyze the effectiveness of the translations obtained through three of the most popular online translating tools (Google Translator, Promt and Worldlingo). The methodology used for evaluation, on the basis of automatic and subjective measures, is specifically oriented here to obtain a translation that will serve as input in a QA system. The results obtained contribute to the realm of innovative search systems by enhancing our understanding of online translators and their potential in the context of multilingual information retrieval

    Conclave: ontology-driven measurement of semantic relatedness between source code elements and problem domain concepts

    Get PDF
    Software maintainers are often challenged with source code changes to improve software systems, or eliminate defects, in unfamiliar programs. To undertake these tasks a sufficient understanding of the system (or at least a small part of it) is required. One of the most time consuming tasks of this process is locating which parts of the code are responsible for some key functionality or feature. Feature (or concept) location techniques address this problem. This paper introduces Conclave, an environment for software analysis, and in particular the Conclave-Mapper tool that provides a feature location facility. This tool explores natural language terms used in programs (e.g. function and variable names), and using textual analysis and a collection of Natural Language Processing techniques, computes synonymous sets of terms. These sets are used to score relatedness between program elements, and search queries or problem domain concepts, producing sorted ranks of program elements that address the search criteria, or concepts. An empirical study is also discussed to evaluate the underlying feature location technique.info:eu-repo/semantics/publishedVersio
    • 

    corecore