378 research outputs found

    Studying the Effect and Treatment of Misspelled Queries in Cross-Language Information Retrieval

    Get PDF
    [Abstract] The performance of Information Retrieval systems is limited by the linguistic variation present in natural language texts. Word-level Natural Language Processing techniques have been shown to be useful in reducing this variation. In this article, we summarize our work on the extension of these techniques for dealing with phrase-level variation in European languages, taking Spanish as a case in point. We propose the use of syntactic dependencies as complex index terms in an attempt to solve the problems deriving from both syntactic and morpho-syntactic variation and, in this way, to obtain more precise index terms. Such dependencies are obtained through a shallow parser based on cascades of finite-state transducers in order to reduce as far as possible the overhead due to this parsing process. The use of different sources of syntactic information, queries or documents, has been also studied, as has the restriction of the dependencies applied to those obtained from noun phrases. Our approaches have been tested using the CLEF corpus, obtaining consistent improvements with regard to classical word-level non-linguistic techniques. Results show, on the one hand, that syntactic information extracted from documents is more useful than that from queries. On the other hand, it has been demonstrated that by restricting dependencies to those corresponding to noun phrases, important reductions of storage and management costs can be achieved, albeit at the expense of a slight reduction in performance.Ministerio de EconomĂ­a y Competitividad; FFI2014-51978-C2-1-RRede Galega de Procesamento da Linguaxe e RecuperaciĂłn de InformaciĂłn; CN2014/034Ministerio de EconomĂ­a y Competitividad; BES-2015-073768Ministerio de EconomĂ­a y Competitividad; FFI2014-51978-C2-2-

    A comparison of automatic search query enhancement algorithms that utilise Wikipedia as a source of a priori knowledge

    Get PDF
    This paper describes the benchmarking and analysis of five Automatic Search Query Enhancement (ASQE) algorithms that utilise Wikipedia as the sole source for a priori knowledge. The contributions of this paper include: 1) A comprehensive review into current ASQE algorithms that utilise Wikipedia as the sole source for a priori knowledge; 2) benchmarking of five existing ASQE algorithms using the TREC-9 Web Topics on the ClueWeb12 data set and 3) analysis of the results from the benchmarking process to identify the strengths and weaknesses each algorithm. During the benchmarking process, 2,500 relevance assessments were performed. Results of these tests are analysed using the Average Precision @10 per query and Mean Average Precision @10 per algorithm. From this analysis we show that the scope of a priori knowledge utilised during enhancement and the available term weighting methods available from Wikipedia can further aid the ASQE process. Although approaches taken by the algorithms are still relevant, an over dependence on weighting schemes and data sources used can easily impact results of an ASQE algorithm

    "Does Vinegar Kill Coronavirus?" - Using Search Log Analysis to Estimate the Extent of COVID-19-Related Misinformation Searching Behaviour in the United States

    Get PDF
    Health experts and government authorities' actions to combat the coronavirus outbreak are strongly compromised by the misinformation infodemic that evolved in parallel to the COVID-19 pandemic. When people get misled by unscientific and unsubstantiated claims regarding the origin or cures for COVID-19, public health response efforts get undermined and people might be less likely to comply with official guidance and thus spread the virus or even harm themselves. To prevent this from happening, a first step is to reveal the prevalence of misinformation ideas in the public. In this study, we use search log analysis to investigate the extent and characteristics of misinformation seeking behaviour in the US using the Bing Search Data-set for Coronavirus Intent. We train a machine learning model to distinguish between regular and misinformation queries and find that only around 1\% of queries are related to misinformation myths or conspiracy theories. The query term \textit{qanon} --- connecting the conspiracy theory to many different origin myths of COVID-19 --- is the most frequent and steadily increasing misinformation-related query in the data-set

    A Wikipedia powered state-based approach to automatic search query enhancement

    Get PDF
    This paper describes the development and testing of a novel Automatic Search Query Enhancement (ASQE) algorithm, the Wikipedia N Sub-state Algorithm (WNSSA), which utilises Wikipedia as the sole data source for prior knowledge. This algorithm is built upon the concept of iterative states and sub-states, harnessing the power of Wikipedia\u27s data set and link information to identify and utilise reoccurring terms to aid term selection and weighting during enhancement. This algorithm is designed to prevent query drift by making callbacks to the user\u27s original search intent by persisting the original query between internal states with additional selected enhancement terms. The developed algorithm has shown to improve both short and long queries by providing a better understanding of the query and available data. The proposed algorithm was compared against five existing ASQE algorithms that utilise Wikipedia as the sole data source, showing an average Mean Average Precision (MAP) improvement of 0.273 over the tested existing ASQE algorithms

    Understanding Children’s Help-Seeking Behaviors: Effects of Domain Knowledge

    Get PDF
    This dissertation explores children’s help-seeking behaviors and use of help features when they formulate search queries and evaluate search results in IR systems. This study was conducted with 30 children who were 8 to 10 years old. The study was designed to answer three research questions with two parts in each: 1(a) What are the types of help-seeking situations experienced by children (8-10 years old) when they formulate search queries in a search engine and a kid-friendly web portal?, 1(b) What are the types of help-seeking situations experienced by children (8-10 years old) when they evaluate search results in a search engine and a kid-friendly web portal?, 2(a) What types of help features do children (8-10 years old) use and desire when they formulate search queries in a search engine and a kid-friendly web portal?, 2(b) What types of help features do children (8-10 years old) use and desire when they evaluate search results in a search engine and a kid-friendly web portal?, 3(a) How does children’s (8-10 years old) domain knowledge affect their help seeking and use of help features when they formulate search queries in a search engine and a kid-friendly web portal?, 3(b) How does children’s (8-10 years old) domain knowledge affect their help seeking and use of help features when they evaluate search results in a search engine and a kid-friendly web portal? This study used multiple data collection methods including performance-based domain knowledge quizzes as direct measurement, domain knowledge self-assessments as indirect measurement, pre-questionnaires, transaction logs, think-aloud protocols, observations, and post-interviews. Open coding analysis was used to examine children’s help-seeking situations. Children’s cognitive, physical, and emotional types of help-seeking situations when using Google and Kids.gov were identified. To explore help features children use and desire when they formulate search queries and evaluate results in Google and Kids.gov, open coding analysis was conducted. Additional descriptive statistics summarized the frequency of help features children used when they formulated search queries and evaluated results in Google and Kids.gov. Finally, this study investigated the effect of children’s domain knowledge on their help seeking and use of help features in using Google and Kids.gov based on linear regression. The level of children’s self-assessed domain knowledge affects occurrences of their help-seeking situations when they formulated search queries in Google. Similarly, children’s domain knowledge quiz scores showed a statistically significant effect on occurrences of their help-seeking situations when they formulated keywords in Google. In the stage of result evaluations, the level of children’s self-assessed domain knowledge influenced their use of help features in Kids.gov. Furthermore, scores of children’s domain knowledge quiz affected their use of help features when they evaluated search results in Kids.gov. Theoretical and practical implications for reducing children’s cognitive, physical, and emotional help-seeking situations when they formulate search queries and evaluate search results in IR systems were discussed based on the results

    Video Content Understanding Using Text

    Get PDF
    The rise of the social media and video streaming industry provided us a plethora of videos and their corresponding descriptive information in the form of concepts (words) and textual video captions. Due to the mass amount of available videos and the textual data, today is the best time ever to study the Computer Vision and Machine Learning problems related to videos and text. In this dissertation, we tackle multiple problems associated with the joint understanding of videos and text. We first address the task of multi-concept video retrieval, where the input is a set of words as concepts, and the output is a ranked list of full-length videos. This approach deals with multi-concept input and prolonged length of videos by incorporating multi-latent variables to tie the information within each shot (short clip of a full-video) and across shots. Secondly, we address the problem of video question answering, in which, the task is to answer a question, in the form of Fill-In-the-Blank (FIB), given a video. Answering a question is a task of retrieving a word from a dictionary (all possible words suitable for an answer) based on the input question and video. Following the FIB problem, we introduce a new problem, called Visual Text Correction (VTC), i.e., detecting and replacing an inaccurate word in the textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence while benefiting 1D-CNNs/LSTMs to encode short/long term dependencies, and fix it by replacing the inaccurate word(s). Finally, as the last part of the dissertation, we propose to tackle the problem of video generation using user input natural language sentences. Our proposed video generation method constructs two distributions out of the input text, corresponding to the first and last frames latent representations. We generate high-fidelity videos by interpolating latent representations and a sequence of CNN based up-pooling blocks

    Enhancing a new design for subject access to online catalogs

    Full text link
    In this report, we describe the enhancement of a new design for subject access to online catalogs. The purpose of this reserch project was to enhance the search trees with new subject searching approaches to enable online catalogs to respond with useful information for the most difficult user queries.http://deepblue.lib.umich.edu/bitstream/2027.42/57991/1/Enhancing_a_new_design_for_subject_access_to_online_catalogs.pd

    Characteristics of Web-based textual communications

    Get PDF
    Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University 2012.Thesis (Ph. D.) -- Bilkent University, 2012.Includes bibliographical references.In this thesis, we analyze different aspects of Web-based textual communications and argue that all such communications share some common properties. In order to provide practical evidence for the validity of this argument, we focus on two common properties by examining these properties on various types of Web-based textual communications data. These properties are: All Web-based communications contain features attributable to their author and reciever; and all Web-based communications exhibit similar heavy tailed distributional properties. In order to provide practical proof for the validity of our claims, we provide three practical, real life research problems and exploit the proposed common properties of Web-based textual communications to find practical solutions to these problems. In this work, we first provide a feature-based result caching framework for real life search engines. To this end, we mined attributes from user queries in order to classify queries and estimate a quality metric for giving admission and eviction decisions for the query result cache. Second, we analyzed messages of an online chat server in order to predict user and mesage attributes. Our results show that several user- and message-based attributes can be predicted with significant occuracy using both chat message- and writing-style based features of the chat users. Third, we provide a parallel framework for in-memory construction of term partitioned inverted indexes. In this work, in order to minimize the total communication time between processors, we provide a bucketing scheme that is based on term-based distributional properties of Web page contents.Küçükyılmaz, TayfunPh.D
    • …
    corecore