41,345 research outputs found

    Accelerating scientific research in the digital era: intelligent assessment and retrieval of research content

    Get PDF
    The efficient, effective, and timely access to the scientific literature by researchers is crucial for accelerating scientific research and discovery. Nowadays, research articles are almost exclusively published in a digital form and stored in digital libraries, accessible over the Web. Using digital libraries for storing scientific literature is advantageous as it enables access to articles at any time and place. Furthermore, digital libraries can leverage information management systems and artificial intelligence techniques to manage, retrieve, and analyze research content. Due to the large size of those libraries and their fast growth pace, the development of intelligent systems that can effectively retrieve and analyze research content is crucial for improving the productivity of researchers. In this thesis, we focus on improving literature search engines by addressing some of their limitations. One of the limitations of the current literature search engines is that they mainly treat articles as the retrieval units and do not support the direct search for any of the article's elements such as figures, tables, and formulas. In this thesis, we study how to enable researchers to access research collections using figures of articles. Figures are entities in research articles that play an essential role in scientific communications. For this reason, research figures can be utilized directly by literature systems to facilitate and accelerate research. As the first step in this direction, we propose and study the novel task of figure retrieval from collections of research articles where the goal is to retrieve research article figures using keyword queries. We focus on the textual bag-of-words representation of search queries and figures and study the effectiveness of different retrieval models for the task and various ways to represent figures using text data. The empirical study shows the benefit of using multiple textual inputs for representing a figure and combining different retrieval models. The results also shed light on the different challenges in addressing this novel task. Next, we address the limitations of the text-based bag-of-words representation of research figures by proposing and studying a new view of representation, namely deep neural network-based distributed representations. Specifically, we focus on using image data and text for learning figure representations with different model architectures and loss functions to understand how sensitive the embeddings are to the learning approach and the features used. We also develop a novel weak supervision technique for training neural networks for this task that leverages the citation network of articles to generate large quantities of training examples. The experimental results show that figure representations, learned using our weak supervision approach, are effective and outperform representations of the bag-of-words technique and pre-trained neural networks. The current systems also have minimal support for addressing queries for which a search engine performs poorly due to ineffective formulation by the user. When conducting research, poor-performing search queries may occur when a researcher faces a new or fast-evolving research topic, resulting in a significant vocabulary gap between the user's query and the relevant articles. In this thesis, we address this problem by developing a novel strategy for collaborative query construction. According to this strategy, the search engine would actively engage users in an iterative process to continuously revise a query. We propose a specific implementation of this strategy in which the search engine and the user work together to expand a search query. Specifically, the system generates expansion terms, utilizing the history of interactions of the user with it, that the user can add to the search query in every iteration to reach an "ideal query". The experimental results attest to the effectiveness of using this approach in improving poor-performing search queries with minimal effort from the user. The last limitation that we address in this thesis is that the current systems usually do not leverage any content analysis for the quality assessment of articles and instead rely on citation counts. In this thesis, we study the task of automatic quality assessment of research articles where the goal is to assess the quality of an article in different aspects such as clarity, originality, and soundness. Automating the quality assessment of articles could improve the current literature systems that can leverage the generated quality scores to support the search and analysis of research articles. Previous works have applied supervised machine learning to automate the assessment by learning from examples of reviewed articles by humans. In this thesis, we study the effectiveness of using topics for the task and propose a novel strategy for constructing multi-view topical features. Experimental results show that such features are effective for this task compared to deep neural network-based features and bag-of-words features. Finally, to facilitate further evaluation of the different approaches suggested in this thesis using real users and realistic user tasks, we developed AcademicExplorer, a novel general system that supports the retrieval and exploration of research articles using several new functions enabled by the proposed algorithms in this thesis, such as exploring research collections using figure embeddings, sorting research articles based on automatically generated review scores, and interactive query formulation. As an open-source system, AcademicExplorer can help advance the research, evaluation, and development of applications in this area

    Word sense disambiguation and information retrieval

    Get PDF
    It has often been thought that word sense ambiguity is a cause of poor performance in Information Retrieval (IR) systems. The belief is that if ambiguous words can be correctly disambiguated, IR performance will increase. However, recent research into the application of a word sense disambiguator to an IR system failed to show any performance increase. From these results it has become clear that more basic research is needed to investigate the relationship between sense ambiguity, disambiguation, and IR. Using a technique that introduces additional sense ambiguity into a collection, this paper presents research that goes beyond previous work in this field to reveal the influence that ambiguity and disambiguation have on a probabilistic IR system. We conclude that word sense ambiguity is only problematic to an IR system when it is retrieving from very short queries. In addition we argue that if a word sense disambiguator is to be of any use to an IR system, the disambiguator must be able to resolve word senses to a high degree of accuracy

    EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets

    Full text link
    This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR , the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets

    Query expansion with naive bayes for searching distributed collections

    Get PDF
    The proliferation of online information resources increases the importance of effective and efficient distributed searching. However, the problem of word mismatch seriously hurts the effectiveness of distributed information retrieval. Automatic query expansion has been suggested as a technique for dealing with the fundamental issue of word mismatch. In this paper, we propose a method - query expansion with Naive Bayes to address the problem, discuss its implementation in IISS system, and present experimental results demonstrating its effectiveness. Such technique not only enhances the discriminatory power of typical queries for choosing the right collections but also hence significantly improves retrieval results

    GeoCLEF 2006: the CLEF 2006 Ccross-language geographic information retrieval track overview

    Get PDF
    After being a pilot track in 2005, GeoCLEF advanced to be a regular track within CLEF 2006. The purpose of GeoCLEF is to test and evaluate cross-language geographic information retrieval (GIR): retrieval for topics with a geographic specification. For GeoCLEF 2006, twenty-five search topics were defined by the organizing groups for searching English, German, Portuguese and Spanish document collections. Topics were translated into English, German, Portuguese, Spanish and Japanese. Several topics in 2006 were significantly more geographically challenging than in 2005. Seventeen groups submitted 149 runs (up from eleven groups and 117 runs in GeoCLEF 2005). The groups used a variety of approaches, including geographic bounding boxes, named entity extraction and external knowledge bases (geographic thesauri and ontologies and gazetteers)

    Word sense disambiguation and information retrieval

    Get PDF
    It has often been thought that word sense ambiguity is a cause of poor performance in Information Retrieval (IR) systems. The belief is that if ambiguous words can be correctly disambiguated, IR performance will increase. However, recent research into the application of a word sense disambiguator to an IR system failed to show any performance increase. From these results it has become clear that more basic research is needed to investigate the relationship between sense ambiguity, disambiguation, and IR. Using a technique that introduces additional sense ambiguity into a collection, this paper presents research that goes beyond previous work in this field to reveal the influence that ambiguity and disambiguation have on a probabilistic IR system. We conclude that word sense ambiguity is only problematic to an IR system when it is retrieving from very short queries. In addition we argue that if a word sense disambiguator is to be of any use to an IR system, the disambiguator must be able to resolve word senses to a high degree of accuracy
    corecore