51 research outputs found

    Division of labour and sharing of knowledge for synchronous collaborative information retrieval

    Get PDF

    Pretrained Transformers for Text Ranking: BERT and Beyond

    Get PDF
    The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading

    Search Relevance based on the Semantic Web

    Get PDF
    In this thesis, we explore the challenge of search relevance in the context of semantic search. Specifically, the notion of semantic relevance can be distinguished from the other types of relevance in Information Retrieval (IR) in terms of employing an underlying semantic model. We propose the emerging Semantic Web data on the Web which is represented in RDF graph structures as an important candidate to become such a semantic model in a search process

    Complexity, Emergent Systems and Complex Biological Systems:\ud Complex Systems Theory and Biodynamics. [Edited book by I.C. Baianu, with listed contributors (2011)]

    Get PDF
    An overview is presented of System dynamics, the study of the behaviour of complex systems, Dynamical system in mathematics Dynamic programming in computer science and control theory, Complex systems biology, Neurodynamics and Psychodynamics.\u

    Semi-Supervised Learning For Identifying Opinions In Web Content

    Get PDF
    Thesis (Ph.D.) - Indiana University, Information Science, 2011Opinions published on the World Wide Web (Web) offer opportunities for detecting personal attitudes regarding topics, products, and services. The opinion detection literature indicates that both a large body of opinions and a wide variety of opinion features are essential for capturing subtle opinion information. Although a large amount of opinion-labeled data is preferable for opinion detection systems, opinion-labeled data is often limited, especially at sub-document levels, and manual annotation is tedious, expensive and error-prone. This shortage of opinion-labeled data is less challenging in some domains (e.g., movie reviews) than in others (e.g., blog posts). While a simple method for improving accuracy in challenging domains is to borrow opinion-labeled data from a non-target data domain, this approach often fails because of the domain transfer problem: Opinion detection strategies designed for one data domain generally do not perform well in another domain. However, while it is difficult to obtain opinion-labeled data, unlabeled user-generated opinion data are readily available. Semi-supervised learning (SSL) requires only limited labeled data to automatically label unlabeled data and has achieved promising results in various natural language processing (NLP) tasks, including traditional topic classification; but SSL has been applied in only a few opinion detection studies. This study investigates application of four different SSL algorithms in three types of Web content: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. SSL algorithms are also evaluated for their effectiveness in sparse data situations and domain adaptation. Research findings suggest that, when there is limited labeled data, SSL is a promising approach for opinion detection in Web content. Although the contributions of SSL varied across data domains, significant improvement was demonstrated for the most challenging data domain--the blogosphere--when a domain transfer-based SSL strategy was implemented

    Spoken content retrieval beyond pipeline integration of automatic speech recognition and information retrieval

    Get PDF
    The dramatic increase in the creation of multimedia content is leading to the development of large archives in which a substantial amount of the information is in spoken form. Efficient access to this information requires effective spoken content retrieval (SCR) methods. Traditionally, SCR systems have focused on a pipeline integration of two fundamental technologies: transcription using automatic speech recognition (ASR) and search supported using text-based information retrieval (IR). Existing SCR approaches estimate the relevance of a spoken retrieval item based on the lexical overlap between a user’s query and the textual transcriptions of the items. However, the speech signal contains other potentially valuable non-lexical information that remains largely unexploited by SCR approaches. Particularly, acoustic correlates of speech prosody, that have been shown useful to identify salient words and determine topic changes, have not been exploited by existing SCR approaches. In addition, the temporal nature of multimedia content means that accessing content is a user intensive, time consuming process. In order to minimise user effort in locating relevant content, SCR systems could suggest playback points in retrieved content indicating the locations where the system believes relevant information may be found. This typically requires adopting a segmentation mechanism for splitting documents into smaller “elements” to be ranked and from which suitable playback points could be selected. Existing segmentation approaches do not generalise well to every possible information need or provide robustness to ASR errors. This thesis extends SCR beyond the standard ASR and IR pipeline approach by: (i) exploring the utilisation of prosodic information as complementary evidence of topical relevance to enhance current SCR approaches; (ii) determining elements of content that, when retrieved, minimise user search effort and provide increased robustness to ASR errors; and (iii) developing enhanced evaluation measures that could better capture the factors that affect user satisfaction in SCR

    Using language models in question answering

    Get PDF
    In this thesis, we describe a language model based approach to parts of a complete Question Answering (QA) system. It includes the processing of the natural language query as well as the retrieval of relevant documents, passages and sentences. The results show that the language model based modules in our QA system perform equally well or even better than current state-of-the-art systems. Due to the heavy use of fast statistical algorithms the main advantage of our system is an efficiency gain compared to the slower deep analysis linguistic methods used in other approaches. A second benefit of using language models is the ability to train them for new languages.In dieser Doktorarbeit wird ein Ansatz basierend auf statistischen Sprachmodellen fßr verschiedene Bestandteile eines kompletten Fragebeantwortungssystems beschrieben. Dies beinhaltet die Verarbeitung der natßrlichsprachlichen Suchanfrage sowie die Suche nach relevanten Dokumenten, Textabschnitten und Sätzen. Die Ergebnisse der Arbeit zeigen, dass sprachmodellbasierte Methoden genauso gut oder sogar noch besser funktionieren, als derzeitige, moderne Systeme. Ein wesentlicher Vorteil des beschriebenen Systems liegt in der Nutzung schneller, statistischer Algorithmen gegenßber den vergleichsweise langsamen, tiefen linguistischen Analysen anderer Ansätze

    Putting the Pieces Together: Exons and piRNAs: A Dissertation

    Get PDF
    Analysis of gene expression has undergone a technological revolution. What was impossible 6 years ago is now routine. High-throughput DNA sequencing machines capable of generating hundreds of millions of reads allow, indeed force, a major revision toward the study of the genome’s functional output—the transcriptome. This thesis examines the history of DNA sequencing, measurement of gene expression by sequencing, isoform complexity driven by alternative splicing and mammalian piRNA precursor biogenesis. Examination of these topics is framed around development of a novel RNA-templated DNA-DNA ligation assay (SeqZip) that allows for efficient analysis of abundant, complex, and functional long RNAs. The discussion focuses on the future of transcriptome analysis, development and applications of SeqZip, and challenges presented to biomedical researchers by extremely large and rich datasets
    • …
    corecore