6 research outputs found

    Vector Space Proximity Based Document Retrieval For Document Embeddings Built By Transformers

    Get PDF
    Internet publications are staying atop of local and international events, generating hundreds, sometimes thousands of news articles per day, making it difficult for readers to navigate this stream of information without assistance. Competition for the reader’s attention has never been greater. One strategy to keep readers’ attention on a specific article and help them better understand its content is news recommendation, which automatically provides readers with references to relevant complementary articles. However, to be effective, news recommendation needs to select from a large collection of candidate articles only a handful of articles that are relevant yet provide diverse information. In this thesis, we propose and experiment with three methods for news recommendation and evaluate them in the context of the NIST News Track. Our first approach is based on the classic BM25 information retrieval approach and assumes that relevant articles will share common key- words with the current article. Our second approach is based on novel document embedding repre- sentations and uses various proximity measures to retrieve the closest documents. For this approach, we experimented with a substantial number of models, proximity measures, and hyperparameters, yielding a total of 47,332 distinct models. Finally, our third approach combines the BM25 and the embedding models to increase the diversity of the results. The results on the 2020 TREC News Track show that the performance of the BM25 model (nDCG@5 of 0.5924) greatly exceeds the TREC median performance (nDCG@5 of 0.5250) and achieves the highest score at the shared task. The performance of the embedding model alone (nDCG@5 of 0.4541) is lower than the TREC median and BM25. The performance of the combined model (nDCG@5 of 0.5873) is rather close to that of the BM25 model; however, an analysis of the results shows that the recommended articles are different from those proposed by BM25, hence may constitute a promising approach to reach diversity without much loss in relevance

    Contributions to information extraction for spanish written biomedical text

    Get PDF
    285 p.Healthcare practice and clinical research produce vast amounts of digitised, unstructured data in multiple languages that are currently underexploited, despite their potential applications in improving healthcare experiences, supporting trainee education, or enabling biomedical research, for example. To automatically transform those contents into relevant, structured information, advanced Natural Language Processing (NLP) mechanisms are required. In NLP, this task is known as Information Extraction. Our work takes place within this growing field of clinical NLP for the Spanish language, as we tackle three distinct problems. First, we compare several supervised machine learning approaches to the problem of sensitive data detection and classification. Specifically, we study the different approaches and their transferability in two corpora, one synthetic and the other authentic. Second, we present and evaluate UMLSmapper, a knowledge-intensive system for biomedical term identification based on the UMLS Metathesaurus. This system recognises and codifies terms without relying on annotated data nor external Named Entity Recognition tools. Although technically naive, it performs on par with more evolved systems, and does not exhibit a considerable deviation from other approaches that rely on oracle terms. Finally, we present and exploit a new corpus of real health records manually annotated with negation and uncertainty information: NUBes. This corpus is the basis for two sets of experiments, one on cue andscope detection, and the other on assertion classification. Throughout the thesis, we apply and compare techniques of varying levels of sophistication and novelty, which reflects the rapid advancement of the field

    STOPPING AND RESUMING: HOW AND WHY DO PEOPLE SEARCH ACROSS SESSIONS FOR COMPLEX TASKS?

    Get PDF
    Cross-session searches (XSS) occur when people look for information online for multiple sessions to complete complex task goals over time. Previous studies explored aspects of XSS, including the reasons that lead to it, like the Multiple Information Seeking Episode (MISE) model, which highlights eight causes. However, less is known about how these reasons manifest in real-life XSS and their relationship with task characteristics. I conducted a diary study with 25 participants engaging in XSS for real-life tasks. Participants reported on at least three search sessions spanning at least two days, and 15 participants attended an interview after they completed the diary study. We used qualitative methods to explore motivations for expected XSS, goal complexity, session resuming and stopping reasons, types of found information, cognitive activities, and the non-search task activities that happened during the XSS process. Our results validated and refined the MISE session resuming and stopping reasons and distinguished subcategories and reasons unique to real-life XSS tasks. We discerned task-oriented and cognition-oriented motivations for XSS. We identified seven types of non-search task activities and three popular modes describing how people intertwine search and non-search activities during XSS. We assessed relationships among factors, including session goal complexity, information types, cognitive activities, session resuming, and stopping reasons using quantitative methods. Our results show significant associations between information types, cognitive activities, session goal complexity, and session resuming and stopping reasons. Furthermore, task stages significantly correlate with perceived overall task difficulty and the difficulty to find enough information. We also identified five XSS-specific challenges. Our results have implications for tailoring future search engines to customize search results according to session resuming reasons and designing tools to assist task management and preparation for session stops. Methodologically, our results have insights into designing tasks and subtasks and controlling the reasons that can lead to successive searches for tasks with varying complexity.Doctor of Philosoph

    Making Certain: Information and Social Reality

    Get PDF
    This dissertation identifies and explains the phenomenon of the production of certainty in information systems. I define this phenomenon pragmatically as instances where practices of justification end upon information systems or their contents. Cases where information systems seem able to produce social reality without reference to the external world indicate that these systems contain facts for determining truth, rather than propositions rendered true or false by the world outside the system. The No Fly list is offered as a running example that both clearly exemplifies the phenomenon and announces the stakes of my project. After an operationalization of key terms and a review of relevant literature, I articulate a research program aimed at characterizing the phenomenon,its major components, and its effects. Notable contributions of the dissertation include: • the identification of the production of certainty as a unitary, trans-disciplinary phenomenon; • the synthesis of a sociolinguistic method capable of unambiguously identifying a) the presence of this phenomenon and b) distinguishing the respective contributions of systemic and social factors to it; and • the development of a taxonomy of certainty that can distinguish between types of certainty production and/or certainty-producing systems.The analysis of certainty proposed and advanced here is a potential compliment to several existing methods of sociotechnical research. This is demonstrated by applying the analysis of certainty to the complex assemblage of computational timekeeping alongside a more traditional infrastructural inversion. Three subsystems, the tz database, Network Time Protocol, and International Atomic Time, are selected from the assemblage of computational timekeeping for analysis. Each system employs a distinct memory practice, in Bowker’s sense, which licenses the forgetting inherent in the production of the information it contains. The analysis of certainty expands upon the insights provided by infrastructural inversion to show how the production of certainty through modern computational timekeeping practices shapes the social reality of time. This analysis serves as an example for scholars who encounter the phenomenon of the production of certainty in information systems to use the proposed theoretical framework to more easily account for, understand, and engage with it in their work. The dissertation concludes by identifying other sites amenable to this kind of analysis, including the algorithmic assemblages commonly referred to as Artificial Intelligence.Doctor of Philosoph

    Catalogue of the public documents of the 54th Congress, 1st session and of all departments of the Government of the United States for the period from July 1, 1895, to June 30, 1896.

    Get PDF
    Catalogue of Public Documents. 1 Nov. HD 355, 54-2, v76, 692p. [3552] For the 54th Congress, 1st session; U.S. publications on Indians are listed

    Queensland government gazette

    Get PDF
    corecore