277 research outputs found

    Entity Linking in Low-Annotation Data Settings

    Get PDF
    Recent advances in natural language processing have focused on applying and adapting large pretrained language models to specific tasks. These models, such as BERT (Devlin et al., 2019) and BART (Lewis et al., 2020a), are pretrained on massive amounts of unlabeled text across a variety of domains. The impact of these pretrained models is visible in the task of entity linking, where a mention of an entity in unstructured text is matched to the relevant entry in a knowledge base. State-of-the-art linkers, such as Wu et al. (2020) and De Cao et al. (2021), leverage pretrained models as a foundation for their systems. However, these models are also trained on large amounts of annotated data, which is crucial to their performance. Often these large datasets consist of domains that are easily annotated, such as Wikipedia or newswire text. However, tailoring NLP tools to a narrow variety of textual domains severely restricts their use in the real world. Many other domains, such as medicine or law, do not have large amounts of entity linking annotations available. Entity linking, which serves to bridge the gap between massive unstructured amounts of text and structured repositories of knowledge, is equally crucial in these domains. Yet tools trained on newswire or Wikipedia annotations are unlikely to be well-suited for identifying medical conditions mentioned in clinical notes. As most annotation efforts focus on English, similar challenges can be noted in building systems for non-English text. There is often a relatively small amount of annotated data in these domains. With this being the case, looking to other types of domain-specific data, such as unannotated text or highly-curated structured knowledge bases, is often required. In these settings, it is crucial to translate lessons taken from tools tailored for high-annotation domains into algorithms that are suited for low-annotation domains. This requires both leveraging broader types of data and understanding the unique challenges present in each domain

    L’appropriation d’un lecteur de glucose connecté à mesure flash chez les personnes vivant avec un diabète en contexte d’éducation thérapeutique

    Full text link
    Cotutelle internationale avec le Laboratoire Éducations et Promotion de la Santé (Santé publique - UR 3412) de l'Université Sorbonne Paris Nord.L'autosurveillance glycémique est essentielle pour les personnes vivant avec un diabète afin d'évaluer leur glycémie et adapter leurs traitements ou comportements. En France, depuis 2017, le glucomètre connecté à mesure flash FreeStyle Libre est proposé aux personnes vivant avec un diabète à la condition de suivre une éducation spécifique au sein de structures coutumières de l’éducation thérapeutique et du diabète. La littérature scientifique a montré l'efficacité de l'autosurveillance avec ce système, mais il existe peu d'études sur son appropriation et son impact. Cette recherche vise à décrire et comprendre le phénomène d’appropriation du FreeStyle Libre en identifiant comment elle s’est déroulée, comment elle s’opérationnalise, selon quelles interventions, chez qui cela fonctionne, dans quels contextes, et quels sont les mécanismes en jeu. Une évaluation réaliste a été menée en se basant sur une théorie de moyenne portée. Cette recherche a été réalisée au sein de quatre terrains en région parisienne auprès de 48 personnes vivant avec un diabète et professionnels de santé. Tout d’abord, les résultats montrent qu’au cours du temps, les programmes ont évolué dans leurs modalités et contenus, dans la façon dont ils s’organisaient, mais aussi que les interventions éducatives réellement mises en œuvre diffèrent au regard des interventions qui sont censées avoir cours. Ensuite, pour expliquer l’appropriation du FreeStyle Libre, 114 chaînes de contexte-mécanismes et effets ont été construites et éclairent sur l’acceptation du FreeStyle Libre, les conditions et modalités d’utilisation et sur les effets produits grâce à celle-ci. Les chaînes de contextes-mécanismes-effets mettent en évidence des contextes plus favorables à l’appropriation (littératie numérique élevée, empowerment préexistant, engagement dans la démarche d’autogestion…) et des contextes moins favorables (trait de personnalité compulsive, littératie générale ou numérique faible, absence d’éducation et d’accompagnement…). Les mécanismes qui sont générés font appel aux connaissances, à l’absence de crainte sur la confidentialité et l’immixtion dans la vie privée, à la motivation, et aux normes personnelles. L’acceptation du FSL est forte et fait intervenir la perception que la technologie peut contribuer à la performance de l’autosurveillance glycémique et qu’elle est facile à utiliser. Ensuite, l’analyse a permis de discriminer plusieurs modalités d’utilisation suivant des indicateurs quantitatifs et qualitatifs de l’usage. Des effets de l’appropriation sont identifiés dans l’amélioration de la qualité de vie dans le diabète, l’amélioration de la relation interpersonnelle entre soignants et personnes soignées, dans la diminution d’une anxiété liée au diabète, dans l’adaptation des traitements et des comportements et enfin dans la connaissance de la maladie et le raisonnement des personnes. La théorie de moyenne portée finale constituée sur la base de ces résultats adresse un modèle global de l’appropriation du FreeStyle Libre. Cette étude montre qu’il existe de nombreuses variations de l’appropriation. Elle situe que l’éducation à l’utilisation du FreeStyle Libre est nécessaire pour en tirer davantage parti et identifie un manque d’intégration de la technologie connectée dans les programmes d’éducation thérapeutique, ce qui constitue un enjeu particulier pour l’avenir.Self-monitoring of blood glucose is essential for people living with diabetes to assess their blood glucose levels and adapt their treatment or behaviour. In France, since 2017, the FreeStyle Libre (FSL) flash glucose meter has been offered to people living with diabetes on the condition that they attend a specific education program within facilities accustomed to diabetes and therapeutic education. The scientific literature has shown the efficacy of self-monitoring with this system, but there are few studies on its appropriation and impact. This research aims to describe and understand the phenomenon of appropriation of FreeStyle Libre by identifying how it has been implemented, how it is operationalized, according to which interventions, in whom it works, in which contexts, and what mechanisms are at work. A realist evaluation was carried out based on a middle-range theory. This research was conducted in four settings in the Paris area involving 48 people living with diabetes and healthcare professionals. First of all, the results show that over time, the programmes have evolved in their modalities and contents, in the way they were organized, but also that the implemented educational interventions differed from those that were supposed to take place. Next, to explain the appropriation of FreeStyle Libre, 114 context-mechanism-effect chains were constructed that shed light on the acceptance of FreeStyle Libre, the conditions and modalities of its use, and the effects produced through it. The context-mechanism-effect chains highlight contexts that are more favourable to appropriation (high digital literacy, pre-existing empowerment, commitment to self-management, etc.) and less favourable contexts (compulsive personality trait, low general or digital literacy, lack of education and support, etc.). The mechanisms that are generated involve knowledge, lack of fear about confidentiality and privacy, motivation, and personal norms. Acceptance of the FSL is strong and involves the perception that the technology can contribute to the performance of self-monitoring of blood glucose and that it is easy to use. Then, the analysis allowed us to distinguish several modalities of use according to quantitative and qualitative indicators of use. The effects of appropriation are identified in the improvement of the quality of life in diabetes, the improvement of the interpersonal relationship between caregivers and cared-for persons, the reduction of anxiety related to diabetes, the adaptation of treatments and behaviours, and finally in the knowledge of the disease and the reasoning of the persons. The final middle-range theory built on these results addresses a global model of the appropriation of FreeStyle Libre. This study shows that there are many variations of appropriation. It identifies that education in the use of FreeStyle Libre is needed to get more out of it, and identifies a lack of integration of connected technology into health education programmes, which is a particular challenge for the future

    Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement

    Full text link
    Mentions of new concepts appear regularly in texts and require automated approaches to harvest and place them into Knowledge Bases (KB), e.g., ontologies and taxonomies. Existing datasets suffer from three issues, (i) mostly assuming that a new concept is pre-discovered and cannot support out-of-KB mention discovery; (ii) only using the concept label as the input along with the KB and thus lacking the contexts of a concept label; and (iii) mostly focusing on concept placement w.r.t a taxonomy of atomic concepts, instead of complex concepts, i.e., with logical operators. To address these issues, we propose a new benchmark, adapting MedMentions dataset (PubMed abstracts) with SNOMED CT versions in 2014 and 2017 under the Diseases sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic product. We provide usage on the evaluation with the dataset for out-of-KB mention discovery and concept placement, adapting recent Large Language Model based methods.Comment: 5 pages, 1 figure, accepted for CIKM 2023. The dataset, data construction scripts, and baseline implementation are available at https://zenodo.org/record/8228005 (Zenodo) and https://github.com/KRR-Oxford/OET (GitHub

    CamemBERT-bio: a Tasty French Language Model Better for your Health

    Full text link
    Clinical data in hospitals are increasingly accessible for research through clinical data warehouses, however these documents are unstructured. It is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. This is why we propose a new French public biomedical dataset on which we have continued the pre-training of CamemBERT. Thus, we introduce a first version of CamemBERT-bio, a specialized public model for the French biomedical domain that shows 2.54 points of F1 score improvement on average on different biomedical named entity recognition tasks

    Computational acquisition of knowledge in small-data environments: a case study in the field of energetics

    Get PDF
    The UK’s defence industry is accelerating its implementation of artificial intelligence, including expert systems and natural language processing (NLP) tools designed to supplement human analysis. This thesis examines the limitations of NLP tools in small-data environments (common in defence) in the defence-related energetic-materials domain. A literature review identifies the domain-specific challenges of developing an expert system (specifically an ontology). The absence of domain resources such as labelled datasets and, most significantly, the preprocessing of text resources are identified as challenges. To address the latter, a novel general-purpose preprocessing pipeline specifically tailored for the energetic-materials domain is developed. The effectiveness of the pipeline is evaluated. Examination of the interface between using NLP tools in data-limited environments to either supplement or replace human analysis completely is conducted in a study examining the subjective concept of importance. A methodology for directly comparing the ability of NLP tools and experts to identify important points in the text is presented. Results show the participants of the study exhibit little agreement, even on which points in the text are important. The NLP, expert (author of the text being examined) and participants only agree on general statements. However, as a group, the participants agreed with the expert. In data-limited environments, the extractive-summarisation tools examined cannot effectively identify the important points in a technical document akin to an expert. A methodology for the classification of journal articles by the technology readiness level (TRL) of the described technologies in a data-limited environment is proposed. Techniques to overcome challenges with using real-world data such as class imbalances are investigated. A methodology to evaluate the reliability of human annotations is presented. Analysis identifies a lack of agreement and consistency in the expert evaluation of document TRL.Open Acces

    S2F-NER: Exploring Sequence-to-Forest Generation for Complex Entity Recognition

    Full text link
    Named Entity Recognition (NER) remains challenging due to the complex entities, like nested, overlapping, and discontinuous entities. Existing approaches, such as sequence-to-sequence (Seq2Seq) generation and span-based classification, have shown impressive performance on various NER subtasks, but they are difficult to scale to datasets with longer input text because of either exposure bias issue or inefficient computation. In this paper, we propose a novel Sequence-to-Forest generation paradigm, S2F-NER, which can directly extract entities in sentence via a Forest decoder that decode multiple entities in parallel rather than sequentially. Specifically, our model generate each path of each tree in forest autoregressively, where the maximum depth of each tree is three (which is the shortest feasible length for complex NER and is far smaller than the decoding length of Seq2Seq). Based on this novel paradigm, our model can elegantly mitigates the exposure bias problem and keep the simplicity of Seq2Seq. Experimental results show that our model significantly outperforms the baselines on three discontinuous NER datasets and on two nested NER datasets, especially for discontinuous entity recognition

    Semantic-aware Retrieval Standards based on Dirichlet Compound Model to Rank Notifications by Level of Urgency

    Get PDF
    There is a growing number of notifications generated from a wide range of sources. However, to our knowledge, there is no well-known generalizable standard for detecting the most urgent notifications. Establishing reusable standards is crucial for applications in which the recommendation (notification) is critical due to the level of urgency and sensitivity (e.g. medical domain). To tackle this problem, this thesis aims to establish Information Retrieval (IR) standards for notification (recommendation) task by taking semantic dimensions (terms, opinions, concepts and user interaction) into consideration. The technical research contributions of this thesis include but not limited to the development of a semantic IR framework based on Dirichlet Compound Model (DCM); namely FDCM, extending FDCM to the recommendation scenario (RFDCM) and proposing novel opinion-aware ranking models. Transparency, explainability and generalizability are some benefits that the use of a mathematically well-defined solution such as DCM offers. The FDCM framework is based on a robust aggregation parameter which effectively combines the semantic retrieval scores using Query Performance Predictors (QPPs). Our experimental results confirm the effectiveness of such approach in recommendation systems and semantic retrieval. One of the main findings of this thesis is that the concept-based extension (term-only + concept-only) of FDCM consistently outperformed both terms-only and concept-only baselines concerning biomedical data. Moreover, we show that semantic IR is beneficial for collaborative filtering and therefore it could help data scientists to develop hybrid and consolidated IR systems comprising content-based and collaborative filtering aspects of recommendation

    Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation

    Full text link
    Screening prioritisation in medical systematic reviews aims to rank the set of documents retrieved by complex Boolean queries. The goal is to prioritise the most important documents so that subsequent review steps can be carried out more efficiently and effectively. The current state of the art uses the final title of the review to rank documents using BERT-based neural neural rankers. However, the final title is only formulated at the end of the review process, which makes this approach impractical as it relies on ex post facto information. At the time of screening, only a rough working title is available, with which the BERT-based ranker achieves is significantly worse than the final title. In this paper, we explore alternative sources of queries for screening prioritisation, such as the Boolean query used to retrieve the set of documents to be screened, and queries generated by instruction-based generative large language models such as ChatGPT and Alpaca. Our best approach is not only practical based on the information available at screening time, but is similar in effectiveness with the final title.Comment: Preprints for Accepted paper in SIGIR-AP-202
    • …
    corecore