4 research outputs found

    Unified Representation for Non-compositional and Compositional Expressions

    Full text link
    Accurate processing of non-compositional language relies on generating good representations for such expressions. In this work, we study the representation of language non-compositionality by proposing a language model, PIER, that builds on BART and can create semantically meaningful and contextually appropriate representations for English potentially idiomatic expressions (PIEs). PIEs are characterized by their non-compositionality and contextual ambiguity in their literal and idiomatic interpretations. Via intrinsic evaluation on embedding quality and extrinsic evaluation on PIE processing and NLU tasks, we show that representations generated by PIER result in 33% higher homogeneity score for embedding clustering than BART, whereas 3.12% and 3.29% gains in accuracy and sequence accuracy for PIE sense classification and span detection compared to the state-of-the-art IE representation model, GIEA. These gains are achieved without sacrificing PIER's performance on NLU tasks (+/- 1% accuracy) compared to BART.Comment: This work is accepted to EMNLP 2023 Finding

    Heterogeneous data to knowledge graphs matching

    Get PDF
    Many applications rely on the existence of reusable data. The FAIR (Findability, Accessibility, Interoperability, and Reusability) principles identify detailed descriptions of data and metadata as the core ingredients for achieving reusability. However, creating descriptive data requires massive manual effort. One way to ensure that data is reusable is by integrating it into Knowledge Graphs (KGs). The semantic foundation of these graphs provides the necessary description for reuse. In the Open Research KG, they propose to model artifacts of scientific endeavors, including publications and their key messages. Datasets supporting these publications are essential carriers of scientific knowledge and should be included in KGs. We focus on biodiversity research as an example domain to develop and evaluate our approach. Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. Understanding such a domain and its mechanisms is essential to preserving this vital foundation of human well-being. It is imperative to monitor the current state of biodiversity and its change over time and to understand its forces driving and preserving life in all its variety and richness. This need has resulted in numerous works being published in this field. For example, a large amount of tabular data (datasets), textual data (publications), and metadata (e.g., dataset description) have been generated. So, it is a data-rich domain with an exceptionally high need for data reuse. Managing and integrating these heterogeneous data of biodiversity research remains a big challenge. Our core research problem is how to enable the reusability of tabular data, which is one aspect of the FAIR data principles. In this thesis, we provide answer for this research problem
    corecore