4 research outputs found
Unified Representation for Non-compositional and Compositional Expressions
Accurate processing of non-compositional language relies on generating good
representations for such expressions. In this work, we study the representation
of language non-compositionality by proposing a language model, PIER, that
builds on BART and can create semantically meaningful and contextually
appropriate representations for English potentially idiomatic expressions
(PIEs). PIEs are characterized by their non-compositionality and contextual
ambiguity in their literal and idiomatic interpretations. Via intrinsic
evaluation on embedding quality and extrinsic evaluation on PIE processing and
NLU tasks, we show that representations generated by PIER result in 33% higher
homogeneity score for embedding clustering than BART, whereas 3.12% and 3.29%
gains in accuracy and sequence accuracy for PIE sense classification and span
detection compared to the state-of-the-art IE representation model, GIEA. These
gains are achieved without sacrificing PIER's performance on NLU tasks (+/- 1%
accuracy) compared to BART.Comment: This work is accepted to EMNLP 2023 Finding
Heterogeneous data to knowledge graphs matching
Many applications rely on the existence of reusable data. The FAIR (Findability, Accessibility, Interoperability, and Reusability) principles identify detailed descriptions of data and metadata as the core ingredients for achieving reusability. However, creating descriptive data requires massive manual effort. One way to ensure that data is reusable is by integrating it into Knowledge Graphs (KGs). The semantic foundation of these graphs provides the necessary description for reuse. In the Open Research KG, they propose to model artifacts of scientific endeavors, including publications and their key messages. Datasets supporting these publications are essential carriers of scientific knowledge and should be included in KGs. We focus on biodiversity research as an example domain to develop and evaluate our approach. Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. Understanding such a domain and its mechanisms is essential to preserving this vital foundation of human well-being. It is imperative to monitor the current state of biodiversity and its change over time and to understand its forces driving and preserving life in all its variety and richness. This need has resulted in numerous works being published in this field. For example, a large amount of tabular data (datasets), textual data (publications), and metadata (e.g., dataset description) have been generated. So, it is a data-rich domain with an exceptionally high need for data reuse. Managing and integrating these heterogeneous data of biodiversity research remains a big challenge. Our core research problem is how to enable the reusability of tabular data, which is one aspect of the FAIR data principles. In this thesis, we provide answer for this research problem