25 research outputs found
I run as fast as a rabbit, can you? A Multilingual Simile Dialogue Dataset
A simile is a figure of speech that compares two different things (called the
tenor and the vehicle) via shared properties. The tenor and the vehicle are
usually connected with comparator words such as "like" or "as". The simile
phenomena are unique and complex in a real-life dialogue scene where the tenor
and the vehicle can be verbal phrases or sentences, mentioned by different
speakers, exist in different sentences, or occur in reversed order. However,
the current simile research usually focuses on similes in a triplet tuple
(tenor, property, vehicle) or a single sentence where the tenor and vehicle are
usually entities or noun phrases, which could not reflect complex simile
phenomena in real scenarios. In this paper, we propose a novel and high-quality
multilingual simile dialogue (MSD) dataset to facilitate the study of complex
simile phenomena. The MSD is the largest manually annotated simile data
(20K) and it contains both English and Chinese data. Meanwhile, the MSD
data can also be used on dialogue tasks to test the ability of dialogue systems
when using similes. We design 3 simile tasks (recognition, interpretation, and
generation) and 2 dialogue tasks (retrieval and generation) with MSD. For each
task, we provide experimental results from strong pre-trained or
state-of-the-art models. The experiments demonstrate the challenge of MSD and
we have released the data/code on GitHub.Comment: 13 Pages, 1 Figure, 12 Tables, ACL 2023 finding
On the Impact of Temporal Representations on Metaphor Detection
State-of-the-art approaches for metaphor detection compare their literal - or
core - meaning and their contextual meaning using metaphor classifiers based on
neural networks. However, metaphorical expressions evolve over time due to
various reasons, such as cultural and societal impact. Metaphorical expressions
are known to co-evolve with language and literal word meanings, and even drive,
to some extent, this evolution. This poses the question of whether different,
possibly time-specific, representations of literal meanings may impact the
metaphor detection task. To the best of our knowledge, this is the first study
that examines the metaphor detection task with a detailed exploratory analysis
where different temporal and static word embeddings are used to account for
different representations of literal meanings. Our experimental analysis is
based on three popular benchmarks used for metaphor detection and word
embeddings extracted from different corpora and temporally aligned using
different state-of-the-art approaches. The results suggest that the usage of
different static word embedding methods does impact the metaphor detection task
and some temporal word embeddings slightly outperform static methods. However,
the results also suggest that temporal word embeddings may provide
representations of the core meaning of the metaphor even too close to their
contextual meaning, thus confusing the classifier. Overall, the interaction
between temporal language evolution and metaphor detection appears tiny in the
benchmark datasets used in our experiments. This suggests that future work for
the computational analysis of this important linguistic phenomenon should first
start by creating a new dataset where this interaction is better represented.Comment: 12 pages, 4 figure
Extended Parallel Corpus for Amharic-English Machine Translation
This paper describes the acquisition, preprocessing, segmentation, and
alignment of an Amharic-English parallel corpus. It will be useful for machine
translation of an under-resourced language, Amharic. The corpus is larger than
previously compiled corpora; it is released for research purposes. We trained
neural machine translation and phrase-based statistical machine translation
models using the corpus. In the automatic evaluation, neural machine
translation models outperform phrase-based statistical machine translation
models.Comment: Accepted to 2nd AfricanNLP workshop at EACL 202
Entity Linking for the Semantic Annotation of Italian Tweets
Linking entity mentions in Italian tweets to concepts in a knowledge base is a challenging task, due to the short and noisy nature of these short messages and the lack of specific resources for Italian. This paper proposes an adaptation of a general purpose Named Entity Linking algorithm, which exploits the similarity measure computed over a Distributional Semantic Model, in the context of Italian tweets. In order to evaluate the proposed algorithm, we introduce a new dataset of tweets for entity linking that we have developed specifically for the Italian language
Conversational Browsing
How can we better understand the mechanisms behind multi-turn information
seeking dialogues? How can we use these insights to design a dialogue system
that does not require explicit query formulation upfront as in question
answering? To answer these questions, we collected observations of human
participants performing a similar task to obtain inspiration for the system
design. Then, we studied the structure of conversations that occurred in these
settings and used the resulting insights to develop a grounded theory, design
and evaluate a first system prototype. Evaluation results show that our
approach is effective and can complement query-based information retrieval
approaches. We contribute new insights about information-seeking behavior by
analyzing and providing automated support for a type of information-seeking
strategy that is effective when the clarity of the information need and
familiarity with the collection content are low
Robust Entity Linking in Heterogeneous Domains
Entity Linking is the task of mapping terms in arbitrary documents to entities in a knowledge base by identifying the correct semantic meaning. It is applied in the extraction of structured data in RDF (Resource Description Framework) from textual documents, but equally so in facilitating artificial intelligence applications, such as Semantic Search, Reasoning and Question and Answering. Most existing Entity Linking systems were optimized for specific domains (e.g., general domain, biomedical domain), knowledge base types (e.g., DBpedia, Wikipedia), or document structures (e.g., tables) and types (e.g., news articles, tweets). This led to very specialized systems that lack robustness and are only applicable for very specific tasks. In this regard, this work focuses on the research and development of a robust Entity Linking system in terms of domains, knowledge base types, and document structures and types.
To create a robust Entity Linking system, we first analyze the following three crucial components of an Entity Linking algorithm in terms of robustness criteria: (i) the underlying knowledge base, (ii) the entity relatedness measure, and (iii) the textual context matching technique. Based on the analyzed components, our scientific contributions are three-fold. First, we show that a federated approach leveraging knowledge from various knowledge base types can significantly improve robustness in Entity Linking systems. Second, we propose a new state-of-the-art, robust entity relatedness measure for topical coherence computation based on semantic entity embeddings. Third, we present the neural-network-based approach Doc2Vec as a textual context matching technique for robust Entity Linking.
Based on our previous findings and outcomes, our main contribution in this work is DoSeR (Disambiguation of Semantic Resources). DoSeR is a robust, knowledge-base-agnostic Entity Linking framework that extracts relevant entity information from multiple knowledge bases in a fully automatic way. The integrated algorithm represents a collective, graph-based approach that utilizes semantic entity and document embeddings for entity relatedness and textual context matching computation. Our evaluation shows, that DoSeR achieves state-of-the-art results over a wide range of different document structures (e.g., tables), document types (e.g., news documents) and domains (e.g., general domain, biomedical domain). In this context, DoSeR outperforms all other (publicly available) Entity Linking algorithms on most data sets