33 research outputs found
Cross-domain Relation Extraction
Language technologies are widely spreading over a diverse range of applications. Therefore, the ability of computational systems to easily adapt to new unseen situations is becoming more and more important.In this thesis, we explore the task of Relation Extraction (RE) from a cross-domain perspective, in order to push the boundaries of model robustness across domains of application. RE is a key task in the automatic extraction of structured information from unstructured text. The goal of RE is the extraction of semantic triplets where two entities mentioned in the input text are connected by a semantic relation. The main challenge to the robustness of RE across domains is that depending on the downstream application the relevant information to extract differs (i.e., the entities and the types of semantic connections between them).The work of this thesis covers the whole experimental pipeline for RE: First, given the lack of previous work in cross-domain RE, we outline several challenges characterizing the research area, from the scarcity of available resources for studying cross-domain RE, to the lack of standards in annotation guidelines and experimental settings. Second, to address the aforementioned challenges, we describe the creation of CrossRE, a multi-domain dataset for RE in English, and its subsequent expansion to 26 languages. Third, we propose two methodologies to boost the performance of RE in this multi-domain setup. Last, we present two frameworks for the analysis of the RE pipeline in terms of model performance and presence of socio-demographic biases.<br/
Matching Theory and Data with Personal-ITY:What a Corpus of Italian YouTube Comments Reveals About Personality
Matching Theory and Data with Personal-ITY: What a Corpus of Italian YouTube Comments Reveals About Personality
As a contribution to personality detection in languages other than English,
we rely on distant supervision to create Personal-ITY, a novel corpus of
YouTube comments in Italian, where authors are labelled with personality
traits. The traits are derived from one of the mainstream personality theories
in psychology research, named MBTI. Using personality prediction experiments,
we (i) study the task of personality prediction in itself on our corpus as well
as on TwiSty, a Twitter dataset also annotated with MBTI labels; (ii) carry out
an extensive, in-depth analysis of the features used by the classifier, and
view them specifically under the light of the original theory that we used to
create the corpus in the first place. We observe that no single model is best
at personality detection, and that while some traits are easier than others to
detect, and also to match back to theory, for other, less frequent traits the
picture is much more blurred.Comment: 12 pages, Accepted at PEOPLES 2020 (workshop COLING 2020). arXiv
admin note: text overlap with arXiv:2011.0568
Matching Theory and Data with Personal-ITY:What a Corpus of Italian YouTube Comments Reveals About Personality
Personal-ITY:A Novel YouTube-based Corpus for Personality Prediction in Italian
We present a novel corpus for personality prediction in Italian, containing a larger number of authors and a different genre compared to previously available resources. The corpus is built exploiting Distant Supervision, assigning Myers-Briggs Type Indicator (MBTI) labels to YouTube comments, and can lend itself to a variety of experiments. We report on preliminary experiments on Personal-ITY, which can serve as a baseline for future work, showing that some types are easier to predict than others, and discussing the perks of cross-dataset prediction
Personal-ITY:A Novel YouTube-based Corpus for Personality Prediction in Italian
We present a novel corpus for personality prediction in Italian, containing a larger number of authors and a different genre compared to previously available resources. The corpus is built exploiting Distant Supervision, assigning Myers-Briggs Type Indicator (MBTI) labels to YouTube comments, and can lend itself to a variety of experiments. We report on preliminary experiments on Personal-ITY, which can serve as a baseline for future work, showing that some types are easier to predict than others, and discussing the perks of cross-dataset prediction
Dissecting Biases in Relation Extraction: A Cross-Dataset Analysis on People’s Gender and Origin
Relation Extraction (RE) is at the core of many Natural Language Understanding tasks, including knowledge-base population and Question Answering. However, any Natural Language Processing system is exposed to biases, and the analysis of these has not received much attention in RE. We propose a new method for inspecting bias in the RE pipeline, which is completely transparent in terms of interpretability. Specifically, in this work we analyze biases related to gender and place of birth. Our methodology includes (i) obtaining semantic triplets (subject, object, semantic relation) involving ‘person’ entities from RE resources, (ii) collecting meta-information (‘gender’ and ‘place of birth’) using Entity Linking technologies, and then (iii) analyze the distribution of triplets across different groups (e.g., men versus women). We investigate bias at two levels: In the training data of three commonly used RE datasets (SREDFM, CrossRE, NYT), and in the predictions of a state-of-the-art RE approach (ReLiK). To enable cross-dataset analysis, we introduce a taxonomy of relation types mapping the label sets of different RE datasets to a unified label space. Our findings reveal that bias is a compounded issue affecting underrepresented groups within data and predictions for RE