15 research outputs found
Can vectors read minds better than experts? Comparing data augmentation strategies for the automated scoring of children's mindreading ability
In this paper we implement and compare 7 different data augmentation
strategies for the task of automatic scoring of children's ability to
understand others' thoughts, feelings, and desires (or "mindreading").
We recruit in-domain experts to re-annotate augmented samples and determine
to what extent each strategy preserves the original rating. We also carry out
multiple experiments to measure how much each augmentation strategy improves
the performance of automatic scoring systems. To determine the capabilities of
automatic systems to generalize to unseen data, we create UK-MIND-20 - a new
corpus of children's performance on tests of mindreading, consisting of 10,320
question-answer pairs.
We obtain a new state-of-the-art performance on the MIND-CA corpus, improving
macro-F1-score by 6 points. Results indicate that both the number of training
examples and the quality of the augmentation strategies affect the performance
of the systems. The task-specific augmentations generally outperform
task-agnostic augmentations. Automatic augmentations based on vectors (GloVe,
FastText) perform the worst.
We find that systems trained on MIND-CA generalize well to UK-MIND-20. We
demonstrate that data augmentation strategies also improve the performance on
unseen data.Comment: The paper will be presented at ACL-IJCNLP 202
Comparing distributional semantic models for identifying groups of semantically related words
Distributional Semantic Models (DSM) are growing in popularity in Computational Linguistics. DSM use corpora of language use to automatically induce formal representations of word meaning. This article focuses on one of the applications of DSM: identifying groups of semantically related words. We compare two models for obtaining formal representations: a well known approach (CLUTO) and a more recently introduced one (Word2Vec). We compare the two models with respect to the PoS coherence and the semantic relatedness of the words within the obtained groups. We also proposed a way to improve the results obtained by Word2Vec through corpus preprocessing. The results show that: a) CLUTO outperformsWord2Vec in both criteria for corpora of medium size; b) The preprocessing largely improves the results for Word2Vec with respect to both criteria
DISCOver: DIStributional approach based on syntactic dependencies for discovering COnstructions
One of the goals in Cognitive Linguistics is the automatic identification and analysis of constructions, since they are fundamental linguistic units for understanding language. This article presents DISCOver, an unsupervised methodology for the automatic discovery of lexico-syntactic patterns that can be considered as candidates for constructions. This methodology follows a distributional semantic approach. Concretely, it is based on our proposed pattern-construction hypothesis: those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions. Our proposal uses Distributional Semantic Models for modelling the context taking into account syntactic dependencies. After a clustering process, we linked all those clusters with strong relationships and we use them as a source of information for deriving lexico-syntactic patterns, obtaining a total number of 220,732 candidates from a 100 million token corpus of Spanish. We evaluated the patterns obtained intrinsically, applying statistical association measures and they were also evaluated qualitatively by experts. Our results were superior to the baseline in both quality and quantity in all cases. While our experiments have been carried out using a Spanish corpus, this methodology is language independent and only requires a large corpus annotated with the parts of speech and dependencies to be applied
"What is on your mind?" Automated Scoring of Mindreading in Childhood and Early Adolescence
In this paper we present the first work on the automated scoring of
mindreading ability in middle childhood and early adolescence. We create
MIND-CA, a new corpus of 11,311 question-answer pairs in English from 1,066
children aged 7 to 14. We perform machine learning experiments and carry out
extensive quantitative and qualitative evaluation. We obtain promising results,
demonstrating the applicability of state-of-the-art NLP solutions to a new
domain and task.Comment: Accepted in COLING 202
Fairly Accurate: Learning Optimal Accuracy vs. Fairness Tradeoffs for Hate Speech Detection
Recent work has emphasized the importance of balancing competing objectives
in model training (e.g., accuracy vs. fairness, or competing measures of
fairness). Such trade-offs reflect a broader class of multi-objective
optimization (MOO) problems in which optimization methods seek Pareto optimal
trade-offs between competing goals. In this work, we first introduce a
differentiable measure that enables direct optimization of group fairness
(specifically, balancing accuracy across groups) in model training. Next, we
demonstrate two model-agnostic MOO frameworks for learning Pareto optimal
parameterizations over different groups of neural classification models. We
evaluate our methods on the specific task of hate speech detection, in which
prior work has shown lack of group fairness across speakers of different
English dialects. Empirical results across convolutional, sequential, and
transformer-based neural architectures show superior empirical accuracy vs.
fairness trade-offs over prior work. More significantly, our measure enables
the Pareto machinery to ensure that each architecture achieves the best
possible trade-off between fairness and accuracy w.r.t. the dataset, given
user-prescribed error tolerance bounds
Comparación de dos modelos de semántica distribucional para identificar grupos de palabras semánticamente relacionadas
Distributional Semantic Models (DSM) are growing in popularity in Computational Linguistics. DSM use corpora of language use to automatically induce formal representations of word meaning. This article focuses on one of the applications of DSM: identifying groups of semantically related words. We compare two models for obtaining formal representations: a well known approach (CLUTO) and a more recently introduced one (Word2Vec). We compare the two models with respect to the PoS coherence and the semantic relatedness of the words within the obtained groups. We also proposed a way to improve the results obtained by Word2Vec through corpus preprocessing. The results show that: a) CLUTO outperformsWord2Vec in both criteria for corpora of medium size; b) The preprocessing largely improves the results for Word2Vec with respect to both criteria.Los Modelos de Semántica Distribucional (MSD) están siendo utilizados de manera extensiva en el área de la LingüÃstica Computacional. Los MSD utilizan corpus de uso de la lengua para inducir de manera automática diferentes tipos de representaciones sobre el significado de las palabras. Este artÃculo se centra en una de las aplicaciones de los MSD: la identificación de grupos de palabras semánticamente relacionadas. Se comparan dos modelos de obtención de representaciones formales: CLUTO, una herramienta estándar de clusterización y Word2Vec, una aproximación reciente al tema. Comparamos los resultados obtenidos con ambos modelos basándonos en dos criterios: la coherencia que presentan estas agrupaciones respecto de la categorÃa morfosintáctica y la cohesión semántica entre las palabras dentro de cada grupo. Se propone también como mejorar los resultados obtenidos con Word2Vec mediante su preprocesamiento morfosintáctico. Los resultados obtenidos demuestran que: a) CLUTO supera a Word2Vec en ambos criterios cuando se trata de corpus de tamaño medio: b) el preprocesamiento mejora de manera clara los resultados obtenidos con Word2Vec para ambos criterios.This work was supported by projects TIN2012-38603-C02-02, SGR-2014-623 and TIN2015-71147-C2-2
Comparing distributional semantic models for identifying groups of semantically related words
Distributional Semantic Models (DSM) are growing in popularity in Computational Linguistics. DSM use corpora of language use to automatically induce formal representations of word meaning. This article focuses on one of the applications of DSM: identifying groups of semantically related words. We compare two models for obtaining formal representations: a well known approach (CLUTO) and a more recently introduced one (Word2Vec). We compare the two models with respect to the PoS coherence and the semantic relatedness of the words within the obtained groups. We also proposed a way to improve the results obtained by Word2Vec through corpus preprocessing. The results show that: a) CLUTO outperformsWord2Vec in both criteria for corpora of medium size; b) The preprocessing largely improves the results for Word2Vec with respect to both criteria