27 research outputs found
Measuring Harmful Representations in Scandinavian Language Models
Scandinavian countries are perceived as role-models when it comes to gender
equality. With the advent of pre-trained language models and their widespread
usage, we investigate to what extent gender-based harmful and toxic content
exist in selected Scandinavian language models. We examine nine models,
covering Danish, Swedish, and Norwegian, by manually creating template-based
sentences and probing the models for completion. We evaluate the completions
using two methods for measuring harmful and toxic completions and provide a
thorough analysis of the results. We show that Scandinavian pre-trained
language models contain harmful and gender-based stereotypes with similar
values across all languages. This finding goes against the general expectations
related to gender equality in Scandinavian countries and shows the possible
problematic outcomes of using such models in real-world settings.Comment: Accepted at the 5th workshop on Natural Language Processing and
Computational Social Science (NLP+CSS) at EMNLP 2022 in Abu Dhabi, Dec 7 202
ADIOS LDA: When Grammar Induction Meets Topic Modeling
We explore the interplay between grammar induction and topic modeling approaches to unsupervised text processing. These two methods complement each other since one allows for the identification of local structures centered around certain key terms, while the other generates a document wide context of expressed topics. This approach allows us to access and identify semantic structures that would be otherwise hardly discovered by using only one of the two aforementioned methods. Using our approach, we are able to provide a deeper understanding of the topic structure by examining inferred information structures characteristic of given topics as well as capture differences in word usage that would be hard by using standard disambiguation methods. We perform our exploration on an extensive corpus of blog posts centered around the surveillance discussion, where we focus on the debate around the Snowden affair. We show how our approach can be used for (semi-) automated content classification and the extraction of semantic features from large textual corpora
Constructions: a new unit of analysis for corpus-based discourse analysis
We propose and assess the novel idea of using automatically induced constructions as a unit of analysis for corpus-based discourse analysis. Automated techniques are needed in order to elucidate important characteristics of corpora for social science research into topics, framing and argument structures. Compared with cur-rent techniques (keywords, n-grams, and collo-cations), constructions capture more linguistic patterning, including some grammatical phe-nomena. Recent advances in natural language processing mean that it is now feasible to auto-matically induce some constructions from large unannotated corpora. In order to assess how well constructions characterise the content of a corpus and how well they elucidate interesting aspects of different discourses, we analysed a corpus of climate change blogs. The utility of constructions for corpus-based discourse analy-sis was compared qualitatively with keywords, n-grams and collocations. We found that the unusually frequent constructions gave interest-ing and different insights into the content of the discourses and enabled better comparison of sub-corpora.
JSEEGraph: Joint Structured Event Extraction as Graph Parsing
We propose a graph-based event extraction framework JSEEGraph that approaches the task of event extraction as general graph parsing in the tradition of Meaning Representation Parsing. It explicitly encodes entities and events in a single semantic graph, and further has the flexibility to encode a wider range of additional IE relations and jointly infer individual tasks. JSEEGraph performs in an end-to-end manner via general graph parsing: (1) instead of flat sequence labelling, nested structures between entities/triggers are efficiently encoded as separate nodes in the graph, allowing for nested and overlapping entities and triggers; (2) both entities, relations, and events can be encoded in the same graph, where entities and event triggers are represented as nodes and entity relations and event arguments are constructed via edges; (3) joint inference avoids error propagation and enhances the interpolation of different IE tasks. We experiment on two benchmark datasets of varying structural complexities; ACE05 and Rich ERE, covering three languages: English, Chinese, and Spanish. Experimental results show that JSEEGraph can handle nested event structures, that it is beneficial to solve different IE tasks jointly, and that event argument extraction in particular benefits from entity extraction. Our code and models are released as open-source.publishedVersio
Automated Claim Detection for Fact-checking: A Case Study using Norwegian Pre-trained Language Models
Automated Claim Detection for Fact-checking: A Case Study using Norwegian Pre-trained Language Models
We investigate to what extent pre-trained language models can be used for automated claim detection for fact-checking in a low resource setting. We explore this idea by fine-tuning four Norwegian pre-trained language models to perform the binary classification task of determining if a claim should be discarded or upheld to be further processed by human fact-checkers. We conduct a set of experiments to compare the performance of the language models, and provide a simple baseline model using SVM with tf-idf features. Since we are focusing on claim detection, the recall score for the upheld class is to be emphasized over other performance measures. Our experiments indicate that the language models are superior to the baseline system in terms of F1, while the baseline model results in the highest precision. However, the two Norwegian models, NorBERT2 and NB-BERT_large, give respectively superior F1 and recall values. We argue that large language models could be successfully employed to solve the automated claim detection problem. The choice of the model depends on the desired end-goal. Moreover, our error analysis shows that language models are generally less sensitive to the changes in claim length and source than the SVM model.publishedVersio
Learning Horn Envelopes via Queries from Large Language Models
We investigate an approach for extracting knowledge from trained neural
networks based on Angluin's exact learning model with membership and
equivalence queries to an oracle. In this approach, the oracle is a trained
neural network. We consider Angluin's classical algorithm for learning Horn
theories and study the necessary changes to make it applicable to learn from
neural networks. In particular, we have to consider that trained neural
networks may not behave as Horn oracles, meaning that their underlying target
theory may not be Horn. We propose a new algorithm that aims at extracting the
"tightest Horn approximation" of the target theory and that is guaranteed to
terminate in exponential time (in the worst case) and in polynomial time if the
target has polynomially many non-Horn examples. To showcase the applicability
of the approach, we perform experiments on pre-trained language models and
extract rules that expose occupation-based gender biases.Comment: 35 pages, 2 figures; manuscript accepted for publication in the
International Journal of Approximate Reasoning (IJAR
Learning Horn envelopes via queries from language models
We present an approach for systematically probing a trained neural network to extract a symbolic abstraction of it, represented as a Boolean formula. We formulate this task within Angluin's exact learning framework, where a learner attempts to extract information from an oracle (in our work, the neural network) by posing membership and equivalence queries. We adapt Angluin's algorithm for Horn formula to the case where the examples are labelled w.r.t. an arbitrary Boolean formula in CNF (rather than a Horn formula). In this setting, the goal is to learn the smallest representation of all the Horn clauses implied by a Boolean formula—called its Horn envelope—which in our case correspond to the rules obeyed by the network. Our algorithm terminates in exponential time in the worst case and in polynomial time if the target Boolean formula can be closely approximated by its envelope. We also show that extracting Horn envelopes in polynomial time is as hard as learning CNFs in polynomial time. To showcase the applicability of the approach, we perform experiments on BERT based language models and extract Horn envelopes that expose occupation-based gender biases.publishedVersio
