106 research outputs found
MELA: Multilingual Evaluation of Linguistic Acceptability
Recent benchmarks for Large Language Models (LLMs) have mostly focused on
application-driven tasks such as complex reasoning and code generation, and
this has led to a scarcity in purely linguistic evaluation of LLMs. Against
this background, we introduce Multilingual Evaluation of Linguistic
Acceptability -- MELA, the first multilingual benchmark on linguistic
acceptability with 48K samples covering 10 languages from a diverse set of
language families. We establish baselines of commonly used LLMs along with
supervised models, and conduct cross-lingual transfer and multi-task learning
experiments with XLM-R. In pursuit of multilingual interpretability, we analyze
the weights of fine-tuned XLM-R to explore the possibility of identifying
transfer difficulty between languages. Our results show that ChatGPT benefits
much from in-context examples but still lags behind fine-tuned XLM-R, while the
performance of GPT-4 is on par with fine-tuned XLM-R even in zero-shot setting.
Cross-lingual and multi-task learning experiments show that unlike semantic
tasks, in-language training data is crucial in acceptability judgements.
Results in layerwise probing indicate that the upper layers of XLM-R become a
task-specific but language-agnostic region for multilingual acceptability
judgment. We also introduce the concept of conflicting weight, which could be a
potential indicator for the difficulty of cross-lingual transfer between
languages. Our data will be available at https://github.com/sjtu-compling/MELA.Comment: Work in progres
The challenging task of summary evaluation: an overview
Evaluation is crucial in the research and development of automatic summarization applications, in order to determine the appropriateness of a summary based on different criteria, such as the content it contains, and the way it is presented. To perform an adequate evaluation is of great relevance to ensure that automatic summaries can be useful for the context and/or application they are generated for. To this end, researchers must be aware of the evaluation metrics, approaches, and datasets that are available, in order to decide which of them would be the most suitable to use, or to be able to propose new ones, overcoming the possible limitations that existing methods may present. In this article, a critical and historical analysis of evaluation metrics, methods, and datasets for automatic summarization systems is presented, where the strengths and weaknesses of evaluation efforts are discussed and the major challenges to solve are identified. Therefore, a clear up-to-date overview of the evolution and progress of summarization evaluation is provided, giving the reader useful insights into the past, present and latest trends in the automatic evaluation of summaries.This research is partially funded by the European Commission under the Seventh (FP7 - 2007- 2013) Framework Programme for Research and Technological Development through the SAM (FP7-611312) project; by the Spanish Government through the projects VoxPopuli (TIN2013-47090-C3-1-P) and Vemodalen (TIN2015-71785-R), the Generalitat Valenciana through project DIIM2.0 (PROMETEOII/2014/001), and the Universidad Nacional de Educación a Distancia through the project “Modelado y síntesis automática de opiniones de usuario en redes sociales” (2014-001-UNED-PROY)
Beyond Extractive: Advancing Abstractive Automatic Text Summarization in Norwegian with Transformers
Automatic summarization is a key area in natural language processing (NLP) and machine learning which attempts to generate informative summaries of articles and documents. Despite its evolution since the 1950s, research on automatically summarising Norwegian text has remained relatively underdeveloped. Though there have been some strides made in extractive systems, which generate summaries by selecting and condensing key phrases directly from the source material, the field of abstractive summarization remains unexplored for the Norwegian language. Abstractive summarization is distinct as it generates summaries incorporating new words and phrases not present in the original text.
This Master's thesis revolves around one key question: Is it possible to create a machine learning system capable of performing abstractive summarization in Norwegian? To answer this question, we generate and release the first two Norwegian datasets for creating and evaluating Norwegian summarization models. One of these datasets is a web scrape of Store Norske Leksikon (SNL), and the other is a machine-translated version of CNN/Daily Mail. Using these datasets, we fine-tune two Norwegian T5 language models with 580M and 1.2B parameters to create summaries. To assess the quality of the models, we employed both automatic ROUGE scores and human evaluations on the generated summaries. In an effort to better understand the model's behaviour, we measure how a model generates summaries with various metrics, including our own novel contribution which we name "Match Ratio" which measures sentence similarities between summaries and articles based on Levenshtein distances.
The top-performing models achieved ROUGE-1 scores of 35.07 and 34.02 on SNL and CNN/DM, respectively. In terms of human evaluation, the best model yielded an average score of 3.96/5.00 for SNL and 4.64/5.00 for CNN/Daily Mail across various criteria. Based on these results, we conclude that it is possible to perform abstractive summarization of Norwegian with high-quality summaries. With this research, we have laid a foundation that hopefully will facilitate future research, empowering others to build upon our findings and contribute further to the development of Norwegian summarization models
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
The common practice for training commonsense models has gone
from-human-to-corpus-to-machine: humans author commonsense knowledge graphs in
order to train commonsense models. In this work, we investigate an alternative,
from-machine-to-corpus-to-machine: general language models author these
commonsense knowledge graphs to train commonsense models. Our study leads to a
new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge
Distillation (Hinton et al., 2015), our approach uses larger models to teach
smaller models. A key difference is that we distill knowledge symbolically-as
text-in addition to the neural model. We also distill only one aspect-the
commonsense of a general language model teacher, allowing the student to be a
different type, a commonsense model. Altogether, we show that careful prompt
engineering and a separately trained critic model allow us to selectively
distill high-quality causal commonsense from GPT-3, a general language model.
Empirical results demonstrate that, for the first time, a human-authored
commonsense knowledge graph is surpassed by our automatically distilled variant
in all three criteria: quantity, quality, and diversity. In addition, it
results in a neural commonsense model that surpasses the teacher model's
commonsense capabilities despite its 100x smaller size. We apply this to the
ATOMIC resource, and share our new symbolic knowledge graph and commonsense
models
Monolingual Sentence Rewriting as Machine Translation: Generation and Evaluation
In this thesis, we investigate approaches to paraphrasing entire sentences within the constraints of a given task, which we call monolingual sentence rewriting. We introduce a unified framework for monolingual sentence rewriting, and apply it to three representative tasks: sentence compression, text simplification, and grammatical error correction. We also perform a detailed analysis of the evaluation methodologies for each task, identify bias in common evaluation techniques, and propose more reliable practices.
Monolingual rewriting can be thought of as translating between two types of English (such as from complex to simple), and therefore our approach is inspired by statistical machine translation. In machine translation, a large quantity of parallel data is necessary to model the transformations from input to output text. Parallel bilingual data naturally occurs between common language pairs (such as English and French), but for monolingual sentence rewriting, there is little existing parallel data and annotation is costly. We modify the statistical machine translation pipeline to harness monolingual resources and insights into task constraints in order to drastically diminish the amount of annotated data necessary to train a robust system. Our method generates more meaning-preserving and grammatical sentences than earlier approaches and requires less task-specific data.
Once candidate sentences are generated, it is crucial to have reliable evaluation methods. Sentential paraphrases must fulfill a variety of requirements: preserve the meaning of the original sentence, be grammatical, and meet any stylistic or task-specific constraints. We analyze common evaluation practices and propose better methods that more accurately measure the quality of output. Often overlooked, robust automatic evaluation methodology is necessary for improving systems, and this work presents new metrics and outlines important considerations for reliably measuring the quality of the generated text
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language
Generation (NLG), defined as the task of generating text or speech from
non-linguistic input. A survey of NLG is timely in view of the changes that the
field has undergone over the past decade or so, especially in relation to new
(usually data-driven) methods, as well as new applications of NLG technology.
This survey therefore aims to (a) give an up-to-date synthesis of research on
the core tasks in NLG and the architectures adopted in which such tasks are
organised; (b) highlight a number of relatively recent research topics that
have arisen partly as a result of growing synergies between NLG and other areas
of artificial intelligence; (c) draw attention to the challenges in NLG
evaluation, relating them to similar challenges faced in other areas of Natural
Language Processing, with an emphasis on different evaluation methods and the
relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118
pages, 8 figures, 1 tabl
- …