394 research outputs found
When silver glitters more than gold: Bootstrapping an Italian part-of-speech tagger for Twitter
We bootstrap a state-of-the-art part-of-speech tagger to tag Italian Twitter
data, in the context of the Evalita 2016 PoSTWITA shared task. We show that
training the tagger on native Twitter data enriched with little amounts of
specifically selected gold data and additional silver-labelled data scraped
from Facebook, yields better results than using large amounts of manually
annotated data from a mix of genres.Comment: Proceedings of the 5th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian (EVALITA 2016
The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text
In this paper we discuss five different corpora annotated for protein names. We present several within- and cross-dataset protein tagging experiments showing that different annotation schemes severely affect the portability of statistical protein taggers. By means of a detailed error analysis we identify crucial annotation issues that future annotation projects should take into careful consideration
IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation
The T5 model and its unified text-to-text paradigm contributed in advancing
the state-of-the-art for many natural language processing tasks. While some
multilingual variants of the T5 model have recently been introduced, their
performances were found to provide suboptimal performances for languages other
than English if compared to monolingual variants. We are motivated by these
findings to introduce IT5, the first family of encoder-decoder transformer
models pretrained specifically on Italian. We perform a thorough cleaning of a
web-crawled Italian corpus including more than 40 billion words and use it to
pretrain three IT5 models of different sizes. The performance of IT5 models and
their multilingual counterparts is then evaluated on a broad range of natural
language understanding and generation benchmarks for Italian. We find the
monolingual IT5 models to provide the best scale-to-performance ratio across
tested models, consistently outperforming their multilingual counterparts and
setting a new state-of-the-art for most Italian conditional language generation
tasks.Comment: 13 pages, 7 tables, 1 figure. Code and checkpoints available:
https://github.com/gsarti/it
mCoT:Multilingual Instruction Tuning for Reasoning Consistency in Language Models
Large language models (LLMs) with Chain-of-thought (CoT) have recently emerged as a powerful technique for eliciting reasoning to improve various downstream tasks. As most research mainly focuses on English, with few explorations in a multilingual context, the question of how reliable this reasoning capability is in different languages is still open. To address it directly, we study multilingual reasoning consistency across multiple languages, using popular open-source LLMs. First, we compile the first large-scale multilingual math reasoning dataset, mCoT-MATH, covering eleven diverse languages. Then, we introduce multilingual CoT instruction tuning to boost reasoning capability across languages, thereby improving model consistency. While existing LLMs show substantial variation across the languages we consider, and especially low performance for lesser resourced languages, our 7B parameter model mCoT achieves impressive consistency across languages, and superior or comparable performance to close- and open-source models even of much larger sizes.</p
Breeding Fillmoreâs Chickens and Hatching the Eggs:Recombining Frames and Roles in Frame-Semantic Parsing
IT5:Text-to-text Pretraining for Italian Language Understanding and Generation
We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.</p
Syntactic Features and Word Similarity for Supervised Metonymy Resolution
We present a supervised machine learning algorithm for metonymy resolution, which exploits the similarity between examples of conventional metonymy. We show that syntactic head-modifier relations are a high precision feature for metonymy recognition but suffer from data sparseness
Multi-Figurative Language Generation
Figurative language generation is the task of reformulating a given text in the desired figure of speech while still being faithful to the original context. We take the first step towards multi-figurative language modelling by providing a benchmark for the automatic generation of five common figurative forms in English. We train mFLAG employing a scheme for multi-figurative language pre-training on top of BART, and a mechanism for injecting the target figurative information into the encoder; this enables the generation of text with the target figurative form from another figurative form without parallel figurative-figurative sentence pairs. Our approach outperforms all strong baselines. We also offer some qualitative analysis and reflections on the relationship between the different figures of speech
- âŚ