Search CORE

118 research outputs found

Translational Divergences and Their Alignment in a Parallel Multilingual Treebank

Author: Bosco Cristina
Sanguinetti Manuela
Publication venue: Edições Colibri
Publication date: 01/01/2012
Field of study

The usefulness of parallel corpora in translation studies and machine translation is strictly related to the availability of aligned data. In this paper we discuss the issues related to the design of a tool for the alignment of data from a parallel treebank, which takes into account morphological, syntactic and semantic knowledge as annotated in this kind of resource. A preliminary analysis is presented which is based on a case study, a parallel treebank for Italian, English and French, i.e. ParTUT. The paper will focus, in particular, on the study of translational divergences and their implications for the development of an alignment tool of parallel parse trees that, benefitting from the linguistic information provided in ParTUT, could properly deal with such divergences

Archivio istituzionale della ricerca - Università di Cagliari

Institutional Research Information System University of Turin

Exploring sentiment in social media and official statistics: A general framework

Author: Lai Mirko
Sanguinetti Manuela
Sulis Emilio
Vinai Manuela
Publication venue: CEUR-WS
Publication date: 01/01/2015
Field of study

Institutional Research Information System University of Turin

The Parallel-TUT: a multilingual and multiformat treebank

Author: Bosco Cristina
Lesmo Leonardo
Sanguinetti Manuela
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2012
Field of study

The paper introduces an ongoing project for the development of a parallel treebank for Italian, English and French, i.e. Parallel–TUT, or simply ParTUT. For the development of this resource, both the dependency and constituency-based formats of the Italian Turin University Treebank (TUT) have been applied to a preliminary dataset, which includes the whole text of the Universal Declaration of Human Rights, sentences from the JRC-Acquis Multilingual Parallel Corpus and the Creative Commons licence. The focus of the project is mainly on the quality of the annotation and the investigation of some issues related to the alignment of data that can be allowed by the TUT formats, also taking into account the availability of conversion tools for display data in standard ways, such as Tiger–XML and CoNLL formats. It is, in fact, our belief that increasing the portability of our treebank could give us the opportunity to access resources and tools provided by other research groups, especially at this stage of the project, where no particular tool – compatible with the TUT format – is available in order to tackle the alignment problems

Archivio istituzionale della ricerca - Università di Cagliari

Institutional Research Information System University of Turin

Long-term social media data collection at the University of Turin

Author: Basile Valerio
Lai Mirko
Sanguinetti Manuela
Publication venue: place:Torino
Publication date: 01/01/2019
Field of study

We report on the collection of social media messages - from Twitter in particular - in the Italian language that is continuously going on since 2012 at the University of Turin. A number of smaller datasets have been extracted from the main collection and enriched with different kinds of annotations for linguistic purposes. Moreover, a few extra datasets have been collected independently and are now in the process of being merged with the main collection. We aim at making the resource available to the community to the best of our possibility, in accordance with the Terms of Service provided by the platforms where data have been gathered from

Archivio istituzionale della ricerca - Università di Cagliari

Exploiting Catenae in a Parallel Treebank Alignment

Author: Bosco Cristina
Cupi Loredana
Sanguinetti Manuela
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2014
Field of study

This paper aims to introduce the issues related to the syntactic alignment of a dependency-based multilingual parallel treebank, ParTUT. Our approach to the task starts from a lexical mapping and then attempts to expand it using dependency relations. In developing the system, however, we realized that the only dependency relations between the individual nodes were not sufficient to overcome some translation divergences, or shifts, especially in the absence of a direct lexical mapping and a different syntactic realization. For this purpose, we explored the use of a novel syntactic notion introduced in dependency theoretical framework, i.e. that of catena (Latin for "chain"), which is intended as a group of words that are continuous with respect to dominance. In relation to the task of aligning parallel dependency structures, catenae can be used to explain and identify those cases of one-to-many or many-to-many correspondences, typical of several translation shifts, that cannot be detected by means of direct word-based mappings or bare syntactic relations. The paper presented here describes the overall structure of the alignment system as it has been currently designed, how catenae are extracted from the parallel resource, and their potential relevance to the completion of tree alignment in ParTUT sentences

Archivio istituzionale della ricerca - Università di Cagliari

Institutional Research Information System University of Turin

Long-term Social Media Data Collection at the University of Turin

Author: Basile Valerio
Lai Mirko
Sanguinetti Manuela
Publication venue: CEUR-WS
Publication date: 01/01/2018
Field of study

Institutional Research Information System University of Turin

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

Author: Bosco Cristina
Cignarella ALESSANDRA TERESA
Sanguinetti Manuela
Publication venue
Publication date: 01/01/2022
Field of study

Institutional Research Information System University of Turin

Marking Irony Activators in a Universal Dependencies Treebank: The Case of an Italian Twitter Corpus

Author: Bosco Cristina
Cignarella Alessandra Teresa
Paolo Rosso
Sanguinetti Manuela
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Institutional Research Information System University of Turin

Is this an effective way to annotate irony activators?

Author: Bosco Cristina
Cignarella Alessandra Teresa
Rosso Paolo
Sanguinetti Manuela
Publication venue: place:Aachen
Publication date: 01/01/2019
Field of study

In this article we describe the first steps of the annotation process of specific irony activators in TWITTIRÒ-UD, a treebank of Italian tweets annotated with fine-grained labels for irony on one hand, and according to the Universal Dependencies scheme on the other. We discuss in particular the annotation scheme adopted to identify irony activators and some of the issues emerged during the first annotation phase. This helped us in the design of the guidelines and allowed us to draw future research directions

Archivio istituzionale della ricerca - Università di Cagliari

The Evalita 2014 Dependency Parsing task

Author: Bosco Cristina
Dell’Orletta Felice
Montemagni Simonetta
Sanguinetti Manuela
Simi Maria
Publication venue
Publication date: 01/01/2014
Field of study

SUMMARY. The Parsing Task is among the “historical” tasks of Evalita, and in all editions its main objective has been to define and improve state-of-the-art technologies for parsing Italian. The 2014’s edition of the shared task features several novelties that have mainly to do with the data set and the subtasks. The paper therefore focuses on these two strictly interrelated aspects and presents an overview of the participants systems and results. RIASSUNTO. Il “Parsing Task”, tra i compiti storici di Evalita, in tutte le edizioni ha avuto lo scopo principale di definire ed estendere lo stato dell’arte per l’analisi sin- tattica automatica della lingua italiana. Nell’edizione del 2014 della campagna di valutazione esso si caratterizza per alcune significative novità legate in particolare ai dati utilizzati per l’addestramento e alla sua organizzazione interna. L’articolo si focalizza pertanto su questi due aspetti strettamente interrelati e presenta una panoramica dei sistemi che hanno partecipato e dei risultati raggiunti

Archivio della Ricerca - Università di Pisa

UnipiEprints

Institutional Research Information System University of Turin