88 research outputs found
Development of the multilingual semantic annotation system
This paper reports on our research to generate multilingual semantic lexical resources and develop multilingual semantic annotation software, which assigns each word in running text to a semantic category based on a lexical semantic classification scheme. Such tools have an important role in developing intelligent multilingual NLP, text mining and ICT systems. In this work, we aim to extend an existing English semantic annotation tool to cover a range of languages, namely Italian, Chinese and Brazilian Portuguese, by bootstrapping new semantic lexical resources via automatically translating existing English semantic lexicons into these languages. We used a set of bilingual dictionaries and word lists for this purpose. In our experiment, with minor manual improvement of the automatically generated semantic lexicons, the prototype tools based on the new lexicons achieved an average lexical coverage of 79.86% and an average annotation precision of 71.42% (if only precise annotations are considered) or 84.64% (if partially correct annotations are included) on the three languages. Our experiment demonstrates that it is feasible to rapidly develop prototype semantic annotation tools for new languages by automatically bootstrapping new semantic lexicons based on existing ones
Exploiting Multiword Expressions to Solve “La Ghigliottina”
Il contributo descrive il sistema UNIOR4NLP, sviluppato per risolvere il gioco “La Ghigliottina”, che ha partecipato alla sfida NLP4FUN della campagna di valutazione Evalita 2018. Il sistema risulta il migliore della competizione e ha prestazioni più elevate rispetto agli umani.The paper describes UNIOR4NLP a system developed to solve “La Ghigliottina” game which took part in the NLP4FUN task of the Evalita
2018 evaluation campaign. The system is the best performing one in the competition and achieves better results than human players
Lexical emergentism and the "frequency-by-regularity" interaction
In spite of considerable converging evidence of the role of inflectional paradigms in word acquisition and processing, little efforts have been put so far into providing detailed, algorithmic models of the interaction between lexical token frequency, paradigm frequency, paradigm regularity. We propose a neurocomputational account of this interaction, and discuss some theoretical implications of preliminary experimental results
Antonymy and Canonicity: Experimental and Distributional Evidence
The present pap
er investigates the phenomenon of antonym canonicity by providing new behavioural
and distributional evidence on Italian adjectives. Previous studies have showed that some pairs of
antonyms are perceived to be better examples of opposition than others, and
are so considered
representative of the whole category (e.g., Deese, 1964; Murphy, 2003; Paradis et al., 2009). Our goal is
to further investigate why such
canonical pairs
(Murphy, 2003) exist and how they come to be
associated. In the literature, two dif
ferent approaches have dealt with this issue. The
lexical
-
categorical
approach
(Charles and Miller, 1989; Justeson and Katz, 1991) finds the cause of canonicity in the high
co
-
occurrence frequency of the two adjectives. The
cognitive
-
prototype approach
(Pa
radis et al., 2009;
Jones et al., 2012)
instead
claims that two adjectives form a canonical pair because they are aligned
along a simple and salient dimension. Our empirical evidence, while supporting the latter view, shows
that the
paradigmatic
distributi
onal properties of adjectives can also contribute to explain the
phenomenon of canonicity, providing a corpus
-
based correlate of the cognitive notion of salience
Word Embeddings in Sentiment Analysis
In the late years sentiment analysis and its applications have reached growing popularity. Concerning this field of research, in the very late years machine learning and word representation learning derived from distributional semantics field (i.e. word embeddings) have proven to be very successful in performing sentiment analysis tasks. In this paper we describe a set of experiments, with the aim of evaluating the impact of word embedding-based features in sentiment analysis tasks.Recentemente la Sentiment Analysis e le sue applicazioni hanno acquisito sempre maggiore popolarità . In tale ambito di ricerca, negli ultimi anni il machine learning e i metodi di rappresentazione delle parole che derivano dalla semantica distribuzionale (nello specifico i word embedding) si sono dimostrati molto efficaci nello svolgimento dei vari compiti collegati con la sentiment analysis. In questo articolo descriviamo una serie di esperimenti condotti con l’obiettivo di valutare l’impatto dell’uso di feature basate sui word embedding nei vari compiti della sentiment analysis
PARSEME-It: an Italian corpus annotated with verbal multiword expressions
The paper describes the PARSEME-It corpus, developed within the PARSEME-It project which aims at the development of methods, tools and resources for multiword expressions (MWE) processing for the Italian language. The project is a spin-off of a larger multilingual project for more than 20 languages from several language families, namely the PARSEME COST Action. The first phase of the project was devoted to verbal multiword expressions (VMWEs). They are a particularly interesting lexical phenomenon because of frequent discontinuity and long-distance dependency. Besides they are very challenging for deep parsing and other Natural Language Processing (NLP) tasks. Notably, MWEs are pervasive in natural languages but are particularly difficult to be handled by NLP tools because of their characteristics and idiomaticity. They pose many challenges to their correct identification and processing: they are a linguistic phenomenon on the edge between lexicon and grammar, their meaning is not simply the addition of the meanings of the single constituents of the MWEs and they are ambiguous since in several cases their reading can be literal or idiomatic. Although several studies have been devoted to this topic, to the best of our knowledge, our study is the first attempt to provide a general framework for the identification of VMWEs in running texts and a comprehensive corpus for the Italian language
The CoLing Lab system for Sentiment Polarity Classification of tweets
This paper describes the CoLing Lab system for the EVALITA 2014 SENTIment POLarity Classification (SENTIPOLC) task. Our system is based on a SVM classifier trained on the rich set of lexical, global and twitter-specific features described in these pages. Overall, our system reached a 0.63 weighted F-score on the test set provided by the task organizers
“ODIO TUTTO CIÒ, VOGLIO LE OSSA”: UNA PRIMA INDAGINE SULLE CARATTERISTICHE LINGUISTICHE DELLE PAGINE SOCIAL PRO-ANA IN LINGUA ITALIANA
Questo articolo presenta il primo profilo linguistico dell’Anoressia Nervosa (AN) per la lingua italiana a partire dall’analisi di pagine web pro-ana (cioè, resoconti che promuovono comportamenti alimentari potenzialmente pericolosi per la vita come la fame, il vomito autoindotto e l’abuso di lassativi). L’analisi si concentra sulle caratteristiche lessicali dei nomi utente e delle biografie, sull’uso di metafore concretizzate e sulla selezione dei deittici personali e dei morfemi di tempo nei testi. I risultati proposti mirano a far luce sulla fattibilità di trasformare le intuizioni linguistiche in uno strumento di screening computazionale su larga scala.
“I hate this, i want bones”: an initial survey of the linguistic characteristics of Italian-language pro-ana social pages
This paper presents the first linguistic profile of Anorexia Nervosa (AN) for the Italian language starting from the analysis of pro-ana web pages (i.e., accounts promoting potentially life-threatening eating behaviors as life-choices such as starvation, self-induced vomiting and laxative abuse). The analysis focuses on the lexical features of usernames and bios, the usage of concretized metaphors and the selection of both personal deictics and tense morphemes in the texts. The proposed findings aim to shed light on the feasibility of turning linguistic insights into a large-scale computational screening tool
“Il Mago della Ghigliottina” @ Ghigliottin-AI: When Linguistics meets Artificial Intelligence
This paper describes Il mago della Ghigliottina, a bot which took part in the Ghigliottin-AI task of the Evalita 2020 evaluation campaign. The aim is to build a system able to solve the TV game “La Ghigliottina”. Our system has already participated in the Evalita 2018 task NLP4FUN. Compared to that occasion, it improved its accuracy from 61% to 68.6%.Questo contributo descrive Il mago della Ghigliottina, un bot che ha partecipato a Ghigliottin-AI, uno dei task di Evalita 2020. Scopo del task è mettere in piedi un sistema automatico capace di risolvere il gioco televisivo “La Ghigliottina”. Il nostro sistema ha già partecipato all’edizione del 2018 di Evalita al task NLP4FUN. Rispetto all’edizione del 2018 di NLP4FUN, l’accuratezza è salita dal 61% al 68.6%
Building Web Corpora for Minority Languages
Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the {``}Finno-Ugric Languages and the Internet{''} (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.Peer reviewe
- …