88 research outputs found

    Development of the multilingual semantic annotation system

    Get PDF
    This paper reports on our research to generate multilingual semantic lexical resources and develop multilingual semantic annotation software, which assigns each word in running text to a semantic category based on a lexical semantic classification scheme. Such tools have an important role in developing intelligent multilingual NLP, text mining and ICT systems. In this work, we aim to extend an existing English semantic annotation tool to cover a range of languages, namely Italian, Chinese and Brazilian Portuguese, by bootstrapping new semantic lexical resources via automatically translating existing English semantic lexicons into these languages. We used a set of bilingual dictionaries and word lists for this purpose. In our experiment, with minor manual improvement of the automatically generated semantic lexicons, the prototype tools based on the new lexicons achieved an average lexical coverage of 79.86% and an average annotation precision of 71.42% (if only precise annotations are considered) or 84.64% (if partially correct annotations are included) on the three languages. Our experiment demonstrates that it is feasible to rapidly develop prototype semantic annotation tools for new languages by automatically bootstrapping new semantic lexicons based on existing ones

    Exploiting Multiword Expressions to Solve “La Ghigliottina”

    Get PDF
    Il contributo descrive il sistema UNIOR4NLP, sviluppato per risolvere il gioco “La Ghigliottina”, che ha partecipato alla sfida NLP4FUN della campagna di valutazione Evalita 2018. Il sistema risulta il migliore della competizione e ha prestazioni più elevate rispetto agli umani.The paper describes UNIOR4NLP a system developed to solve “La Ghigliottina” game which took part in the NLP4FUN task of the Evalita 2018 evaluation campaign. The system is the best performing one in the competition and achieves better results than human players

    Lexical emergentism and the "frequency-by-regularity" interaction

    Get PDF
    In spite of considerable converging evidence of the role of inflectional paradigms in word acquisition and processing, little efforts have been put so far into providing detailed, algorithmic models of the interaction between lexical token frequency, paradigm frequency, paradigm regularity. We propose a neurocomputational account of this interaction, and discuss some theoretical implications of preliminary experimental results

    Antonymy and Canonicity: Experimental and Distributional Evidence

    Get PDF
    The present pap er investigates the phenomenon of antonym canonicity by providing new behavioural and distributional evidence on Italian adjectives. Previous studies have showed that some pairs of antonyms are perceived to be better examples of opposition than others, and are so considered representative of the whole category (e.g., Deese, 1964; Murphy, 2003; Paradis et al., 2009). Our goal is to further investigate why such canonical pairs (Murphy, 2003) exist and how they come to be associated. In the literature, two dif ferent approaches have dealt with this issue. The lexical - categorical approach (Charles and Miller, 1989; Justeson and Katz, 1991) finds the cause of canonicity in the high co - occurrence frequency of the two adjectives. The cognitive - prototype approach (Pa radis et al., 2009; Jones et al., 2012) instead claims that two adjectives form a canonical pair because they are aligned along a simple and salient dimension. Our empirical evidence, while supporting the latter view, shows that the paradigmatic distributi onal properties of adjectives can also contribute to explain the phenomenon of canonicity, providing a corpus - based correlate of the cognitive notion of salience

    Word Embeddings in Sentiment Analysis

    Get PDF
    In the late years sentiment analysis and its applications have reached growing popularity. Concerning this field of research, in the very late years machine learning and word representation learning derived from distributional semantics field (i.e. word embeddings) have proven to be very successful in performing sentiment analysis tasks. In this paper we describe a set of experiments, with the aim of evaluating the impact of word embedding-based features in sentiment analysis tasks.Recentemente la Sentiment Analysis e le sue applicazioni hanno acquisito sempre maggiore popolarità. In tale ambito di ricerca, negli ultimi anni il machine learning e i metodi di rappresentazione delle parole che derivano dalla semantica distribuzionale (nello specifico i word embedding) si sono dimostrati molto efficaci nello svolgimento dei vari compiti collegati con la sentiment analysis. In questo articolo descriviamo una serie di esperimenti condotti con l’obiettivo di valutare l’impatto dell’uso di feature basate sui word embedding nei vari compiti della sentiment analysis

    PARSEME-It: an Italian corpus annotated with verbal multiword expressions

    Get PDF
    The paper describes the PARSEME-It corpus, developed within the PARSEME-It project which aims at the development of methods, tools and resources for multiword expressions (MWE) processing for the Italian language. The project is a spin-off of a larger multilingual project for more than 20 languages from several language families, namely the PARSEME COST Action. The first phase of the project was devoted to verbal multiword expressions (VMWEs). They are a particularly interesting lexical phenomenon because of frequent discontinuity and long-distance dependency. Besides they are very challenging for deep parsing and other Natural Language Processing (NLP) tasks. Notably, MWEs are pervasive in natural languages but are particularly difficult to be handled by NLP tools because of their characteristics and idiomaticity. They pose many challenges to their correct identification and processing: they are a linguistic phenomenon on the edge between lexicon and grammar, their meaning is not simply the addition of the meanings of the single constituents of the MWEs and they are ambiguous since in several cases their reading can be literal or idiomatic. Although several studies have been devoted to this topic, to the best of our knowledge, our study is the first attempt to provide a general framework for the identification of VMWEs in running texts and a comprehensive corpus for the Italian language

    The CoLing Lab system for Sentiment Polarity Classification of tweets

    Get PDF
    This paper describes the CoLing Lab system for the EVALITA 2014 SENTIment POLarity Classification (SENTIPOLC) task. Our system is based on a SVM classifier trained on the rich set of lexical, global and twitter-specific features described in these pages. Overall, our system reached a 0.63 weighted F-score on the test set provided by the task organizers

    “ODIO TUTTO CIÒ, VOGLIO LE OSSA”: UNA PRIMA INDAGINE SULLE CARATTERISTICHE LINGUISTICHE DELLE PAGINE SOCIAL PRO-ANA IN LINGUA ITALIANA

    Get PDF
    Questo articolo presenta il primo profilo linguistico dell’Anoressia Nervosa (AN) per la lingua italiana a partire dall’analisi di pagine web pro-ana (cioè, resoconti che promuovono comportamenti alimentari potenzialmente pericolosi per la vita come la fame, il vomito autoindotto e l’abuso di lassativi). L’analisi si concentra sulle caratteristiche lessicali dei nomi utente e delle biografie, sull’uso di metafore concretizzate e sulla selezione dei deittici personali e dei morfemi di tempo nei testi. I risultati proposti mirano a far luce sulla fattibilitĂ  di trasformare le intuizioni linguistiche in uno strumento di screening computazionale su larga scala.   “I hate this, i want bones”: an initial survey of the linguistic characteristics of Italian-language pro-ana social pages This paper presents the first linguistic profile of Anorexia Nervosa (AN) for the Italian language starting from the analysis of pro-ana web pages (i.e., accounts promoting potentially life-threatening eating behaviors as life-choices such as starvation, self-induced vomiting and laxative abuse). The analysis focuses on the lexical features of usernames and bios, the usage of concretized metaphors and the selection of both personal deictics and tense morphemes in the texts. The proposed findings aim to shed light on the feasibility of turning linguistic insights into a large-scale computational screening tool

    “Il Mago della Ghigliottina” @ Ghigliottin-AI: When Linguistics meets Artificial Intelligence

    Get PDF
    This paper describes Il mago della Ghigliottina, a bot which took part in the Ghigliottin-AI task of the Evalita 2020 evaluation campaign. The aim is to build a system able to solve the TV game “La Ghigliottina”. Our system has already participated in the Evalita 2018 task NLP4FUN. Compared to that occasion, it improved its accuracy from 61% to 68.6%.Questo contributo descrive Il mago della Ghigliottina, un bot che ha partecipato a Ghigliottin-AI, uno dei task di Evalita 2020. Scopo del task è mettere in piedi un sistema automatico capace di risolvere il gioco televisivo “La Ghigliottina”. Il nostro sistema ha già partecipato all’edizione del 2018 di Evalita al task NLP4FUN. Rispetto all’edizione del 2018 di NLP4FUN, l’accuratezza è salita dal 61% al 68.6%

    Building Web Corpora for Minority Languages

    Get PDF
    Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the {``}Finno-Ugric Languages and the Internet{''} (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.Peer reviewe
    • …
    corecore