16 research outputs found

    LiMoSiNe pipeline: Multilingual UIMA-based NLP platform

    Get PDF
    We present a robust and efficient parallelizable multilingual UIMA-based platform for automatically annotating textual inputs with different layers of linguistic description, ranging from surface level phenomena all the way down to deep discourse-level information. In particular, given an input text, the pipeline extracts: sentences and tokens; entity mentions; syntactic information; opinionated expressions; relations between entity mentions; co-reference chains and wikified entities. The system is available in two versions: a standalone distribution enables design and optimization of userspecific sub-modules, whereas a server-client distribution allows for straightforward highperformance NLP processing, reducing the engineering cost for higher-level tasks

    VALICO-UD: annotating an Italian learner corpus

    Get PDF
    Previous work on learner language has highlighted the importance of having annotated resources to describe the development of interlanguage. Despite this, few learner resources, mainly for English L2, feature error and syntactic annotation. This thesis describes the development of a novel parallel learner Italian treebank, VALICO-UD. Its name suggests two main points: where the data comes from—i.e. the corpus VALICO, a collection of non-native Italian texts elicited by comic strips—and what formalism is used for linguistic annotation—i.e. Universal Dependencies (UD) formalism. It is a parallel treebank because the resource provides for each learner sentence (LS) a target hypothesis (TH) (i.e., parallel corrected version written by an Italian native speaker) which is in turn annotated in UD. We developed this treebank to be exploitable for interlanguage research and comparable with the resources employed in Natural Language Processing tasks such as Native Language Identification or Grammatical Error Identification and Correction. VALICO-UD is composed of 237 texts written by English, French, German and Spanish native speakers, which correspond to 2,234 LSs, each associated with a single TH. While all LSs and THs were automatically annotated using UDPipe, only a portion of the treebank made of 398 LSs plus correspondent THs has been manually corrected and released in May 2021 in the UD repository. This core section features also an explicit XML-based annotation of the errors occurring in each sentence. Thus, the treebank is currently organized in two sections: the core gold standard—comprising 398 LSs and their correspondent THs—and the silver standard—consisting of 1,836 LSs and their correspondent THs. In order to contribute to the computational investigation about the peculiar type of texts included in VALICO-UD, this thesis describes the annotation schema of the resource, provides some preliminary tests about the performance of UDPipe models on this treebank, reports on inter-annotator agreement results for both error and linguistic annotation, and suggests some possible applications

    Primary and secondary discourse connectives: definitions and lexicons

    Get PDF
    Starting from the perspective that discourse structure arises from the presence of coherence relations, we provide a map of linguistic discourse structuring devices (DRDs), and focus on those for written text. We propose to structure these items by differentiating between primary and secondary connectives on the one hand, and free connecting phrases on the other. For the former, we propose that their behavior can be described by lexicons, and we show one concrete proposal that by now has been applied to three languages, with others being added in ongoing work. The lexical representations can be useful both for humans (theoretical investigations, transfer to other languages) and for machines (automatic discourse parsing and generation)

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    ENHANCING SENTIMENT LEXICA WITH NEGATION AND MODALITY FOR SENTIMENT ANALYSIS OF TWEETS

    Get PDF
    Sentiment analysis became one of the core tasks in the field of Natural Language Processing especially with the rise of social media. Public opinion is important for many domains such as commerce, politics, sociology, psychology, or finance. As an important player in social media, Twitter is the most frequently used microblogging platform for public opinion on any topic. In recent years, sentiment analysis in Twitter turned into a recognized shared task challenge. In this thesis, we propose to enhance sentiment lexica with the linguistic notions negation and modality for this challenge. We test the interoperability between various sentiment lexica with each other and with negation and modality and add some Twitter-specific ad-hoc features. The performance of different combinations of these features is analyzed in comprehensive ablation experiments. We participated in two challenges of the International Workshop on Semantic Evaluations (SemEval 2015). Our system performed robustly and reliably in the sentiment classification of tweets task, where it ranked 9th among 40 participants. However, it proved to be the state-of-the-art for measuring degree of sentiment of tweets with figurative language, where it ranked 1st among 35 systems

    Korreferentzia-ebazpena euskarazko testuetan.

    Get PDF
    203 p.Gaur egun, korreferentzia-ebazpen automatikoa gakotzat har dezakegu testuak ulertuahal izateko; ondorioz, behar-beharrezkoa da diskurtsoaren ulerkuntza sakona eskatzenduten Lengoaia Naturalaren Prozesamenduko (NLP) hainbat atazatan.Testu bateko bi espresio testualek objektu berbera adierazi edo erreferentziatzendutenean, bi espresio horien artean korreferentzia-erlazio bat dagoela esan ohi da. Testubatean ager daitezkeen espresio testual horien arteko korreferentzia-erlazioak ebazteahelburu duen atazari korreferentzia-ebazpena deritzo.Tesi-lan hau, hizkuntzalaritza konputazionalaren arloan kokatzen da eta euskarazidatzitako testuen korreferentzia-ebazpen automatikoa du helburu, zehazkiago esanda,euskarazko korreferentzia-ebazpen automatikoa gauzatzeko dagoen baliabide eta tresnenhutsunea betetzea du helburu.Tesi-lan honetan, lehenik euskarazko testuetan ager daitezkeen espresio testualakautomatikoki identifikatzeko garatu dugun erregelatan oinarritutako tresna azaltzen da.Ondoren, Stanfordeko unibertsitatean ingeleserako diseinatu den erregelatanoinarritutako korreferentzia-ebazpenerako sistema euskararen ezaugarrietara nolaegokitu den eta ezagutza-base semantikoak erabiliz nola hobetu dugun aurkezten da.Bukatzeko, ikasketa automatikoan oinarritzen den BART korreferentzia-ebazpenerakosistema euskarara egokitzeko eta hobetzeko egindako lana azaltzen da
    corecore