    Computational Models of Dialectal Variation and Underlying Linguistic Features

    The report illustrates the results of the joint research activity carried out from June 13th to July 4th 2010 at the University of Groningen - Faculty of Arts - Center for Language and Cognition Groningen (CLCG) directed by Prof. John Nerbonne. In particular, it illustrates the application and specialization of the technique of "hierarchical bipartite spectral graph partitioning" (Wieling and Nerbonne, 2010) with respect to the dialectal corpus of the Atlante Lessicale Toscano (\u27Lexical Atlas of Tuscany\u27, henceforth ALT) and discusses achieved results. The analysis focuses on the level of phonetic variation: this is the level of analysis for which an aggregate analysis of the ALT dialectal corpus has provided divergent results compared to the analyses by Giannelli (1976, 2000) and Pellegrini (1977), as documented in Montemagni (2007, 2008). Phonetic variation in Tuscany thus provides a particularly challenging case study to test the potential of this new analysis technique to study models of linguistic variation

    Analisi linguistico-computazionali del corpus dialettale dell\u27Atlante Lessicale Toscano. Primi risultati sul rapporto toscano-italiano

    No abstract available.Analisi linguistico-computazionali del ?Un discorso sul territorio sfumatissimo in Toscana tra dialetto e lingua e sull?italiano della regione pu? forse apparire marginale in un lavoro che si pone come scopo primo di chiarire l?organizzazione areale del patrimonio del lessico toscano?: cos? Teresa Poggi Salani apre il suo contributo dal titolo Dialetto e lingua a confronto parlando dell?impresa dell?Atlante Lessicale Toscano, a quei tempi ancora ai suoi albori. La studiosa continua notando che ?in Italia in nessun?altra terra come in questa si scopre poi, nell?insieme, cos? frequentemente che ci? che ? dialetto ? qui richiesto proprio perch? tale ? ? anche italiano?. A impresa ultimata, l?opera avrebbe dovuto accogliere ? secondo le aspettative ? anche ?tanto? italiano di Toscana

    Bootstrapping enhanced universal dependencies for Italian

    The paper presents an extension of the Italian Universal Dependencies Treebank with an "enhanced" representation level (e-IUDT), aimed at simplifying the information extraction process. The modules developed to semi-automatically build e-IUDT were delexicalized to perform cross-language enhancements: preliminary experiments in this direction led to promising results

    The Italian dependency annotated corpus developed for the CoNLL-2007 Shared Task

    This document illustrates the Italian dependency annotated corpus developed for the CoNLL-X Shared Task (henceforth referred to as ISST-CoNLL). In particular, it provides information on the background resource, the way the CoNLL Italian resource has been designed and developed, and finally documents the adopted annotation scheme

    Tecnologie linguistico-computazionali per la valutazione delle competenze linguistiche in ambito scolastico

    The presentation will illustrate whether and how linguistic technologies can be used to monitor the language learning process.Se da una lato le tecnologie linguistico-computazionali svolgono un ruolo ormai indiscusso per l\u27accesso al contenuto testuale, sia esso rappresentato dalla conoscenza specifica di un dominio oppure dalla conoscenza linguistica sottostante (es. collocazioni, strutture argomentali, relazioni semantico-lessicali tra parole, ecc.), ci? non appare scontato quando si vada a considerare il loro ruolo nella valutazione della competenza linguistica di apprendenti. La presente comunicazione intende indagare questo interrogativo, in particolare se e in che misura le tecnologie linguistico-computazionali possano costituire un valido ausilio nella valutazione della competenza linguistica italiana di studenti in ambito scolastico

    The BOOTStrep BioLexicon: a Lexical Resource for Biomedical Text Mining

    The BOOTStrep BioLexicon is a large-scale lexical resource developed to address Text Mining requirements in the biomedical domain. It is a collective achievement by different teams, namely EBML-EBI, CNR-ILC, and University of Manchester. The different aspects of the whole BioLexicon building cycle will be illustrated in the talk, ranging from the design and the implementation of the resource, which follows the ISO/DIS 24613 "Lexical Mark-up Framework" standard, to its population carried out both by leveraging existing bio-resources and by employing advanced natural language technologies to discover new terms, relations and linguistic information from scientific literature, to its evaluation with respect to both domain-specific and general purpose lexical resources

    Design, Construction and Use of an Italian Dependency Treebank: Methodological Issues and Empirical Results

    Treebanks allow for multiple uses: by linguists, which may search for examples (or counter-examples) for a given theory or hypothesis; by psycholinguists, interested in computing construction frequencies and comparing them with human preferences; by computational linguists, for tasks such as lexicon and grammar induction or parser evaluation. Treebanks can also be explored to determine the typology of factors playing a role in specific natural language learning and processing tasks as well as their relative salience. In all cases, the results of Treebank mining are heavily influenced by the design principles underlying the adopted annotation scheme. For Treebanks to be successfully exploited for these many-fold uses, the design of the annotation scheme should fit a list of basic requirements ranging from usability in both real applications and for research purposes, compatibility with different approaches to syntax (either adopted in theoretical or applicative frameworks) to applicability on a wide scale, in a coherent and replicable way, and to different language varieties (e.g. both written and spoken language). The presentation will focus on methodological issues connected with the design, construction and use of an Italian dependency Treebank, originating from the functional annotation layer of ISST, a multi-layered annotated corpus of Italian representing one of the main outcomes of an Italian national project (SI-TAL, 1999-2001), which underwent different revisions aimed at making it compliant to the de-facto CoNLL representation standard and at meeting the basic annotation requirements mentioned above. The Italian dependency Treebank, in its different versions, has been exploited for different purposes, ranging from the induction of subcategorization frames, the discovery and assessment of typologically relevant and linguistically motivated grammatical constraints, to the training of dependency parsers as well as their evaluation in the framework of international parsing evaluation campaigns (CoNLL-2007 and Evalita-2009). Achieved results will be discussed in the light of different annotation choices, trying to contribute to the debate on general issues such as whether and to what extent richness of the annotation scheme or the distinction between head-complement and head-modifier (or head-adjunct) relations found in many contemporary syntactic theories represent real advantages for the multiple uses Treebanks are subjected to

    Towards an NLP-based approach for measuring syntactic complexity: preliminary experiments with Italian texts from different registers

    In this paper, we explore how NLP can be used to automatically identify relevant syntactic complexity features in texts with the aim of assessing their correlation with specific linguistic registers. Our final goal is twofold. On the one hand, we demonstrate that automatic morpho-syntactic and syntactic annotation of texts provides sufficiently accurate output for use in the automatic extraction and measurement of syntactic complexity features. On the other hand, we identify the set of syntactic features strongly correlating with considered linguistic registers

    Harmonization and Merging of two Italian Dependency Treebanks

    The paper describes the methodology which is currently being defined for the construction of a "Merged Italian Dependency Treebank'' (MIDT) starting from already existing resources. In particular, it reports the results of a case study carried out on two available dependency treebanks, i.e. TUT and ISST--TANL. The issues raised during the comparison of the annotation schemes underlying the two treebanks are discussed and investigated with a particular emphasis on the definition of a set of linguistic categories to be used as a "bridge'' between the specific schemes. As an encoding format, the CoNLL de facto standard is used
