    Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions

    International audienceWe present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions

    PARSEME corpus release 1.3

    We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced

    Creación y Simulación de Metodologías de Análisis, Clasificación e Integración de Nuevos Requerimientos a Software Propietario

    La priorización de nuevos requerimientos a implementar en un software propietario es un punto fundamental para su mantenimiento, la conservación de la calidad, observación de las reglas de negocio y los estándares de la empresa. Aunque existen herramientas de priorización basadas en técnicas probadas y reconocidas, las mismas requieren una calificación previa de cada requerimiento. Si la empresa cuenta con solicitudes provenientes de varios clientes de un mismo producto, aumentan los factores que afectan a la empresa, las herramientas disponibles no contemplan estos aspectos y hacen mucho más compleja la tarea de calificación. Este trabajo de investigación abarca la realización de un relevamiento de los métodos de priorización y selección de nuevos requerimientos utilizados por empresas de la zona de Rosario, y la definición de una metodología para la selección un nuevo requerimiento, que implica el análisis y evaluación de todas las implicaciones sobre el producto de software y la empresa, respetando sus reglas de negocio. La metodología creada conduce a la definición de los procesos para la construcción de una herramienta de calificación y priorización de nuevos requerimientos en software propietario que tiene solicitudes de varios clientes al mismo tiempo, con instrumentos de calificación que consideran todos los aspectos relacionados, proveerá técnicas de priorización actuales y emitirá informes personalizados según diferentes perspectivas de la empresa.Eje: Ingeniería de SoftwareRed de Universidades con Carreras en Informática (RedUNCI

    Identificación y traducción de Expresiones Multipalabra de tipo verbo+sustantivo: análisis de castellano-euskera

    This is a summary of the PhD thesis written by Uxoa Iñurrieta under the supervision of Dr. Gorka Labaka and Dr. Itziar Aduriz. Full title of the PhD thesis in Basque: Izena+aditza Unitate Fraseologikoak gaztelaniatik euskarara: azterketa eta tratamendu konputazionala. The defense was held in San Sebastian on November 29, 2019. The doctoral committee was integrated by Ricardo Etxepare (Centre National de la Recherche Scientifique), Margarita Alonso (Universidad de Coruña) and Miren Azkarate (University of the Basque Country).Este es un resumen de la tesis doctoral escrita por Uxoa Iñurrieta bajo la supervisión del Dr. Gorka Labaka y la Dra. Itziar Aduriz. Título completo de la tesis en euskera: Izena+aditza Unitate Fraseologikoak gaztelaniatik euskarara: azterketa eta tratamendu konputazionala. La defensa de la tesis se celebró en Donostia-San Sebastián el 29 de Noviembre de 2019, ante el tribunal formado por Ricardo Etxepare (Centre National de la Recherche Scientifique), Margarita Alonso (Universidad de Coruña) y Miren Azkarate (UPV/EHU).The Spanish Ministry of Economy and Competitiveness, who awarded Uxoa Iñurrieta a predoctoral fellowship (BES-2013-066372) to conduct research within the SKATeR project (TIN2012-38584-C06-02)

    General and specialised corpora to raise linguistic awareness in a language undergoing the normalisation process: academic writing in Basque [Innovation and digital technologies in Languages for Specific Purposes]

    Academic writing is challenging for many university students in any language, but it is especially difficult for students whose instruction language is still on its way to normalisation and has an unstable academic discourse, such as Basque. This paper explains how corpora can be exploited to raise these students' linguistic awareness. To that end, learning objectives are defined, corpora-based exercises are designed, and the difficulties that students overcome are observed. The focus of this paper are students of scientific and technological degrees in the courses on Basque for Academic Purposes, where they are taught how to solve lexical, grammatical, stylistic and register-related doubts. The final aim of the course is that these students become aware of the functional development of Basque, so that they contribute to it in their professional careers

    Literal Occurrences of Multiword Expressions: Rare Birds That Cause a Stir

    International audienceMultiword expressions can have both idiomatic and literal occurrences. For instance pulling strings can be understood either as making use of one's influence, or literally. Distinguishing these two cases has been addressed in linguistics and psycholinguistics studies, and is also considered one of the major challenges in MWE processing. We suggest that literal occurrences should be considered in both semantic and syntactic terms, which motivates their study in a treebank. We propose heuristics to automatically pre-identify candidate sentences that might contain literal occurrences of verbal VMWEs, and we apply them to existing treebanks in five typologically different languages: Basque, German, Greek, Polish and Portuguese. We also perform a linguistic study of the literal occurrences extracted by the different heuristics. The results suggest that literal occurrences constitute a rare phenomenon. We also identify some properties that may distinguish them from their idiomatic counterparts. This article is a largely extended version of Savary and Cordeiro (2018)

    Multilingual corpus of literal occurrences of multiword expressions

    The corpus contains sentences with idiomatic, literal and coincidental occurrences of verbal multiword expressions (VMWEs) in Basque, German, Greek, Polish and Portuguese. The source corpus is the PARSEME multilingual corpus of VMWEs v 1.1 (cf. http://hdl.handle.net/11372/LRT-2842). The sentences with VMWEs were extracted from the source corpus and potential co-occurrences of the same lexemes were automatically extracted from the same corpus. These candidates were then manually annotated by native experts into 6 classes, including literal and coincidental occurrences, as well as various annotation errors. The construction of the corpus is described by the following publication: Agata Savary, Silvio Ricardo Cordeiro, Timm Lichte, Carlos Ramisch, Uxoa Iñurrieta, Voula Giouli (forthcoming) "Literal occurrences of multiword expressions: Rare birds that cause a stir", to appear in Prague Bulletin of Mathematical Linguistics

    Izen+aditz konbinazioen azterketa elebiduna, hizkuntza-aplikazio aurreratuei begira

    Hiztegi elebidunak oinarritzat hartuta, euskarazko eta gaztelaniazko izen+aditz konbinazioak izan ditugu aztergai lan honetan. Konbinazioen eta euren ordainen ezaugarri morfosintaktiko zein semantikoei begiratu diegu, eta bi hizkuntzak parez pare jarri ditugu, zer alde eta antzekotasun duten aztertzeko. Artikulu honek agerian uzten du zeinen konplexuak diren era horretako egiturak eta, ondorioz, zeinen garrantzitsua den Hizkuntzaren Prozesamenduko aplikazioetan tratamendu egoki bat ematea, itzulpen automatikoan adibidez. Horrez gain, azterketatik lortutako emaitza guztiak interfaze publiko batean jarri ditugu, edonork bilaketak egin ahal izan ditzan guk landutako konbinazioen gainean