25 research outputs found

    Pronominal anaphora in Basque: annotation of a real corpus

    Get PDF
    This paper describes the process followed in the annotation of pronominal anaphora in the Eus3LB corpus1 of Basque. Our aim is to use this annotation as the basis for later computational treatment of our language. We present the linguistic analysis carried out, the criteria defined for the tagging and some relevant linguistic conclusions about the features of the antecedents needed to link them correctly to their anaphoric elements

    Corrección gramatical para euskera mediante una arquitectura neuronal seq2seq y ejemplos sintéticos

    Get PDF
    Sequence-to-sequence neural architectures are the state of the art for addressing the task of correcting grammatical errors. However, large training datasets are required for this task. This paper studies the use of sequence-to-sequence neural models for the correction of grammatical errors in Basque. As there is no training data for this language, we have developed a rule-based method to generate grammatically incorrect sentences from a collection of correct sentences extracted from a corpus of 500,000 news in Basque. We have built different training datasets according to different strategies to combine the synthetic examples. From these datasets different models based on the Transformer architecture have been trained and evaluated according to accuracy, recall and F0.5 score. The results obtained with the best model reach 0.87 of F0.5 score.Las arquitecturas neuronales secuencia a secuencia constituyen el estado del arte para abordar la tarea de corrección de errores gramaticales. Sin embargo, su entrenamiento requiere de grandes conjuntos de datos. Este trabajo estudia el uso de modelos neuronales secuencia a secuencia para la corrección de errores gramaticales en euskera. Al no existir datos de entrenamiento para este idioma, hemos desarrollado un método basado en reglas para generar de forma sintética oraciones gramaticalmente incorrectas a partir de una colección de oraciones correctas extraídas de un corpus de 500.000 noticias en euskera. Hemos construido diferentes conjuntos de datos de entrenamiento de acuerdo a distintas estrategias para combinar los ejemplos sintéticos. A partir de estos conjuntos de datos hemos entrenado sendos modelos basados en la arquitectura Transformer que hemos evaluado y comparado de acuerdo a las métricas de precisión, cobertura y F0.5. Los resultados obtenidos con el mejor modelo alcanzan un F0.5 de 0.87

    Pronominal Anaphora in Basque: computational point of view and the development of a corpus

    Get PDF
    This paper describes the process of annotating pronominal anaphor in a corpus of Basque which consists of 54.000 words. Our aim is to use this annotation as a basis for later computational processing. The linguistic study carried out and the criteria defined for the tagging process are also presented in the pape

    Euskarazko anafora pronominala: ikuspuntu konputazionala eta corpus baten garapena

    Get PDF

    Euskarazko anafora pronominala: ikuspuntu konputazionala eta corpus baten garapena

    Get PDF

    LINGUATEC: Development of linguistic resources to advance the digitisation of the languages of the Pyrenees

    Get PDF
    El objetivo del proyecto es desarrollar, probar y difundir nuevos recursos, nuevas herramientas y aplicaciones lingüísticas innovadoras para mejorar el nivel de digitalización del aragonés, vasco y occitano. Resultados esperados: (1) Hoja de ruta para la digitalización del aragonés, (2) Nuevos recursos lingüísticos, (3) Herramientas lingüísticas desarrolladas (síntesis de voz occitana, aragonesa y vasca del País Vasco francés, detector de texto occitano y variantes del occitano, mejora de la traducción automática del francés al occitano, del castellano vasco, del castellano al aragonés, (4) Aplicaciones innovadoras desarrolladas en los idiomas de los Pirineos.The goal of the project is to develop, test and disseminate new innovative linguistic resources, tools and solutions for a better digitalization level of the Aragonian, Basque and Occitan languages. As a result, we will obtain, among others, (1) a road map of Aragonian Digitalization, (2) new monolingual and bilingual lexicons and morphosyntactic and syntactic analysers for Occitan, (3) a Northern Basque speech recognition system, and several linguistic tools as well as (4) new innovative solutions for Aragonian, Basque and Occitan.La investigación llevada a cabo en este proyecto se lleva a cabo como parte del proyecto “LINGUATEC: Desarrollo de la cooperación transfronteriza y transferencia de conocimiento en tecnologías de la lengua” (POCTEFA EFA227/16, FEDER), financiado por el Ministerio de Economía y Competitividad y el Fondo Europeo de Desarrollo Regional (FEDER)

    State-of-the-art on monolingual lexicography for Basque (Basque)

    No full text
    In this article, we give an overview of the evolution of Basque lexicography to the present, pointing out its main achievements and shortcomings, as well as its challenges for the future. Basque lexicography has a relatively short history, but a considerable amount of resources have been produced in the last 50 years, since the standardisation process began. After years of lexicographic work by different groups and publishers, a remarkable achievement is the Dictionary of the Academy (Euskaltzaindiaren Hiztegia), a prescriptive updated dictionary recently published and based on historical and contemporary corpora. Although the number of monolingual products is noticeably increasing in the last years, Basque dictionary making has been specially productive for bilingual purposes, due probably to the sociolinguistic status of the language. On the other hand, specialized lexicography and terminology have been very active from the beginning of the standadisartion process. Since the beginning of the XXI. century, use of corpora has known an increasing impulse. Many Basque dictionaries are freely available on the Internet

    Pronominal anaphora in Basque: annotation of a real corpus

    Get PDF
    En este artículo se describe el proceso de etiquetado manual de la anáfora pronominal en el corpus Eus3LB, corpus de 54.000 palabras de texto escrito en euskera etiquetado a nivel sintáctico y que servirá de base para posteriores tratamientos computacionales. Presentamos aquí el estudio lingüístico previo, los criterios de etiquetado establecidos y algunas conclusiones lingüísticas relevantes sobre las características de las relaciones entre la anáfora pronominal y su correspondiente antecedente.This paper describes the process followed in the annotation of pronominal anaphora in the Eus3LB corpus of Basque. Our aim is to use this annotation as the basis for later computational treatment of our language. We present the linguistic analysis carried out, the criteria defined for the tagging and some relevant linguistic conclusions about the features of the antecedents needed to link them correctly to their anaphoric elements

    Pronominal anaphora in Basque: annotation of a real corpus

    No full text
    This paper describes the process followed in the annotation of pronominal anaphora in the Eus3LB corpus1 of Basque. Our aim is to use this annotation as the basis for later computational treatment of our language. We present the linguistic analysis carried out, the criteria defined for the tagging and some relevant linguistic conclusions about the features of the antecedents needed to link them correctly to their anaphoric elements