Search CORE

46 research outputs found

A Cascaded Syntactic Analyser for Basque

Author: Aduriz Itziar,
Aranzabe Maxux,
Arriola Jose Maria,
Díaz De Ilarraza Arantza
Gojenola Koldo,
Oronoz Maite,
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2004
Field of study

This article presents a robust syntactic analyser for Basque and the different modules it contains. Each module is structured in different analysis layers for which each layer takes the information provided by the previous layer as its input; thus creating a gradually deeper syntactic analysis in cascade. This analysis is carried out using the Constraint Grammar (CG) formalism. Moreover, the article describes the standardisation process of the parsing formats using XML

Identificación de cláusulas y chunks para el Euskera, usando Filtrado y Ranking con el Perceptron

Author: Alegría Loinaz Iñaki
Arrieta Cortajarena Bertol
Carreras Pérez Xavier
Díaz de Ilarraza Sánchez Arantza
Uria Garin Larraitz
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2008
Field of study

Este artículo presenta sistemas de identificación de chunks y cláusulas para el euskera, combinando gramáticas basadas en reglas con técnicas de aprendizaje automático. Más concretamente, se utiliza el modelo de Filtrado y Ranking con el Perceptron (Carreras, Màrquez y Castro, 2005): un modelo de aprendizaje que permite identificar estructuras sintácticas parciales en la oración, con resultados óptimos para estas tareas en inglés. Este modelo permite incorporar nuevos atributos, y posibilita así el uso de información de diferentes fuentes. De esta manera, hemos añadido información lingüística en los algoritmos de aprendizaje. Así, los resultados del identificador de chunks han mejorado considerablemente y se ha compensado la influencia del relativamente pequeño corpus de entrenamiento que disponemos para el euskera. En cuanto a la identificación de cláusulas, los primeros resultados no son demasiado buenos, debido probablemente al orden libre del euskera y al pequeño corpus del que disponemos actualmente.This paper presents systems for syntactic chunking and clause identification for Basque, combining rule-based grammars with machine-learning techniques. Precisely, we used Filtering-Ranking with Perceptrons (Carreras, Màrquez and Castro, 2005): a learning model that recognizes partial syntactic structures in sentences, obtaining state-of-the-art performance for these tasks in English. This model allows incorporating a rich set of features to represent syntactic phrases, making possible to use information from different sources. We used this property in order to include more linguistic features in the learning model and the results obtained in chunking have been improved greatly. This way, we have made up for the relatively small training data available for Basque to learn a chunking model. In the case of clause identification, our preliminary results are low, which suggest that this is due to the free order of Basque and to the small corpus available.Research partly funded by the Basque Government (Department of Education, University and Research, IT-397-07), the Spanish Ministry of Education and Science (TIN2007-63173) and the ETORTEK-ANHITZ project from the Basque Government (Department of Culture and Industry, IE06- 185)

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Pronominal anaphora in Basque: annotation of a real corpus

Author: Aduriz Itziar
Ceberio Klara
Díaz de Ilarraza Sánchez Arantza
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN)
Publication date: 25/02/2019
Field of study

This paper describes the process followed in the annotation of pronominal anaphora in the Eus3LB corpus1 of Basque. Our aim is to use this annotation as the basis for later computational treatment of our language. We present the linguistic analysis carried out, the criteria defined for the tagging and some relevant linguistic conclusions about the features of the antecedents needed to link them correctly to their anaphoric elements

Diposit Digital de la Universitat de Barcelona

Corpusen etiketatze linguistikoa

Author: Aldezabal Roteta Izaskun
Aranzabe Urruzola María Jesús
Díaz de Ilarraza Sánchez Arantza
Estarrona Ibarloza Ainara
Ezeiza Ramos Nerea
Uria Garin Larraitz
Publication venue: Servicio Editorial de la Universidad del País Vasco/Euskal Herriko Unibertsitatearen Argitalpen Zerbitzua
Publication date: 01/01/2009
Field of study

In this article, we shall comment on the steps that have to be taken to give a linguistic label to a corpus and the difficulties that appear in this process. Our main objective was to highlight the importance of the labelling when preparing a corpus that is useful for linguistic research, and the need to establish criteria and to take the decisions that this entails. We also explain how semi-automatic methods are applied and how the manual revision that guarantees the quality of the corpus is carried out. Once the corpus has been revised and labelled, it will be useful both for carrying out linguistic analyses and for improving or assessing the linguistic tools and resources, and also for channelling automatic study

Archivo Digital para la Docencia y la Investigación

Natural Language Processing and Language Technologies for the Basque Language

Author: Begoña Altuna
Itziar Gonzalez-Dios
Publication venue: 'University of Deusto'
Publication date: 01/07/2022
Field of study

The presence of a language in the digital domain is crucial for its survival, as online communication and digital language resources have become the standard in the last decades and will gain more importance in the coming years. In order to develop advanced systems that are considered the basics for an efficient digital communication (e.g. machine translation systems, text-to-speech and speech-to-text converters and digital assistants), it is necessary to digitalise linguistic resources and create tools. In the case of Basque, scholars have studied the creation of digital linguistic resources and the tools that allow the development of those systems for the last forty years. In this paper, we present an overview of the natural language processing and language technology resources developed for Basque, their impact in the process of making Basque a “digital language” and the applications and challenges in multilingual communication. More precisely, we present the well-known products for Basque, the basic tools and the resources that are behind the products we use every day. Likewise, we would like that this survey serves as a guide for other minority languages that are making their way to digitalisation. Received: 05 April 2022 Accepted: 20 May 202

Directory of Open Access Journals

Corpusen etiketatze linguistikoa

Author: Aldezabal Roteta Izaskun
Aranzabe Urruzola María Jesús
Díaz de Ilarraza Sánchez Arantza
Estarrona Ibarloza Ainara
Ezeiza Ramos Nerea
Uria Garin Larraitz
Publication venue: Servicio Editorial de la Universidad del País Vasco/Euskal Herriko Unibertsitatearen Argitalpen Zerbitzua
Publication date: 01/01/2009
Field of study

Archivo Digital para la Docencia y la Investigación

Universidad del País Vasco / Euskal Herriko Unibertsitatea: Ciencia - Portal de revistas digitales de la UPV/EHU

Readability assessment and automatic text simplification, the analysis of basque complex structures

Author: González Dios Itziar
Publication venue
Publication date: 23/06/2016
Field of study

301 p.(eus); 217 (eng)Tesi-lan honetan, euskarazko testuen konplexutasuna eta sinplifikazioa automatikoki aztertzeko lehen urratsak egin ditugu. Testuen konplexutasuna aztertzeko, testuen sinplifikazio automatikoa helburu duten beste hizkuntzetako lanetan eta euskarazko corpusetan egindako azterketa linguistikoan oinarritu gara. Azterketa horietatik testuak automatikoki sinplifikatzeko oinarri linguistikoak ezarri ditugu. Konplexutasuna automatikoki analizatzeko, ezaugarri linguistikoetan eta ikasketa automatikoko tekniketan oinarrituta ErreXail sistema sortu eta inplementatu dugu.Horretaz gain, testuak automatikoki sinplifikatuko dituen Euskarazko Testuen Sinplifikatzailea (EuTS) sistemaren arkitektura diseinatu dugu, sistemaren modulu bakoitzean egingo diren eragiketak definituz eta, kasu-azterketa bezala,informazio biografikoa duten egitura parentetikoak sinplifikatuko dituen Biografix tresna eleaniztuna inplementatuz.Amaitzeko, Euskarazko Testu Sinplifikatuen Corpusa (ETSC) corpusa osatu dugu. Corpus hau baliatu dugu gure sinplifikaziorako azterketetatik ateratako hurbilpena beste batzuekin erkatzeko. Konparazio horiek egiteko, etiketatze-eskema bat ere definitu dugu

Archivo Digital para la Docencia y la Investigación

Evaluation of automatic break insertion for an agglutinative and inflected language

Author: Agresti
Allen
Bachenko
Blum
Breiman
Carletta
Eva Navas
Frazier
Hirschberg
Inmaculada Hernáez
Iñaki Sainz
Liberman
Maragoudakis
Oparin
Ostendorf
Read
Salton
Sangho
Siegel
Stone
Taylor
van Rijsbergen
Wang
Yoon
Zellner
Zervas
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Euskarazko hitz anitzeko unitate lexikalen tratamendu konputazionala

Author: Alegría Loinaz Iñaki
Ezeiza Ramos Nerea
Odriozola Pereira Juan Carlos
Urízar Enbeitia Rubén
Publication venue: Servicio Editorial de la Universidad del País Vasco/Euskal Herriko Unibertsitatearen Argitalpen Zerbitzua
Publication date: 01/01/2009
Field of study

Multi-word Lexical Units (MWLU) are of great importance in language in general, and in Natural Language Processing in particular, since they are not governed by the free rules of the system. In this article, we give an overview of the different types of phraseological units, explaining briefly each one's features. Our priority being to process idioms automatically in Basque texts, we concisely analyze several approaches for the inflectional description of MWLUs, and then, we explain the system we have developed for Basque: (i) a general representation for describing MWLUs in the lexical database for Basque (EDBL), (ii) HABIL, a tool capable of detecting and analyzing them based on the features described in the database, and (iii) a constraint grammar for disambiguating ambiguous MWLUs

Archivo Digital para la Docencia y la Investigación

Universidad del País Vasco / Euskal Herriko Unibertsitatea: Ciencia - Portal de revistas digitales de la UPV/EHU