Search CORE

8 research outputs found

The Corpus of Basque Simplified Texts (CBST)

Author: Aranzabe Urruzola María Jesús
Díaz de Ilarraza Sánchez María Aranzazu
González Dios Itziar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2018
Field of study

In this paper we present the corpus of Basque simplified texts. This corpus compiles 227 original sentences of science popularisation domain and two simplified versions of each sentence. The simplified versions have been created following different approaches: the structural, by a court translator who considers easy-to-read guidelines and the intuitive, by a teacher based on her experience. The aim of this corpus is to make a comparative analysis of simplified text. To that end, we also present the annotation scheme we have created to annotate the corpus. The annotation scheme is divided into eight macro-operations: delete, merge, split, transformation, insert, reordering, no operation and other. These macro-operations can be classified into different operations. We also relate our work and results to other languages. This corpus will be used to corroborate the decisions taken and to improve the design of the automatic text simplification system for Basque.Cerrar texto de financiación Itziar Gonzalez-Dios's work was funded by a Ph.D. grant from the Basque Government and a postdoctoral grant for the new doctors from the Vice-rectory of Research of the University of the Basque Country (UPV/EHU). We are very grateful to the translator and teacher that simplified the texts. We also want to thank Dominique Brunato, Felice Dell'Orletta and Giulia Venturi for their help with the Italian annotation scheme and their suggestions when analysing the corpus and Oier Lopez de Lacalle for his help with the statistical analysis. We also want to express our gratitude to the anonymous reviewers for their comments and suggestions. This research was supported by the Basque Government (IT344-10), and the Spanish Ministry of Economy and Competitiveness, EXTRECM Project (TIN2013-46616-C2-1-R)

Archivo Digital para la Docencia y la Investigación

Euskarazko denbora-egiturak etiketatzeko gidalerroak v2.0

Author: Altuna Begoña
Aranzabe Urruzola María Jesús
Díaz de Ilarraza Sánchez María Aranzazu
Publication venue
Publication date: 11/02/2016
Field of study

[EN]To interpret the temporal information on texts, a mark-up language that will code that information is needed, in order to make that information automatically reachable. The most used mark-up language is TimeML (Pustejovsky et al., 2003), which has also been choosen for Basque. In this guidelines we present the Basque version of ISO-TimeML (ISO-TimeML working group, 2008). After having analysed the tags, attributes and values created for English, we describe the most appropriate ones to represent Basque time structures’ information.[EU]Testuetan agertzen den denborazko informazioa interpretatu ahal izateko, informazio hori kodetuko duen markaketa-lengoaia behar da, gerora informazio hori automatikoki baliatu ahal izateko. TimeML (Pustejovsky et al., 2003) etiketatze-lengoaia da zabalduena eta euskararako ere erabili dena. Lan honetan ISO-TimeMLren (ISO-TimeML working group, 2008) euskararako moldaketa aurkezten da; ingeleserako sortutako etiketa, atributu eta horien balioak aztertu ostean, euskarazko denbora-egituren informazioa hobekien islatzen dituztenak deskribatzen dira, hain zuzen ere

Euskarazko denbora-egiturak etiketatzeko gidalerroak v1.0

Author: Altuna Begoña
Aranzabe Urruzola María Jesús
Díaz de Ilarraza Sánchez María Aranzazu
Publication venue
Publication date: 01/12/2014
Field of study

To interpret the temporal information on texts, a mark-up language that will code that information is needed, in order to make that information automatically reachable. The most used mark-up language is TimeML (Pustejovsky et al., 2003), which has also been choosen for Basque. In this guidelines we present the Basque version of ISO-TimeML (ISO-TimeML working group, 2008). After having analysed the tags, attributes and values created for English, we describe the most appropriate ones to represent Basque time structures’ information.Testuetan agertzen den denborazko informazioa interpretatu ahal izateko, informazio hori kodetuko duen markaketa-lengoaia behar da, gerora informazio hori automatikoki baliatu ahal izateko. TimeML (Pustejovsky et al., 2003) etiketatze-lengoaia da zabalduena eta euskararako ere erabili dena. Lan honetan ISO-TimeMLren (ISO-TimeML working group, 2008) euskararako moldaketa aurkezten da; ingeleserako sortutako etiketa, atributu eta horien balioak aztertu ostean, euskarazko denbora-egituren informazioa hobekien islatzen dituztenak deskribatzen dira, hain zuzen ere

Perpaus adberbialen agerpena, maiztasuna eta kokapena EPEC‐DEP corpusean

Author: Aranzabe Urruzola María Jesús
Díaz de Ilarraza Sánchez María Aranzazu
González-Dios Itziar
Publication venue
Publication date: 26/02/2015
Field of study

In this report we present the results obtained analysing the use, frequency of use and the position of adverbial clauses. This analysis has been performed in the Basque Dependency Treebank (BDT). We also have used the descriptive grammars of Euskaltzaindia, the Royal Academy of the Basque.Txosten honetan euskarazko perpaus adberbialen agerpenaren, maiztasunaren eta kokapenaren emaitzak aurkezten dira. Analisi hau egiteko, euskarazko EPEC‐DEP zuhaitz‐bankua edo Treebank‐a eta Euskaltzaindiaren gramatika deskriptiboak erabili dira

A methodology for the semiautomatic annotation of EPEC-RolSem, a basque corpus labeled at predicative level following the PropBank-Verb Net model

Author: Aldezabal Roteta Izaskun
Aranzabe Urruzola María Jesús
Díaz de Ilarraza Sánchez María Aranzazu
Estarrona Ibarloza Ainara
Publication venue
Publication date: 01/01/2013
Field of study

In this article we describe the methodology developed for the semiautomatic annotation of EPEC-RolSem, a Basque corpus labeled at predicate level following the PropBank-VerbNet model. The methodology presented is the product of detailed theoretical study of the semantic nature of verbs in Basque and of their similarities and differences with verbs in other languages. As part of the proposed methodology, we are creating a Basque lexicon on the PropBank-VerbNet model that we have named the Basque Verb Index (BVI). Our work thus dovetails the general trend toward building lexicons from tagged corpora that is clear in work conducted for other languages. EPEC-RolSem and BVI are two important resources for the computational semantic processing of Basque; as far as the authors are aware, they are also the first resources of their kind developed for Basque. In addition, each entry in BVI is linked to the corresponding verb-entry in well-known resources like PropBank, VerbNet, WordNet, Levin’s Classification and FrameNet. We have also implemented several automatic processes to aid in creating and annotating the BVI, including processes designed to facilitate the task of manual annotation.Lan honetan, EPEC-RolSem corpusa etiketatzeko jarraitu dugun metodologia deskribatuko dugu. EPEC-RolSem corpusa PropBank-VerbNet ereduari jarraiki predikatu-mailan etiketatutako euskarazko corpusa da. Etiketatze-lana aurrera eramateko euskal aditzen izaera semantikoa aztertu eta ingeleseko aditzekin konparatu dugu, azterketa horren emaitza da lan honetan proposatzen dugun metodologia. Metodologiaren atal bat PropBank-VerbNet eredura sortutako euskal aditzen lexikoiaren osaketa izan da, lexikoi hau Basque Verb Index (BVI) deitu dugu. Gure lanak alor honetan beste hizkuntzetan dagoen joera nagusia jarraitzen du, hau da, etiketatutako corpusetatik lexikoiak sortzea. EPEC-RolSem eta BVI oso baliabide garrantzitsuak dira euskararen semantika konputazionalaren alorrean, izan ere, euskararako sortutako mota honetako lehen baliabideak dira. Honetaz guztiaz gain, BVIko sarrera bakoitza PropBank, VerbNet, WordNet, Levinen sailkapena eta FrameNet bezalako baliabide ezagunekin lotua dago. Hainbat prozesu automatiko inplementatu ditugu EPEC-RolSem corpusaren eskuzko etiketatzea laguntzeko eta baita BVI sortzeko eta osatzeko ere

Itzulpen automatikorako gaztelania-euskara patroiak : lehen urratsak

Author: Aranberri Monasterio Nora
Díaz de Ilarraza Sánchez María Aranzazu
Iñurrieta Usoa
Sarasola Gabiola Kepa Mirena
Publication venue
Publication date: 21/11/2013
Field of study

[EU]Lan honetan, adibideetan oinarritutako patroi batzuk sortu ditugu, erregeletan oinarritutako itzulpen-sistema automatiko bat hobetzeko asmoz. Patroirik erabilgarrienak emango zituzten adibideak bakarrik hartzeko, euren erabilera-maiztasunari eta itzulpen automatikoen egokitasunari erreparatu diegu. Ondoren, adibideetako entitate-izenak eta zenbakiak orokortu ditugu, elementu horiek aldatuta ere, patroiak erabili ahal izateko

Erreferentziakidetasun-sareen etiketatze-metodologia EPEC Corpusean tratamendu konputazionalari begira

Author: Aduriz Itziar
Ceberio Berger Klara
Díaz de Ilarraza Sánchez María Aranzazu
García Azkoaga Inés Mª
Publication venue: Servicio Editorial de la Universidad del País Vasco / Euskal Herriko Unibertsitateko Argitalpen Zerbitzua
Publication date: 01/01/2015
Field of study

Libro-homenaje editado por Mª José Ezeizabarrena y Ricardo Góme

Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool

Author: Atutxa Salazar Aitziber
Bengoetxea Kortazar Kepa Xabier
Díaz de Ilarraza Sánchez María Aranzazu
Iruskieta Quintian Mikel
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 04/09/2019
Field of study

Lately, discourse structure has received considerable attention due to the benefits its application offers in several NLP tasks such as opinion mining, summarization, question answering, text simplification, among others. When automatically analyzing texts, discourse parsers typically perform two different tasks: i) identification of basic discourse units (text segmentation) ii) linking discourse units by means of discourse relations, building structures such as trees or graphs. The resulting discourse structures are, in general terms, accurate at intra-sentence discourse-level relations, however they fail to capture the correct inter-sentence relations. Detecting the main discourse unit (the Central Unit) is helpful for discourse analyzers (and also for manual annotation) in improving their results in rhetorical labeling. Bearing this in mind, we set out to build the first two steps of a discourse parser following a top-down strategy: i) to find discourse units, ii) to detect the Central Unit. The final step, i.e. assigning rhetorical relations, remains to be worked on in the immediate future. In accordance with this strategy, our paper presents a tool consisting of a discourse segmenter and an automatic Central Unit detector.This study was carried out within the framework of the following projects: IXA Group: natural language processing IT1343-19 (Basque Government), DL4NLP KK-2019/00045 (Basque Government), PROSA-MED TIN2016-77820-C3-1-R (MINECO) and DeepReading: RTI2018-096846-B-C21 (MCIU/AEI/FEDER, UE)

Archivo Digital para la Docencia y la Investigación