Search CORE

19 research outputs found

СУЧАСНІ МЕТОДИ ВИРІШЕННЯ ПРОБЛЕМИ ГРАМАТИЧНОЇ ОМОНІМІЇ В ТЕКСТІ. (Modern solving methods problems of grammatical homonymy in the text.)

Author: Буньо Г. (H. Bun'o)
Publication venue: Видавництво Національного університету «Острозька академія».
Publication date: 01/01/2014
Field of study

У статті розглянуто явище граматичної омонімії, а саме її різновид – омонімію морфологічну, з позицій текстоцентричного підходу. Проаналізовано основні підходи, досвід та перспективи вирішення цієї проблеми у процесі автоматичного морфологічного аналізу тексту, зокрема для української та інших мов зі складною морфологією. (The article studies the phenomenon of grammatical homonymy, namely the morphological homonymy, from the text-centered perspective. The main approaches, experience, and prospects for solving the issue of grammatical ambiguity in the process of automatic morphological analysis are considered, notably in terms of Ukrainian and other morphologically complex languages.

Цифровий архів Острозької академії (Digital Repository of Ostroh Academy)

Corrección gramatical para euskera mediante una arquitectura neuronal seq2seq y ejemplos sintéticos

Author: Beloki Leiza Zuhaitz
Ceberio Berger Klara
Corral Ander
Saralegi Urizar Xabier
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/09/2020
Field of study

Sequence-to-sequence neural architectures are the state of the art for addressing the task of correcting grammatical errors. However, large training datasets are required for this task. This paper studies the use of sequence-to-sequence neural models for the correction of grammatical errors in Basque. As there is no training data for this language, we have developed a rule-based method to generate grammatically incorrect sentences from a collection of correct sentences extracted from a corpus of 500,000 news in Basque. We have built different training datasets according to different strategies to combine the synthetic examples. From these datasets different models based on the Transformer architecture have been trained and evaluated according to accuracy, recall and F0.5 score. The results obtained with the best model reach 0.87 of F0.5 score.Las arquitecturas neuronales secuencia a secuencia constituyen el estado del arte para abordar la tarea de corrección de errores gramaticales. Sin embargo, su entrenamiento requiere de grandes conjuntos de datos. Este trabajo estudia el uso de modelos neuronales secuencia a secuencia para la corrección de errores gramaticales en euskera. Al no existir datos de entrenamiento para este idioma, hemos desarrollado un método basado en reglas para generar de forma sintética oraciones gramaticalmente incorrectas a partir de una colección de oraciones correctas extraídas de un corpus de 500.000 noticias en euskera. Hemos construido diferentes conjuntos de datos de entrenamiento de acuerdo a distintas estrategias para combinar los ejemplos sintéticos. A partir de estos conjuntos de datos hemos entrenado sendos modelos basados en la arquitectura Transformer que hemos evaluado y comparado de acuerdo a las métricas de precisión, cobertura y F0.5. Los resultados obtenidos con el mejor modelo alcanzan un F0.5 de 0.87

Repositorio Institucional de la Universidad de Alicante

Identificación de cláusulas y chunks para el Euskera, usando Filtrado y Ranking con el Perceptron

Author: Alegría Loinaz Iñaki
Arrieta Cortajarena Bertol
Carreras Pérez Xavier
Díaz de Ilarraza Sánchez Arantza
Uria Garin Larraitz
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2008
Field of study

Este artículo presenta sistemas de identificación de chunks y cláusulas para el euskera, combinando gramáticas basadas en reglas con técnicas de aprendizaje automático. Más concretamente, se utiliza el modelo de Filtrado y Ranking con el Perceptron (Carreras, Màrquez y Castro, 2005): un modelo de aprendizaje que permite identificar estructuras sintácticas parciales en la oración, con resultados óptimos para estas tareas en inglés. Este modelo permite incorporar nuevos atributos, y posibilita así el uso de información de diferentes fuentes. De esta manera, hemos añadido información lingüística en los algoritmos de aprendizaje. Así, los resultados del identificador de chunks han mejorado considerablemente y se ha compensado la influencia del relativamente pequeño corpus de entrenamiento que disponemos para el euskera. En cuanto a la identificación de cláusulas, los primeros resultados no son demasiado buenos, debido probablemente al orden libre del euskera y al pequeño corpus del que disponemos actualmente.This paper presents systems for syntactic chunking and clause identification for Basque, combining rule-based grammars with machine-learning techniques. Precisely, we used Filtering-Ranking with Perceptrons (Carreras, Màrquez and Castro, 2005): a learning model that recognizes partial syntactic structures in sentences, obtaining state-of-the-art performance for these tasks in English. This model allows incorporating a rich set of features to represent syntactic phrases, making possible to use information from different sources. We used this property in order to include more linguistic features in the learning model and the results obtained in chunking have been improved greatly. This way, we have made up for the relatively small training data available for Basque to learn a chunking model. In the case of clause identification, our preliminary results are low, which suggest that this is due to the free order of Basque and to the small corpus available.Research partly funded by the Basque Government (Department of Education, University and Research, IT-397-07), the Spanish Ministry of Education and Science (TIN2007-63173) and the ETORTEK-ANHITZ project from the Basque Government (Department of Culture and Industry, IE06- 185)

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Teknologia garatzeko estrategiak baliabide urriko hizkuntzetarako: euskararen eta Ixa taldearen adibidea

Author: Aduriz Itziar,
Alegria Iñaki,
Artola Xabier,
Díaz De Ilarraza Arantza
Sarasola Kepa
Publication venue: HAL CCSD
Publication date: 01/06/2011
Field of study

El artículo comienza presentando varios datos que muestran la situación de la lengua vasca, y a continuación proponiendo una clasificación para las lenguas del mundo según sea su presencia en Internet y en la tecnología de la lengua. El cuerpo del artículo presenta el trabajo hecho por el grupo Ixa en el campo del procesamiento automático del euskara, identificando sus siete hitos principales y describiendo la estrategia que ha guiado este desarrollo. Se plantea que esta estrategia puede servir como referencia para 190 lenguas que según la lasificación propuesta no poseen recursos de tecnología de la lengua pero si poseen una mínima presencia significativa en Internet.Euskararen egoeraren inguruan hainbat datu ematen dira labur-labur, eta horrekin batera munduko hizkuntzak sailkatzeko proposamen bat aurkezten da Interneten eta hizkuntz teknologian duten egoeren araberakoa. Euskararen prozesaketa automatikoan Ixa taldeak izan duen bilakaeraren nondik norakoak zehazten dira gero, hainbat mugarri azpimarratuz eta ibilbide hori jarraitzeko erabili den estrategia deskribatuz. Munduko 190 hizkuntzentzat erreferentzia izan daiteke estrategia hori, hain zuen, Interneten presentzia minimo eduki bai baina oraindik hizkuntza-teknologia mota hau landu ez duten hizkuntzentzat

ArtXiker - @HAL

Improving the automatic segmentation of subtitles through conditional random field

Author: Alvarez Aitor
Arzelus Haritz
Balenciaga Marina
del Pozo Arantza
Martínez-Hinarejos Carlos-D.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

[EN] Automatic segmentation of subtitles is a novel research field which has not been studied extensively to date. However, quality automatic subtitling is a real need for broadcasters which seek for automatic solutions given the demanding European audiovisual legislation. In this article, a method based on Conditional Random Field is presented to deal with the automatic subtitling segmentation. This is a continuation of a previous work in the field, which proposed a method based on Support Vector Machine classifier to generate possible candidates for breaks. For this study, two corpora in Basque and Spanish were used for experiments, and the performance of the current method was tested and compared with the previous solution and two rule-based systems through several evaluation metrics. Finally, an experiment with human evaluators was carried out with the aim of measuring the productivity gain in post-editing automatic subtitles generated with the new method presented.This work was partially supported by the project CoMUN-HaT - TIN2015-70924-C2-1-R (MINECO/FEDER).Alvarez, A.; Martínez-Hinarejos, C.; Arzelus, H.; Balenciaga, M.; Del Pozo, A. (2017). Improving the automatic segmentation of subtitles through conditional random field. Speech Communication. 88:83-95. https://doi.org/10.1016/j.specom.2017.01.010S83958

Crossref

RiuNet