Search CORE

8 research outputs found

Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages

Author: Jha Saurav
Singh Anil Kumar
Sudhakar Akhilesh
Publication venue
Publication date: 01/01/2019
Field of study

Out-of-vocabulary (OOV) words can pose serious challenges for machine translation (MT) tasks, and in particular, for low-resource language (LRL) pairs, i.e., language pairs for which few or no parallel corpora exist. Our work adapts variants of seq2seq models to perform transduction of such words from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs built from a bilingual dictionary of Hindi--Bhojpuri words. We demonstrate that our models can be effectively used for language pairs that have limited parallel corpora; our models work at the character level to grasp phonetic and orthographic similarities across multiple types of word adaptations, whether synchronic or diachronic, loan words or cognates. We describe the training aspects of several character level NMT systems that we adapted to this task and characterize their typical errors. Our method improves BLEU score by 6.3 on the Hindi-to-Bhojpuri translation task. Further, we show that such transductions can generalize well to other languages by applying it successfully to Hindi -- Bangla cognate pairs. Our work can be seen as an important step in the process of: (i) resolving the OOV words problem arising in MT tasks, (ii) creating effective parallel corpora for resource-constrained languages, and (iii) leveraging the enhanced semantic knowledge captured by word-level embeddings to perform character-level tasks.Comment: 47 pages, 4 figures, 21 tables (including Appendices

arXiv.org e-Print Archive

Implementación de un lematizador para una lengua de escasos recursos: caso shipibo-konibo

Author: Pereira Noriega José Humberto
Publication venue: 'Baishideng Publishing Group Inc.'
Publication date: 01/01/2019
Field of study

Desde que el Ministerio de Educación oficializó el alfabeto shipibo-konibo, existe la necesidad de generar una gran cantidad de documentos educativos y oficiales para los hablantes de esta lengua, los cuales solo se realizan actualmente mediante el apoyo de traductores o personas bilingües. Sin embargo, en el campo de la lingüística computacional existen herramientas que permiten facilitar estas labores, como es el caso de un lematizador, el cual se encarga de obtener el lema o forma base de una palabra a partir de su forma flexionada. Su realización se da comúnmente mediante dos métodos: el uso de reglas morfológicas y el uso de diccionarios. Debido a esto, este proyecto tiene como objetivo principal desarrollar una herramienta de lematización para el shipibo-konibo usando un corpus de palabras, la cual se base en los estándares de anotación utilizados en otras lenguas, y que sea fácil de utilizar mediante una librería de funciones y un servicio web. Esta herramienta final se realizó utilizando principalmente el método de clasificación de los k-vecinos más cercanos, el cual permite estimar la clase de un nuevo caso mediante la comparación de sus características con las de casos previamente clasificados y dando como resultado la clase más frecuente para valores similares. Finalmente, la herramienta de lematización desarrollada logró alcanzar una precisión de 0.736 y de esta manera superar a herramientas utilizadas en otros idiomas.Tesi

Repositorio institucional - Concytec

Robust part-of-speech tagging of social media text

Author: Horsmann Tobias
Publication venue
Publication date: 27/04/2018
Field of study

Part-of-Speech (PoS) tagging (Wortklassenerkennung) ist ein wichtiger Verarbeitungsschritt in vielen sprachverarbeitenden Anwendungen. Heute gibt es daher viele PoS Tagger, die diese wichtige Aufgabe automatisiert erledigen. Es hat sich gezeigt, dass PoS tagging auf informellen Texten oft nur mit unzureichender Genauigkeit möglich ist. Insbesondere Texte aus sozialen Medien sind eine große Herausforderung. Die erhöhte Fehlerrate, welche auf mangelnde Robustheit zurückgeführt werden kann, hat schwere Folgen für Anwendungen die auf PoS Informationen angewiesen sind. Diese Arbeit untersucht daher Tagger-Robustheit unter den drei Gesichtspunkten der (i) Domänenrobustheit, (ii) Sprachrobustheit und (iii) Robustheit gegenüber seltenen linguistischen Phänomene. Für (i) beginnen wir mit einer Analyse der Phänomene, die in informellen Texten häufig anzutreffen sind, aber in formalen Texten nur selten bis gar keine Verwendung finden. Damit schaffen wir einen Überblick über die Art der Phänomene die das Tagging von informellen Texten so schwierig machen. Wir evaluieren viele der üblicherweise benutzen Tagger für die englische und deutsche Sprache auf Texten aus verschiedenen Domänen, um einen umfassenden Überblick über die derzeitige Robustheit der verfügbaren Tagger zu bieten. Die Untersuchung ergab im Wesentlichen, dass alle Tagger auf informellen Texten große Schwächen zeigen. Methoden, um die Robustheit für domänenübergreifendes Tagging zu verbessern, sind prinzipiell hilfreich, lösen aber das grundlegende Robustheitsproblem nicht. Als neuen Lösungsansatz stellen wir Tagging in zwei Schritten vor, welches eine erhöhte Robustheit gegenüber domänenübergreifenden Tagging bietet. Im ersten Schritt wird nur grob-granular getaggt und im zweiten Schritt wird dieses Tagging dann auf das fein-granulare Level verfeinert. Für (ii) untersuchen wir Sprachrobustheit und ob jede Sprache einen zugeschnittenen Tagger benötigt, oder ob es möglich ist einen sprach-unabhängigen Tagger zu konstruieren, der für mehrere Sprachen funktioniert. Dazu vergleichen wir Tagger basierend auf verschiedenen Algorithmen auf 21 Sprachen und analysieren die notwendigen technischen Eigenschaften für einen Tagger, der auf mehreren Sprachen akkurate Modelle lernen kann. Die Untersuchung ergibt, dass Sprachrobustheit an für sich kein schwerwiegendes Problem ist und, dass die Tagsetgröße des Trainingskorpus ein wesentlich stärkerer Einflussfaktor für die Eignung eines Taggers ist als die Zugehörigkeit zu einer gewissen Sprache. Bezüglich (iii) untersuchen wir, wie man mit seltenen Phänomenen umgehen kann, für die nicht genug Trainingsdaten verfügbar sind. Dazu stellen wir eine neue kostengünstige Methode vor, die nur einen minimalen Aufwand an manueller Annotation erwartet, um zusätzliche Daten für solche seltenen Phänomene zu produzieren. Ein Feldversuch hat gezeigt, dass die produzierten Daten ausreichen um das Tagging von seltenen Phänomenen deutlich zu verbessern. Abschließend präsentieren wir zwei Software-Werkzeuge, FlexTag und DeepTC, die wir im Rahmen dieser Arbeit entwickelt haben. Diese Werkzeuge bieten die notwendige Flexibilität und Reproduzierbarkeit für die Experimente in dieser Arbeit.Part-of-speech (PoS) taggers are an important processing component in many Natural Language Processing (NLP) applications, which led to a variety of taggers for tackling this task. Recent work in this field showed that tagging accuracy on informal text domains is poor in comparison to formal text domains. In particular, social media text, which is inherently different from formal standard text, leads to a drastically increased error rate. These arising challenges originate in a lack of robustness of taggers towards domain transfers. This increased error rate has an impact on NLP applications that depend on PoS information. The main contribution of this thesis is the exploration of the concept of robustness under the following three aspects: (i) domain robustness, (ii) language robustness and (iii) long tail robustness. Regarding (i), we start with an analysis of the phenomena found in informal text that make tagging this kind of text challenging. Furthermore, we conduct a comprehensive robustness comparison of many commonly used taggers for English and German by evaluating them on the text of several text domains. We find that the tagging of informal text is poorly supported by available taggers. A review and analysis of currently used methods to adapt taggers to informal text showed that these methods improve tagging accuracy but offer no satisfactory solution. We propose an alternative tagging approach that reaches an increased multi-domain tagging robustness. This approach is based on tagging in two steps. The first step tags on a coarse-grained level and the second step refines the tags to the fine-grained tags. Regarding (ii), we investigate whether each language requires a language-tailored PoS tagger or if the construction of a competitive language independent tagger is feasible. We explore the technical details that contribute to a tagger's language robustness by comparing taggers based on different algorithms to learn models of 21 languages. We find that language robustness is a less severe issue and that the impact of the tagger choice depends more on the granularity of the tagset that shall be learned than on the language. Regarding (iii), we investigate methods to improve tagging of infrequent phenomena of which no sufficient amount of annotated training data is available, which is a common challenge in the social media domain. We propose a new method to overcome this lack of data that offers an inexpensive way of producing more training data. In a field study, we show that the quality of the produced data suffices to train tagger models that can recognize these under-represented phenomena. Furthermore, we present two software tools, FlexTag and DeepTC, which we developed in the course of this thesis. These tools provide the necessary flexibility for conducting all the experiments in this thesis and ensure their reproducibility

Duisburg-Essen Publications Online

EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

Author
Publication venue: 'OpenEdition'
Publication date: 10/06/2022
Field of study

Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

Categorisation of Arabic Twitter Text

Author: Altamimi Mohammed Hamed R
Publication venue
Publication date: 26/02/2020
Field of study

EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

Author: Agerri Rodrigo
Aliprandi Carlo
Alkhalifa Rabab
Alzetta Chiara
Angel Jason
Anselmi Guido
Appiah Balaji Nitin Nikamanth
Aroyehun Segun Taofeek
Artigas Herold Maria Fernanda
Attanasio Giuseppe
Attardi Giuseppe
Badryzlova Yulia
Bai Yang
Baldissin Gioia
Ballarè Silvia
Barrón-Cedeño Alberto
Bartle Anna-Sophie
Basile Pierpaolo
Basile Valerio
Basili Roberto
Belotti Federico
Bennici Mauro
Bharathi B.
Bhuvana J.
Bianchi Federico
Bisconti Elia
Bolanos Luis
Bondielli Alessandro
Bosco Cristina
Breazzano Claudia
Brivio Matteo
Brunato Dominique
Cafagna Michele
Caputo Annalina
Caselli Tommaso
Cassotti Pierluigi
Castañeda Enrique
Castro Castro Daniel
Centeno Roberto
Cercel Dumitru-Clementin
Cerruti Massimo
Chandrabose Aravindan
Chesi Cristiano
Chiarello Filippo
Cignarella Alessandra Teresa
Cimino Andrea
Comandini Gloria
Croce Danilo
Dai Hongbing
Dascalu Mihai
Dell’Orletta Felice
Delmonte Rodolfo
Deng Tao
De Francesco Nazareno
De Martino Graziella
De Mattei Lorenzo
Di Buccio Emanuele
Di Maro Maria
di Nuovo Elisa
Di Rosa Emanuele
dos S.R. da Silva Adriano
Durante Alberto
El Abassi Samer
Espinosa María S.
Fabrizi Samuel
Fantoni Gualtiero
Ferilli Stefano
Ferraccioli Federico
Fersini Elisabetta
Finos Livio
Fiorucci Stefano
Fontana Michele
Frenda Simona
Gambino Giuseppe
Gatt Albert
Gelbukh Alexander
Giorgi Giulia
Giorgioni Simone
Girardi Paolo
Goria Eugenio
Gregori Lorenzo
Hoffmann Julia
Iacono Maria
Iovine Andrea
Izzi Giovanni Luca
Jimenez Sergio
Kaiser Jens
Kayalvizhi S.
Kivlichan Ian
Klaus Svea
Koceva Frosina
Kovács György
Kruschwitz Udo
Labadie Tamayo Roberto
Lai Mirko
Laicher Severin
Lapesa Gabriella
Lavergne Eric
Lebani Gianluca E.
Lebani Gianluca E.
Lees Alyssa
Lenci Alessandro
Leonardelli Elisa
Li Hongling
Liakata Maria
Lovetere Marco
Madonna Domenico
Massidda Riccardo
Mattei Lorenzo De
Mauri Caterina
Mele Francesco
Melucci Massimo
Menini Stefano
Miaschi Alessio
Miliani Martina
Moggio Alessio
Montagnani Matteo
Montefinese Maria
Montemagni Simonetta
Monti Johanna
Moraca Maurizio
Moretti Giovanni
Morra Simone
Murphy Killian
Muti Arianna
Nakov Preslav
Nisioi Sergiu
Nissim Malvina
Nozza Debora
Occhipinti Daniela
Ortega Bueno Reynier
Ou Xiaozhi
Palmonari Matteo
Parizzi Andrea
Pascucci Antonio
Passaro Lucia C.
Pastor Eliana
Patti Viviana
Pirrone Roberto
Polignano Marco
Politi Marcello
Pont Mattia Da
Pražák Ondřej
Proisl Thomas
Puccetti Giovanni
Přibáň Pavel
Radicioni Daniele P.
Rama Ilir
Rambelli Giulia
Ravelli Andrea Amelio
Rodrigo Alvaro
Rodriguez-Diaz Carlos A.
Rodriguez Cisnero Mariano Jason
Roman Norton T.
Roman Norton Trevisan
Rossmann Daniela
Rosso Paolo
Rotaru Armand Stefan
Rubino Edoardo
Russo Irene
Sabella Gianluca
Saini Rajkumar
Salman Samir
Sangati Federico
Sanguinetti Manuela
Sarti Gabriele
Schlechtweg Dominik
Schulte im Walde Sabine
Sciandra Andrea
Setpal Jinen
Siciliani Lucia
Solari Dario
Sorensen Jeffrey
Sorgente Antonio
Sprugnoli Rachele
Stranisci Marco
Tamburini Fabio
Taylor Stephen
Tesei Andrea
Thenmozhi D.
Tonelli Sara
Torre Ilaria
Tsakalidis Adam
Varvara Rossella
Venturi Giulia
Vettigli Giuseppe
Vlad George-Alexandru
Wang Benyou
Zaharia George-Eduard
Zamparelli Roberto
Zubiaga Arkaitz
Publication venue: 'OpenEdition'
Publication date: 11/05/2021
Field of study

OpenEdition