Search CORE

1,517 research outputs found

Character-level and syntax-level models for low-resource and multilingual natural language processing

Author: Severini Silvia
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 05/07/2023
Field of study

There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)

Digitale Hochschulschriften der LMU

TRANSLIT : a large-scale name transliteration resource

Author: Benites de Azevedo e Souza Fernando
Cieliebak Mark
Duivesteijn Gilbert François
von Däniken Pius
Publication venue: European Language Resources Association
Publication date: 01/05/2020
Field of study

Transliteration is the process of expressing a proper name from a source language in the characters of a target language (e.g. from Cyrillic to Latin characters). We present TRANSLIT, a large-scale corpus with approx. 1.6 million entries in more than 180 languages with about 3 million variations of person and geolocation names. The corpus is based on various public data sources, which have been transformed into a unified format to simplify their usage, plus a newly compiled dataset from Wikipedia. In addition, we apply several machine learning methods to establish baselines for automatically detecting transliterated names in various languages. Our best systems achieve an accuracy of 92\% on identification of transliterated pairs

ZHAW digitalcollection

ParaNames: A Massively Multilingual Entity Name Corpus

Author: Lignos Constantine
Sälevä Jonne
Publication venue
Publication date: 31/03/2022
Field of study

This preprint describes work in progress on ParaNames, a multilingual parallel name resource consisting of names for approximately 14 million entities. The included names span over 400 languages, and almost all entities are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to-date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English. Our resource is released at \url{https://github.com/bltlab/paranames} under a Creative Commons license (CC BY 4.0)

arXiv.org e-Print Archive

Combining Word Embeddings with Bilingual Orthography Embeddings for Bilingual Dictionary Induction

Author: Fraser Alexander
Hangya Viktor
Schütze Hinrich
Severini Silvia
Publication venue
Publication date: 01/01/2020
Field of study

Crossref

Open Access LMU

Slav-NER : the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic languages

Author: Babych Bogdan
Kancheva Zara
Kanishcheva Olga
Lebedeva Maria
Marcinczuk Michał
Nakov Preslav
Osenova Petya
Piskorski Jakub
Pivovarova Lidia
Pollak Senja
Přibáň Pavel
Radev Ivaylo
Robnik-Šikonja Marko
Starko Vasyl
Steinberger Josef
Yangarber Roman
Publication venue
Publication date: 01/01/2021
Field of study

Non peer reviewe

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Helsingin yliopiston digitaalinen arkisto

Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

Author: Makazhanov Aibek
Myrzakhmetov Bagdat
Yessenbayev Zhandos
Publication venue: The IEEE 12th International Conference Application of Information and Communication Technologies
Publication date: 01/10/2018
Field of study

We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy

Crossref

Nazarbayev University Repository

Exploiting Cross-Lingual Representations For Natural Language Processing

Author: Upadhyay Shyam
Publication venue: ScholarlyCommons
Publication date: 01/01/2019
Field of study

Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages. In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest

ScholarlyCommons@Penn

Sub-Character Tokenization for Chinese Pretrained Language Models

Author: Chen Yingfa
Liu Qun
Liu Zhiyuan
Qi Fanchao
Si Chenglei
Sun Maosong
Wang Xiaozhi
Wang Yasheng
Zhang Zhengyan
Publication venue
Publication date: 22/12/2021
Field of study

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word tokenization. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to all homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code at https://github.com/thunlp/SubCharTokenization to facilitate future work.Comment: This draft supersedes the previous version named "SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining

arXiv.org e-Print Archive

Directory of Open Access Journals