Search CORE

479 research outputs found

2kenize: Tying Subword Sequences for Chinese Script Conversion

Author: A Pranav
Augenstein Isabelle
Publication venue
Publication date: 01/01/2020
Field of study

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

한국어 자연어 처리를 위한 토크나이저에 대한 분석

Author: 이소람
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(석사) -- 서울대학교대학원 : 데이터사이언스대학원 데이터사이언스학과, 2023. 2. 이재진.Although there have been studies on Korean tokenizers intensively, there are not many studies that have reflected the grammatical characteristics of Korean, which is classified as an agglutinative language. In Korean, unlike inflectional languages such as English, where each word segment is a single word, a word segment is not a single word, but a combination of several words or a word and its corresponding grammatical elements. Therefore, it is unreasonable to use tokenizers developed based on English for Korean NLP (Natural Language Processing). By comparing and analyzing 7 tokenizers (Mecab-ko, BPE, WordPiece, Unigram) that are currently mainly used in Korean NLP, we raise the necessity for a new Korean tokenizer that reflects Korean grammatical characteristics. And we summarize the Korean grammar characteristics that a new tokenizer should reflect.한국어 토크나이저에 관한 연구는 계속되어 왔지만, 한국어의 문법적 특성을 반영한 토크나이저에 대한 연구는 많지 않다. 한국어는 하나의 어절이 여러 개의 단어 혹은 단어와 그에 대응되는 문법적 요소들로 구성된 교착어의 특징을 가지고 있다. 이는 하나의 어절이 하나의 단어로 구성되어 있고, 굴절어의 특징을 갖고 있는 영어와는 다르기 때문에 영어를 기반으로 개발된 토크나이저들을 한국어 자연어 처리에 사용하는 것은 적합하지 않다. 본 논문에서는 한국어 자연어 처리에서 주로 사용되는 7개의 토크나이저 (Mecab-ko, BPE, WordPiece, Unigram) 들을 비교하고 분석한다. 분석 결과를 바탕으로 한국어의 문법적 특성을 반영한 새로운 토크나이저의 필요성을 제안하고, 해당 토크나이저가 반영해야 할 요소들에 대하여 정리하였다.1 Introduction 1 1.1 Purpose of Research 2 2 Background and Related Work 3 2.1 Tokenizer 3 2.2 GPT-2 4 2.3 Related Work 5 3 Experiments 6 3.1 Dataset 6 3.2 Tokenizer Training 6 3.3 GPT Pretraining 7 3.4 GPT Finetuning 7 3.5 Results 8 4 Conclusion 14 4.1 Analysis 14석

SNU Open Repository and Archive

From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation

Author: Kumar Adarsh
Sarmento Pedro
Publication venue
Publication date: 18/04/2023
Field of study

Subword tokenization has been widely successful in text-based natural language processing (NLP) tasks with Transformer-based models. As Transformer models become increasingly popular in symbolic music-related studies, it is imperative to investigate the efficacy of subword tokenization in the symbolic music domain. In this paper, we explore subword tokenization techniques, such as byte-pair encoding (BPE), in symbolic music generation and its impact on the overall structure of generated songs. Our experiments are based on three types of MIDI datasets: single track-melody only, multi-track with a single instrument, and multi-track and multi-instrument. We apply subword tokenization on post-musical tokenization schemes and find that it enables the generation of longer songs at the same time and improves the overall structure of the generated music in terms of objective metrics like structure indicator (SI), Pitch Class Entropy, etc. We also compare two subword tokenization methods, BPE and Unigram, and observe that both methods lead to consistent improvements. Our study suggests that subword tokenization is a promising technique for symbolic music generation and may have broader implications for music composition, particularly in cases involving complex data such as multi-track songs

arXiv.org e-Print Archive

Better Word Embeddings by Disentangling Contextual n-Gram Information

Author: Gupta Prakhar
Jaggi Martin
Pagliardini Matteo
Publication venue
Publication date: 01/01/2019
Field of study

Pre-trained word vectors are ubiquitous in Natural Language Processing applications. In this paper, we show how training word embeddings jointly with bigram and even trigram embeddings, results in improved unigram embeddings. We claim that training word embeddings along with higher n-gram embeddings helps in the removal of the contextual information from the unigrams, resulting in better stand-alone word embeddings. We empirically show the validity of our hypothesis by outperforming other competing word representation models by a significant margin on a wide variety of tasks. We make our models publicly available.Comment: NAACL 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation

Author: Gronroos Stig-Arne
Kurimo Mikko
Virpioja Sami
Publication venue
Publication date: 01/12/2020
Field of study

There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks—English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish—and one real-world task, Norwegian to North Sámi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks-English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish-and one real-world task, Norwegian to North Sami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.Peer reviewe

arXiv.org e-Print Archive

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto