479 research outputs found
2kenize: Tying Subword Sequences for Chinese Script Conversion
Simplified Chinese to Traditional Chinese character conversion is a common
preprocessing step in Chinese NLP. Despite this, current approaches have poor
performance because they do not take into account that a simplified Chinese
character can correspond to multiple traditional characters. Here, we propose a
model that can disambiguate between mappings and convert between the two
scripts. The model is based on subword segmentation, two language models, as
well as a method for mapping between subword sequences. We further construct
benchmark datasets for topic classification and script conversion. Our proposed
method outperforms previous Chinese Character conversion approaches by 6 points
in accuracy. These results are further confirmed in a downstream application,
where 2kenize is used to convert pretraining dataset for topic classification.
An error analysis reveals that our method's particular strengths are in dealing
with code-mixing and named entities.Comment: Accepted to ACL 202
νκ΅μ΄ μμ°μ΄ μ²λ¦¬λ₯Ό μν ν ν¬λμ΄μ μ λν λΆμ
νμλ
Όλ¬Έ(μμ¬) -- μμΈλνκ΅λνμ : λ°μ΄ν°μ¬μ΄μΈμ€λνμ λ°μ΄ν°μ¬μ΄μΈμ€νκ³Ό, 2023. 2. μ΄μ¬μ§.Although there have been studies on Korean tokenizers intensively, there are not many studies that have reflected the grammatical characteristics of Korean, which is classified as an agglutinative language. In Korean, unlike inflectional languages such as English, where each word segment is a single word, a word segment is not a single word, but a combination of several words or a word and its corresponding grammatical elements. Therefore, it is unreasonable to use tokenizers developed based on English for Korean NLP (Natural Language Processing).
By comparing and analyzing 7 tokenizers (Mecab-ko, BPE, WordPiece, Unigram) that are currently mainly used in Korean NLP, we raise the necessity for a new Korean tokenizer that reflects Korean grammatical characteristics. And we summarize the Korean grammar characteristics that a new tokenizer should reflect.νκ΅μ΄ ν ν¬λμ΄μ μ κ΄ν μ°κ΅¬λ κ³μλμ΄ μμ§λ§, νκ΅μ΄μ λ¬Έλ²μ νΉμ±μ λ°μν ν ν¬λμ΄μ μ λν μ°κ΅¬λ λ§μ§ μλ€. νκ΅μ΄λ νλμ μ΄μ μ΄ μ¬λ¬ κ°μ λ¨μ΄ νΉμ λ¨μ΄μ κ·Έμ λμλλ λ¬Έλ²μ μμλ€λ‘ ꡬμ±λ κ΅μ°©μ΄μ νΉμ§μ κ°μ§κ³ μλ€. μ΄λ νλμ μ΄μ μ΄ νλμ λ¨μ΄λ‘ ꡬμ±λμ΄ μκ³ , κ΅΄μ μ΄μ νΉμ§μ κ°κ³ μλ μμ΄μλ λ€λ₯΄κΈ° λλ¬Έμ μμ΄λ₯Ό κΈ°λ°μΌλ‘ κ°λ°λ ν ν¬λμ΄μ λ€μ νκ΅μ΄ μμ°μ΄ μ²λ¦¬μ μ¬μ©νλ κ²μ μ ν©νμ§ μλ€. λ³Έ λ
Όλ¬Έμμλ νκ΅μ΄ μμ°μ΄ μ²λ¦¬μμ μ£Όλ‘ μ¬μ©λλ 7κ°μ ν ν¬λμ΄μ (Mecab-ko, BPE, WordPiece, Unigram) λ€μ λΉκ΅νκ³ λΆμνλ€. λΆμ κ²°κ³Όλ₯Ό λ°νμΌλ‘ νκ΅μ΄μ λ¬Έλ²μ νΉμ±μ λ°μν μλ‘μ΄ ν ν¬λμ΄μ μ νμμ±μ μ μνκ³ , ν΄λΉ ν ν¬λμ΄μ κ° λ°μν΄μΌ ν μμλ€μ λνμ¬ μ 리νμλ€.1 Introduction 1
1.1 Purpose of Research 2
2 Background and Related Work 3
2.1 Tokenizer 3
2.2 GPT-2 4
2.3 Related Work 5
3 Experiments 6
3.1 Dataset 6
3.2 Tokenizer Training 6
3.3 GPT Pretraining 7
3.4 GPT Finetuning 7
3.5 Results 8
4 Conclusion 14
4.1 Analysis 14μ
From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation
Subword tokenization has been widely successful in text-based natural
language processing (NLP) tasks with Transformer-based models. As Transformer
models become increasingly popular in symbolic music-related studies, it is
imperative to investigate the efficacy of subword tokenization in the symbolic
music domain. In this paper, we explore subword tokenization techniques, such
as byte-pair encoding (BPE), in symbolic music generation and its impact on the
overall structure of generated songs. Our experiments are based on three types
of MIDI datasets: single track-melody only, multi-track with a single
instrument, and multi-track and multi-instrument. We apply subword tokenization
on post-musical tokenization schemes and find that it enables the generation of
longer songs at the same time and improves the overall structure of the
generated music in terms of objective metrics like structure indicator (SI),
Pitch Class Entropy, etc. We also compare two subword tokenization methods, BPE
and Unigram, and observe that both methods lead to consistent improvements. Our
study suggests that subword tokenization is a promising technique for symbolic
music generation and may have broader implications for music composition,
particularly in cases involving complex data such as multi-track songs
Better Word Embeddings by Disentangling Contextual n-Gram Information
Pre-trained word vectors are ubiquitous in Natural Language Processing
applications. In this paper, we show how training word embeddings jointly with
bigram and even trigram embeddings, results in improved unigram embeddings. We
claim that training word embeddings along with higher n-gram embeddings helps
in the removal of the contextual information from the unigrams, resulting in
better stand-alone word embeddings. We empirically show the validity of our
hypothesis by outperforming other competing word representation models by a
significant margin on a wide variety of tasks. We make our models publicly
available.Comment: NAACL 201
Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation
There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasksβEnglish to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedishβand one real-world task, Norwegian to North SΓ‘mi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks-English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish-and one real-world task, Norwegian to North Sami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.Peer reviewe
- β¦