479 research outputs found

    2kenize: Tying Subword Sequences for Chinese Script Conversion

    Full text link
    Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202

    ν•œκ΅­μ–΄ μžμ—°μ–΄ 처리λ₯Ό μœ„ν•œ ν† ν¬λ‚˜μ΄μ €μ— λŒ€ν•œ 뢄석

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(석사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : λ°μ΄ν„°μ‚¬μ΄μ–ΈμŠ€λŒ€ν•™μ› λ°μ΄ν„°μ‚¬μ΄μ–ΈμŠ€ν•™κ³Ό, 2023. 2. μ΄μž¬μ§„.Although there have been studies on Korean tokenizers intensively, there are not many studies that have reflected the grammatical characteristics of Korean, which is classified as an agglutinative language. In Korean, unlike inflectional languages such as English, where each word segment is a single word, a word segment is not a single word, but a combination of several words or a word and its corresponding grammatical elements. Therefore, it is unreasonable to use tokenizers developed based on English for Korean NLP (Natural Language Processing). By comparing and analyzing 7 tokenizers (Mecab-ko, BPE, WordPiece, Unigram) that are currently mainly used in Korean NLP, we raise the necessity for a new Korean tokenizer that reflects Korean grammatical characteristics. And we summarize the Korean grammar characteristics that a new tokenizer should reflect.ν•œκ΅­μ–΄ ν† ν¬λ‚˜μ΄μ €μ— κ΄€ν•œ μ—°κ΅¬λŠ” κ³„μ†λ˜μ–΄ μ™”μ§€λ§Œ, ν•œκ΅­μ–΄μ˜ 문법적 νŠΉμ„±μ„ λ°˜μ˜ν•œ ν† ν¬λ‚˜μ΄μ €μ— λŒ€ν•œ μ—°κ΅¬λŠ” λ§Žμ§€ μ•Šλ‹€. ν•œκ΅­μ–΄λŠ” ν•˜λ‚˜μ˜ μ–΄μ ˆμ΄ μ—¬λŸ¬ 개의 단어 ν˜Ήμ€ 단어와 그에 λŒ€μ‘λ˜λŠ” 문법적 μš”μ†Œλ“€λ‘œ κ΅¬μ„±λœ κ΅μ°©μ–΄μ˜ νŠΉμ§•μ„ 가지고 μžˆλ‹€. μ΄λŠ” ν•˜λ‚˜μ˜ μ–΄μ ˆμ΄ ν•˜λ‚˜μ˜ λ‹¨μ–΄λ‘œ κ΅¬μ„±λ˜μ–΄ 있고, κ΅΄μ ˆμ–΄μ˜ νŠΉμ§•μ„ κ°–κ³  μžˆλŠ” μ˜μ–΄μ™€λŠ” λ‹€λ₯΄κΈ° λ•Œλ¬Έμ— μ˜μ–΄λ₯Ό 기반으둜 개발된 ν† ν¬λ‚˜μ΄μ €λ“€μ„ ν•œκ΅­μ–΄ μžμ—°μ–΄ μ²˜λ¦¬μ— μ‚¬μš©ν•˜λŠ” 것은 μ ν•©ν•˜μ§€ μ•Šλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” ν•œκ΅­μ–΄ μžμ—°μ–΄ μ²˜λ¦¬μ—μ„œ 주둜 μ‚¬μš©λ˜λŠ” 7개의 ν† ν¬λ‚˜μ΄μ € (Mecab-ko, BPE, WordPiece, Unigram) 듀을 λΉ„κ΅ν•˜κ³  λΆ„μ„ν•œλ‹€. 뢄석 κ²°κ³Όλ₯Ό λ°”νƒ•μœΌλ‘œ ν•œκ΅­μ–΄μ˜ 문법적 νŠΉμ„±μ„ λ°˜μ˜ν•œ μƒˆλ‘œμš΄ ν† ν¬λ‚˜μ΄μ €μ˜ ν•„μš”μ„±μ„ μ œμ•ˆν•˜κ³ , ν•΄λ‹Ή ν† ν¬λ‚˜μ΄μ €κ°€ λ°˜μ˜ν•΄μ•Ό ν•  μš”μ†Œλ“€μ— λŒ€ν•˜μ—¬ μ •λ¦¬ν•˜μ˜€λ‹€.1 Introduction 1 1.1 Purpose of Research 2 2 Background and Related Work 3 2.1 Tokenizer 3 2.2 GPT-2 4 2.3 Related Work 5 3 Experiments 6 3.1 Dataset 6 3.2 Tokenizer Training 6 3.3 GPT Pretraining 7 3.4 GPT Finetuning 7 3.5 Results 8 4 Conclusion 14 4.1 Analysis 14석

    From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation

    Full text link
    Subword tokenization has been widely successful in text-based natural language processing (NLP) tasks with Transformer-based models. As Transformer models become increasingly popular in symbolic music-related studies, it is imperative to investigate the efficacy of subword tokenization in the symbolic music domain. In this paper, we explore subword tokenization techniques, such as byte-pair encoding (BPE), in symbolic music generation and its impact on the overall structure of generated songs. Our experiments are based on three types of MIDI datasets: single track-melody only, multi-track with a single instrument, and multi-track and multi-instrument. We apply subword tokenization on post-musical tokenization schemes and find that it enables the generation of longer songs at the same time and improves the overall structure of the generated music in terms of objective metrics like structure indicator (SI), Pitch Class Entropy, etc. We also compare two subword tokenization methods, BPE and Unigram, and observe that both methods lead to consistent improvements. Our study suggests that subword tokenization is a promising technique for symbolic music generation and may have broader implications for music composition, particularly in cases involving complex data such as multi-track songs

    Better Word Embeddings by Disentangling Contextual n-Gram Information

    Full text link
    Pre-trained word vectors are ubiquitous in Natural Language Processing applications. In this paper, we show how training word embeddings jointly with bigram and even trigram embeddings, results in improved unigram embeddings. We claim that training word embeddings along with higher n-gram embeddings helps in the removal of the contextual information from the unigrams, resulting in better stand-alone word embeddings. We empirically show the validity of our hypothesis by outperforming other competing word representation models by a significant margin on a wide variety of tasks. We make our models publicly available.Comment: NAACL 201

    Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation

    Get PDF
    There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasksβ€”English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedishβ€”and one real-world task, Norwegian to North SΓ‘mi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks-English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish-and one real-world task, Norwegian to North Sami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.Peer reviewe
    • …
    corecore