1,274 research outputs found

    CHJ-WLSP : Annotation of \u27Word List by Semantic Principles\u27 Labels for the Corpus of Historical Japanese

    Get PDF
    National Institute for Japanese Language and Linguistics / Tokyo University of Foreign StudiesSaitama UniversityUniversity of TokyoKyoto Prefectural UniversityUniversity of TokyoMejiro UniversityNational Institute for Japanese Language and LinguisticsThis article presents a word-sense annotation for the Corpus of Historical Japanese: a mashed-up Japanese lexicon based on the \u27Word List by Semantic Principles\u27 (WLSP). The WLSP is a large-scale Japanese thesaurus that includes 98,241 entries with syntactic and hierarchical semantic categories. The historical WLSP is also compiled for the words in ancient Japanese. We utilized a morpheme-word sense alignment table to extract all possible word sense candidates for each word appearing in the target corpus. Then, we manually disambiguated the word senses for 647,751 words in the texts from the 10th century to 1910

    Word Sense Disambiguation of Corpus of Historical Japanese Using Japanese BERT Trained with Contemporary Texts

    Get PDF
    application/pdfTokyo University of Agriculture and TechnologyTokyo University of Agriculture and TechnologyNational Institute for Japanese Language and Linguisticshttps://aclanthology.org/2022.paclic-1.49/journal articl

    Generation and Evaluation of Concept Embeddings Via Fine-Tuning Using Automatically Tagged Corpus

    Get PDF
    Ibaraki UniversityNational Institute for Japanese Language and LinguisticsIbaraki Universit

    Diachronic Domain Adaptation of Word Sense Disambiguation in Corpus of Historical Japanese Using Word Embeddings

    Get PDF
    東京農工大学茨城大学茨城大学Tokyo University of Agriculture and TechnologyIbaraki UniversityIbaraki University語義タグ付きコーパスを用いた現代日本語の語義曖昧性解消の研究は数多い。しかし,入手可能なタグ付きコーパスが少ないため,日本語の古典語の語義曖昧性解消を高性能に行うことは難しい。そのため,現代日本語文を用いて通時的な領域適応を行うことは,古典語の語義曖昧性解消の性能を高めるひとつの解決方法であると考えられる。本研究では,日本語の古典語の語義曖昧性解消において,領域適応手法のひとつである,分散表現のfine-tuningの効果について調べる。現代文の分散表現であるNWJC2vecの古典語によるfine-tuningや,古典語によって作成した分散表現の現代文によるfine-tuningなど,様々なfine-tuningのシナリオを検証した。さらに,NWJC2vecを古典語でfine-tuningする際には,時代順に段階的に分散表現をfine-tuningする手法についても試した。語義曖昧性解消の対象語の前後二語ずつの単語の分散表現を素性とし,Support Vector Machineの分類器に用いて分類を行った。シナリオは(1)現代文のコーパスの全用例と古典語のコーパスの用例8割を訓練事例とし,残りの2割の古典語の用例をテストとして利用する場合,(2)古典語の用例だけを利用して五分割交差検定を行った場合,(3)現代文のコーパスの全用例を訓練事例とし,古典語全用例をテストする場合の三通りを比較した。最高の精度となったのは,(2)古典語の用例だけを利用したシナリオで,古典語によって作成した分散表現に現代文によるfine-tuningを行った場合であった。There have been many studies on word sense disambiguation (WSD) in contemporary Japanese. However, it is difficult to achieve high performance of WSD in historical Japanese because of the lack of sense-tagged corpora. Therefore, diachronic adaptation using contemporary Japanese could be a solution. We investigated the effectiveness of the fine-tuning of word embeddings for WSD in historical Japanese. A variety of fine-tuning scenarios are examined, including the case where the word embeddings of contemporary Japanese (NWJC2vec) are fine-tuned with historical Japanese and the case where the word embeddings trained with historical Japanese are fine-tuned with contemporary Japanese. Moreover, when NWJC2vec was fine-tuned with a historical corpus, the case where the word embeddings were gradually fine-tuned in the order of time was also tested. The word embeddings of two words before and after the target word are used as the features for the support vector machine, which is a classifier of WSD. The following three scenarios are compared: (1) all the examples from the contemporary Japanese corpus and 80% examples from the historical corpus are used as the training data for the test of the remaining 20% examples from the historical corpus, (2) 5-fold cross validation of the examples of the historical Japanese corpus, and (3) all the examples from the contemporary corpus are used as the training data for test examples from the historical corpus. The best accuracy was achieved when we used word embeddings trained from a historical corpus and fine-tuned with a contemporary corpus in the 5-fold cross validation scenario

    Semi-supervised learning for all-words WSD using self-learning and fine-tuning

    Get PDF

    Design of BCCWJ-EEG : Balanced Corpus with Human Electroencephalography

    Get PDF
    Waseda UniversityNational Institute for Japanese Language and LinguisticsThe past decade has witnessed the happy marriage between natural language processing (NLP) and the cognitive science of language. Moreover, given the historical relationship between biological and artificial neural networks, the advent of deep learning has re-sparked strong interests in the fusion of NLP and the neuroscience of language. Importantly, this inter-fertilization between NLP, on one hand, and the cognitive (neuro)science of language, on the other, has been driven by the language resources annotated with human language processing data. However, there remain several limitations with those language resources on annotations, genres, languages, etc. In this paper, we describe the design of a novel language resource called BCCWJ-EEG, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) experimentally annotated with human electroencephalography (EEG). Specifically, after extensively reviewing the language resources currently available in the literature with special focus on eye-tracking and EEG, we summarize the details concerning (i) participants, (ii) stimuli, (iii) procedure, (iv) data preprocessing, (v) corpus evaluation, (vi) resource release, and (vii) compilation schedule. In addition, potential applications of BCCWJ-EEG to neuroscience and NLP will also be discussed

    Reading Time and Vocabulary Rating in the Japanese Language : Large-Scale Reading Time Data Collection Using Crowdsourcing

    Get PDF
    National Institute for Japanese Language and Linguistics / Tokyo University of Foreign StudiesThis study examined the effect of the differences in human vocabulary on reading time. This study conducted a word familiarity survey and applied a generalised linear mixed model to the participant ratings, assuming vocabulary to be a random effect of the participants. Following this, the participants took part in a self-paced reading task, and their reading times were recorded. The results clarified the effect of vocabulary differences on reading time

    Proceedings

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 98 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893
    corecore