84 research outputs found

    An Approach toward Register Classification of Book Samples in the Balanced Corpus of Contemporary Written Japanese

    Get PDF

    Design of BCCWJ-EEG : Balanced Corpus with Human Electroencephalography

    Get PDF
    Waseda UniversityNational Institute for Japanese Language and LinguisticsThe past decade has witnessed the happy marriage between natural language processing (NLP) and the cognitive science of language. Moreover, given the historical relationship between biological and artificial neural networks, the advent of deep learning has re-sparked strong interests in the fusion of NLP and the neuroscience of language. Importantly, this inter-fertilization between NLP, on one hand, and the cognitive (neuro)science of language, on the other, has been driven by the language resources annotated with human language processing data. However, there remain several limitations with those language resources on annotations, genres, languages, etc. In this paper, we describe the design of a novel language resource called BCCWJ-EEG, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) experimentally annotated with human electroencephalography (EEG). Specifically, after extensively reviewing the language resources currently available in the literature with special focus on eye-tracking and EEG, we summarize the details concerning (i) participants, (ii) stimuli, (iii) procedure, (iv) data preprocessing, (v) corpus evaluation, (vi) resource release, and (vii) compilation schedule. In addition, potential applications of BCCWJ-EEG to neuroscience and NLP will also be discussed

    Word Sense Disambiguation of Corpus of Historical Japanese Using Japanese BERT Trained with Contemporary Texts

    Get PDF
    application/pdfTokyo University of Agriculture and TechnologyTokyo University of Agriculture and TechnologyNational Institute for Japanese Language and Linguisticshttps://aclanthology.org/2022.paclic-1.49/journal articl

    『現代日本語書き言葉均衡コーパス』のロシア語翻訳データの構築とその日露対照研究への活用の可能性

    Get PDF
    東京大学東京外国語大学大学院 博士後期課程The University of TokyoPh.D. Student, Tokyo University of Foreign Studies『現代日本語書き言葉均衡コーパス』(の一部のデータ)には,既に英語,イタリア語,インドネシア語,中国語の翻訳データが構築されているが,新たにロシア語の翻訳データを構築した。対象となる起点テキストは『現代日本語書き言葉均衡コーパス』新聞(PN)コアデータ16サンプル(総語数は短単位で全16,657語)とし,ロシア語目標テキストの総語数は13,070語となった。本データの構築にあたっては,日本語からロシア語へ人手による翻訳を行ったが,日本語とロシア語の言語構造の違いや表現の違い等により,翻訳に困難が生じた箇所もあった。本稿では,翻訳データの構築方法,翻訳の際の留意点の詳細を述べる。また,原文の日本語テキストと翻訳先のロシア語テキストは人手で文単位のアライメントを取り,各文にはIDを付与した。その作業方法についても記述する。翻訳データの構築,アライメント作業により,起点テキストと目標テキストは簡易的な日露パラレルコーパスとして利用可能となり,日露対照研究や類型論研究に活用できると考えられる。本稿では,このような活用の可能性を示すために,ケーススタディとして日本語の文末表現を取り上げ,ロシア語と対照させて同異を議論する。A part of the data of the "Balanced Corpus of Contemporary Written Japanese" (BCCWJ) is translated into English, Italian, Chinese, and Indonesian. We added new translation data collected from 16 samples of newspaper (PN) core data to BCCWJ in Russian. The total length of the Japanese source text is 16,657 short unit words, which corresponds to 13,070 words in the Russian target text. The translation was conducted manually by a native Russian speaker. During the translation, various difficulties were encountered due to significant structural and lexical differences between Japanese and Russian. This study introduces the data construction method that we used and some key points that we focused on while translating. We also manually aligned all sentences in the source text with those in the translation and assigned an ID to each sentence; this study provides an explanation regarding this workflow as well. Translation and alignment make the original data and their translation function as a simple Japanese-Russian parallel corpus. This can be useful for Japanese-Russian comparative studies and linguistic typology studies. In this study, we address Japanese sentence endings and compare them with Russian ones as a case study to present the possible ways of using our new translation data

    『現代日本語書き言葉均衡コーパス』への情報構造アノテーションとその分析

    Get PDF
    東京外国語大学大学院 博士後期課程国立国語研究所 コーパス開発センター千葉大学人文科学研究院 特任研究員国立国語研究所 コーパス開発センター 非常勤研究員Ph.D. Student, Tokyo University of Foreign StudiesCenter for Corpus Development, NINJALResearch Fellow, Graduate School of Humanities, Chiba UniversityAdjunct Researcher, Center for Corpus Development, NINJAL本稿では,『現代日本語書き言葉均衡コーパス』のテキスト(新聞(PN)コアデータ16サンプル)内の名詞句に対し,情報構造に関係する文法情報のラベル(情報状態,共有性,定性,特定性,有生性,有情性,動作主性)をアノテーションした結果を報告する。特に,本稿ではアノテーションの概要と基礎統計について述べる。ラベル間の対応をKappa値で評価した結果,先行研究で既にアノテーションされていた共参照情報を基にした情報状態と定性・特定性の間には中程度の一致(0.41以上)が見られたのに対し,今回新たに付与した共有性と定性・特定性の間にはほとんど完璧な一致(0.81以上)が見られた。冠詞選択に大きな影響を与える定性・特定性のアノテーションは,定性・特定性が話し手側により踏み込んだ概念であることから複雑で難度が高いため,他の文法情報で定性・特定性を推定する方がより容易であると考えられる。評価の結果は,定性・特定性の推定には,共参照情報を基にした情報状態だけでは十分でなく,聞き手/読み手の観点を考慮した共有性が重要であることを意味している。また,日本語では助詞「は」と「が」の使い分けについて,情報構造との関連が指摘されているが,付属語主辞とのラベルの関係を見ると,「が」「を」「に」は新情報が多く,「は」は若干旧情報が多いこと,「は」「の」に定性・特定のものが多く,「を」に不定・不特定のものが多いことがわかった。This paper presents the information structure\u27s annotation data (information status, commonness, definiteness, specificity, animacy, sentience, and agentivity) of the "Balanced Corpus of Contemporary Written Japanese." The annotation schema and statistics are displayed. Evaluation utilizing Kappa value indicates a moderate agreement (0.41≤) between the information status that is based on the already annotated co-reference information and definiteness/specificity. In addition, there is an almost perfect agreement (0.81≤) between commonness, which is recently annotated in this research, and definiteness/specificity. Thus, we conclude that commonness is more significant than information status to estimate definiteness and specificity, significantly affecting article selection in languages with articles. We investigate the relation between some particles and labels explained in this research since some researchers report that information structure is related to the distinction between the particles wa and ga in Japanese. Hence, the particles ga, o, and ni are usually employed with discourse-new noun phrases and wa with discourse-old ones. The particle wa is generally employed with definite and specific noun phrases, while o is employed with indefinite and unspecific ones

    Generation and Evaluation of Concept Embeddings Via Fine-Tuning Using Automatically Tagged Corpus

    Get PDF
    Ibaraki UniversityNational Institute for Japanese Language and LinguisticsIbaraki Universit

    Speech corpora in NINJAL, Japan demonstration of corpus concordance systems : Chunagon and Kotonoha

    Get PDF
    National Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsThe National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan) provides a demonstration site in the LPSS 2019 conference. This manuscript presents an overview of the demonstration of three corpora: Corpus of Spontaneous Japanese, Corpus of Everyday Japanese Conversation, and Corpus of Japanese Dialects.NINJAL also demonstrates two concordance systems. The first is "Chunagon (中納言)" which is a morpheme based concordance system that was made publicly available in 2011. The second is the currently developing system "Kotonoha" released in 2018 that enables query of multiple corpora in terms of register type and period

    読み時間と情報構造について(ちょっとながめ)

    Get PDF
    会議名: 言語資源活用ワークショップ2016, 開催地: 国立国語研究所, 会期: 2017年3月7日-8日, 主催: 国立国語研究所 コーパス開発センター本研究では『現代日本語書き言葉均衡コーパス』に対して付与された,文の読み時間データ『BCCWJ-EyeTrack』と,名詞句の定性などの情報構造アノテーションデータの対照分析を行った。日本語母語話者24 人分のデータを線形混合モデルにより分析した結果,特定性(specificity)・有情性(sentience)・共有性(commonness) が文の読み時間に影響を与え,それぞれ異なったパターンの読み時間の遅延を引き起こすことがわかった。特に共有性においては新情報(hearer-new)・想定可能(bridging) が識別可能なレベルで異なった。このことは,ある名詞句が言語受容者にとって新情報なのか想定可能なのかを読み時間データから推定することができる可能性を示唆しており,文書要約のユーザ適応などの応用に利用することが期待できる
    corecore