38 research outputs found

    An Approach toward Register Classification of Book Samples in the Balanced Corpus of Contemporary Written Japanese

    Get PDF

    Design of BCCWJ-EEG : Balanced Corpus with Human Electroencephalography

    Get PDF
    Waseda UniversityNational Institute for Japanese Language and LinguisticsThe past decade has witnessed the happy marriage between natural language processing (NLP) and the cognitive science of language. Moreover, given the historical relationship between biological and artificial neural networks, the advent of deep learning has re-sparked strong interests in the fusion of NLP and the neuroscience of language. Importantly, this inter-fertilization between NLP, on one hand, and the cognitive (neuro)science of language, on the other, has been driven by the language resources annotated with human language processing data. However, there remain several limitations with those language resources on annotations, genres, languages, etc. In this paper, we describe the design of a novel language resource called BCCWJ-EEG, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) experimentally annotated with human electroencephalography (EEG). Specifically, after extensively reviewing the language resources currently available in the literature with special focus on eye-tracking and EEG, we summarize the details concerning (i) participants, (ii) stimuli, (iii) procedure, (iv) data preprocessing, (v) corpus evaluation, (vi) resource release, and (vii) compilation schedule. In addition, potential applications of BCCWJ-EEG to neuroscience and NLP will also be discussed

    日本語語彙特性のデータベースの構築―その基礎枠組み及び主要中核要素の概観―

    Get PDF
    In order to be able to conduct meaningful research into all aspects of language, it is essential for language science and cognitive science researchers to have practicalaccess to an increasingly wider range of detailed and contemporary information about their target languages. Against that background, this paper presents a short overview summary of an ongoing project to construct a largescale database of Japanese lexical properties (JLP). More specifically, after outlining the concurrent construction of the ontology of Japanese lexical properties (JLP-O; Joyce & Hodošček, 2014), which provides the basic guiding framework for the JLP database construction project, the paper also outlines the initial core components of the JLP database, with particular emphasis on two of those components;namely, a database of semantic transparency (ST) ratings for approximately 10,000 two-kanji compound words and some initial results for the extraction and automatic analyses of the word structures of both three- and fourkanji compound words.言語科学者や認知科学者にとって,言語のあらゆる側面について有意義な研究を企図するためには,目的とする言語に関する詳細かつ現代的な幅広い情報に実用可能なレベルでアクセスできることが必要不可欠である。このことを背景として,本稿では,日本語の語彙特性に関する大規模データベースの構築を目指して現在進行中のプロジェクトについての概要を説明する。具体的には,この日本語語彙特性データベース構築プロジェクトに対して基本的な枠組みを提供する,日本語語彙特性に関するオントロジー(Joyce & Hodošček, 2014)の構築について概観したのちに,日本語語彙特性データベースの主要中核要素について略述する。特に,約10,000 の漢字二字熟語に対する意味的透明性の評定データベースと,漢字三字および四字の熟語の抽出とその語構造に対する自動分析に関する主要な結果という2 種類の中核要素を取り上げて論じる

    Speech corpora in NINJAL, Japan demonstration of corpus concordance systems : Chunagon and Kotonoha

    Get PDF
    National Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsNational Institute for Japanese Language and LinguisticsThe National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan) provides a demonstration site in the LPSS 2019 conference. This manuscript presents an overview of the demonstration of three corpora: Corpus of Spontaneous Japanese, Corpus of Everyday Japanese Conversation, and Corpus of Japanese Dialects.NINJAL also demonstrates two concordance systems. The first is "Chunagon (中納言)" which is a morpheme based concordance system that was made publicly available in 2011. The second is the currently developing system "Kotonoha" released in 2018 that enables query of multiple corpora in terms of register type and period

    『現代日本語書き言葉均衡コーパス』への情報構造アノテーションとその分析

    Get PDF
    東京外国語大学大学院 博士後期課程国立国語研究所 コーパス開発センター千葉大学人文科学研究院 特任研究員国立国語研究所 コーパス開発センター 非常勤研究員Ph.D. Student, Tokyo University of Foreign StudiesCenter for Corpus Development, NINJALResearch Fellow, Graduate School of Humanities, Chiba UniversityAdjunct Researcher, Center for Corpus Development, NINJAL本稿では,『現代日本語書き言葉均衡コーパス』のテキスト(新聞(PN)コアデータ16サンプル)内の名詞句に対し,情報構造に関係する文法情報のラベル(情報状態,共有性,定性,特定性,有生性,有情性,動作主性)をアノテーションした結果を報告する。特に,本稿ではアノテーションの概要と基礎統計について述べる。ラベル間の対応をKappa値で評価した結果,先行研究で既にアノテーションされていた共参照情報を基にした情報状態と定性・特定性の間には中程度の一致(0.41以上)が見られたのに対し,今回新たに付与した共有性と定性・特定性の間にはほとんど完璧な一致(0.81以上)が見られた。冠詞選択に大きな影響を与える定性・特定性のアノテーションは,定性・特定性が話し手側により踏み込んだ概念であることから複雑で難度が高いため,他の文法情報で定性・特定性を推定する方がより容易であると考えられる。評価の結果は,定性・特定性の推定には,共参照情報を基にした情報状態だけでは十分でなく,聞き手/読み手の観点を考慮した共有性が重要であることを意味している。また,日本語では助詞「は」と「が」の使い分けについて,情報構造との関連が指摘されているが,付属語主辞とのラベルの関係を見ると,「が」「を」「に」は新情報が多く,「は」は若干旧情報が多いこと,「は」「の」に定性・特定のものが多く,「を」に不定・不特定のものが多いことがわかった。This paper presents the information structure\u27s annotation data (information status, commonness, definiteness, specificity, animacy, sentience, and agentivity) of the "Balanced Corpus of Contemporary Written Japanese." The annotation schema and statistics are displayed. Evaluation utilizing Kappa value indicates a moderate agreement (0.41≤) between the information status that is based on the already annotated co-reference information and definiteness/specificity. In addition, there is an almost perfect agreement (0.81≤) between commonness, which is recently annotated in this research, and definiteness/specificity. Thus, we conclude that commonness is more significant than information status to estimate definiteness and specificity, significantly affecting article selection in languages with articles. We investigate the relation between some particles and labels explained in this research since some researchers report that information structure is related to the distinction between the particles wa and ga in Japanese. Hence, the particles ga, o, and ni are usually employed with discourse-new noun phrases and wa with discourse-old ones. The particle wa is generally employed with definite and specific noun phrases, while o is employed with indefinite and unspecific ones

    平易なコーパスを用いないテキスト平易化

    Get PDF
    首都大学東京, 2018-03-25, 博士(工学)首都大学東

    『現代日本語書き言葉均衡コーパス』への情報構造アノテーションの分析

    Get PDF
    会議名: 言語資源活用ワークショップ2016, 開催地: 国立国語研究所, 会期: 2017年3月7日-8日, 主催: 国立国語研究所 コーパス開発センター日本語は冠詞のない言語である.ゆえに,日本語から冠詞を持つ言語への翻訳の際には,人によるものでも,機械によるものでも冠詞選択の問題を引き起こすことになる.冠詞選択には,ソースとなる言語における名詞句の情報構造(定性,特定性など) が影響を与える.本稿では,翻訳における冠詞選択の問題を軽減させるため,『現代日本語書き言葉均衡コーパス』のテキスト(新聞(PN) コアデータ16 サンプル) 内の名詞句に対し,情報構造に関係する文法情報のタグをアノテーションした結果を報告する.特に,本稿ではそのアノテーションの概要と基礎統計について述べる

    An experimental framework for designing document structure for users' decision making -- An empirical study of recipes

    Full text link
    Textual documents need to be of good quality to ensure effective asynchronous communication in remote areas, especially during the COVID-19 pandemic. However, defining a preferred document structure (content and arrangement) for improving lay readers' decision-making is challenging. First, the types of useful content for various readers cannot be determined simply by gathering expert knowledge. Second, methodologies to evaluate the document's usefulness from the user's perspective have not been established. This study proposed the experimental framework to identify useful contents of documents by aggregating lay readers' insights. This study used 200 online recipes as research subjects and recruited 1,340 amateur cooks as lay readers. The proposed framework identified six useful contents of recipes. Multi-level modeling then showed that among the six identified contents, suitable ingredients or notes arranged with a subheading at the end of each cooking step significantly increased recipes' usefulness. Our framework contributes to the communication design via documents
    corecore