Search CORE

196 research outputs found

Word Familiarity Rate Estimation Using a Bayesian Linear Mixed Model

Author: Masayuki Asahara
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

National Institute for Japanese Language and LinguisticsThis paper presents research on word familiarity rate estimation using the \u27Word List by Semantic Principles\u27. We collected rating information on 96,557 words in the \u27Word List by Semantic Principles\u27 via Yahoo! crowdsourcing . We asked 3,392 subject participants to use their introspection to rate the familiarity of words based on the five perspectives of \u27KNOW\u27, \u27WRITE\u27, \u27READ\u27, \u27SPEAK\u27, and \u27LISTEN\u27, and each word was rated by at least 16 subject participants. We used Bayesian linear mixed models to estimate the word familiarity rates. We also explored the ratings with the semantic labels used in the \u27Word List by Semantic Principles\u27

Crossref

Institutional Repositories DataBase (IRDB)

Academic Repository of the National Institute for Japanese Language and Linguistics / 国立国語研究所学術情報リポジトリ

Between Reading Time and Information Structure

Author: Asahara Masayuki
Publication venue: the National University (Philippines)
Publication date: 01/01/2017
Field of study

Waseda University Repository

Reading Time and Vocabulary Rating in the Japanese Language : Large-Scale Reading Time Data Collection Using Crowdsourcing

Author: Masayuki Asahara
Publication venue: European Language Resources Association
Publication date: 01/01/2022
Field of study

National Institute for Japanese Language and Linguistics / Tokyo University of Foreign StudiesThis study examined the effect of the differences in human vocabulary on reading time. This study conducted a word familiarity survey and applied a generalised linear mixed model to the participant ratings, assuming vocabulary to be a random effect of the participants. Following this, the participants took part in a self-paced reading task, and their reading times were recorded. The results clarified the effect of vocabulary differences on reading time

Academic Repository of the National Institute for Japanese Language and Linguistics / 国立国語研究所学術情報リポジトリ

Word Sense Disambiguation of Corpus of Historical Japanese Using Japanese BERT Trained with Contemporary Texts

Author: Asahara Masayuki
Komiya Kanako
Oki Nagi
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/10/2022
Field of study

application/pdfTokyo University of Agriculture and TechnologyTokyo University of Agriculture and TechnologyNational Institute for Japanese Language and Linguisticshttps://aclanthology.org/2022.paclic-1.49/journal articl

Academic Repository of the National Institute for Japanese Language and Linguistics / 国立国語研究所学術情報リポジトリ

UD_Japanese-CEJC: Dependency Relation Annotation on Corpus of Everyday Japanese Conversation

Author: ASAHARA Masayuki
MATSUDA Hiroshi
OMURA Mai
WAKASA Aya
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2023
Field of study

Conference name: the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Conference place: Prague, Czechia, Session period: 2023/09/11-15, Organizer: Association for Computational Linguisticsapplication/pdfNational Institute for Japanese Language and LinguisticsTohoku UniversityMegagon Labs, Tokyo, Recruit Co., LtdNational Institute for Japanese Language and LinguisticsIn this study, we have developed Universal Dependencies (UD) resources for spoken Japanese in the Corpus of Everyday Japanese Conversation (CEJC). The CEJC is a large corpus of spoken language that encompasses various everyday conversations in Japanese, and includes word delimitation and part-of-speech annotation. We have newly annotated Long Word Unit delimitation and Bunsetsu (Japanese phrase)-based dependencies, including Bunsetsu boundaries, for CEJC. The UD of Japanese resources was constructed in accordance with hand-maintained conversion rules from the CEJC with two types of word delimitation, part-of-speech tags and Bunsetsu-based syntactic dependency relations. Furthermore, we examined various issues pertaining to the construction of UD in the CEJC by comparing it with the written Japanese corpus and evaluating UD parsing accuracy.conference pape

Academic Repository of the National Institute for Japanese Language and Linguistics / 国立国語研究所学術情報リポジトリ

Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning

Author: Asahara Masayuki
Cheng Fei
Kobayashi Ichiro
Kurohashi Sadao
Publication venue
Publication date: 31/10/2023
Field of study

Temporal relation classification is a pair-wise task for identifying the relation of a temporal link (TLINK) between two mentions, i.e. event, time, and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T, and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two transfer learning baselines on both the English and Japanese data.Comment: EMNLP 2020 Finding

arXiv.org e-Print Archive

Design of BCCWJ-EEG : Balanced Corpus with Human Electroencephalography

Author: Masayuki Asahara
Yohei Oseki
Publication venue: European Language Resources Association
Publication date: 01/05/2020
Field of study

Waseda UniversityNational Institute for Japanese Language and LinguisticsThe past decade has witnessed the happy marriage between natural language processing (NLP) and the cognitive science of language. Moreover, given the historical relationship between biological and artificial neural networks, the advent of deep learning has re-sparked strong interests in the fusion of NLP and the neuroscience of language. Importantly, this inter-fertilization between NLP, on one hand, and the cognitive (neuro)science of language, on the other, has been driven by the language resources annotated with human language processing data. However, there remain several limitations with those language resources on annotations, genres, languages, etc. In this paper, we describe the design of a novel language resource called BCCWJ-EEG, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) experimentally annotated with human electroencephalography (EEG). Specifically, after extensively reviewing the language resources currently available in the literature with special focus on eye-tracking and EEG, we summarize the details concerning (i) participants, (ii) stimuli, (iii) procedure, (iv) data preprocessing, (v) corpus evaluation, (vi) resource release, and (vii) compilation schedule. In addition, potential applications of BCCWJ-EEG to neuroscience and NLP will also be discussed

Institutional Repositories DataBase (IRDB)

Academic Repository of the National Institute for Japanese Language and Linguistics / 国立国語研究所学術情報リポジトリ

Coreference based event-argument relation extraction on biomedical text

Author: Asahara Masayuki
Hirao Tsutomu
Matsumoto Yuji
Riedel Sebastian
Yoshikawa Katsumasa
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

This paper presents a new approach to exploit coreference information for extracting event-argument (E-A) relations from biomedical documents. This approach has two advantages: (1) it can extract a large number of valuable E-A relations based on the concept of salience in discourse; (2) it enables us to identify E-A relations over sentence boundaries (cross-links) using transitivity of coreference relations. We propose two coreference-based models: a pipeline based on Support Vector Machine (SVM) classifiers, and a joint Markov Logic Network (MLN). We show the effectiveness of these models on a biomedical event corpus. Both models outperform the systems that do not use coreference information. When the two proposed models are compared to each other, joint MLN outperforms pipeline SVM with gold coreference information

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UCL Discovery

BCCWJ-TimeBank: Temporal and Event Information Annotation on Japanese Text

Author: Asahara Masayuki
Imada Mizuho
Konishi Hikari
Maekawa Kikuo
Yasuda Sachi
Publication venue: Department of English, National Chengchi University
Publication date: 01/01/2013
Field of study

Waseda University Repository