196 research outputs found
Word Familiarity Rate Estimation Using a Bayesian Linear Mixed Model
National Institute for Japanese Language and LinguisticsThis paper presents research on word familiarity rate estimation using the \u27Word List by Semantic Principles\u27. We collected rating information on 96,557 words in the \u27Word List by Semantic Principles\u27 via Yahoo! crowdsourcing . We asked 3,392 subject participants to use their introspection to rate the familiarity of words based on the five perspectives of \u27KNOW\u27, \u27WRITE\u27, \u27READ\u27, \u27SPEAK\u27, and \u27LISTEN\u27, and each word was rated by at least 16 subject participants. We used Bayesian linear mixed models to estimate the word familiarity rates. We also explored the ratings with the semantic labels used in the \u27Word List by Semantic Principles\u27
Reading Time and Vocabulary Rating in the Japanese Language : Large-Scale Reading Time Data Collection Using Crowdsourcing
National Institute for Japanese Language and Linguistics / Tokyo University of Foreign StudiesThis study examined the effect of the differences in human vocabulary on reading time. This study conducted a word familiarity survey and applied a generalised linear mixed model to the participant ratings, assuming vocabulary to be a random effect of the participants. Following this, the participants took part in a self-paced reading task, and their reading times were recorded. The results clarified the effect of vocabulary differences on reading time
Word Sense Disambiguation of Corpus of Historical Japanese Using Japanese BERT Trained with Contemporary Texts
application/pdfTokyo University of Agriculture and TechnologyTokyo University of Agriculture and TechnologyNational Institute for Japanese Language and Linguisticshttps://aclanthology.org/2022.paclic-1.49/journal articl
UD_Japanese-CEJC: Dependency Relation Annotation on Corpus of Everyday Japanese Conversation
Conference name: the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Conference place: Prague, Czechia, Session period: 2023/09/11-15, Organizer: Association for Computational Linguisticsapplication/pdfNational Institute for Japanese Language and LinguisticsTohoku UniversityMegagon Labs, Tokyo, Recruit Co., LtdNational Institute for Japanese Language and LinguisticsIn this study, we have developed Universal Dependencies (UD) resources for spoken Japanese in the Corpus of Everyday Japanese Conversation (CEJC). The CEJC is a large corpus of spoken language that encompasses various everyday conversations in Japanese, and includes word delimitation and part-of-speech annotation. We have newly annotated Long Word Unit delimitation and Bunsetsu (Japanese phrase)-based dependencies, including Bunsetsu boundaries, for CEJC. The UD of Japanese resources was constructed in accordance with hand-maintained conversion rules from the CEJC with two types of word delimitation, part-of-speech tags and Bunsetsu-based syntactic dependency relations. Furthermore, we examined various issues pertaining to the construction of UD in the CEJC by comparing it with the written Japanese corpus and evaluating UD parsing accuracy.conference pape
Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning
Temporal relation classification is a pair-wise task for identifying the
relation of a temporal link (TLINK) between two mentions, i.e. event, time, and
document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs
involving a common mention do not share information. 2) Existing models with
independent classifiers for each TLINK category (E2E, E2T, and E2D) hinder from
using the whole data. This paper presents an event centric model that allows to
manage dynamic event representations across multiple TLINKs. Our model deals
with three TLINK categories with multi-task learning to leverage the full size
of data. The experimental results show that our proposal outperforms
state-of-the-art models and two transfer learning baselines on both the English
and Japanese data.Comment: EMNLP 2020 Finding
Design of BCCWJ-EEG : Balanced Corpus with Human Electroencephalography
Waseda UniversityNational Institute for Japanese Language and LinguisticsThe past decade has witnessed the happy marriage between natural language processing (NLP) and the cognitive science of language. Moreover, given the historical relationship between biological and artificial neural networks, the advent of deep learning has re-sparked strong interests in the fusion of NLP and the neuroscience of language. Importantly, this inter-fertilization between NLP, on one hand, and the cognitive (neuro)science of language, on the other, has been driven by the language resources annotated with human language processing data. However, there remain several limitations with those language resources on annotations, genres, languages, etc. In this paper, we describe the design of a novel language resource called BCCWJ-EEG, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) experimentally annotated with human electroencephalography (EEG). Specifically, after extensively reviewing the language resources currently available in the literature with special focus on eye-tracking and EEG, we summarize the details concerning (i) participants, (ii) stimuli, (iii) procedure, (iv) data preprocessing, (v) corpus evaluation, (vi) resource release, and (vii) compilation schedule. In addition, potential applications of BCCWJ-EEG to neuroscience and NLP will also be discussed
Coreference based event-argument relation extraction on biomedical text
This paper presents a new approach to exploit coreference information for extracting event-argument (E-A) relations from biomedical documents. This approach has two advantages: (1) it can extract a large number of valuable E-A relations based on the concept of salience in discourse; (2) it enables us to identify E-A relations over sentence boundaries (cross-links) using transitivity of coreference relations. We propose two coreference-based models: a pipeline based on Support Vector Machine (SVM) classifiers, and a joint Markov Logic Network (MLN). We show the effectiveness of these models on a biomedical event corpus. Both models outperform the systems that do not use coreference information. When the two proposed models are compared to each other, joint MLN outperforms pipeline SVM with gold coreference information
- …