74 research outputs found

    コーパス日本語学のための言語資源 : 形態素解析用電子化辞書の開発とその応用

    Get PDF
    千葉大学国立国語研究所国立国語研究所京都高度技術研究所東京大学情報通信研究機構国立国語研究所Chiba UniversityThe National Institute for Japanese LanguageThe National Institute for Japanese LanguageASTEMThe University of TokyoNational Institute of Information and Communications TechnologyThe National Institute for Japanese Languageコーパス日本語学への応用を指向した形態素解析用電子化辞書UniDicを開発した。大規模コーパスに対する形態論情報付与作業には,計算機を用いた形態素解析システムの利用が不可欠であるが,既存の形態素解析システム用辞書には,コーパス日本語学への応用を考える上でさまざまな不都合がある。1つは,単位の認定がある場合には長く,ある場合には短いといった不揃いがあることであり,もう1つは,異表記や異形態に対して同一の見出しが与えられないということである。言語研究で重要な要件となる,このような単位の斉一性や見出しの同一性への対処といったことを中心に,本電子化辞書の設計方針とそれを実装した辞書データベースシステムについて述べる。さらに,この設計の有用性を示すため,表記や語形の変異に関するコーパス分析の事例を紹介する。In this paper, we describe the design and the implementation of an electronic dictionary for morphological analysis, UniDic, which aims particularly at application to Japanese corpus linguistics. It has been indispensable for the development of a large-scale corpus to utilize an automatic morphological analyzer on computer. The existing dictionaries for morphological analyzers, however, reveal lots of problems when used in corpus linguistics, such as unevenness in defining a unit and failure in handling allomorphs and orthographic variants. Our dictionary, in contrast, deals with the uniformity of units and the identity of indexes, which are important requirements for linguistic analysis of corpora. We adopt multi-level definition of word units, consisting of short-, middle-, and long-unit words, and structured representation of indexes, composed of lemma, word form, orthography, and pronunciation. We develop a database system that straight-forwardly implements this design of the dictionary and a friendly user-interface for dictionary builders to be capable of searching and registering entries with grasping the complex structure of the indexes. We also show how this structured representation benefits us in analyzing morphologically annotated corpora, presenting case studies that investigate the variation of word form in spoken language corpus and the variation of orthography in written language corpus

    ニホンゴ テキスト カイセキ セイセイ ノ タメノ サイダイ エントロピー モデル

    No full text
    京都大学0048新制・論文博士博士(情報学)乙第11479号論情博第50号新制||情||28(附属図書館)UT51-2004-G974(主査)教授 松山 隆司, 教授 河原 達也, 助教授 佐藤 理史学位規則第4条第2項該当Doctor of InformaticsKyoto UniversityDA

    SENSEVAL-2 Japanese Translation Task

    No full text
    This paper describes the Senseval-2 Japanese translation task. In this task, word senses are defined according to distinct translations in a given target language. A translation memory (TM) was constructed which contains, for each Japanese head word, a list of typical Japanese expressions and their English translations. For each test word instance, participants were required to submit the TM record best approximating that usage, or alternatively, actual target word translations. There were 9 system entries from a total of 7 organizations

    SENSEVAL-2 Japanese Translation Task

    No full text
    This paper describes the Senseval-2 Japanese translation task. In this task, word senses are defined according to distinct translations in a given target language. A translation memory (TM) was constructed which contains, for each Japanese head word, a list of typical Japanese expressions and their English translations. For each test word instance, participants were required to submit the TM record best approximating that usage, or alternatively, actual target word translations. There were 9 system entries from a total of 7 organizations

    Morphological Annotation of a Large Spontaneous Speech Corpus in Japanese

    No full text
    We propose an efficient framework for humanaided morphological annotation of a large spontaneous speech corpus such as the Corpus of Spontaneous Japanese. In this framework, even when word units have several definitions in a given corpus, and not all words are found in a dictionary or in a training corpus, we can morphologically analyze the given corpus with high accuracy and low labor costs by detecting words not found in the dictionary and putting them into it. We can further reduce labor costs by expanding training corpora based on active learning

    Enhancing the ELP for the 21st Century

    No full text
    The Council of Europe’s European Language Portfolio (ELP) was originally designed as a paper based document in three distinct parts (Passport, Biography, Dossier). In the past few years some attempts have been directed at producing an electronic version of the original paper model, however very little has been done to explore the full potential of the ELP in a digital environment. The project presented in this paper is based on the development and usage of an electronic ELP by adult distance learning students at the Open University (OU) in the United Kingdom. The paper will reflect on two main issues around the current format of the ELP and its potential in the digital era based on technical appropriateness and analysis of learners’ needs and interests. Firstly, it will assess the suitability of the original ELP template’s structures and navigation in a digital environment, and will report on the technical challenges of recreating it in an electronic format, as well as on the solutions that needed to be found for a coherent electronic design. Secondly, this presentation will suggest how the content of the ELP could be enhanced and expanded taking advantage of the technology. The project at the OU piloted an additional section on learning styles within the Biography and also provided a link to the newly developed Autobiography of Intercultural Encounters (AIE). The conclusions of the study suggest that the potential of the ELP as a learning guide, self-reflection instrument and self-assessment tool can be significantly enhanced using a virtual environment instead of a paper format
    corecore