81 research outputs found

    ミャンマー語テキストの形式手法による音節分割、正規化と辞書順排列

    Get PDF
    国立大学法人長岡技術科学大

    Methods for Efficient Ontology Lexicalization for Non-Indo-European Languages: The Case of Japanese

    Get PDF
    Lanser B. Methods for Efficient Ontology Lexicalization for Non-Indo-European Languages: The Case of Japanese. Bielefeld: Universität Bielefeld; 2017.In order to make the growing amount of conceptual knowledge available through ontologies and datasets accessible to humans, NLP applications need access to information on how this knowledge can be verbalized in natural language. One way to provide this kind of information are ontology lexicons, which apart from the actual verbalizations in a given target language can provide further, rich linguistic information about them. Compiling such lexicons manually is a very time-consuming task and requires expertise both in Semantic Web technologies and lexicon engineering, as well as a very good knowledge of the target language at hand. In this thesis we present two alternative approaches to generating ontology lexicons by means of crowdsourcing on the one hand and through the framework M-ATOLL on the other hand. So far, M-ATOLL has been used with a number of Indo-European languages that share a large set of common characteristics. Therefore, another focus of this work will be the generation of ontology lexicons specifically for Non-Indo-European languages. In order to explore these two topics, we use both approaches to generate Japanese ontology lexicons for the DBpedia ontology: First, we use CrowdFlower to generate a small Japanese ontology lexicon for ten exemplary ontology elements according to a two-stage workflow, the main underlying idea of which is to turn the task of generating lexicon entries into a translation task; the starting point of this translation task is a manually created English lexicon for DBpedia. Next, we adapt M-ATOLL's corpus-based approach to being used with Japanese, and use the adapted system to generate two lexicons for five example properties, respectively. Aspects of the DBpedia system that require modifications for being used with Japanese include the dependency patterns employed by M-ATOLL to extract candidate verbalizations from corpus data, and the templates used to generate the actual lexicon entries. Comparison of the lexicons generated by both approaches to manually created gold standards shows that both approaches are viable options for the generation of ontology lexicons also for Non-Indo-European languages

    The extraction, introduction, transfer, diffusion and integration of loanwords in Japan : loanwords in a literate society.

    Get PDF
    This doctoral thesis seeks primarily to establish a model which shows how loanwords in Japanese evolve through a stepwise process. The process starts well before the actual borrowing itself, when Japanese school children acquire a stratum of English morphemes to which conventional pronunciations have been ascribed. This stratum could be said to be composed of a large set of orthography-pronunciation analogies. Foreign words are then extracted from foreign word stocks by agents of introduction, typically advertising copywriters or magazine journalists. However, since these words are unsuitable for use in Japanese as is, the agents then proceed to domesticate them according to Japanese rules of phonology, orthography, morphology, syntax and semantics. The next step involves transference into the public zone, crucially via the written word, before being disseminated and finally integrated. A few researchers have hinted that such a process exists but have taken it no further. Here, proof is evinced by interviews with the agents themselves and together with documentary and quantitative corpus analyses it is shown that lexical borrowing of western words in Japanese proceeds in accordance with such a model. It is furthermore shown that these agents adhere to one of three broad cultural environments and borrow/domesticate words within this genre. They then pass along channels of tran,~ference, dissemination and integration in accordance with genre specific patterns. Investigation of these genre-specific channels of evolution constitutes the second research objective. Three other research objectives are addressed within the framework of this model, namely genre-specific patterns of transference and dissemination, when a word changes from being a foreign word to being an integrated loanword, and factors governing the displacement of native words by loanwords

    Document Meta-Information as Weak Supervision for Machine Translation

    Get PDF
    Data-driven machine translation has advanced considerably since the first pioneering work in the 1990s with recent systems claiming human parity on sentence translation for highresource tasks. However, performance degrades for low-resource domains with no available sentence-parallel training data. Machine translation systems also rarely incorporate the document context beyond the sentence level, ignoring knowledge which is essential for some situations. In this thesis, we aim to address the two issues mentioned above by examining ways to incorporate document-level meta-information into data-driven machine translation. Examples of document meta-information include document authorship and categorization information, as well as cross-lingual correspondences between documents, such as hyperlinks or citations between documents. As this meta-information is much more coarse-grained than reference translations, it constitutes a source of weak supervision for machine translation. We present four cumulatively conducted case studies where we devise and evaluate methods to exploit these sources of weak supervision both in low-resource scenarios where no task-appropriate supervision from parallel data exists, and in a full supervision scenario where weak supervision from document meta-information is used to supplement supervision from sentence-level reference translations. All case studies show improved translation quality when incorporating document meta-information

    Evaluating Information Retrieval and Access Tasks

    Get PDF
    This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one

    Learner Modelling for Individualised Reading in a Second Language

    Get PDF
    Extensive reading is an effective language learning technique that involves fast reading of large quantities of easy and interesting second language (L2) text. However, graded readers used by beginner learners are expensive and often dull. The alternative is text written for native speakers (authentic text), which is generally too difficult for beginners. The aim of this research is to overcome this problem by developing a computer-assisted approach that enables learners of all abilities to perform effective extensive reading using freely-available text on the web. This thesis describes the research, development and evaluation of a complex software system called FERN that combines learner modelling and iCALL with narrow reading of electronic text. The system incorporates four key components: (1) automatic glossing of difficult words in texts, (2) individualised search engine for locating interesting texts of appropriate difficulty, (3) supplementary exercises for introducing key vocabulary and reviewing difficult words and (4) reliably monitoring reading and reporting progress. FERN was optimised for English speakers learning Spanish, but is easily adapted for learners of others languages. The suitability of the FERN system was evaluated through corpus analysis, machine translation analysis and a year-long study with second year university Spanish class. The machine translation analysis combined with the classroom study demonstrated that the word and phrase error rate generated in FERN is low enough to validate the use of machine translation to automatically generate glosses, but is high enough that a translation dictionary is required as a backup. The classroom study demonstrated that when aided by glosses students can read at over 100 words per minute if they know 95% of the words, whereas compared to the 98% word knowledge required for effective unaided extensive reading. A corpus analysis demonstrated that beginner learners of Spanish can do effective narrow reading of news articles using FERN after learning only 200–300 high-frequency word families, in addition to familiarity with English-Spanish cognates and proper nouns. FERN also reliably monitors reading speeds and word counts, and provides motivating progress reports, which enable teachers to set concrete reading goals that dramatically increase the quantity that students read, as demonstrated in the user study
    corecore