Search CORE

1,302 research outputs found

Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences

Author: Ando Rie Kubota
Lee Lillian
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 10/05/2002
Field of study

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese. Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.Comment: 22 pages. To appear in Natural Language Engineerin

arXiv.org e-Print Archive

CiteSeerX

Crossref

Text Augmentation: Inserting markup into natural language text with PPM Models

Author: Yeates Stuart Andrew
Publication venue: The University of Waikato
Publication date: 01/01/2006
Field of study

This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora

CiteSeerX

Research Commons@Waikato

Unsupervised Statistical Segmentation of Japanese Kanji Strings

Author: Ando Rie
Lee Lillian
Publication venue: 'SAGE Publications'
Publication date: 01/01/1999
Field of study

Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character

n

-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of both standard and novel error metrics

CiteSeerX

eCommons@Cornell

A history and theory of textual event detection and recognition

Author: Chen Yanping
Ding Zehua
Huang Ruizhang
Qin Yongbin
Shah Nazaraf
Zheng Qinghua
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/11/2020
Field of study

Coventry University Pure Portal

Revisiting the challenges and surveys in text similarity matching and detection methods

Author: Kusrini Kusrini
Muhammad Alva Hendi
Oyong Irwan
Publication venue: 'Universitas Ahmad Dahlan, Kampus 3'
Publication date: 30/09/2022
Field of study

The massive amount of information from the internet has revolutionized the field of natural language processing. One of the challenges was estimating the similarity between texts. This has been an open research problem although various studies have proposed new methods over the years. This paper surveyed and traced the primary studies in the field of text similarity. The aim was to give a broad overview of existing issues, applications, and methods of text similarity research. This paper identified four issues and several applications of text similarity matching. It classified current studies based on intrinsic, extrinsic, and hybrid approaches. Then, we identified the methods and classified them into lexical-similarity, syntactic-similarity, semantic-similarity, structural-similarity, and hybrid. Furthermore, this study also analyzed and discussed method improvement, current limitations, and open challenges on this topic for future research directions

Journal of Education and Learning (EduLearn)