Search CORE

6,381 research outputs found

Segmenting DNA sequence into words based on statistical language model

Author: Wang Liang
Publication venue
Publication date: 26/02/2012
Field of study

This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach to word segmentation’ method to segment the DNA sequences. The benchmark of segmenting method is also proposed. In cross segmenting test, we find different genomes may use the similar language, but belong to different branches, just like the English and French/Latin. We present some possible applications of this method at last

Nature Precedings

Fast and Accurate Neural Word Segmentation for Chinese

Author: Cai Deng
Huang Feiyue
Wu Yongjian
Xin Yuan
Zhang Zhisong
Zhao Hai
Publication venue
Publication date: 01/01/2017
Field of study

Neural models with minimal feature engineering have achieved competitive performance against traditional methods for the task of Chinese word segmentation. However, both training and working procedures of the current neural models are computationally inefficient. This paper presents a greedy neural word segmenter with balanced word and character embedding inputs to alleviate the existing drawbacks. Our segmenter is truly end-to-end, capable of performing segmentation much faster and even more accurate than state-of-the-art neural models on Chinese benchmark datasets.Comment: To appear in ACL201

arXiv.org e-Print Archive

Crossref

Bootstrapping word alignment via word packing

Author: Ma Yanjun
Stroppa Nicolas
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2007
Field of study

We introduce a simple method to pack words for statistical word alignment. Our goal is to simplify the task of automatic word alignment by packing several consecutive words together when we believe they correspond to a single word in the opposite language. This is done using the word aligner itself, i.e. by bootstrapping on its output. We evaluate the performance of our approach on a Chinese-to-English machine translation task, and report a 12.2% relative increase in BLEU score over a state-of-the art phrase-based SMT system

Irish Universities

DCU Online Research Access Service