Search CORE

26 research outputs found

Segmenting DNA sequence into words based on statistical language model

Author: Wang Liang
Publication venue
Publication date: 26/02/2012
Field of study

This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach to word segmentation’ method to segment the DNA sequences. The benchmark of segmenting method is also proposed. In cross segmenting test, we find different genomes may use the similar language, but belong to different branches, just like the English and French/Latin. We present some possible applications of this method at last

Nature Precedings

Senti-Lexicon and Analysis for Restaurant Reviews of Myanmar Text

Author: Aung S. S. (Sint)
Aye Y. M. (Yu)
Publication venue: 'Infogain Publication'
Publication date: 01/05/2018
Field of study

Social media has just become as an influential with the rapidly growing popularity of online customers reviews available in social sites by using informal languages and emoticons. These reviews are very helpful for new customers and for decision making process. Sentiment analysis is to state the feelings, opinions about people\u27s reviews together with sentiment. Most of researchers applied sentiment analysis for English Language. There is no research efforts have sought to provide sentiment analysis of Myanmar text. To tackle this problem, we propose the resource of Myanmar Language for mining food and restaurants\u27 reviews. This paper aims to build language resource to overcome the language specific problem and opinion word extraction for Myanmar text reviews of consumers. We address dictionary based approach of lexicon-based sentiment analysis for analysis of opinion word extraction in food and restaurants domain. This research assesses the challenges and problem faced in sentiment analysis of Myanmar Language area for future

Neliti

Hybrid Technique for Arabic Text Compression

Author: Arafat Awajan
Enas Abu Jrai
Publication venue: Global Journals Inc. (US)
Publication date: 21/02/2015
Field of study

Arabic content on the Internet and other digital media is increasing exponentially, and the number of Arab users of these media has multiplied by more than 20 over the past five years. There is a real need to save allocated space for this content as well as allowing more efficient usage, searching, and retrieving information operations on this content. Using techniques borrowed from other languages or general data compression techniques, ignoring the proper features of Arabic has limited success in terms of compression ratio. In this paper, we present a hybrid technique that uses the linguistic features of Arabic language to improve the compression ratio of Arabic texts. This technique works in phases. In the first phase, the text file is split into four different files using a multilayer model-based approach. In the second phase, each one of these four files is compressed using the Burrows-Wheeler compression algorithm

Global Journal of Computer Science and Technology (GJCST)

Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences

Author: Ando Rie Kubota
Lee Lillian
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 10/05/2002
Field of study

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese. Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.Comment: 22 pages. To appear in Natural Language Engineerin

arXiv.org e-Print Archive

CiteSeerX

Crossref

Evaluation via Negativa of Chinese Word Segmentation for Information Retrieval

Author: Hsu Wen-Lian
Jiang Mike Tian-Jian
Kuo Chan-Hung
Shih Cheng-Wei
Tsai Richard Tzong-Han
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

New Word Detection Algorithm for Chinese Based on Extraction of Local Context Information

Author: Li S
Li T
Li Tang-Qiu
Ruan D
Shi Xiao-Dong
Su Chang
Zeng Hua-Lin
Zhou Chang-Le
周昌乐
Publication venue
Publication date: 01/01/2008
Field of study

Chinese segmentation is an important issue in Chinese text processing. The traditional segmentation methods those depend on an existing dictionary stiffer the drawbacks when encounter unknown words. The paper proposed a segmenting algorithm for Chinese based on extracting local context information. It added the context information of the testing text into the local PPM statistical model so as to guide the detection Of new words. The algorithm focusing on the process of online segmentation and new word detection achieves a good effect in the close or opening test, and outperforms some well-known Chinese segmentation system to a certain extent

Xiamen University Institutional Repository

Zipf's Law of Abbreviation and the Principle of Least Effort:Language users optimise a miniature lexicon for efficient communication

Author: Culbertson Jennifer
Kanwal Jasmeen
Kirby Simon
Smith Kenny
Publication venue: 'Elsevier BV'
Publication date: 01/08/2017
Field of study

Crossref

Edinburgh Research Explorer

"LazImpa": Lazy and Impatient neural agents learn to communicate efficiently

Author: Chaabouni Rahma
Dupoux Emmanuel
Rita Mathieu
Publication venue
Publication date: 01/01/2020
Field of study

Previous work has shown that artificial neural agents naturally develop surprisingly non-efficient codes. This is illustrated by the fact that in a referential game involving a speaker and a listener neural networks optimizing accurate transmission over a discrete channel, the emergent messages fail to achieve an optimal length. Furthermore, frequent messages tend to be longer than infrequent ones, a pattern contrary to the Zipf Law of Abbreviation (ZLA) observed in all natural languages. Here, we show that near-optimal and ZLA-compatible messages can emerge, but only if both the speaker and the listener are modified. We hence introduce a new communication system, "LazImpa", where the speaker is made increasingly lazy, i.e. avoids long messages, and the listener impatient, i.e.,~seeks to guess the intended content as soon as possible.Comment: Accepted to CoNLL 202

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server