Search CORE

26 research outputs found

Dual Long Short-Term Memory Networks for Sub-Character Representation Learning

Author: Feng Yi
Gao Zhimin
He Han
Townsend George
Wu Lei
Yan Hua
Yang Xiaokun
Publication venue
Publication date: 01/01/2018
Field of study

Characters have commonly been regarded as the minimal processing unit in Natural Language Processing (NLP). But many non-latin languages have hieroglyphic writing systems, involving a big alphabet with thousands or millions of characters. Each character is composed of even smaller parts, which are often ignored by the previous work. In this paper, we propose a novel architecture employing two stacked Long Short-Term Memory Networks (LSTMs) to learn sub-character level representation and capture deeper level of semantic meanings. To build a concrete study and substantiate the efficiency of our neural architecture, we take Chinese Word Segmentation as a research case example. Among those languages, Chinese is a typical case, for which every character contains several components called radicals. Our networks employ a shared radical level embedding to solve both Simplified and Traditional Chinese Word Segmentation, without extra Traditional to Simplified Chinese conversion, in such a highly end-to-end way the word segmentation can be significantly simplified compared to the previous work. Radical level embeddings can also capture deeper semantic meaning below character level and improve the system performance of learning. By tying radical and character embeddings together, the parameter count is reduced whereas semantic knowledge is shared and transferred between two levels, boosting the performance largely. On 3 out of 4 Bakeoff 2005 datasets, our method surpassed state-of-the-art results by up to 0.4%. Our results are reproducible, source codes and corpora are available on GitHub.Comment: Accepted & forthcoming at ITNG-201

arXiv.org e-Print Archive

Crossref

Word Boundary Decision with CRF for Chinese Word Segmentation

Author: Huang Chu-Ren
Li Shoushan
Publication venue: City University of Hong Kong
Publication date: 01/01/2009
Field of study

PACLIC 23 / City University of Hong Kong / 3-5 December 200

The Hong Kong Polytechnic University Pao Yue-kong Library

Waseda University Repository

Automatically Generating a Large, Culture-Specific Blocklist for China

Author: Feamster Nick
Hounsel Austin
Mittal Prateek
Publication venue
Publication date: 01/01/2018
Field of study

Internet censorship measurements rely on lists of websites to be tested, or "block lists" that are curated by third parties. Unfortunately, many of these lists are not public, and those that are tend to focus on a small group of topics, leaving other types of sites and services untested. To increase and diversify the set of sites on existing block lists, we use natural language processing and search engines to automatically discover a much wider range of websites that are censored in China. Using these techniques, we create a list of 1125 websites outside the Alexa Top 1,000 that cover Chinese politics, minority human rights organizations, oppressed religions, and more. Importantly,

\textit{none of the sites we discover are present on the current largest block list}

. The list that we develop not only vastly expands the set of sites that current Internet measurement tools can test, but it also deepens our understanding of the nature of content that is censored in China. We have released both this new block list and the code for generating it

arXiv.org e-Print Archive

Princeton University Open Access Repository

Which Is Essential for Chinese Word Segmentation: Character versus Word

Author: Huang Chang-Ning
Zhao Hai
Publication venue: 'Tsinghua University Press'
Publication date: 01/10/2006
Field of study

PACLIC 20 / Wuhan, China / 1-3 November, 200

Waseda University Repository

Identifying Prepositional Phrases in Chinese Patent Texts with Rule-based and CRF Methods

Author: Jin Yaohong
Li Hongzheng
Publication venue
Publication date: 01/01/2015
Field of study

Waseda University Repository

Which is More Suitable for Chinese Word Segmentation, the Generative Model or the Discriminative One?

Author: Su Keh-Yih
Wang Kun
Zong Chengqing
Publication venue: City University of Hong Kong
Publication date: 01/01/2009
Field of study

PACLIC 23 / City University of Hong Kong / 3-5 December 200

Waseda University Repository

Towards Feasible Instructor Intervention in MOOC discussion forums

Author: Chandrasekaran Muthu
Kan Min-Yen
Ragupathi Kiruthika
Tan Bernard
Publication venue: AIS Electronic Library (AISeL)
Publication date: 13/12/2015
Field of study

Massive Open Online Courses allow numerous people from around the world to have access to knowledge that they otherwise have not. However, high student-to-instructor ratio in MOOCs restricts instructors’ ability to facilitate student learning by intervening in discussions forums, as they do in face-to-face classrooms. Instructors need automated guidance on when and how to intervene in discussion forums. Using a typology of pedagogical interventions derived from prior research, we annotate a large corpus of discussion forum contents to enable supervised machine learning to automatically identify interventions that promote student learning. Such machine learning models may allow building of dashboards to automatically prompt instructors on when and how to intervene in discussion forums. In the longer term, it may be possible to automate these interventions relieving instructors of this effort. Such automated approaches are essential for allowing good pedagogical practices to scale in the context of MOOC discussion forums

AIS Electronic Library (AISeL)

Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field

Author: He Yeping
Liu Huidan
Ma Longlong
Nuo Minghua
Wu Jian
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He.In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He

Waseda University Repository

Institute Of Software, Chinese Academy Of Sciences