26 research outputs found
Dual Long Short-Term Memory Networks for Sub-Character Representation Learning
Characters have commonly been regarded as the minimal processing unit in
Natural Language Processing (NLP). But many non-latin languages have
hieroglyphic writing systems, involving a big alphabet with thousands or
millions of characters. Each character is composed of even smaller parts, which
are often ignored by the previous work. In this paper, we propose a novel
architecture employing two stacked Long Short-Term Memory Networks (LSTMs) to
learn sub-character level representation and capture deeper level of semantic
meanings. To build a concrete study and substantiate the efficiency of our
neural architecture, we take Chinese Word Segmentation as a research case
example. Among those languages, Chinese is a typical case, for which every
character contains several components called radicals. Our networks employ a
shared radical level embedding to solve both Simplified and Traditional Chinese
Word Segmentation, without extra Traditional to Simplified Chinese conversion,
in such a highly end-to-end way the word segmentation can be significantly
simplified compared to the previous work. Radical level embeddings can also
capture deeper semantic meaning below character level and improve the system
performance of learning. By tying radical and character embeddings together,
the parameter count is reduced whereas semantic knowledge is shared and
transferred between two levels, boosting the performance largely. On 3 out of 4
Bakeoff 2005 datasets, our method surpassed state-of-the-art results by up to
0.4%. Our results are reproducible, source codes and corpora are available on
GitHub.Comment: Accepted & forthcoming at ITNG-201
Word Boundary Decision with CRF for Chinese Word Segmentation
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Automatically Generating a Large, Culture-Specific Blocklist for China
Internet censorship measurements rely on lists of websites to be tested, or
"block lists" that are curated by third parties. Unfortunately, many of these
lists are not public, and those that are tend to focus on a small group of
topics, leaving other types of sites and services untested. To increase and
diversify the set of sites on existing block lists, we use natural language
processing and search engines to automatically discover a much wider range of
websites that are censored in China. Using these techniques, we create a list
of 1125 websites outside the Alexa Top 1,000 that cover Chinese politics,
minority human rights organizations, oppressed religions, and more.
Importantly, . The list that we develop not only vastly expands the set
of sites that current Internet measurement tools can test, but it also deepens
our understanding of the nature of content that is censored in China. We have
released both this new block list and the code for generating it
Which Is Essential for Chinese Word Segmentation: Character versus Word
PACLIC 20 / Wuhan, China / 1-3 November, 200
Which is More Suitable for Chinese Word Segmentation, the Generative Model or the Discriminative One?
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Towards Feasible Instructor Intervention in MOOC discussion forums
Massive Open Online Courses allow numerous people from around the world to have access to knowledge that they otherwise have not. However, high student-to-instructor ratio in MOOCs restricts instructorsā ability to facilitate student learning by intervening in discussions forums, as they do in face-to-face classrooms. Instructors need automated guidance on when and how to intervene in discussion forums. Using a typology of pedagogical interventions derived from prior research, we annotate a large corpus of discussion forum contents to enable supervised machine learning to automatically identify interventions that promote student learning. Such machine learning models may allow building of dashboards to automatically prompt instructors on when and how to intervene in discussion forums. In the longer term, it may be possible to automate these interventions relieving instructors of this effort. Such automated approaches are essential for allowing good pedagogical practices to scale in the context of MOOC discussion forums
Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field
In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He.In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He