1 research outputs found
Hierarchical Character Embeddings: Learning Phonological and Semantic Representations in Languages of Logographic Origin using Recursive Neural Networks
Logographs (Chinese characters) have recursive structures (i.e. hierarchies
of sub-units in logographs) that contain phonological and semantic information,
as developmental psychology literature suggests that native speakers leverage
on the structures to learn how to read. Exploiting these structures could
potentially lead to better embeddings that can benefit many downstream tasks.
We propose building hierarchical logograph (character) embeddings from
logograph recursive structures using treeLSTM, a recursive neural network.
Using recursive neural network imposes a prior on the mapping from logographs
to embeddings since the network must read in the sub-units in logographs
according to the order specified by the recursive structures. Based on human
behavior in language learning and reading, we hypothesize that modeling
logographs' structures using recursive neural network should be beneficial. To
verify this claim, we consider two tasks (1) predicting logographs' Cantonese
pronunciation from logographic structures and (2) language modeling. Empirical
results show that the proposed hierarchical embeddings outperform baseline
approaches. Diagnostic analysis suggests that hierarchical embeddings
constructed using treeLSTM is less sensitive to distractors, thus is more
robust, especially on complex logographs.Comment: Accepted by IEEE Transactions on Audio, Speech and Language
Processing. Copyright 2019 IEE