80 research outputs found
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
We investigate a lattice LSTM network for Chinese word segmentation (CWS) to
utilize words or subwords. It integrates the character sequence features with
all subsequences information matched from a lexicon. The matched subsequences
serve as information shortcut tunnels which link their start and end characters
directly. Gated units are used to control the contribution of multiple input
links. Through formula derivation and comparison, we show that the lattice LSTM
is an extension of the standard LSTM with the ability to take multiple inputs.
Previous lattice LSTM model takes word embeddings as the lexicon input, we
prove that subword encoding can give the comparable performance and has the
benefit of not relying on any external segmentor. The contribution of lattice
LSTM comes from both lexicon and pretrained embeddings information, we find
that the lexicon information contributes more than the pretrained embeddings
information through controlled experiments. Our experiments show that the
lattice structure with subword encoding gives competitive or better results
with previous state-of-the-art methods on four segmentation benchmarks.
Detailed analyses are conducted to compare the performance of word encoding and
subword encoding in lattice LSTM. We also investigate the performance of
lattice LSTM structure under different circumstances and when this model works
or fails.Comment: 8 page
CASICT Tibetan Word Segmentation System for MLWS2017
We participated in the MLWS 2017 on Tibetan word segmentation task, our
system is trained in a unrestricted way, by introducing a baseline system and
76w tibetan segmented sentences of ours. In the system character sequence is
processed by the baseline system into word sequence, then a subword unit (BPE
algorithm) split rare words into subwords with its corresponding features,
after that a neural network classifier is adopted to token each subword into
"B,M,E,S" label, in decoding step a simple rule is used to recover a final word
sequence. The candidate system for submition is selected by evaluating the
F-score in dev set pre-extracted from the 76w sentences. Experiment shows that
this method can fix segmentation errors of baseline system and result in a
significant performance gain
Lattice-Based Transformer Encoder for Neural Machine Translation
Neural machine translation (NMT) takes deterministic sequences for source
representations. However, either word-level or subword-level segmentations have
multiple choices to split a source sequence with different word segmentors or
different subword vocabulary sizes. We hypothesize that the diversity in
segmentations may affect the NMT performance. To integrate different
segmentations with the state-of-the-art NMT model, Transformer, we propose
lattice-based encoders to explore effective word or subword representation in
an automatic way during training. We propose two methods: 1) lattice positional
encoding and 2) lattice-aware self-attention. These two methods can be used
together and show complementary to each other to further improve translation
performance. Experiment results show superiorities of lattice-based encoders in
word-level and subword-level representations over conventional Transformer
encoder.Comment: Accepted by ACL 201
Glyce: Glyph-vectors for Chinese Character Representations
It is intuitive that NLP tasks for logographic languages like Chinese should
benefit from the use of the glyph information in those languages. However, due
to the lack of rich pictographic evidence in glyphs and the weak generalization
ability of standard computer vision models on character data, an effective way
to utilize the glyph information remains to be found. In this paper, we address
this gap by presenting Glyce, the glyph-vectors for Chinese character
representations. We make three major innovations: (1) We use historical Chinese
scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to
enrich the pictographic evidence in characters; (2) We design CNN structures
(called tianzege-CNN) tailored to Chinese character image processing; and (3)
We use image-classification as an auxiliary task in a multi-task learning setup
to increase the model's ability to generalize. We show that glyph-based models
are able to consistently outperform word/char ID-based models in a wide range
of Chinese NLP tasks. We are able to set new state-of-the-art results for a
variety of Chinese NLP tasks, including tagging (NER, CWS, POS), sentence pair
classification, single sentence classification tasks, dependency parsing, and
semantic role labeling. For example, the proposed model achieves an F1 score of
80.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an almost
perfect accuracy of 99.8\% on the Fudan corpus for text classification. Code
found at https://github.com/ShannonAI/glyce.Comment: Accepted by NeurIPS 201
Is Word Segmentation Necessary for Deep Learning of Chinese Representations?
Segmenting a chunk of text into words is usually the first step of processing
Chinese text, but its necessity has rarely been explored. In this paper, we ask
the fundamental question of whether Chinese word segmentation (CWS) is
necessary for deep learning-based Chinese Natural Language Processing. We
benchmark neural word-based models which rely on word segmentation against
neural char-based models which do not involve word segmentation in four
end-to-end NLP benchmark tasks: language modeling, machine translation,
sentence matching/paraphrase and text classification. Through direct
comparisons between these two types of models, we find that char-based models
consistently outperform word-based models. Based on these observations, we
conduct comprehensive experiments to study why word-based models underperform
char-based models in these deep learning-based NLP tasks. We show that it is
because word-based models are more vulnerable to data sparsity and the presence
of out-of-vocabulary (OOV) words, and thus more prone to overfitting. We hope
this paper could encourage researchers in the community to rethink the
necessity of word segmentation in deep learning-based Chinese Natural Language
Processing. \footnote{Yuxian Meng and Xiaoya Li contributed equally to this
paper.}Comment: to appear at ACL201
Lexicon-constrained Copying Network for Chinese Abstractive Summarization
Copy mechanism allows sequence-to-sequence models to choose words from the
input and put them directly into the output, which is finding increasing use in
abstractive summarization. However, since there is no explicit delimiter in
Chinese sentences, most existing models for Chinese abstractive summarization
can only perform character copy, resulting in inefficient. To solve this
problem, we propose a lexicon-constrained copying network that models
multi-granularity in both encoder and decoder. On the source side, words and
characters are aggregated into the same input memory using a Transformerbased
encoder. On the target side, the decoder can copy either a character or a
multi-character word at each time step, and the decoding process is guided by a
word-enhanced search algorithm that facilitates the parallel computation and
encourages the model to copy more words. Moreover, we adopt a word selector to
integrate keyword information. Experiments results on a Chinese social media
dataset show that our model can work standalone or with the word selector. Both
forms can outperform previous character-based models and achieve competitive
performances
Neural Lattice Language Models
In this work, we propose a new language modeling paradigm that has the
ability to perform both prediction and moderation of information flow at
multiple granularities: neural lattice language models. These models construct
a lattice of possible paths through a sentence and marginalize across this
lattice to calculate sequence probabilities or optimize parameters. This
approach allows us to seamlessly incorporate linguistic intuitions - including
polysemy and existence of multi-word lexical items - into our language model.
Experiments on multiple language modeling tasks show that English neural
lattice language models that utilize polysemous embeddings are able to improve
perplexity by 9.95% relative to a word-level baseline, and that a Chinese model
that handles multi-character tokens is able to improve perplexity by 20.94%
relative to a character-level baseline
Lattice Transformer for Speech Translation
Recent advances in sequence modeling have highlighted the strengths of the
transformer architecture, especially in achieving state-of-the-art machine
translation results. However, depending on the up-stream systems, e.g., speech
recognition, or word segmentation, the input to translation system can vary
greatly. The goal of this work is to extend the attention mechanism of the
transformer to naturally consume the lattice in addition to the traditional
sequential input. We first propose a general lattice transformer for speech
translation where the input is the output of the automatic speech recognition
(ASR) which contains multiple paths and posterior scores. To leverage the extra
information from the lattice structure, we develop a novel controllable lattice
attention mechanism to obtain latent representations. On the LDC
Spanish-English speech translation corpus, our experiments show that lattice
transformer generalizes significantly better and outperforms both a transformer
baseline and a lattice LSTM. Additionally, we validate our approach on the WMT
2017 Chinese-English translation task with lattice inputs from different BPE
segmentations. In this task, we also observe the improvements over strong
baselines.Comment: accepted to ACL 201
FGN: Fusion Glyph Network for Chinese Named Entity Recognition
Chinese NER is a challenging task. As pictographs, Chinese characters contain
latent glyph information, which is often overlooked. In this paper, we propose
the FGN, Fusion Glyph Network for Chinese NER. Except for adding glyph
information, this method may also add extra interactive information with the
fusion mechanism. The major innovations of FGN include: (1) a novel CNN
structure called CGS-CNN is proposed to capture both glyph information and
interactive information between glyphs from neighboring characters. (2) we
provide a method with sliding window and Slice-Attention to fuse the BERT
representation and glyph representation for a character, which may capture
potential interactive knowledge between context and glyph. Experiments are
conducted on four NER datasets, showing that FGN with LSTM-CRF as tagger
achieves new state-of-the-arts performance for Chinese NER. Further, more
experiments are conducted to investigate the influences of various components
and settings in FGN
Chinese Spelling Error Detection Using a Fusion Lattice LSTM
Spelling error detection serves as a crucial preprocessing in many natural
language processing applications. Due to the characteristics of Chinese
Language, Chinese spelling error detection is more challenging than error
detection in English. Existing methods are mainly under a pipeline framework,
which artificially divides error detection process into two steps. Thus, these
methods bring error propagation and cannot always work well due to the
complexity of the language environment. Besides existing methods only adopt
character or word information, and ignore the positive effect of fusing
character, word, pinyin1 information together. We propose an LF-LSTM-CRF model,
which is an extension of the LSTMCRF with word lattices and
character-pinyin-fusion inputs. Our model takes advantage of the end-to-end
framework to detect errors as a whole process, and dynamically integrates
character, word and pinyin information. Experiments on the SIGHAN data show
that our LF-LSTM-CRF outperforms existing methods with similar external
resources consistently, and confirm the feasibility of adopting the end-to-end
framework and the availability of integrating of character, word and pinyin
information.Comment: 8 pages,5 figure
- …