9,880 research outputs found
Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization
Existing methods for CWS usually rely on a large number of labeled sentences
to train word segmentation models, which are expensive and time-consuming to
annotate. Luckily, the unlabeled data is usually easy to collect and many
high-quality Chinese lexicons are off-the-shelf, both of which can provide
useful information for CWS. In this paper, we propose a neural approach for
Chinese word segmentation which can exploit both lexicon and unlabeled data.
Our approach is based on a variant of posterior regularization algorithm, and
the unlabeled data and lexicon are incorporated into model training as indirect
supervision by regularizing the prediction space of CWS models. Extensive
experiments on multiple benchmark datasets in both in-domain and cross-domain
scenarios validate the effectiveness of our approach.Comment: 7 pages, 11 figures, accepted by the 2019 World Wide Web Conference
(WWW '19
Dual Long Short-Term Memory Networks for Sub-Character Representation Learning
Characters have commonly been regarded as the minimal processing unit in
Natural Language Processing (NLP). But many non-latin languages have
hieroglyphic writing systems, involving a big alphabet with thousands or
millions of characters. Each character is composed of even smaller parts, which
are often ignored by the previous work. In this paper, we propose a novel
architecture employing two stacked Long Short-Term Memory Networks (LSTMs) to
learn sub-character level representation and capture deeper level of semantic
meanings. To build a concrete study and substantiate the efficiency of our
neural architecture, we take Chinese Word Segmentation as a research case
example. Among those languages, Chinese is a typical case, for which every
character contains several components called radicals. Our networks employ a
shared radical level embedding to solve both Simplified and Traditional Chinese
Word Segmentation, without extra Traditional to Simplified Chinese conversion,
in such a highly end-to-end way the word segmentation can be significantly
simplified compared to the previous work. Radical level embeddings can also
capture deeper semantic meaning below character level and improve the system
performance of learning. By tying radical and character embeddings together,
the parameter count is reduced whereas semantic knowledge is shared and
transferred between two levels, boosting the performance largely. On 3 out of 4
Bakeoff 2005 datasets, our method surpassed state-of-the-art results by up to
0.4%. Our results are reproducible, source codes and corpora are available on
GitHub.Comment: Accepted & forthcoming at ITNG-201
Image Forgery Localization Based on Multi-Scale Convolutional Neural Networks
In this paper, we propose to utilize Convolutional Neural Networks (CNNs) and
the segmentation-based multi-scale analysis to locate tampered areas in digital
images. First, to deal with color input sliding windows of different scales, a
unified CNN architecture is designed. Then, we elaborately design the training
procedures of CNNs on sampled training patches. With a set of robust
multi-scale tampering detectors based on CNNs, complementary tampering
possibility maps can be generated. Last but not least, a segmentation-based
method is proposed to fuse the maps and generate the final decision map. By
exploiting the benefits of both the small-scale and large-scale analyses, the
segmentation-based multi-scale analysis can lead to a performance leap in
forgery localization of CNNs. Numerous experiments are conducted to demonstrate
the effectiveness and efficiency of our method.Comment: 7 pages, 6 figure
Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition
Online handwritten Chinese text recognition (OHCTR) is a challenging problem
as it involves a large-scale character set, ambiguous segmentation, and
variable-length input sequences. In this paper, we exploit the outstanding
capability of path signature to translate online pen-tip trajectories into
informative signature feature maps using a sliding window-based method,
successfully capturing the analytic and geometric properties of pen strokes
with strong local invariance and robustness. A multi-spatial-context fully
convolutional recurrent network (MCFCRN) is proposed to exploit the multiple
spatial contexts from the signature feature maps and generate a prediction
sequence while completely avoiding the difficult segmentation problem.
Furthermore, an implicit language model is developed to make predictions based
on semantic context within a predicting feature sequence, providing a new
perspective for incorporating lexicon constraints and prior knowledge about a
certain language in the recognition procedure. Experiments on two standard
benchmarks, Dataset-CASIA and Dataset-ICDAR, yielded outstanding results, with
correct rates of 97.10% and 97.15%, respectively, which are significantly
better than the best result reported thus far in the literature.Comment: 14 pages, 9 figure
- …