921 research outputs found
Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization
Existing methods for CWS usually rely on a large number of labeled sentences
to train word segmentation models, which are expensive and time-consuming to
annotate. Luckily, the unlabeled data is usually easy to collect and many
high-quality Chinese lexicons are off-the-shelf, both of which can provide
useful information for CWS. In this paper, we propose a neural approach for
Chinese word segmentation which can exploit both lexicon and unlabeled data.
Our approach is based on a variant of posterior regularization algorithm, and
the unlabeled data and lexicon are incorporated into model training as indirect
supervision by regularizing the prediction space of CWS models. Extensive
experiments on multiple benchmark datasets in both in-domain and cross-domain
scenarios validate the effectiveness of our approach.Comment: 7 pages, 11 figures, accepted by the 2019 World Wide Web Conference
(WWW '19
Neural Word Segmentation with Rich Pretraining
Neural word segmentation research has benefited from large-scale raw texts by
leveraging them for pretraining character and word embeddings. On the other
hand, statistical segmentation research has exploited richer sources of
external information, such as punctuation, automatic segmentation and POS. We
investigate the effectiveness of a range of external training sources for
neural word segmentation by building a modular segmentation model, pretraining
the most important submodule using rich external sources. Results show that
such pretraining significantly improves the model, leading to accuracies
competitive to the best methods on six benchmarks.Comment: Accepted by ACL 201
Efficient Multi-Template Learning for Structured Prediction
Conditional random field (CRF) and Structural Support Vector Machine
(Structural SVM) are two state-of-the-art methods for structured prediction
which captures the interdependencies among output variables. The success of
these methods is attributed to the fact that their discriminative models are
able to account for overlapping features on the whole input observations. These
features are usually generated by applying a given set of templates on labeled
data, but improper templates may lead to degraded performance. To alleviate
this issue, in this paper, we propose a novel multiple template learning
paradigm to learn structured prediction and the importance of each template
simultaneously, so that hundreds of arbitrary templates could be added into the
learning model without caution. This paradigm can be formulated as a special
multiple kernel learning problem with exponential number of constraints. Then
we introduce an efficient cutting plane algorithm to solve this problem in the
primal, and its convergence is presented. We also evaluate the proposed
learning paradigm on two widely-studied structured prediction tasks,
\emph{i.e.} sequence labeling and dependency parsing. Extensive experimental
results show that the proposed method outperforms CRFs and Structural SVMs due
to exploiting the importance of each template. Our complexity analysis and
empirical results also show that our proposed method is more efficient than
OnlineMKL on very sparse and high-dimensional data. We further extend this
paradigm for structured prediction using generalized -block norm
regularization with , and experiments show competitive performances when
GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts
In the context of the rapid development of large language models, we have
meticulously trained and introduced the GujiBERT and GujiGPT language models,
which are foundational models specifically designed for intelligent information
processing of ancient texts. These models have been trained on an extensive
dataset that encompasses both simplified and traditional Chinese characters,
allowing them to effectively handle various natural language processing tasks
related to ancient books, including but not limited to automatic sentence
segmentation, punctuation, word segmentation, part-of-speech tagging, entity
recognition, and automatic translation. Notably, these models have exhibited
exceptional performance across a range of validation tasks using publicly
available datasets. Our research findings highlight the efficacy of employing
self-supervised methods to further train the models using classical text
corpora, thus enhancing their capability to tackle downstream tasks. Moreover,
it is worth emphasizing that the choice of font, the scale of the corpus, and
the initial model selection all exert significant influence over the ultimate
experimental outcomes. To cater to the diverse text processing preferences of
researchers in digital humanities and linguistics, we have developed three
distinct categories comprising a total of nine model variations. We believe
that by sharing these foundational language models specialized in the domain of
ancient texts, we can facilitate the intelligent processing and scholarly
exploration of ancient literary works and, consequently, contribute to the
global dissemination of China's rich and esteemed traditional culture in this
new era.Comment: 22pages,0 figur
Seq-UPS: Sequential Uncertainty-aware Pseudo-label Selection for Semi-Supervised Text Recognition
This paper looks at semi-supervised learning (SSL) for image-based text
recognition. One of the most popular SSL approaches is pseudo-labeling (PL). PL
approaches assign labels to unlabeled data before re-training the model with a
combination of labeled and pseudo-labeled data. However, PL methods are
severely degraded by noise and are prone to over-fitting to noisy labels, due
to the inclusion of erroneous high confidence pseudo-labels generated from
poorly calibrated models, thus, rendering threshold-based selection
ineffective. Moreover, the combinatorial complexity of the hypothesis space and
the error accumulation due to multiple incorrect autoregressive steps posit
pseudo-labeling challenging for sequence models. To this end, we propose a
pseudo-label generation and an uncertainty-based data selection framework for
semi-supervised text recognition. We first use Beam-Search inference to yield
highly probable hypotheses to assign pseudo-labels to the unlabelled examples.
Then we adopt an ensemble of models, sampled by applying dropout, to obtain a
robust estimate of the uncertainty associated with the prediction, considering
both the character-level and word-level predictive distribution to select good
quality pseudo-labels. Extensive experiments on several benchmark handwriting
and scene-text datasets show that our method outperforms the baseline
approaches and the previous state-of-the-art semi-supervised text-recognition
methods.Comment: Accepted at WACV 202
- …