6,058 research outputs found
Fast and Accurate Neural Word Segmentation for Chinese
Neural models with minimal feature engineering have achieved competitive
performance against traditional methods for the task of Chinese word
segmentation. However, both training and working procedures of the current
neural models are computationally inefficient. This paper presents a greedy
neural word segmenter with balanced word and character embedding inputs to
alleviate the existing drawbacks. Our segmenter is truly end-to-end, capable of
performing segmentation much faster and even more accurate than
state-of-the-art neural models on Chinese benchmark datasets.Comment: To appear in ACL201
Estimating the granularity coefficient of a Potts-Markov random field within an MCMC algorithm
This paper addresses the problem of estimating the Potts parameter B jointly
with the unknown parameters of a Bayesian model within a Markov chain Monte
Carlo (MCMC) algorithm. Standard MCMC methods cannot be applied to this problem
because performing inference on B requires computing the intractable
normalizing constant of the Potts model. In the proposed MCMC method the
estimation of B is conducted using a likelihood-free Metropolis-Hastings
algorithm. Experimental results obtained for synthetic data show that
estimating B jointly with the other unknown parameters leads to estimation
results that are as good as those obtained with the actual value of B. On the
other hand, assuming that the value of B is known can degrade estimation
performance significantly if this value is incorrect. To illustrate the
interest of this method, the proposed algorithm is successfully applied to real
bidimensional SAR and tridimensional ultrasound images
Filtered Semi-Markov CRF
Semi-Markov CRF has been proposed as an alternative to the traditional Linear
Chain CRF for text segmentation tasks such as Named Entity Recognition (NER).
Unlike CRF, which treats text segmentation as token-level prediction, Semi-CRF
considers segments as the basic unit, making it more expressive. However,
Semi-CRF suffers from two major drawbacks: (1) quadratic complexity over
sequence length, as it operates on every span of the input sequence, and (2)
inferior performance compared to CRF for sequence labeling tasks like NER. In
this paper, we introduce Filtered Semi-Markov CRF, a variant of Semi-CRF that
addresses these issues by incorporating a filtering step to eliminate
irrelevant segments, reducing complexity and search space. Our approach is
evaluated on several NER benchmarks, where it outperforms both CRF and Semi-CRF
while being significantly faster. The implementation of our method is available
on \href{https://github.com/urchade/Filtered-Semi-Markov-CRF}{Github}.Comment: EMNLP 2023 (Findings
Which Is Essential for Chinese Word Segmentation: Character versus Word
PACLIC 20 / Wuhan, China / 1-3 November, 200
Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition
Online handwritten Chinese text recognition (OHCTR) is a challenging problem
as it involves a large-scale character set, ambiguous segmentation, and
variable-length input sequences. In this paper, we exploit the outstanding
capability of path signature to translate online pen-tip trajectories into
informative signature feature maps using a sliding window-based method,
successfully capturing the analytic and geometric properties of pen strokes
with strong local invariance and robustness. A multi-spatial-context fully
convolutional recurrent network (MCFCRN) is proposed to exploit the multiple
spatial contexts from the signature feature maps and generate a prediction
sequence while completely avoiding the difficult segmentation problem.
Furthermore, an implicit language model is developed to make predictions based
on semantic context within a predicting feature sequence, providing a new
perspective for incorporating lexicon constraints and prior knowledge about a
certain language in the recognition procedure. Experiments on two standard
benchmarks, Dataset-CASIA and Dataset-ICDAR, yielded outstanding results, with
correct rates of 97.10% and 97.15%, respectively, which are significantly
better than the best result reported thus far in the literature.Comment: 14 pages, 9 figure
- …