37,506 research outputs found
Text authorship identified using the dynamics of word co-occurrence networks
The identification of authorship in disputed documents still requires human
expertise, which is now unfeasible for many tasks owing to the large volumes of
text and authors in practical applications. In this study, we introduce a
methodology based on the dynamics of word co-occurrence networks representing
written texts to classify a corpus of 80 texts by 8 authors. The texts were
divided into sections with equal number of linguistic tokens, from which time
series were created for 12 topological metrics. The series were proven to be
stationary (p-value>0.05), which permits to use distribution moments as
learning attributes. With an optimized supervised learning procedure using a
Radial Basis Function Network, 68 out of 80 texts were correctly classified,
i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in
purely dynamic network metrics were found to characterize authorship, thus
opening the way for the description of texts in terms of small evolving
networks. Moreover, the approach introduced allows for comparison of texts with
diverse characteristics in a simple, fast fashion
A fine-grained approach to scene text script identification
This paper focuses on the problem of script identification in unconstrained
scenarios. Script identification is an important prerequisite to recognition,
and an indispensable condition for automatic text understanding systems
designed for multi-language environments. Although widely studied for document
images and handwritten documents, it remains an almost unexplored territory for
scene text images.
We detail a novel method for script identification in natural images that
combines convolutional features and the Naive-Bayes Nearest Neighbor
classifier. The proposed framework efficiently exploits the discriminative
power of small stroke-parts, in a fine-grained classification framework.
In addition, we propose a new public benchmark dataset for the evaluation of
joint text detection and script identification in natural scenes. Experiments
done in this new dataset demonstrate that the proposed method yields state of
the art results, while it generalizes well to different datasets and variable
number of scripts. The evidence provided shows that multi-lingual scene text
recognition in the wild is a viable proposition. Source code of the proposed
method is made available online
Facilitating translation using source language paraphrase lattices
For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input
sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore,
given limited data, in order to facilitate translation
from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and largescale
English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resource sufficient pairs to some extent
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
- …