19,537 research outputs found
Czech Text Document Corpus v 2.0
This paper introduces "Czech Text Document Corpus v 2.0", a collection of
text documents for automatic document classification in Czech language. It is
composed of the text documents provided by the Czech News Agency and is freely
available for research purposes at http://ctdc.kiv.zcu.cz/. This corpus was
created in order to facilitate a straightforward comparison of the document
classification approaches on Czech data. It is particularly dedicated to
evaluation of multi-label document classification approaches, because one
document is usually labelled with more than one label. Besides the information
about the document classes, the corpus is also annotated at the morphological
layer. This paper further shows the results of selected state-of-the-art
methods on this corpus to offer the possibility of an easy comparison with
these approaches.Comment: Accepted for LREC 201
Efficient Scene Text Localization and Recognition with Local Character Refinement
An unconstrained end-to-end text localization and recognition method is
presented. The method detects initial text hypothesis in a single pass by an
efficient region-based method and subsequently refines the text hypothesis
using a more robust local text model, which deviates from the common assumption
of region-based methods that all characters are detected as connected
components.
Additionally, a novel feature based on character stroke area estimation is
introduced. The feature is efficiently computed from a region distance map, it
is invariant to scaling and rotations and allows to efficiently detect text
regions regardless of what portion of text they capture.
The method runs in real time and achieves state-of-the-art text localization
and recognition results on the ICDAR 2013 Robust Reading dataset
Benchmark of machine learning methods for classification of a Sentinel-2 image
Thanks to mainly ESA and USGS, a large bulk of free images of the Earth is readily available nowadays. One of the main goals of
remote sensing is to label images according to a set of semantic categories, i.e. image classification. This is a very challenging issue
since land cover of a specific class may present a large spatial and spectral variability and objects may appear at different scales and
orientations.
In this study, we report the results of benchmarking 9 machine learning algorithms tested for accuracy and speed in training and
classification of land-cover classes in a Sentinel-2 dataset. The following machine learning methods (MLM) have been tested: linear
discriminant analysis, k-nearest neighbour, random forests, support vector machines, multi layered perceptron, multi layered
perceptron ensemble, ctree, boosting, logarithmic regression. The validation is carried out using a control dataset which consists of an
independent classification in 11 land-cover classes of an area about 60 km2, obtained by manual visual interpretation of high resolution
images (20 cm ground sampling distance) by experts. In this study five out of the eleven classes are used since the others have too few
samples (pixels) for testing and validating subsets. The classes used are the following: (i) urban (ii) sowable areas (iii) water (iv) tree
plantations (v) grasslands.
Validation is carried out using three different approaches: (i) using pixels from the training dataset (train), (ii) using pixels from the
training dataset and applying cross-validation with the k-fold method (kfold) and (iii) using all pixels from the control dataset. Five
accuracy indices are calculated for the comparison between the values predicted with each model and control values over three sets of
data: the training dataset (train), the whole control dataset (full) and with k-fold cross-validation (kfold) with ten folds. Results from
validation of predictions of the whole dataset (full) show the random forests method with the highest values; kappa index ranging from
0.55 to 0.42 respectively with the most and least number pixels for training. The two neural networks (multi layered perceptron and its
ensemble) and the support vector machines - with default radial basis function kernel - methods follow closely with comparable
performanc
Cross-Lingual Adaptation using Structural Correspondence Learning
Cross-lingual adaptation, a special case of domain adaptation, refers to the
transfer of classification knowledge between two languages. In this article we
describe an extension of Structural Correspondence Learning (SCL), a recently
proposed algorithm for domain adaptation, for cross-lingual adaptation. The
proposed method uses unlabeled documents from both languages, along with a word
translation oracle, to induce cross-lingual feature correspondences. From these
correspondences a cross-lingual representation is created that enables the
transfer of classification knowledge from the source to the target language.
The main advantages of this approach over other approaches are its resource
efficiency and task specificity.
We conduct experiments in the area of cross-language topic and sentiment
classification involving English as source language and German, French, and
Japanese as target languages. The results show a significant improvement of the
proposed method over a machine translation baseline, reducing the relative
error due to cross-lingual adaptation by an average of 30% (topic
classification) and 59% (sentiment classification). We further report on
empirical analyses that reveal insights into the use of unlabeled data, the
sensitivity with respect to important hyperparameters, and the nature of the
induced cross-lingual correspondences
- …