1,504 research outputs found
Kannada Character Recognition System A Review
Intensive research has been done on optical character recognition ocr and a
large number of articles have been published on this topic during the last few
decades. Many commercial OCR systems are now available in the market, but most
of these systems work for Roman, Chinese, Japanese and Arabic characters. There
are no sufficient number of works on Indian language character recognition
especially Kannada script among 12 major scripts in India. This paper presents
a review of existing work on printed Kannada script and their results. The
characteristics of Kannada script and Kannada Character Recognition System kcr
are discussed in detail. Finally fusion at the classifier level is proposed to
increase the recognition accuracy.Comment: 12 pages, 8 figure
Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems
Visual modifications to text are often used to obfuscate offensive comments
in social media (e.g., "!d10t") or as a writing style ("1337" in "leet speak"),
among other scenarios. We consider this as a new type of adversarial attack in
NLP, a setting to which humans are very robust, as our experiments with both
simple and more difficult visual input perturbations demonstrate. We then
investigate the impact of visual adversarial attacks on current NLP systems on
character-, word-, and sentence-level tasks, showing that both neural and
non-neural models are, in contrast to humans, extremely sensitive to such
attacks, suffering performance decreases of up to 82\%. We then explore three
shielding methods---visual character embeddings, adversarial training, and
rule-based recovery---which substantially improve the robustness of the models.
However, the shielding methods still fall behind performances achieved in
non-attack scenarios, which demonstrates the difficulty of dealing with visual
attacks.Comment: Accepted as long paper at NAACL-2019; fixed one ungrammatical
sentenc
Learning Character-level Compositionality with Visual Features
Previous work has modeled the compositionality of words by creating
character-level models of meaning, reducing problems of sparsity for rare
words. However, in many writing systems compositionality has an effect even on
the character-level: the meaning of a character is derived by the sum of its
parts. In this paper, we model this effect by creating embeddings for
characters based on their visual characteristics, creating an image for the
character and running it through a convolutional neural network to produce a
visual character embedding. Experiments on a text classification task
demonstrate that such model allows for better processing of instances with rare
characters in languages such as Chinese, Japanese, and Korean. Additionally,
qualitative analyses demonstrate that our proposed model learns to focus on the
parts of characters that carry semantic content, resulting in embeddings that
are coherent in visual space.Comment: Accepted to ACL 201
Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data
Similarity-based approaches represent a promising direction for time series
analysis. However, many such methods rely on parameter tuning, and some have
shortcomings if the time series are multivariate (MTS), due to dependencies
between attributes, or the time series contain missing data. In this paper, we
address these challenges within the powerful context of kernel methods by
proposing the robust \emph{time series cluster kernel} (TCK). The approach
taken leverages the missing data handling properties of Gaussian mixture models
(GMM) augmented with informative prior distributions. An ensemble learning
approach is exploited to ensure robustness to parameters by combining the
clustering results of many GMM to form the final kernel.
We evaluate the TCK on synthetic and real data and compare to other
state-of-the-art techniques. The experimental results demonstrate that the TCK
is robust to parameter choices, provides competitive results for MTS without
missing data and outstanding results for missing data.Comment: 23 pages, 6 figure
A Cross-Lingual Similarity Measure for Detecting Biomedical Term Translations
Bilingual dictionaries for technical terms such as biomedical terms are an important resource for machine translation systems as well as for humans who would like to understand a concept described in a foreign language. Often a biomedical term is first proposed in English and later it is manually translated to other languages. Despite the fact that there are large monolingual lexicons of biomedical terms, only a fraction of those term lexicons are translated to other languages. Manually compiling large-scale bilingual dictionaries for technical domains is a challenging task because it is difficult to find a sufficiently large number of bilingual experts. We propose a cross-lingual similarity measure for detecting most similar translation candidates for a biomedical term specified in one language (source) from another language (target). Specifically, a biomedical term in a language is represented using two types of features: (a) intrinsic features that consist of character n-grams extracted from the term under consideration, and (b) extrinsic features that consist of unigrams and bigrams extracted from the contextual windows surrounding the term under consideration. We propose a cross-lingual similarity measure using each of those feature types. First, to reduce the dimensionality of the feature space in each language, we propose prototype vector projection (PVP)—a non-negative lower-dimensional vector projection method. Second, we propose a method to learn a mapping between the feature spaces in the source and target language using partial least squares regression (PLSR). The proposed method requires only a small number of training instances to learn a cross-lingual similarity measure. The proposed PVP method outperforms popular dimensionality reduction methods such as the singular value decomposition (SVD) and non-negative matrix factorization (NMF) in a nearest neighbor prediction task. Moreover, our experimental results covering several language pairs such as English–French, English–Spanish, English–Greek, and English–Japanese show that the proposed method outperforms several other feature projection methods in biomedical term translation prediction tasks
- …