7,624 research outputs found
Unconstrained Scene Text and Video Text Recognition for Arabic Script
Building robust recognizers for Arabic has always been challenging. We
demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid
architecture in recognizing Arabic text in videos and natural scenes. We
outperform previous state-of-the-art on two publicly available video text
datasets - ALIF and ACTIV. For the scene text recognition task, we introduce a
new Arabic scene text dataset and establish baseline results. For scripts like
Arabic, a major challenge in developing robust recognizers is the lack of large
quantity of annotated data. We overcome this by synthesising millions of Arabic
text images from a large vocabulary of Arabic words and phrases. Our
implementation is built on top of the model introduced here [37] which is
proven quite effective for English scene text recognition. The model follows a
segmentation-free, sequence to sequence transcription approach. The network
transcribes a sequence of convolutional features from the input image to a
sequence of target labels. This does away with the need for segmenting input
image into constituent characters/glyphs, which is often difficult for Arabic
script. Further, the ability of RNNs to model contextual dependencies yields
superior recognition results.Comment: 5 page
Two-Stream RNN/CNN for Action Recognition in 3D Videos
The recognition of actions from video sequences has many applications in
health monitoring, assisted living, surveillance, and smart homes. Despite
advances in sensing, in particular related to 3D video, the methodologies to
process the data are still subject to research. We demonstrate superior results
by a system which combines recurrent neural networks with convolutional neural
networks in a voting approach. The gated-recurrent-unit-based neural networks
are particularly well-suited to distinguish actions based on long-term
information from optical tracking data; the 3D-CNNs focus more on detailed,
recent information from video data. The resulting features are merged in an SVM
which then classifies the movement. In this architecture, our method improves
recognition rates of state-of-the-art methods by 14% on standard data sets.Comment: Published in 2017 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS
E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are a key technology for emerging
applications such as automatic speech recognition, machine translation or image
description. Long Short Term Memory (LSTM) networks are the most successful RNN
implementation, as they can learn long term dependencies to achieve high
accuracy. Unfortunately, the recurrent nature of LSTM networks significantly
constrains the amount of parallelism and, hence, multicore CPUs and many-core
GPUs exhibit poor efficiency for RNN inference. In this paper, we present
E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM
computation. The main goal of E-PUR is to support large recurrent neural
networks for low-power mobile devices. E-PUR provides an efficient hardware
implementation of LSTM networks that is flexible to support diverse
applications. One of its main novelties is a technique that we call Maximizing
Weight Locality (MWL), which improves the temporal locality of the memory
accesses for fetching the synaptic weights, reducing the memory requirements by
a large extent. Our experimental results show that E-PUR achieves real-time
performance for different LSTM networks, while reducing energy consumption by
orders of magnitude with respect to general-purpose processors and GPUs, and it
requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA
Tegra X1, E-PUR provides an average energy reduction of 92x
Learning Fashion Compatibility with Bidirectional LSTMs
The ubiquity of online fashion shopping demands effective recommendation
services for customers. In this paper, we study two types of fashion
recommendation: (i) suggesting an item that matches existing components in a
set to form a stylish outfit (a collection of fashion items), and (ii)
generating an outfit with multimodal (images/text) specifications from a user.
To this end, we propose to jointly learn a visual-semantic embedding and the
compatibility relationships among fashion items in an end-to-end fashion. More
specifically, we consider a fashion outfit to be a sequence (usually from top
to bottom and then accessories) and each item in the outfit as a time step.
Given the fashion items in an outfit, we train a bidirectional LSTM (Bi-LSTM)
model to sequentially predict the next item conditioned on previous ones to
learn their compatibility relationships. Further, we learn a visual-semantic
space by regressing image features to their semantic representations aiming to
inject attribute and category information as a regularization for training the
LSTM. The trained network can not only perform the aforementioned
recommendations effectively but also predict the compatibility of a given
outfit. We conduct extensive experiments on our newly collected Polyvore
dataset, and the results provide strong qualitative and quantitative evidence
that our framework outperforms alternative methods.Comment: ACM MM 1
- …