6,173 research outputs found
Rotational Rectification Network: Enabling Pedestrian Detection for Mobile Vision
Across a majority of pedestrian detection datasets, it is typically assumed
that pedestrians will be standing upright with respect to the image coordinate
system. This assumption, however, is not always valid for many vision-equipped
mobile platforms such as mobile phones, UAVs or construction vehicles on rugged
terrain. In these situations, the motion of the camera can cause images of
pedestrians to be captured at extreme angles. This can lead to very poor
pedestrian detection performance when using standard pedestrian detectors. To
address this issue, we propose a Rotational Rectification Network (R2N) that
can be inserted into any CNN-based pedestrian (or object) detector to adapt it
to significant changes in camera rotation. The rotational rectification network
uses a 2D rotation estimation module that passes rotational information to a
spatial transformer network to undistort image features. To enable robust
rotation estimation, we propose a Global Polar Pooling (GP-Pooling) operator to
capture rotational shifts in convolutional features. Through our experiments,
we show how our rotational rectification network can be used to improve the
performance of the state-of-the-art pedestrian detector under heavy image
rotation by up to 45
Beyond Gaussian Pyramid: Multi-skip Feature Stacking for Action Recognition
Most state-of-the-art action feature extractors involve differential
operators, which act as highpass filters and tend to attenuate low frequency
action information. This attenuation introduces bias to the resulting features
and generates ill-conditioned feature matrices. The Gaussian Pyramid has been
used as a feature enhancing technique that encodes scale-invariant
characteristics into the feature space in an attempt to deal with this
attenuation. However, at the core of the Gaussian Pyramid is a convolutional
smoothing operation, which makes it incapable of generating new features at
coarse scales. In order to address this problem, we propose a novel feature
enhancing technique called Multi-skIp Feature Stacking (MIFS), which stacks
features extracted using a family of differential filters parameterized with
multiple time skips and encodes shift-invariance into the frequency space. MIFS
compensates for information lost from using differential operators by
recapturing information at coarse scales. This recaptured information allows us
to match actions at different speeds and ranges of motion. We prove that MIFS
enhances the learnability of differential-based features exponentially. The
resulting feature matrices from MIFS have much smaller conditional numbers and
variances than those from conventional methods. Experimental results show
significantly improved performance on challenging action recognition and event
detection tasks. Specifically, our method exceeds the state-of-the-arts on
Hollywood2, UCF101 and UCF50 datasets and is comparable to state-of-the-arts on
HMDB51 and Olympics Sports datasets. MIFS can also be used as a speedup
strategy for feature extraction with minimal or no accuracy cost
Rotated Feature Network for multi-orientation object detection
General detectors follow the pipeline that feature maps extracted from
ConvNets are shared between classification and regression tasks. However, there
exists obvious conflicting requirements in multi-orientation object detection
that classification is insensitive to orientations, while regression is quite
sensitive. To address this issue, we provide an Encoder-Decoder architecture,
called Rotated Feature Network (RFN), which produces rotation-sensitive feature
maps (RS) for regression and rotation-invariant feature maps (RI) for
classification. Specifically, the Encoder unit assigns weights for rotated
feature maps. The Decoder unit extracts RS and RI by performing resuming
operator on rotated and reweighed feature maps, respectively. To make the
rotation-invariant characteristics more reliable, we adopt a metric to
quantitatively evaluate the rotation-invariance by adding a constrain item in
the loss, yielding a promising detection performance. Compared with the
state-of-the-art methods, our method can achieve significant improvement on
NWPU VHR-10 and RSOD datasets. We further evaluate the RFN on the scene
classification in remote sensing images and object detection in natural images,
demonstrating its good generalization ability. The proposed RFN can be
integrated into an existing framework, leading to great performance with only a
slight increase in model complexity.Comment: 9 pages, 7 figure
The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers
Solving mathematical word problems (MWPs) automatically is challenging,
primarily due to the semantic gap between human-readable words and
machine-understandable logics. Despite the long history dated back to the1960s,
MWPs have regained intensive attention in the past few years with the
advancement of Artificial Intelligence (AI). Solving MWPs successfully is
considered as a milestone towards general AI. Many systems have claimed
promising results in self-crafted and small-scale datasets. However, when
applied on large and diverse datasets, none of the proposed methods in the
literature achieves high precision, revealing that current MWP solvers still
have much room for improvement. This motivated us to present a comprehensive
survey to deliver a clear and complete picture of automatic math problem
solvers. In this survey, we emphasize on algebraic word problems, summarize
their extracted features and proposed techniques to bridge the semantic gap and
compare their performance in the publicly accessible datasets. We also cover
automatic solvers for other types of math problems such as geometric problems
that require the understanding of diagrams. Finally, we identify several
emerging research directions for the readers with interests in MWPs.Comment: 18 pages, 5 figure
Estimation of Static and Dynamic Urban Populations with Mobile Network Metadata
Communication-enabled devices routinely carried by individuals have become
pervasive, opening unprecedented opportunities for collecting digital metadata
about the mobility of large populations. In this paper, we propose a novel
methodology for the estimation of people density at metropolitan scales, using
subscriber presence metadata collected by a mobile operator. Our approach suits
the estimation of static population densities, i.e., of the distribution of
dwelling units per urban area contained in traditional censuses. More
importantly, it enables the estimation of dynamic population densities, i.e.,
the time-varying distributions of people in a conurbation. By leveraging
substantial real-world mobile network metadata and ground-truth information, we
demonstrate that the accuracy of our solution is superior to that granted by
state-of-the-art methods in practical heterogeneous urban scenarios
Enhanced image approximation using shifted rank-1 reconstruction
Low rank approximation has been extensively studied in the past. It is most
suitable to reproduce rectangular like structures in the data. In this work we
introduce a generalization using shifted rank-1 matrices to approximate
. These matrices are of the form
where , and
.The operator circularly shifts the k-th
column of by . These kind of shifts naturally appear in
applications, where an object is observed in measurements at different
positions indicated by the shift . The vector gives the
observation intensity. Exemplary, a seismic wave can be recorded at sensors
with different time of arrival ; Or a car moves through a video
changing its position in every frame. We present theoretical results as well as
an efficient algorithm to calculate a shifted rank-1 approximation in . The benefit of the proposed method is demonstrated in numerical
experiments. A comparison to other sparse approximation methods is given.
Finally, we illustrate the utility of the extracted parameters for direct
information extraction in several applications including video processing or
non-destructive testing
Robust Visual Tracking using Multi-Frame Multi-Feature Joint Modeling
It remains a huge challenge to design effective and efficient trackers under
complex scenarios, including occlusions, illumination changes and pose
variations. To cope with this problem, a promising solution is to integrate the
temporal consistency across consecutive frames and multiple feature cues in a
unified model. Motivated by this idea, we propose a novel correlation
filter-based tracker in this work, in which the temporal relatedness is
reconciled under a multi-task learning framework and the multiple feature cues
are modeled using a multi-view learning approach. We demonstrate the resulting
regression model can be efficiently learned by exploiting the structure of
blockwise diagonal matrix. A fast blockwise diagonal matrix inversion algorithm
is developed thereafter for efficient online tracking. Meanwhile, we
incorporate an adaptive scale estimation mechanism to strengthen the stability
of scale variation tracking. We implement our tracker using two types of
features and test it on two benchmark datasets. Experimental results
demonstrate the superiority of our proposed approach when compared with other
state-of-the-art trackers. project homepage
http://bmal.hust.edu.cn/project/KMF2JMTtracking.htmlComment: This paper has been accepted by IEEE Transactions on Circuits and
Systems for Video Technology. The MATLAB code of our method is available from
our project homepage http://bmal.hust.edu.cn/project/KMF2JMTtracking.htm
Supervised Community Detection with Line Graph Neural Networks
Traditionally, community detection in graphs can be solved using spectral
methods or posterior inference under probabilistic graphical models. Focusing
on random graph families such as the stochastic block model, recent research
has unified both approaches and identified both statistical and computational
detection thresholds in terms of the signal-to-noise ratio. By recasting
community detection as a node-wise classification problem on graphs, we can
also study it from a learning perspective. We present a novel family of Graph
Neural Networks (GNNs) for solving community detection problems in a supervised
learning setting. We show that, in a data-driven manner and without access to
the underlying generative models, they can match or even surpass the
performance of the belief propagation algorithm on binary and multi-class
stochastic block models, which is believed to reach the computational
threshold. In particular, we propose to augment GNNs with the non-backtracking
operator defined on the line graph of edge adjacencies. Our models also achieve
good performance on real-world datasets. In addition, we perform the first
analysis of the optimization landscape of training linear GNNs for community
detection problems, demonstrating that under certain simplifications and
assumptions, the loss values at local and global minima are not far apart.Comment: Published at International Conference on Learning Representations
(ICLR 2019
HyperFusion-Net: Densely Reflective Fusion for Salient Object Detection
Salient object detection (SOD), which aims to find the most important region
of interest and segment the relevant object/item in that area, is an important
yet challenging vision task. This problem is inspired by the fact that human
seems to perceive main scene elements with high priorities. Thus, accurate
detection of salient objects in complex scenes is critical for human-computer
interaction. In this paper, we present a novel feature learning framework for
SOD, in which we cast the SOD as a pixel-wise classification problem. The
proposed framework utilizes a densely hierarchical feature fusion network,
named HyperFusion-Net, automatically predicts the most important area and
segments the associated objects in an end-to-end manner. Specifically, inspired
by the human perception system and image reflection separation, we first
decompose input images into reflective image pairs by content-preserving
transforms. Then, the complementary information of reflective image pairs is
jointly extracted by an interweaved convolutional neural network (ICNN) and
hierarchically combined with a hyper-dense fusion mechanism. Based on the fused
multi-scale features, our method finally achieves a promising way of predicting
SOD. As shown in our extensive experiments, the proposed method consistently
outperforms other state-of-the-art methods on seven public datasets with a
large margin.Comment: Submmited to ECCV 2018, 16 pages, including 6 figures and 4 tables.
arXiv admin note: text overlap with arXiv:1802.0652
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
- …