6,850 research outputs found
EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis
Data clustering has received a lot of attention and numerous methods,
algorithms and software packages are available. Among these techniques,
parametric finite-mixture models play a central role due to their interesting
mathematical properties and to the existence of maximum-likelihood estimators
based on expectation-maximization (EM). In this paper we propose a new mixture
model that associates a weight with each observed point. We introduce the
weighted-data Gaussian mixture and we derive two EM algorithms. The first one
considers a fixed weight for each observation. The second one treats each
weight as a random variable following a gamma distribution. We propose a model
selection method based on a minimum message length criterion, provide a weight
initialization strategy, and validate the proposed algorithms by comparing them
with several state of the art parametric and non-parametric clustering
techniques. We also demonstrate the effectiveness and robustness of the
proposed clustering technique in the presence of heterogeneous data, namely
audio-visual scene analysis.Comment: 14 pages, 4 figures, 4 table
Forecasting People Trajectories and Head Poses by Jointly Reasoning on Tracklets and Vislets
In this work, we explore the correlation between people trajectories and
their head orientations. We argue that people trajectory and head pose
forecasting can be modelled as a joint problem. Recent approaches on trajectory
forecasting leverage short-term trajectories (aka tracklets) of pedestrians to
predict their future paths. In addition, sociological cues, such as expected
destination or pedestrian interaction, are often combined with tracklets. In
this paper, we propose MiXing-LSTM (MX-LSTM) to capture the interplay between
positions and head orientations (vislets) thanks to a joint unconstrained
optimization of full covariance matrices during the LSTM backpropagation. We
additionally exploit the head orientations as a proxy for the visual attention,
when modeling social interactions. MX-LSTM predicts future pedestrians location
and head pose, increasing the standard capabilities of the current approaches
on long-term trajectory forecasting. Compared to the state-of-the-art, our
approach shows better performances on an extensive set of public benchmarks.
MX-LSTM is particularly effective when people move slowly, i.e. the most
challenging scenario for all other models. The proposed approach also allows
for accurate predictions on a longer time horizon.Comment: Accepted at IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE 2019. arXiv admin note: text overlap with arXiv:1805.0065
Pixel Features for Self-organizing Map Based Detection of Foreground Objects in Dynamic Environments
Among current foreground detection algorithms for video sequences, methods based on self-organizing maps are obtaining a greater relevance. In this work we propose a probabilistic self-organising map based model, which uses a uniform distribution to represent the foreground. A suitable set of characteristic pixel features is chosen to train the probabilistic model. Our approach has been compared to some competing methods on a test set of benchmark videos, with favorable results.Universidad de Málaga. Campus de Excelencia Internacional AndalucĂa Tech
Human Motion Trajectory Prediction: A Survey
With growing numbers of intelligent autonomous systems in human environments,
the ability of such systems to perceive, understand and anticipate human
behavior becomes increasingly important. Specifically, predicting future
positions of dynamic agents and planning considering such predictions are key
tasks for self-driving vehicles, service robots and advanced surveillance
systems. This paper provides a survey of human motion trajectory prediction. We
review, analyze and structure a large selection of work from different
communities and propose a taxonomy that categorizes existing methods based on
the motion modeling approach and level of contextual information used. We
provide an overview of the existing datasets and performance metrics. We
discuss limitations of the state of the art and outline directions for further
research.Comment: Submitted to the International Journal of Robotics Research (IJRR),
37 page
SUR-Net: Predicting the Satisfied User Ratio Curve for Image Compression with Deep Learning
The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.The Satisfied User Ratio (SUR) curve for a lossy image compression scheme, e.g., JPEG, characterizes the probability distribution of the Just Noticeable Difference (JND) level, the smallest distortion level that can be perceived by a subject. We propose the first deep learning approach to predict such SUR curves. Instead of the direct approach of regressing the SUR
curve itself for a given reference image, our model is trained on pairs of images, original and compressed. Relying on a Siamese
Convolutional Neural Network (CNN), feature pooling, a fully connected regression-head, and transfer learning, we achieved
a good prediction performance. Experiments on the MCL-JCI dataset showed a mean Bhattacharyya distance between the
predicted and the original JND distributions of only 0.072
Single Image Depth Prediction Made Better: A Multivariate Gaussian Take
Neural-network-based single image depth prediction (SIDP) is a challenging
task where the goal is to predict the scene's per-pixel depth at test time.
Since the problem, by definition, is ill-posed, the fundamental goal is to come
up with an approach that can reliably model the scene depth from a set of
training examples. In the pursuit of perfect depth estimation, most existing
state-of-the-art learning techniques predict a single scalar depth value
per-pixel. Yet, it is well-known that the trained model has accuracy limits and
can predict imprecise depth. Therefore, an SIDP approach must be mindful of the
expected depth variations in the model's prediction at test time. Accordingly,
we introduce an approach that performs continuous modeling of per-pixel depth,
where we can predict and reason about the per-pixel depth and its distribution.
To this end, we model per-pixel scene depth using a multivariate Gaussian
distribution. Moreover, contrary to the existing uncertainty modeling methods
-- in the same spirit, where per-pixel depth is assumed to be independent, we
introduce per-pixel covariance modeling that encodes its depth dependency w.r.t
all the scene points. Unfortunately, per-pixel depth covariance modeling leads
to a computationally expensive continuous loss function, which we solve
efficiently using the learned low-rank approximation of the overall covariance
matrix. Notably, when tested on benchmark datasets such as KITTI, NYU, and
SUN-RGB-D, the SIDP model obtained by optimizing our loss function shows
state-of-the-art results. Our method's accuracy (named MG) is among the top on
the KITTI depth-prediction benchmark leaderboard.Comment: Accepted to IEEE/CVF CVPR 2023. Draft info: 17 pages, 13 Figures, 9
Table
PeopleNet: A Novel People Counting Framework for Head-Mounted Moving Camera Videos
Traditional crowd counting (optical flow or feature matching) techniques have been upgraded to deep learning (DL) models due to their lack of automatic feature extraction and low-precision outcomes. Most of these models were tested on surveillance scene crowd datasets captured by stationary shooting equipment. It is very challenging to perform people counting from the videos shot with a head-mounted moving camera; this is mainly due to mixing the temporal information of the moving crowd with the induced camera motion. This study proposed a transfer learning-based PeopleNet model to tackle this significant problem. For this, we have made some significant changes to the standard VGG16 model, by disabling top convolutional blocks and replacing its standard fully connected layers with some new fully connected and dense layers. The strong transfer learning capability of the VGG16 network yields in-depth insights of the PeopleNet into the good quality of density maps resulting in highly accurate crowd estimation. The performance of the proposed model has been tested over a self-generated image database prepared from moving camera video clips, as there is no public and benchmark dataset for this work. The proposed framework has given promising results on various crowd categories such as dense, sparse, average, etc. To ensure versatility, we have done self and cross-evaluation on various crowd counting models and datasets, which proves the importance of the PeopleNet model in adverse defense of society
- …