28 research outputs found
Benchmark evaluation of object segmentation proposal
Abstract. In this research, we provide an in depth analysis and evaluation of four recent segmentation proposals algorithms on PASCAL VOC benchmark. The principal goal of this study is to investigate these object detection proposal methods in an un-biased evaluation framework.
Despite having a widespread application, the strengths and weaknesses of different segmentation proposal methods with respect to each other are mostly not completely clear in the previous works. This thesis provides additional insights to the segmentation proposal methods. In order to evaluate the quality of proposals we plot the recall as a function of average number of regions per image. PASCAL VOC 2012 Object categories, where the methodologies show high performance and instances where these algorithms suffer low recall is also discussed in this work. Experimental evaluation reveals that, despite being different in the operational nature, generally all segmentation proposal methods share similar strengths and weaknesses. The analysis also show how one could select a proposal generation method based on object attributes.
Finally we show that, improvement in recall can be obtained by merging the proposals of different algorithms together. Experimental evaluation shows that this merging approach outperforms individual algorithms both in terms of precision and recall
Transformer Networks for Trajectory Forecasting
Most recent successes on forecasting the people motion are based on LSTM
models and all most recent progress has been achieved by modelling the social
interaction among people and the people interaction with the scene. We question
the use of the LSTM models and propose the novel use of Transformer Networks
for trajectory forecasting. This is a fundamental switch from the sequential
step-by-step processing of LSTMs to the only-attention-based memory mechanisms
of Transformers. In particular, we consider both the original Transformer
Network (TF) and the larger Bidirectional Transformer (BERT), state-of-the-art
on all natural language processing tasks. Our proposed Transformers predict the
trajectories of the individual people in the scene. These are "simple" model
because each person is modelled separately without any complex human-human nor
scene interaction terms. In particular, the TF model without bells and whistles
yields the best score on the largest and most challenging trajectory
forecasting benchmark of TrajNet. Additionally, its extension which predicts
multiple plausible future trajectories performs on par with more engineered
techniques on the 5 datasets of ETH + UCY. Finally, we show that Transformers
may deal with missing observations, as it may be the case with real sensor
data. Code is available at https://github.com/FGiuliari/Trajectory-Transformer.Comment: 18 pages, 3 figure
Head Pose Estimation and Trajectory Forecasting
Human activity recognition and forecasting can be used as a primary cue for scene understanding. Acquiring details from the scene has vast applications in different fields such as computer vision, robotics and more recently smart lighting. In this work, we present the use of Visual Frustum of Attention(VFOA) for scene understanding and activity forecasting. The VFOA identifies the volume of a scene where fixations of a person may occur; it can be inferred from the head pose estimation, and it is crucial in those situations where precise gazing information cannot be retrieved, like in un-constrained indoor scenes or surveillance scenarios. Here we present a framework based on Faster RCNN, which introduces a branch in the network architecture related to the head pose estimation. The key idea is to leverage the presence of the people body to better infer the head pose, through a joint optimization process. Additionally, we enrich the Town Center dataset with head pose labels, promoting further study on this topic. Results on this novel benchmark and ablation studies on other task-specific datasets promote our idea and confirm the importance of the body cues to contextualize the head pose estimation. Secondly, we illustrate the use of VFOA in more general trajectory forecasting.. We present two approcahes 1) a handcrafted energy function based approach 2) a datat driven approach. First, Considering social theories, we propose a prediction model for estimating future movement of pedestrians by leveraging on their head orientation. This cue, when produced by an oracle and injected in a novel socially-based energy minimization approach, allows to get state-of-the-art performances on four different forecasting benchmarks, without relying on additional information such as expected destination and desired speed, which are supposed to be know beforehand for most of the current forecasting techniques. Our approach uses the head pose estimation for two aims: 1) to define a view frustum of attention, highlighting the people a given subject is more interested about, in order to avoid collisions; 2) to give a short time estimation of what would be the desired destination point. Moreover, we show that when the head pose estimation is given by a real detector, though the performance decreases, it still remains at the level of the top score forecasting systems. Secondly, recent approaches on trajectory forecasting use tracklets to predict the future positions of pedestrians exploiting Long Short Term Memory (LSTM) architectures. This paper shows that adding vislets, that is, short sequences of head pose estimations, allows to increase significantly the trajectory forecasting performance. We then propose to use vislets in a novel framework called MX-LSTM, capturing the interplay between tracklets and vislets thanks to a joint unconstrained optimization of full covariance matrices during the LSTM backpropagation. At the same time,MX-LSTM predicts the future head poses, increasing the standard capabilities of the long-term trajectory forecasting approaches. Finally, we illustrate a practical application by implementing an Invisible Light Switch (ILS). Inseid ILS detection, head pose estimation and recognition of current and forecast human activities will allow an advanced occupancy detection, i.e. a control switch which turns lights on when the people are in the environment or about to enter it. Furthermore, this work joins research in smart lighting and computer vision towards the ILS, which will bring both technologies together. The result light management system will be aware of the 3D geometry, light calibration, current and forecast activity maps. The user will be allowed to up an illumination pattern and move around in the environment (e.g. through office rooms or warehouse aisles). The system will maintain the lighting (given available light sources) for the user across the scene parts and across the daylight changes. Importantly, the system will turn lights off in areas not visible by the user, therefore providing energy saving in the invisi
Human-centric light sensing and estimation from RGBD images: The invisible light switch
Lighting design in indoor environments is of primary importance for at least
two reasons: 1) people should perceive an adequate light; 2) an effective
lighting design means consistent energy saving. We present the Invisible Light
Switch (ILS) to address both aspects. ILS dynamically adjusts the room
illumination level to save energy while maintaining constant the light level
perception of the users. So the energy saving is invisible to them. Our
proposed ILS leverages a radiosity model to estimate the light level which is
perceived by a person within an indoor environment, taking into account the
person position and her/his viewing frustum (head pose). ILS may therefore dim
those luminaires, which are not seen by the user, resulting in an effective
energy saving, especially in large open offices (where light may otherwise be
ON everywhere for a single person). To quantify the system performance, we have
collected a new dataset where people wear luxmeter devices while working in
office rooms. The luxmeters measure the amount of light (in Lux) reaching the
people gaze, which we consider a proxy to their illumination level perception.
Our initial results are promising: in a room with 8 LED luminaires, the energy
consumption in a day may be reduced from 18585 to 6206 watts with ILS
(currently needing 1560 watts for operations). While doing so, the drop in
perceived lighting decreases by just 200 lux, a value considered negligible
when the original illumination level is above 1200 lux, as is normally the case
in offices
MX-LSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses
Recent approaches on trajectory forecasting use tracklets to predict the
future positions of pedestrians exploiting Long Short Term Memory (LSTM)
architectures. This paper shows that adding vislets, that is, short sequences
of head pose estimations, allows to increase significantly the trajectory
forecasting performance. We then propose to use vislets in a novel framework
called MX-LSTM, capturing the interplay between tracklets and vislets thanks to
a joint unconstrained optimization of full covariance matrices during the LSTM
backpropagation. At the same time, MX-LSTM predicts the future head poses,
increasing the standard capabilities of the long-term trajectory forecasting
approaches. With standard head pose estimators and an attentional-based social
pooling, MX-LSTM scores the new trajectory forecasting state-of-the-art in all
the considered datasets (Zara01, Zara02, UCY, and TownCentre) with a dramatic
margin when the pedestrians slow down, a case where most of the forecasting
approaches struggle to provide an accurate solution.Comment: 10 pages, 3 figures to appear in CVPR 201
Data-Efficient Training of CNNs and Transformers with Coresets: A Stability Perspective
Coreset selection is among the most effective ways to reduce the training
time of CNNs, however, only limited is known on how the resultant models will
behave under variations of the coreset size, and choice of datasets and models.
Moreover, given the recent paradigm shift towards transformer-based models, it
is still an open question how coreset selection would impact their performance.
There are several similar intriguing questions that need to be answered for a
wide acceptance of coreset selection methods, and this paper attempts to answer
some of these. We present a systematic benchmarking setup and perform a
rigorous comparison of different coreset selection methods on CNNs and
transformers. Our investigation reveals that under certain circumstances,
random selection of subsets is more robust and stable when compared with the
SOTA selection methods. We demonstrate that the conventional concept of uniform
subset sampling across the various classes of the data is not the appropriate
choice. Rather samples should be adaptively chosen based on the complexity of
the data distribution for each class. Transformers are generally pretrained on
large datasets, and we show that for certain target datasets, it helps to keep
their performance stable at even very small coreset sizes. We further show that
when no pretraining is done or when the pretrained transformer models are used
with non-natural images (e.g. medical data), CNNs tend to generalize better
than transformers at even very small coreset sizes. Lastly, we demonstrate that
in the absence of the right pretraining, CNNs are better at learning the
semantic coherence between spatially distant objects within an image, and these
tend to outperform transformers at almost all choices of the coreset size
Forecasting People Trajectories and Head Poses by Jointly Reasoning on Tracklets and Vislets
In this work, we explore the correlation between people trajectories and
their head orientations. We argue that people trajectory and head pose
forecasting can be modelled as a joint problem. Recent approaches on trajectory
forecasting leverage short-term trajectories (aka tracklets) of pedestrians to
predict their future paths. In addition, sociological cues, such as expected
destination or pedestrian interaction, are often combined with tracklets. In
this paper, we propose MiXing-LSTM (MX-LSTM) to capture the interplay between
positions and head orientations (vislets) thanks to a joint unconstrained
optimization of full covariance matrices during the LSTM backpropagation. We
additionally exploit the head orientations as a proxy for the visual attention,
when modeling social interactions. MX-LSTM predicts future pedestrians location
and head pose, increasing the standard capabilities of the current approaches
on long-term trajectory forecasting. Compared to the state-of-the-art, our
approach shows better performances on an extensive set of public benchmarks.
MX-LSTM is particularly effective when people move slowly, i.e. the most
challenging scenario for all other models. The proposed approach also allows
for accurate predictions on a longer time horizon.Comment: Accepted at IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE 2019. arXiv admin note: text overlap with arXiv:1805.0065