This  work  adapts  a  deep  neural  model  for  image  saliency

prediction to the temporal domain of egocentric video. We compute the

saliency  map  for  each  video  frame,  firstly  with  an  off-the-shelf  model

trained  from  static  images,  secondly  by  adding  a  a  convolutional  or

conv-LSTM layers trained with a dataset for video saliency prediction.

We study each configuration on EgoMon, a new dataset made of seven

egocentric  videos  recorded  by  three  subjects  in  both  free-viewing  and

task-driven set ups. Our results indicate that the temporal adaptation is

beneficial when the viewer is not moving and observing the scene from

a narrow field of view. Encouraged by this observation, we compute and

publish the saliency maps for the EPIC Kitchens dataset, in which view-

ers are cooking

Cherto, Monica

Giró-i-Nieto, Xavier

Gurrin, Cathal

Mohedano, Eva

Panagiotis, Linardos

arXiv

This  work  adapts  a  deep  neural  model  for  image  saliency
prediction to the temporal domain of egocentric video. We compute the
saliency  map  for  each  video  frame,  firstly  with  an  off-the-shelf  model
trained  from  static  images,  secondly  by  adding  a  a  convolutional  or
conv-LSTM layers trained with a dataset for video saliency prediction.
We study each configuration on EgoMon, a new dataset made of seven
egocentric  videos  recorded  by  three  subjects  in  both  free-viewing  and
task-driven set ups. Our results indicate that the temporal adaptation is
beneficial when the viewer is not moving and observing the scene from
a narrow field of view. Encouraged by this observation, we compute and
publish the saliency maps for the EPIC Kitchens dataset, in which view-
ers are cooking

Irish Universities

Temporal saliency adaptation in egocentric videos

DCU Online Research Access Service

Temporal Saliency Adaptationin Egocentric VideosPanagiotis Linardos1, Eva Mohedano2, Monica Cherto2,Cathal Gurrin2, and Xavier Giro-i-Nieto11 Universitat Politecnica de Catalunya, 08034 Barcelona, Catalonia/Spain2 Dublin City University, Glasnevin, Whitehall, Dublin 9, Irelandlinardos.akis@gmail.com, xavier.giro@upc.eduAbstract. This work adapts a deep neural model for image saliencyprediction to the temporal domain of egocentric video. We compute thesaliency map for each video frame, firstly with an off-the-shelf modeltrained from static images, secondly by adding a a convolutional orconv-LSTM layers trained with a dataset for video saliency prediction.We study each configuration on EgoMon, a new dataset made of sevenegocentric videos recorded by three subjects in both free-viewing andtask-driven set ups. Our results indicate that the temporal adaptation isbeneficial when the viewer is not moving and observing the scene froma narrow field of view. Encouraged by this observation, we compute andpublish the saliency maps for the EPIC Kitchens dataset, in which view-ers are cooking.1 MotivationSaliency prediction refers to the task of estimating which regions of an image havea higher probability of being observed by a viewer. The result of such predictionsis expressed under the form of a saliency map (heat maps), in which higher valuesare aligned with those pixel locations with higher probabilities of attractingthe viewer’s attention. This information can be used for multiple applications,such as a higher quality coding of the salient regions [22], spatial-aware featureweighting [15], or image retargeting [19]. This task has been extensively exploredin set ups where the viewer is asked to observe an image [10,7,12,2] or video [20]depicting a scene.Our work focuses on the case of egocentric vision, which presents the partic-ularity of having the viewer immersed in the scene. In this case, the user is notonly free to fixate the gaze over any region, but also to change the framing ofthe scene with his head motion. When collecting datasets, this set up also differsfrom others in which the same image or video is shown to many viewers, as inthis case each recording and scene is unique for each user. Egocentric saliencyprediction has been studied in the past [5,18], a research line that we extend byassessing a state of the art model in image saliency prediction to this egocentricvideo set up. We developed our study on a new egocentric video dataset, namedEgoMon, and added a temporal adaptation on the SalGAN model [14] for imagearXiv:1808.09559v2  [cs.CV]  4 Sep 20182 P. Linardos et al.saliency prediction. We observe that the temporal saliency adaptation improvesperformance when the viewer is engaged in a task and with a narrow field ofview, but, on the other hand, losses are measured when the viewer is simplyfree-viewing an open scene. Encouraged by these results, we have computed thesaliency maps pertaining to the Epic Kitchens object detection challenge [3].We believe that these data can be valuable for third-party research focusing onother task such as object detection [15] or video summarization [21]. Both theEgoMon dataset, Epic Kitchens saliency maps and trained models are publiclyavailable 3.2 The EgoMon Gaze and Video DatasetThe recording of an egocentric video dataset requires a wearable camera, butalso a wearable eye tracker. This specificity in the hardware, together with theprivacy constraints, limits the availability of public datasets in this domain. TheGTEA Gaze dataset was collected using Tobii eye-tracker glasses [5]. The moreupdated version of the dataset (EGTEA+) contains 28 hours of cooking activi-ties from 86 unique sessions of 32 subjects. Similarly, the University of Texas atAustin Egocentric (UT Ego) Dataset [18] was collected using the Looxcie wear-able (head-mounted) camera. It contains four videos, each video 3-5 hours longand captured in a natural, uncontrolled setting. The videos depict a variety ofactivities such as eating, shopping, attending a lecture, driving, or cooking.In this work we introduce EgoMon, a new egocentric gaze and video dataset.Data was recorded in Dublin (Ireland) by three different individuals wearinga pair of Tobii glasses equipped with a monocular eye tracker. The dataset isdelivered as a collection of seven videos of an average length of 30 minutes.EgoMon includes both free-viewing activities (a walk in a park, walking to theoffice, a walk in the botanic gardens, a bus ride), as well as task-oriented activities(cooking an omellette, listening to an oral presentation and playing cards). Inthe case of the botanic gardens, an additional a sequence of images capturedevery 30 seconds with a Narrative clip camera is also provided.3 Deep Neural Models for Temporal Saliency AdaptationVideo saliency prediction with deep neural networks has basically adapted to thistask the architectures proposed for video action recognition. Two-stream net-works [17] combining video frames and optical flow were applied in [1] for saliencyprediction, while temporal sequences modeled with RNN [4] were adopted in [11].The authors of the largest dataset for video saliency prediction, the DHF1K (Dy-namic Human Fixation 1K) dataset[20], also trained a deep neural model basedon ConvLSTM layers to predict the saliency maps. Similarly, the authors of [6]propose a complex convolutional architecture with four branches fused with atemporal-aware ConvLSTM layer. Regarding egocentric saliency prediction with3 https://imatge-upc.github.io/saliency-2018-videosalgan/Temporal Saliency Adaptation in Egocentric Videos 3deep models, Huang et al. [9] propose to model the bottom-up and top-downattention mechanisms on the GTEA Gaze dataset. Their approach combinesa saliency prediction with a task-dependent attention module, which explicitlymodels the temporal shift of gaze fixations during different manipulation tasks.Our proposed architecture starts by processing each video frame separatelywith SalGAN [14], an image-based saliency prediction pre-trained trained onthe SALICON dataset [8]. SalGAN outputs a sequence of static saliency mapswhich were fed into two types of adaptation layers: 128 convolutional filters[13] of kernel size 3x3 and padding of 1, and its temporal-aware counterpartas ConvLSTM [16] with the same convolutional parameters. Their parameterswere estimated from 700 training videos from the DHF1K dataset [20]. An SGDoptimizer with 0.9 momentum was used, and the learning rate started at 0.00001and decayed with a 0.1. There was also a weight decay of 0.0001.Fig. 1. Architecture of the dynamic model. The static model uses plain convolutionswithout the LSTM temporal recurrence.4 ExperimentationThe proposed model was assessed firstly on the same DHF1K dataset [20] thesame from which the conv and convLSTM layers were trained. Afterwards, themodel was assessed on the proposed EgoMon dataset to draw our conclusions inthe egocentric domain.Table 1 indicates that, surprisingly, the off-the-shelf (frame-based) SalGANmodel [14] outperformed the state of the art model on the DHF1K [20] dataset.On the other hand, the quality of the prediction decreases when the conv or4 P. Linardos et al.Table 1. Performance on the DHF1K dataset.AUC-J ↑ sAUC ↑ NSS ↑ CC ↑ SIM ↑SoA [20] 0.885 0.553 2.259 0.415 0.311SalGAN [14] 0.930 0.834 2.468 0.372 0.264+ conv 0.743 0.723 2.208 0.303 0.261+convLSTM 0.744 0.722 2.246 0.302 0.260Table 2. NSS metric across the DHF1K and EgoMon datasets.DHF1K EgoMonSalGAN [14] 2.468 2.079+conv 2.208 1.250+convLSTM 2.246 1.247convLSTM layers are trained on top, which indicates that the domain adaptationis damaging the performance of the original SalGAN.Table 3. Performance on different EgoMon tasks (NSS metric).free-viewing recordings (bottom-up saliency)bus ride botanical gardens dcu park walking office AVERAGESalGAN [14] 1.618 1.182 4.374 3.435 2.652+ conv 0.947 0.846 0.683 0.745 0.805+ convLSTM 0.827 0.576 1.172 1.040 0.904task-driven recordings (top-down saliency)playing cards presentation tortilla AVERAGESalGAN [14] 0.967 1.360 1.618 1.315+ conv 1.114 1.966 2.002 1.694+ convLSTM 1.141 1.897 2.077 1.705Table 2 indicates an even worse loss of performance when adding this adapta-tion layers in the EgoMon dataset. Nevertheless, the more detailed look providedin Table 3 that actually the adaptation layers are beneficial in those scenes wherethe user is engaged in an activity.Qualitative analysis of the saliency maps showed that the convolutional layers(with and without temporal information) had the effect of reinforcing the higherprobability pixels at the expense of darkening the lower ones. This effect benefi-cial in the case of task-driven activities, because the scene tends to be constantin time and the region of interest is localized in the space. However, free-viewingtasks contain changing scenes with much more sparse saliency maps.Temporal Saliency Adaptation in Egocentric Videos 55 AckowledgementsPanagiotis Linardos and Monica Cherto were supported by the Erasmus+ Pro-gram from the European Union for student mobility. This research was partiallysupported by the Spanish Ministry of Economy and Competitivity and the Eu-ropean Regional Development Fund (ERDF) under contract TEC2016-75976-R.We acknowledge the support of NVIDIA Corporation for the donation of GPUs.References1. Bak, C., Kocak, A., Erdem, E., Erdem, A.: Spatio-temporal saliency networks fordynamic saliency prediction. IEEE Transactions on Multimedia (2017)2. Borji, A.: Boosting bottom-up and top-down visual features for saliency estimation.In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012)3. Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E.,Moltisanti, D., Munro, J., Perrett, T., Price, W., Wray, M.: Scaling egocentricvision: The epic-kitchens dataset. In: European Conference on Computer Vision(ECCV) (2018)4. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan,S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visualrecognition and description. In: Proceedings of the IEEE conference on computervision and pattern recognition. pp. 2625–2634 (2015)5. Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. Lec-ture Notes in Computer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) 7572 LNCS(PART 1), 314–327(2012)6. Gorji, S., Clark, J.J.: Going from image to video saliency: Augmenting imagesalience with dynamic attentional push. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 7501–7511 (2018)7. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Neural InformationProcessing Systems (NIPS) (2006)8. Huang, X., Shen, C., Boix, X., Zhao, Q.: Salicon: Reducing the semantic gap insaliency prediction by adapting deep neural networks. In: IEEE International Con-ference on Computer Vision (ICCV) (2015)9. Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learningtask-dependent attention transition. arXiv preprint arXiv:1803.09125 (2018)10. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapidscene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI) (20), 1254–1259 (1998)11. Jiang, L., Xu, M., Wang, Z.: Predicting video saliency with object-to-motion cnnand two-layer convolutional lstm. arXiv preprint arXiv:1709.06316 (2017)12. Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humanslook. In: IEEE International Conference on Computer Vision (ICCV) (2009)13. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied todocument recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)14. Pan, J., Ferrer, C.C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E.,Giro-i Nieto, X.: SalGAN: Visual Saliency Prediction with Generative AdversarialNetworks (2017)6 P. Linardos et al.15. Reyes, C., Mohedano, E., McGuinness, K., O’Connor, N.E., Giro-i Nieto, X.: Whereis my phone?: Personal object retrieval from egocentric images. In: Proceedings ofthe first Workshop on Lifelogging Tools and Applications. pp. 55–62. ACM (2016)16. Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., Woo, W.: ConvolutionalLSTM network: A machine learning approach for precipitation nowcasting. CoRRabs/1506.04214 (2015), http://arxiv.org/abs/1506.0421417. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-nition in videos. In: Advances in neural information processing systems. pp. 568–576 (2014)18. Su, Y.C., Grauman, K.: Detecting engagement in egocentric video. In: EuropeanConference on Computer Vision. pp. 454–471. Springer (2016)19. Theis, L., Korshunova, I., Tejani, A., Husza´r, F.: Faster gaze prediction with densenetworks and fisher pruning. arXiv preprint arXiv:1801.05787 (2018)20. Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting video saliency: Alarge-scale benchmark and a new model. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 4894–4903 (2018)21. Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled ego-centric video summarization via constrained submodular maximization. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 2235–2244 (2015)22. Zhu, S., Xu, Z.: Spatiotemporal visual saliency guided perceptual high efficiencyvideo coding with neural network. Neurocomputing 275, 511–522 (2018)

Temporal saliency adaptation in egocentric videos

Abstract

Similar works

Full text

Available Versions

Irish Universities

DCU Online Research Access Service