During recent years transformers architectures have been growing in popularity. Modulated Detection Transformer (MDETR) is an end-to-endmulti-modal understanding model that performs tasks such as phase grounding, referring expression comprehension, referring expression segmentation, andvisual question answering. One remarkable aspect of the model is the capacity to infer over classes that it was not previously trained for. In this work we explore the use of MDETR in a new task, action detection, without any previous training. We obtain quantitative results using the Atomic Visual Actions dataset.Although the model does not report the best performance in the task, we believe that it is an interesting finding. We show that it is possible to use a multi-modal model to tackle a task that it was not designed for. Finally, we believe that this line of research may lead into the generalization of MDETR in additionaldownstream tasks.Sociedad Argentina de Informática e Investigación Operativ

Aggio, Santiago L.

Crisol, Tomás

Ermantraut, Joel

Iparraguirre, Javier

Rostagno, Adrián

Servicio de Difusión de la Creación Intelectual

Exploring Modulated Detection Transformer as a Toolfor Action Recognition in VideosTomás Crisol1?, Joel Ermantraut1??, Adrián Rostagno1, Santiago L. Aggio1,2, andJavier Iparraguirre11 Universidad Tecnológica Nacional, Facultad Regional Bahı́a Blanca, Argentinatomascrisol12,joelermantraut@gmail.comarostag@frbb.utn.edu.arj.iparraguirre@computer.org2 CONICET, Bahı́a Blanca, Argentinaslaggio@criba.edu.arAbstract. During recent years transformers architectures have been growing inpopularity. Modulated Detection Transformer (MDETR) is an end-to-endmulti-modal understanding model that performs tasks such as phase grounding,referring expression comprehension, referring expression segmentation, andvisual question answering. One remarkable aspect of the model is the capacity toinfer over classes that it was not previously trained for. In this work we explorethe use of MDETR in a new task, action detection, without any previoustraining. We obtain quantitative results using the Atomic Visual Actions dataset.Although the model does not report the best performance in the task, we believethat it is an interesting finding. We show that it is possible to use a multi-modalmodel to tackle a task that it was not designed for. Finally, we believe that thisline of research may lead into the generalization of MDETR in additionaldownstream tasks.Keywords: Multi-modal transformers · Action detection · Model generalization.1 IntroductionTransformers architectures have been increasing in popularity among the machinelearning community [5]. Initially, this type of architecture emerged in the naturallanguage processing space [9]. However, it is possible to observe a rapid expansion inother modalities such as computer vision [5]. Recently, multi-modal transformersenabled the possibility to process images and text using a single model. Additionally,video understanding tasks were tackled by transformers models such as the workproposed by Wang et al. [10].Modulated Detection Transformer (MDETR) [4] is a multi-modal transformer. Thearchitecture accepts an image and text as input, and it can be trained on multipledownstream tasks. One particular task is visual question answering. Although theinitial design of the model does not target video understanding, we used MDETR as an? These authors contributed equally?? These authors contributed equallySAIV, Simposio Argentino de Imágenes y VisiónMemorias de las 51 JAIIO - SAIV - ISSN: 2451-7496 - Página 62 T. Crisol et al.action recognition model. Without any additional training, we evaluated theperformance of the model on an action recognition dataset. Naturally, the results arenot the best reported. However, we found it valuable to assess the use of a multi-modaltransformer in tasks that it was not designed for. It is important to highlight that noprevious training was performed before the evaluation.The Atomic Visual Actions (AVA) [2] dataset consists of a collection of 430 videosannotated with 80 visual actions. It contains 1.58M action labels associated with abounding box. Since MDETR provides coordinates that are related to the output, it ispossible to ask a question and get an answer with the related area of interest. We ranexperiments on AVA and we obtained quantitative results. Additionally, all reportedfindings were published in an open repository 3. Next section explores the relatedwork. In section 3 quantitative results are presented. Finally, conclusions are stated insection 4.2 Related Work2.1 MDETRModulated Detection Transformer (MDETR) [4], performs object detection inconjunction with language understanding. The concept enables end-to-endmulti-modal understanding. The model relies only on text and the aligned boxes in animage. Unlike previous detection methods, MDETR detects concepts from free textand generalizes to unseen combinations of categories and attributes. Quantitativeresults reported by MDETR authors are outstanding in four tasks. Reported categorieswere phase grounding, referring expression comprehension, referring expressionsegmentation, and visual question answering. Given a collection of videos, wesampled the clips and extracted 1 frame per second. Using visual question answering,we asked for actions and measured the output of the system.As any transformer, MDETR was trained in two stages, the pre-training and thedownstream tasks. During pre-training, the model ingested a combination of Flickr30k[8], MS COCO [7] and Visual Genome (VG) [6] datasets. In the case of the visualquestion answering task, GQA [3] was the selected dataset. During inference, the modelreads a tensor of linear image features and a tensor of linear text features. Then, the inputis concatenated and fed into an encoder. Depending on the task, the decoder presentssome variation. In the case of question answering, object queries and specific queriesare fed into the decoder. As a result, the decoder provides new object positions andanswers to the queries.2.2 AVAAVA [2] dataset is a person centered corpus, annotated at a 1 Hz sample rate. Everyperson is located using a bounding box and the labels correspond to actions related tothe pose, interactions with objects, and interactions with other persons. The temporalcontext of the annotation is centered ± 1.5 second around the keyframe. This “brief”3 https://github.com/BHI-Research/AVA MDETRSAIV, Simposio Argentino de Imágenes y VisiónMemorias de las 51 JAIIO - SAIV - ISSN: 2451-7496 - Página 7Exploring MDETR as a Tool for Action Recognition 3time lapse gives the name to the dataset. Annotations in the dataset are precise and thenumber of labels reaches 1.58 million.Multiple metrics are available in AVA. Intersection-over-union (IoU) is reported atframe level and at video level. In the case of frame level, the metric is built following thestandard protocol used by the PASCAL VOC challenge [1]. Average precision (AP) iscomputed using an IoU threshold of 0.5. Mean Average Precision (mAP) is the averageof AP over all classes. In this work, AP is reported.3 ResultsExperiments reported on this work were obtained using the original MDETR modeland the AVA actions dataset v2.2. Since the model takes images and text as input, wesampled the videos at 1 Hz to obtain the frames. Regarding actions, we created acollection of questions that ask for the actions vocabulary annotated in AVA. For eachframe extracted from the dataset, we asked all the questions available. The output ofthe model was saved in a CSV file. Afterwards, we obtained quantitative results.Figure 1 shows a correct action detection (left) and an incorrect result (right). In thiscase, the frames belong to the AVA dataset and the model used is MDETR. A frameand a question about the action to detect are given to the model as input. As output,the model provides an answer, its confidence, and the location in the image where theanswer was found.(a) (b)Fig. 1: Output of MDETR using the visual question answering used to detect and action.On the left, successful results can be observed. In this case, an image and the question“is someone sitting?” are given to the model. On the right, the image and the question“is someone dancing?” were given to the model. The example on the left shows a failurein the action detection.State of the art results show that the action detection task is far from solved. Up toour best knowledge, the best performing model achieved 38.8 mAP [11]. Since in ourexperiments we are using the standard MDETR model, not new training was required.In our case, the overall performance of the model is orders of magnitude below the bestreported results. This was an expected outcome.SAIV, Simposio Argentino de Imágenes y VisiónMemorias de las 51 JAIIO - SAIV - ISSN: 2451-7496 - Página 84 T. Crisol et al.Table 1 shows quantitative results where MDETR performed the best and where itdid not detect actions. Depending on the point of view the results can be interpreted asnegative or positive. The negative aspect is that the model cannot reach results as themodels designed to achieve the task specifically. The positive aspect is that MDETR isdetecting actions without any additional training. This is a remarkable fact consideringthat the original design was targeting other tasks.Table 1: Table captions should be placed above the tables.Pascal Boxes Categories ResultsBest Performance Worst PerformanceCategory AP@0.5IOU Category AP@0.5IOUsleep 0.0019 answer phone 0.0sit 0.0016 kiss (a person) 0.0stand 0.0011 throw 0.0hand shake 0.0005 touch (an object) 0.0dance 0.0003 write 0.04 Conclusions and Future WorkIn this work we showed the use of an end-to-end text and image understandingtransformer model in a task that it was not designed for. We obtained quantitativeresults using a challenging action recognition dataset and we tested the limits of thearchitecture. The remarkable characteristic that makes MDETR unique is that themodel can infer over classes that it did not see before. For instance, it can detect a pinkelephant (not present in the annotations). We wanted to push this aspect to this limit inthe case of action detection. Although the model achieves poor quantitative results, itis possible to detect actions. This is an outstanding achievement considering thescenario of experiments.As future work, we plan to train MDETR in the action detection task. Weunderstand that there is potential in this line of research. Since the AVA datasetprovides a high number of labels, the task seems feasible. We believe that there isroom for generalization in the use of multi-modal transformers models.SAIV, Simposio Argentino de Imágenes y VisiónMemorias de las 51 JAIIO - SAIV - ISSN: 2451-7496 - Página 9Exploring MDETR as a Tool for Action Recognition 5References1. Everingham, M., Eslami, S., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: Thepascal visual object classes challenge: A retrospective. International journal of computervision 111(1), 98–136 (2015)2. Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S.,Toderici, G., Ricco, S., Sukthankar, R., et al.: Ava: A video dataset of spatio-temporallylocalized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 6047–6056 (2018)3. Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoningand compositional question answering. In: Proceedings of the IEEE/CVF conference oncomputer vision and pattern recognition. pp. 6700–6709 (2019)4. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulateddetection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVFInternational Conference on Computer Vision. pp. 1780–1790 (2021)5. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision:A survey. ACM Computing Surveys (CSUR) (2021)6. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis,Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision usingcrowdsourced dense image annotations. International journal of computer vision 123(1),32–73 (2017)7. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick,C.L.: Microsoft coco: Common objects in context. In: European conference on computervision. pp. 740–755. Springer (2014)8. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.:Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentencemodels. In: Proceedings of the IEEE international conference on computer vision. pp.2641–2649 (2015)9. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30 (2017)10. Wang, J., Bertasius, G., Tran, D., Torresani, L.: Long-short temporal contrastive learning ofvideo transformers. arXiv preprint arXiv:2106.09212 (2021)11. Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature predictionfor self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition. pp. 14668–14678 (2022)SAIV, Simposio Argentino de Imágenes y VisiónMemorias de las 51 JAIIO - SAIV - ISSN: 2451-7496 - Página 10

Exploring modulated detection transformer as a tool for action recognition in videos

http://sedici.unlp.edu.ar/bitstream/handle/10915/151735/Documento_completo.pdf?sequence=1

Exploring modulated detection transformer as a tool for action recognition in videos

Abstract

Similar works

Full text

Available Versions

Servicio de Difusión de la Creación Intelectual