3,885 research outputs found

    Temporal Sentence Grounding in Videos: A Survey and Future Directions

    Full text link
    Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate the methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table

    Understanding cities with machine eyes: A review of deep computer vision in urban analytics

    Get PDF
    Modelling urban systems has interested planners and modellers for decades. Different models have been achieved relying on mathematics, cellular automation, complexity, and scaling. While most of these models tend to be a simplification of reality, today within the paradigm shifts of artificial intelligence across the different fields of science, the applications of computer vision show promising potential in understanding the realistic dynamics of cities. While cities are complex by nature, computer vision shows progress in tackling a variety of complex physical and non-physical visual tasks. In this article, we review the tasks and algorithms of computer vision and their applications in understanding cities. We attempt to subdivide computer vision algorithms into tasks, and cities into layers to show evidence of where computer vision is intensively applied and where further research is needed. We focus on highlighting the potential role of computer vision in understanding urban systems related to the built environment, natural environment, human interaction, transportation, and infrastructure. After showing the diversity of computer vision algorithms and applications, the challenges that remain in understanding the integration between these different layers of cities and their interactions with one another relying on deep learning and computer vision. We also show recommendations for practice and policy-making towards reaching AI-generated urban policies

    SSHA: Video Violence Recognition and Localization Using a Semi-Supervised Hard Attention Model

    Full text link
    Current human-based surveillance systems are prone to inadequate availability and reliability. Artificial intelligence-based solutions are compelling, considering their reliability and precision in the face of an increasing adaption of surveillance systems. Exceedingly efficient and precise machine learning models are required to effectively utilize the extensive volume of high-definition surveillance imagery. This study focuses on improving the accuracy of the methods and models used in automated surveillance systems to recognize and localize human violence in video footage. The proposed model uses an I3D backbone pretrained on the Kinetics dataset and has achieved state-of-the-art accuracy of 90.4% and 98.7% on RWF and Hockey datasets, respectively. The semi-supervised hard attention mechanism has enabled the proposed method to fully capture the available information in a high-resolution video by processing the necessary video regions in great detail.Comment: 11 pages, 4 figures, 4 equations, 3 tables, 1 algorith

    An Adaptive Threshold for the Canny Edge Detection with Actor-Critic Algorithm

    Full text link
    Visual surveillance aims to perform robust foreground object detection regardless of the time and place. Object detection shows good results using only spatial information, but foreground object detection in visual surveillance requires proper temporal and spatial information processing. In deep learning-based foreground object detection algorithms, the detection ability is superior to classical background subtraction (BGS) algorithms in an environment similar to training. However, the performance is lower than that of the classical BGS algorithm in the environment different from training. This paper proposes a spatio-temporal fusion network (STFN) that could extract temporal and spatial information using a temporal network and a spatial network. We suggest a method using a semi-foreground map for stable training of the proposed STFN. The proposed algorithm shows excellent performance in an environment different from training, and we show it through experiments with various public datasets. Also, STFN can generate a compliant background image in a semi-supervised method, and it can operate in real-time on a desktop with GPU. The proposed method shows 11.28% and 18.33% higher FM than the latest deep learning method in the LASIESTA and SBI dataset, respectively
    • …
    corecore