106,647 research outputs found

    Zero-shot keyword spotting for visual speech recognition in-the-wild

    Full text link
    Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.Comment: Accepted at ECCV-201

    Dynamic Visual Attention: competitive versus motion priority scheme

    Get PDF
    Defined as attentive process in presence of visual sequences, dynamic visual attention responds to static and motion features as well. For a computer model, a straightforward way to integrate these features is to combine all features in a competitive scheme: the saliency map contains a contribution of each feature, static and motion. Another way of integration is to combine the features in a motion priority scheme: in presence of motion, the saliency map is computed as the motion map, and in absence of motion, as the static map. In this paper, four models are considered: two models based on a competitive scheme and two models based on a motion priority scheme. The models are evaluated experimentally by comparing them with respect to the eye movement patterns of human subjects, while viewing a set of video sequences. Qualitative and quantitative evaluations, performed in the context of simple synthetic video sequences, show the highest performance of the motion priority scheme, compared to the competitive scheme

    Dynamic stereoscopic selective visual attention (dssva): integrating motion and shape with depth in video segmentation

    Get PDF
    Depth inclusion as an important parameter for dynamic selective visual attention is presented in this article. The model introduced in this paper is based on two previously developed models, dynamic selective visual attention and visual stereoscopy, giving rise to the so-called dynamic stereoscopic selective visual attention method. The three models are based on the accumulative computation problem-solving method. This paper shows how software reusability enables enhancing results in vision research (video segmentation) by integrating earlier works. In this article, the first results obtained for synthetic sequences are included to show the effectiveness of the integration of motion and shape features with depth parameter in video segmentation

    Multimodal Feature Learning for Video Captioning

    Get PDF
    Video captioning refers to the task of generating a natural language sentence that explains the content of the input video clips. This study proposes a deep neural network model for effective video captioning. Apart from visual features, the proposed model learns additionally semantic features that describe the video content effectively. In our model, visual features of the input video are extracted using convolutional neural networks such as C3D and ResNet, while semantic features are obtained using recurrent neural networks such as LSTM. In addition, our model includes an attention-based caption generation network to generate the correct natural language captions based on the multimodal video feature sequences. Various experiments, conducted with the two large benchmark datasets, Microsoft Video Description (MSVD) and Microsoft Research Video-to-Text (MSR-VTT), demonstrate the performance of the proposed model

    Integration and Evaluation of a Video Surveillance System

    Get PDF
    Visual surveillance systems are getting a lot of attention over the last few years, due to a growing need for surveillance applications. In this thesis, we present a visual surveillance system that integrates modules for motion detection, tracking, and trajectory characterization to achieve robust monitoring of moving objects in scenes under surveillance. The system operates on video sequences acquired by stationary color and infra-red surveillance cameras. Motion detection is implemented using an algorithm that combines thresholding of temporal variance and background modeling. The tracking algorithm combines motion and appearance information into an appearance model and uses a particle filter framework for object tracking. The trajectory analysis module builds a model for a given normal activity using a factorization approach, and uses this model for the detection of any abnormal motion pattern. The system was tested on a large ground-truthed data set containing hundreds of color and FLIR image sequences. Results of performance evaluation using these sequences are reported in this thesis

    Dynamic Visual Attention: competitive versus motion priority scheme

    Get PDF
    Defined as attentive process in presence of visual sequences, dynamic visual attention responds to static and motion features as well. For a computer model, a straightforward way to integrate these features is to combine all features in a competitive scheme: the saliency map contains a contribution of each feature, static and motion. Another way of integration is to combine the features in a motion priority scheme: in presence of motion, the saliency map is computed as the motion map, and in absence of motion, as the static map. In this paper, four models are considered: two models based on a competitive scheme and two models based on a motion priority scheme. The models are evaluated experimentally by comparing them with respect to the eye movement patterns of human subjects, while viewing a set of video sequences. Qualitative and quantitative evaluations, performed in the context of simple synthetic video sequences, show the highest performance of the motion priority scheme, compared to the competitive scheme
    • …
    corecore