1,785 research outputs found

    Combining Residual Networks with LSTMs for Lipreading

    Full text link
    We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.Comment: Submitted to Interspeech 201

    LEARNet Dynamic Imaging Network for Micro Expression Recognition

    Full text link
    Unlike prevalent facial expressions, micro expressions have subtle, involuntary muscle movements which are short-lived in nature. These minute muscle movements reflect true emotions of a person. Due to the short duration and low intensity, these micro-expressions are very difficult to perceive and interpret correctly. In this paper, we propose the dynamic representation of micro-expressions to preserve facial movement information of a video in a single frame. We also propose a Lateral Accretive Hybrid Network (LEARNet) to capture micro-level features of an expression in the facial region. The LEARNet refines the salient expression features in accretive manner by incorporating accretion layers (AL) in the network. The response of the AL holds the hybrid feature maps generated by prior laterally connected convolution layers. Moreover, LEARNet architecture incorporates the cross decoupled relationship between convolution layers which helps in preserving the tiny but influential facial muscle change information. The visual responses of the proposed LEARNet depict the effectiveness of the system by preserving both high- and micro-level edge features of facial expression. The effectiveness of the proposed LEARNet is evaluated on four benchmark datasets: CASME-I, CASME-II, CAS(ME)^2 and SMIC. The experimental results after investigation show a significant improvement of 4.03%, 1.90%, 1.79% and 2.82% as compared with ResNet on CASME-I, CASME-II, CAS(ME)^2 and SMIC datasets respectively.Comment: Dynamic imaging, accretion, lateral, micro expression recognitio

    Auditory and cross-modal attention for the cognitive access to objects

    Get PDF
    This article aims at studying how we track and identify objects on the basis of multimodal perception. It belongs to ‘procedural' theories according to which demonstrative identification depends on using procedures of perceptual attention (e.g., Campbell, 2002; Clark, 2000; Evans, 1982; Pylyshyn, 2003; Ullman, 1984). In contrast to prevalent views according to which demonstrative identification is primarily based on the orienting of visual attention to the target object itself (Campbell, 2002: 115-16), I shall investigate an alternative Crossmodal View. According to the Crossmodal View, demonstrative identification depends more fundamentally on crossmodal attention. I shall present an argument maintaining namely that perceivers routinely use the coordinating ability of crossmodal attention to retrieve the continuity and uniqueness of the spatiotemporal path of the target object of their identification acts. The analysis will focus on examples of crossmodal links between audition and vision

    Aspects of spatiotemporal integration in bat sonar

    Get PDF
    Bat sonar is an active sense that is based on the common mammalian auditory system. Bats emit echolocation calls in the high frequency range and extract information about their surroundings by listening to the returning echoes. These echoes carry information, like spatial cues, about object location in the three-dimensional space (azimuth, elevation, and distance). Distance information, for example, is obtained from temporal cues as the interval between the emission of an echolocation call and the returning echo (echo delay). But echoes also carry information about spatial object properties like shape, orientation, or size (in terms of its height, width, and depth). To achieve a reliable internal representation of the environment, bats need to integrate spatial and temporal echo information. In this cumulative thesis different aspects of spatiotemporal integration in bat sonar were addressed, beginning with the perception and neural encoding of object size. Object width as size relevant dimension is encoded by the intensity of its echo. Additionally, the sonar aperture (the naturally co-varying spread of angles of incidence from which the echoes impinge on the ears) co-varies proportionally. In the first study, using a combined psychophysical and electrophysical approach (including the presentation of virtual objects), it was investigated which of both acoustic cues echolocating bats (Phyllostomus discolor) employ for the estimation of object width. Interestingly, the results showed that bats can discriminate object width by only using sonar-aperture information. This was reflected in the responses of a population of units in the auditory midbrain and cortex that responded strongest to echoes from objects with a specific sonar aperture, independent of variations in echo intensity. The study revealed that the sonar aperture is a behaviorally relevant and reliably encoded spatial perceptual cue for object size. It furthermore supported the theory that the mammalian central nervous system is principally aiming to find modality independent representation of spatial object properties. We therefore suggested that the sonar aperture, as an echo acoustic equivalent of the visual aperture (also referred to as the visual angle), could be one of these object properties. In the visual system object size is encoded by the visual aperture as the extent of the image on the retina. It depends on object distance that is not explicitly encoded. Thus, for reliable size perception at different distances, higher computational mechanisms are needed. This phenomenon is termed ‘size constancy’ or ‘size-distance invariance’ and is assumed to reflect an automatic re-scaling of visual aperture with perceived object distance. But in echolocating bats object width (sonar aperture) and object distance (echo delay) are accurately perceived and explicitly neurally encoded. In the second study we investigated whether bats show the ability to spontaneously combine spatial and temporal cues to determine absolute width information in terms of sonar size constancy (SSC). This was addressed by using the same setup and species as in the psychophysical approach of the first study. As a result SSC could not be verified as an important feature of sonar perception in bats. This lack of SSC could result from the bats relying on different modalities to extract size information at different distances. Alternatively, it is thinkable that familiarity with a behaviorally relevant, conspicuous object is required, as it was discussed for visual size constancy. But size constancy is found in many sensory modalities and more importantly, SSC was recently found in a blind human echolocator. It was discussed to be based on the same spatial and temporal cues as presented in our study. Thus, this topic should be readdressed in bats in a more natural context as size constancy could be a general mechanism for object normalization. As the spatiotemporal layout of the environment and the objects within changes with locomotion, in the third study the spatiotemporal integration in bat biosonar in a natural and naturalistic context was addressed. Trawling bats species hunt above water and capture fish or insects directly from or close to the surface. Here water acts as an acoustic mirror that can reduce clutter by reflecting sonar emissions away from the bats. However, objects on the water lead to echo enhancement. In a combined laboratory and field study we tested and quantified the effect of different surface types with different reflection properties (smooth and clutter surface) and object height on object detection and discrimination in the trawling bat species, Myotis daubentonii. The bats had to detect a mealworm presented above these different surfaces and discriminate it from an inedible PVC disk. At low heights above the clutter surface, the bats’ detection performance was worse than above a smooth surface. At a height of 50 cm, the surface structure had no influence on target detection. Above the clutter surface, object discrimination decreased with decreasing height. The study revealed different perceptual strategies that could allow efficient object detection and discrimination. When approaching objects above clutter, echolocation calls showed a significantly higher peak frequency, eventually suggesting a strategy for temporal separation of object echoes from clutter. Flight-path reconstruction showed that the bats attacked objects from below over water but from above over clutter. These results are consistent with the hypothesis that trawling bats exploit an echo-acoustic ground effect, in terms of a spatiotemporal integration of direct object reflections with indirect reflections from the water surface. It could lead to optimized prey-detection and discrimination not only for prey on the water but also above. Additionally, the bats could employ a precedence-like strategy to avoid misleading spatial cues that signal the wrong object elevation by using only the first and therewith direct echo for object localization

    Multi-Modal Perception for Selective Rendering

    Get PDF
    A major challenge in generating high-fidelity virtual environments (VEs) is to be able to provide realism at interactive rates. The high-fidelity simulation of light and sound is still unachievable in real-time as such physical accuracy is very computationally demanding. Only recently has visual perception been used in high-fidelity rendering to improve performance by a series of novel exploitations; to render parts of the scene that are not currently being attended to by the viewer at a much lower quality without the difference being perceived. This paper investigates the effect spatialised directional sound has on the visual attention of a user towards rendered images. These perceptual artefacts are utilised in selective rendering pipelines via the use of multi-modal maps. The multi-modal maps are tested through psychophysical experiments to examine their applicability to selective rendering algorithms, with a series of fixed cost rendering functions, and are found to perform significantly better than only using image saliency maps that are naively applied to multi-modal virtual environments

    STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

    Full text link
    While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.Comment: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track on Datasets and Benchmark
    • …
    corecore