Search CORE

11 research outputs found

Secrets of the Balkans; seven years of a diplomatist's life in the storm centre of Europe,

Author: Vopicka Charles J., 1857-1935.
Publication venue: Chicago, Rand, McNally & company,
Publication date
Field of study

Mode of access: Internet

University of Michigan Library Digital Collections

CAR2 - Czech Database of Car Speech

Author: J. Vopicka
P. Pollak
P. Sovka
V. Hanzl
Publication venue: Spolecnost pro radioelektronicke inzenyrstvi
Publication date: 01/12/1999
Field of study

This paper presents new Czech language two-channel (stereo) speech database recorded in car environment. The created database was designed for experiments with speech enhancement for communication purposes and for the study and the design of a robust speech recognition systems. Tools for automated phoneme labelling based on Baum-Welch re-estimation were realised. The noise analysis of the car background environment was done

Directory of Open Access Journals

Digital library of Brno University of Technology

Scattering vs. Discrete Cosine Transform Features in Visual Speech Processing

Author: Marcheret E. Potamianos G., Vopicka J., Goel V.
Publication venue
Publication date: 01/01/2015
Field of study

Appearance-based feature extraction constitutes the dominant approach for visual speech representation in a variety of problems, such as automatic speechreading, visual speech detection, and others. To obtain the necessary visual features, typically a rectangular region-of-interest (ROI) containing the speaker’s mouth is first extracted, followed, most commonly, by a discrete cosine transform (DCT) of the ROI pixel values and a feature selection step. The approach, although algorithmically simple and computationally efficient, suffers from lack of DCT invariance to typical ROI deformations, stemming, primarily, from speaker’s head pose variability and small tracking inaccuracies. To address the problem, in this paper, the recently introduced scattering transform is investigated as an alternative to DCT within the appearance-based framework for ROI representation, suitable for visual speech applications. A number of such tasks are considered, namely, visual-only speech activity detection, visual-only and audio-visual sub-phonetic classification, as well as audio-visual speech synchrony detection, all employing deep neural network classifiers with either DCT or scattering-based visual features. Comparative experiments of the resulting systems are conducted on a large audio-visual corpus of frontal face videos, demonstrating, in all cases, the scattering transform superiority over the DCT. © 2015 Auditory-Visual Speech Processing 2015, AVSP 2015, held in conjunction with Facial Analysis and Animation, FAA 2015 - 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015. All rights reserved

University of Thessaly Institutional Repository

Detecting audio-visual synchrony using deep neural networks

Author: Marcheret E. Potamianos G., Vopicka J., Goel V.
Publication venue
Publication date: 01/01/2015
Field of study

In this paper, we address the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not. This is of interest in a wide range of applications, for example spoof detection in biometrics, lip-syncing, speaker detection and diarization in multi-subject videos, and video data quality assurance. In our adopted approach, we investigate the use of deep neural networks (DNNs) for this purpose. The proposed synchrony DNNs operate directly on audio and visual features over relatively wide contexts, or, alternatively, on appropriate hidden (bottleneck) or output layers of DNNs trained for single-modal or audio-visual automatic speech recognition. In all cases, the synchrony DNN classes consist of the "in-sync" and a number of "out-of-sync" targets, the latter considered at multiples of ± 30 msec steps of overall asynchrony between the two modalities. We apply the proposed approach on two multi-subject audio-visual databases, one of high-quality data recorded in studio-like conditions, and one of data recorded by smart cell-phone devices. On both sets, and under a speaker-independent experimental framework, we are able to achieve very low equal-error-rates in distinguishing "in-sync" from "out-of-sync" data. Copyright © 2015 ISCA

University of Thessaly Institutional Repository