8 research outputs found
Heterogeneous Graph-based Trajectory Prediction using Local Map Context and Social Interactions
Precisely predicting the future trajectories of surrounding traffic
participants is a crucial but challenging problem in autonomous driving, due to
complex interactions between traffic agents, map context and traffic rules.
Vector-based approaches have recently shown to achieve among the best
performances on trajectory prediction benchmarks. These methods model simple
interactions between traffic agents but don't distinguish between relation-type
and attributes like their distance along the road. Furthermore, they represent
lanes only by sequences of vectors representing center lines and ignore context
information like lane dividers and other road elements. We present a novel
approach for vector-based trajectory prediction that addresses these
shortcomings by leveraging three crucial sources of information: First, we
model interactions between traffic agents by a semantic scene graph, that
accounts for the nature and important features of their relation. Second, we
extract agent-centric image-based map features to model the local map context.
Finally, we generate anchor paths to enforce the policy in multi-modal
prediction to permitted trajectories only. Each of these three enhancements
shows advantages over the baseline model HoliGraph.Comment: Accepted on IEEE ITSC 202
Speaker Verification Experiments on the XM2VTS Database
This paper describes two speaker verification algorithms, a text-independent method based on a second order statistical measure and a text-dependent method based on hidden Markov modelling. We investigate the effect of different features, sampling rates, and threshold setting methods and introduce a N-best words pruning method that aims to compensate for the effect of poorly trained client models. Experimental evaluation is performed on the publicly available XM2VTS database according to a published protocol for three different operating points and a priori threshold setting
Towards Speaker Independent Continuous Speechreading
This paper describes recent speechreading experiments for a speaker independent continuous digit recognition task. Visual feature extraction is performed by a lip tracker which recovers information about the lip shape and information about the greylevel intensity around the mouth. These features are used to train visual word models using continuous density HMMs. Results show that the method generalises well to new speakers and that the recognition rate is highly variable across digits as expected due to the high visual confusability of certain words. 1. INTRODUCTION Current speechreading (or lipreading) systems have mainly been evaluated for small vocabulary, speaker dependent, isolated speech recognition tasks [8]. One of the main difficulties in speechreading, however, is to cope with the large appearance variability across subjects and to extract visual speech features which generalise well for new speakers. Appearance variability might for example be due to differences of lips, t..
Using the Multi-Stream Approach for Continuous Audio-Visual Speech Recognition: Experiments on the M2VTS Database
The Multi-Stream automatic speech recognition approach was investigated in this work as a framework for Audio -Visual data fusion and speech recognition. This method presents many potential advantages for such a task. It particularly allows for synchronous decoding of continuous speech while still allowing for some asynchrony of the visual and acoustic information streams. First, the Multi-Stream formalism is briefly recalled. Then, on top of the Multi-Stream motivations, experiments on the M2VTS multimodal database are presented and discussed. To our knowledge, these are the first experiments addressing multi-speaker continuous Audio-Visual Speech Recognition (AVSR). It is shown that the Multi-Stream approach can yield improved Audio-Visual speech recognition performance when the acoustic signal is corrupted by noise as well as for clean speech
Active Shape Models for Visual Speech Feature Extraction
. Most approaches for lip modelling are based on heuristic constraints imposed by the user. We describe the use of Active Shape Models for extracting visual speech features for use by automatic speechreading systems, where the deformation of the lip model as well as image search is based on a priori knowledge learned from a training set. We demonstrate the robustness and accuracy of the technique for locating and tracking lips on a database consisting of a broad variety of talkers and lighting conditions. Keywords. lip locating, lip tracking, learned model, learned features 1. Introduction While mainstream speech recognition research has concentrated almost exclusively on the acoustic speech signal, it is well known that humans use visual information of the talker's face (mainly lip movements) in addition to the acoustic signal for speech perception purpose. Whereas several well known methods exist for representing acoustic features of speech, it is still not fully understood (i) w..
Speaker Identification By Lipreading
This paper describes a new approach for speaker identification based on lipreading. Visual features are extracted from image sequences of the talking face and consist of shape parameters which describe the lip boundary and intensity parameters which describe the grey-level distribution of the mouth area. Intensity information is based on principal component analysis using eigenspaces which deform with the shape model. The extracted parameters account for both, speech dependent and speaker dependent information. We built spatio-temporal speaker models based on these features, using HMMs with mixtures of Gaussians. Promising results were obtained for text dependent and text independent speaker identification tests performed on a small video database. 1. INTRODUCTION Whereas current state-of-the-art speaker recognition systems make use of the acoustic speech signal only, in the speech recognition community it is well known that visual information from lip movements provides additional s..
Audio-Visual Person Verification
. In this paper we investigate benefits of classifier combination (fusion) for a multimodal system for personal identity verification. The system uses frontal face images and speech. We show that a sophisticated fusion strategy enables the system to outperform its facial and vocal modules when taken seperately. We show that both trained linear weighted schemes and fusion by Support Vector Machine classifier leads to a significant reduction of total error rates. The complete system is tested on data from a publicly available audio-visual database (XM2VTS, 295 subjects) according to a published protocol. 2 IDIAP--RR 98-18 1 Introduction Recognition systems based on biometric features (face, voice, iris, etc ...) have received a lot of attention in recent years Most of the proposed approaches focus on mono-modal identification. The system uses a single modality to find the closest person to the user in a database. Relatively high recognition rates were obtained for different modalities..