69 research outputs found

    Multimodal interactions in virtual environments using eye tracking and gesture control.

    Get PDF
    Multimodal interactions provide users with more natural ways to interact with virtual environments than using traditional input methods. An emerging approach is gaze modulated pointing, which enables users to perform virtual content selection and manipulation conveniently through the use of a combination of gaze and other hand control techniques/pointing devices, in this thesis, mid-air gestures. To establish a synergy between the two modalities and evaluate the affordance of this novel multimodal interaction technique, it is important to understand their behavioural patterns and relationship, as well as any possible perceptual conflicts and interactive ambiguities. More specifically, evidence shows that eye movements lead hand movements but the question remains that whether the leading relationship is similar when interacting using a pointing device. Moreover, as gaze modulated pointing uses different sensors to track and detect user behaviours, its performance relies on users perception on the exact spatial mapping between the virtual space and the physical space. It raises an underexplored issue that whether gaze can introduce misalignment of the spatial mapping and lead to users misperception and interactive errors. Furthermore, the accuracy of eye tracking and mid-air gesture control are not comparable with the traditional pointing techniques (e.g., mouse) yet. This may cause pointing ambiguity when fine grainy interactions are required, such as selecting in a dense virtual scene where proximity and occlusion are prone to occur. This thesis addresses these concerns through experimental studies and theoretical analysis that involve paradigm design, development of interactive prototypes, and user study for verification of assumptions, comparisons and evaluations. Substantial data sets were obtained and analysed from each experiment. The results conform to and extend previous empirical findings that gaze leads pointing devices movements in most cases both spatially and temporally. It is testified that gaze does introduce spatial misperception and three methods (Scaling, Magnet and Dual-gaze) were proposed and proved to be able to reduce the impact caused by this perceptual conflict where Magnet and Dual-gaze can deliver better performance than Scaling. In addition, a coarse-to-fine solution is proposed and evaluated to compensate the degradation introduced by eye tracking inaccuracy, which uses a gaze cone to detect ambiguity followed by a gaze probe for decluttering. The results show that this solution can enhance the interaction accuracy but requires a compromise on efficiency. These findings can be used to inform a more robust multimodal inter- face design for interactions within virtual environments that are supported by both eye tracking and mid-air gesture control. This work also opens up a technical pathway for the design of future multimodal interaction techniques, which starts from a derivation from natural correlated behavioural patterns, and then considers whether the design of the interaction technique can maintain perceptual constancy and whether any ambiguity among the integrated modalities will be introduced

    Gaze modulated disambiguation technique for gesture control in 3D virtual objects selection

    Get PDF
    © 2017 IEEE. Inputs with multimodal information provide more natural ways to interact with virtual 3D environment. An emerging technique that integrates gaze modulated pointing with mid-air gesture control enables fast target acquisition and rich control expressions. The performance of this technique relies on the eye tracking accuracy which is not comparable with the traditional pointing techniques (e.g., mouse) yet. This will cause troubles when fine grainy interactions are required, such as selecting in a dense virtual scene where proximity and occlusion are prone to occur. This paper proposes a coarse-to-fine solution to compensate the degradation introduced by eye tracking inaccuracy using a gaze cone to detect ambiguity and then a gaze probe for decluttering. It is tested in a comparative experiment which involves 12 participants with 3240 runs. The results show that the proposed technique enhanced the selection accuracy and user experience but it is still with a potential to be improved in efficiency. This study contributes to providing a robust multimodal interface design supported by both eye tracking and mid-air gesture control

    Multimodality with Eye tracking and Haptics: A New Horizon for Serious Games?

    Get PDF
    The goal of this review is to illustrate the emerging use of multimodal virtual reality that can benefit learning-based games. The review begins with an introduction to multimodal virtual reality in serious games and we provide a brief discussion of why cognitive processes involved in learning and training are enhanced under immersive virtual environments. We initially outline studies that have used eye tracking and haptic feedback independently in serious games, and then review some innovative applications that have already combined eye tracking and haptic devices in order to provide applicable multimodal frameworks for learning-based games. Finally, some general conclusions are identified and clarified in order to advance current understanding in multimodal serious game production as well as exploring possible areas for new applications

    Exploitation of Novel Multiplayer Gesture-based Interaction and Virtual Puppetry for Digital Storytelling to Develop Children’s Narrative Skills

    Get PDF
    In recent years, digital storytelling has demonstrated powerful pedagogical functions by improving creativity, collaboration and intimacy among young children. Saturated with digital media technologies in their daily lives, the young generation demands natural interactive learning environments which offer multimodalities of feedback and meaningful immersive learning experiences. Virtual puppetry assisted storytelling system for young children, which utilises depth motion sensing technology and gesture control as the Human-Computer Interaction (HCI) method, has been proved to provide natural interactive learning experience for single player. In this paper, we designed and developed a novel system that allows multiple players to narrate, and most importantly, to interact with other characters and interactive virtual items in the virtual environment. We have conducted one user experiment with four young children for pedagogical evaluation and another user experiment with five postgraduate students for system evaluation. Our user study shows this novel digital storytelling system has great potential to stimulate learning abilities of young children through collaboration tasks

    Semantic framework for interactive animation generation and its application in virtual shadow play performance.

    Get PDF
    Designing and creating complex and interactive animation is still a challenge in the field of virtual reality, which has to handle various aspects of functional requirements (e.g. graphics, physics, AI, multimodal inputs and outputs, and massive data assets management). In this paper, a semantic framework is proposed to model the construction of interactive animation and promote animation assets reuse in a systematic and standardized way. As its ontological implementation, two domain specific ontologies for the hand-gesture-based interaction and animation data repository have been developed in the context of Chinese traditional shadow play art. Finally, prototype of interactive Chinese shadow play performance system using deep motion sensor device is presented as the usage example

    Exploitation of multiplayer interaction and development of virtual puppetry storytelling using gesture control and stereoscopic devices

    Get PDF
    With the rapid development of human-computer interaction technologies, the new media generation demands novel learning experiences with natural interaction and immersive experience. Considering that digital storytelling is a powerful pedagogical tool for young children, in this paper, we design an immersive storytelling environment that allows multiple players to use naturally interactive hand gestures to manipulate virtual puppetry for assisting narration. A set of multimodal interaction techniques is presented for a hybrid user interface that integrates existing 3D visualization and interaction devices including head-mounted displays and depth motion sensor. In this system, the young players could intuitively use hand gestures to manipulate virtual puppets to perform a story and interact with props in a virtual stereoscopic environment. We have conducted a user experiment with four young children for pedagogical evaluation, as well as system acceptability and interactivity evaluation by postgraduate students. The results show that our framework has great potential to stimulate learning abilities of young children through collaboration tasks. The stereoscopic head-mounted display outperformed the traditional monoscopic display in a comparison between the two

    Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

    Full text link
    Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. Separate latent features are derived to learn dysarthric speech characteristics and phoneme context representations. Self-supervised pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline speed perturbation and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27.78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset of speakers with "Very Low" intelligibility.Comment: Submitted to ICASSP 202

    Oriented Graphene Nanoribbons Embedded in Hexagonal Boron Nitride Trenches

    Full text link
    Graphene nanoribbons (GNRs) are ultra-narrow strips of graphene that have the potential to be used in high-performance graphene-based semiconductor electronics. However, controlled growth of GNRs on dielectric substrates remains a challenge. Here, we report the successful growth of GNRs directly on hexagonal boron nitride substrates with smooth edges and controllable widths using chemical vapour deposition. The approach is based on a type of template growth that allows for the in-plane epitaxy of mono-layered GNRs in nano-trenches on hexagonal boron nitride with edges following a zigzag direction. The embedded GNR channels show excellent electronic properties, even at room temperature. Such in-plane hetero-integration of GNRs, which is compatible with integrated circuit processing, creates a gapped channel with a width of a few benzene rings, enabling the development of digital integrated circuitry based on GNRs.Comment: 32 pages, 4 figures, Supplementary informatio

    Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

    Full text link
    Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin

    Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

    Full text link
    Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. In addition, domain adapted wav2vec2.0 representations are utilized in acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric and elderly speech recognition systems. Experiments conducted on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently outperform the standalone wav2vec2.0 models by statistically significant WER reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two tasks respectively. The lowest published WERs of 22.56% (52.53% on very low intelligibility, 39.09% on unseen words) and 18.17% are obtained on the UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set respectively.Comment: accepted by ICASSP 202
    • …
    corecore