Search CORE

12,417 research outputs found

Multimodal person recognition for human-vehicle interaction

Author: Abut Huseyin
Abut Hüseyin
Ercil Aytul
Erdogan Hakan
Erdoğan Hakan
Erzin Engin
Erçil Aytül
Tekalp A. Murat
Yemez Yucel
Yemez Yücel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2006
Field of study

Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies

Sabanci University Research Database

Towards Augmentative Speech Communication

Author: Denis Beautemps
Hiroshi Ishiguro
Norihiro Hagita
Panikos Heracleous
Publication venue: 'IntechOpen'
Publication date: 21/06/2011
Field of study

IntechOpen

LipLearner: Customizable Silent Speech Interactions on Mobile Devices

Author: Fang Shitao
Rekimoto Jun
Su Zixiong
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/02/2023
Field of study

Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme. Subjective feedback indicated that our system provides essential functionalities for customizable silent speech interactions with high usability and learnability.Comment: Conditionally accepted to the ACM CHI Conference on Human Factors in Computing Systems 2023 (CHI '23

arXiv.org e-Print Archive

MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

Author: Huo Yuchen
Li Qi
Li Sen
Liu Li
Wang Jianrong
Xu Tianyi
Publication venue
Publication date: 04/06/2023
Field of study

Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction. However, the existing available Mandarin audio-visual datasets are limited and lack the depth information. To address this issue, this work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers. To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed to create a well-balanced reading material. In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition. We also provide a baseline experiment, which could be used to evaluate the effectiveness of the dataset. The dataset and code will be released at https://github.com/SpringHuo/MAVD

arXiv.org e-Print Archive

Audio-visual speech processing system for Polish applicable to human-computer interaction

Author: Jadczyk Tomasz
Publication venue: 'AGHU University of Science and Technology Press'
Publication date: 19/02/2018
Field of study

This paper describes audio-visual speech recognition system for Polish language and a set of performance tests under various acoustic conditions. We first present the overall structure of AVASR systems with three main areas: audio features extraction, visual features extraction and subsequently, audiovisual speech integration. We present MFCC features for audio stream with standard HMM modeling technique, then we describe appearance and shape based visual features. Subsequently we present two feature integration techniques, feature concatenation and model fusion. We also discuss the results of a set of experiments conducted to select best system setup for Polish, under noisy audio conditions. Experiments are simulating human-computer interaction in computer control case with voice commands in difficult audio environments. With Active Appearance Model (AAM) and multistream Hidden Markov Model (HMM) we can improve system accuracy by reducing Word Error Rate for more than 30%, comparing to audio-only speech recognition, when Signal-to-Noise Ratio goes down to 0dB

Computer Science Journal (AGH University of Science and Technology, Krakow)

Humanoid with Interaction Ability Using Vision and Speech Information

Author: Junichi Ido
Ryuichi Nisimura
Tsukasa Ogasawara
Yoshio Matsumoto
Publication venue: 'IntechOpen'
Publication date: 01/11/2008
Field of study

IntechOpen

Crossref

Chicago Maya and MacNorsk II: A Tale of Two Software Development Projects (Original Construction vs. Prefab)

Author: Landahl Karen L.
Need Barbara
Ziolkowski Mike
Publication venue: 'The University of Kansas'
Publication date: 01/04/1999
Field of study

The University of Kansas: Journals@KU

Biodiversity Informatics

Vowel priority lip matching scheme and similarity evaluation model based on humanoid robot Ren-Xin

Author: Kang Xin
Liu Zheng
Nishide Shun
Ren Fuji
Publication venue: Springer Nature
Publication date: 21/07/2023
Field of study

At present, the significance of humanoid robots dramatically increased while this kind of robots rarely enters human life because of its immature development. The lip shape of humanoid robots is crucial in the speech process since it makes humanoid robots look like real humans. Many studies show that vowels are the essential elements of pronunciation in all languages in the world. Based on the traditional research of viseme, we increased the priority of the smooth transition of lip between vowels and propose a lip matching scheme based on vowel priority. Additionally, we also designed a similarity evaluation model based on the Manhattan distance by using computer vision lip features, which quantifies the lip shape similarity between 0-1 provides an effective recommendation of evaluation standard. Surprisingly, this model successfully compensates the disadvantages of lip shape similarity evaluation criteria in this field. We applied this lip-matching scheme to Ren-Xin humanoid robot and performed robot teaching experiments as well as a similarity comparison experiment of 20 sentences with two males and two females and the robot. Notably, all the experiments have achieved excellent results

Tokushima University Institutional Repository