3,264 research outputs found

    IMAGINE Final Report

    No full text

    Spot the conversation: speaker diarisation in the wild

    Full text link
    The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.Comment: The dataset will be available for download from http://www.robots.ox.ac.uk/~vgg/data/voxceleb/voxconverse.html . The development set will be released in July 2020, and the test set will be released in October 202

    Objective and Subjective Evaluation of Wideband Speech Quality

    Get PDF
    Traditional landline and cellular communications use a bandwidth of 300 - 3400 Hz for transmitting speech. This narrow bandwidth impacts quality, intelligibility and naturalness of transmitted speech. There is an impending change within the telecommunication industry towards using wider bandwidth speech, but the enlarged bandwidth also introduces a few challenges in speech processing. Echo and noise are two challenging issues in wideband telephony, due to increased perceptual sensitivity by users. Subjective and/or objective measurements of speech quality are important in benchmarking speech processing algorithms and evaluating the effect of parameters like noise, echo, and delay in wideband telephony. Subjective measures include ratings of speech quality by listeners, whereas objective measures compute a metric based on the reference and degraded speech samples. While subjective quality ratings are the gold - standard\u27\u27, they are also time- and resource- consuming. An objective metric that correlates highly with subjective data is attractive, as it can act as a substitute for subjective quality scores in gauging the performance of different algorithms and devices. This thesis reports results from a series of experiments on subjective and objective speech quality evaluation for wideband telephony applications. First, a custom wideband noise reduction database was created that contained speech samples corrupted by different background noises at different signal to noise ratios (SNRs) and processed by six different noise reduction algorithms. Comprehensive subjective evaluation of this database revealed an interaction between the algorithm performance, noise type and SNR. Several auditory-based objective metrics such as the Loudness Pattern Distortion (LPD) measure based on the Moore - Glasberg auditory model were evaluated in predicting the subjective scores. In addition, the performance of Bayesian Multivariate Regression Splines(BMLS) was also evaluated in terms of mapping the scores calculated by the objective metrics to the true quality scores. The combination of LPD and BMLS resulted in high correlation with the subjective scores and was used as a substitution for fine - tuning the noise reduction algorithms. Second, the effect of echo and delay on the wideband speech was evaluated in both listening and conversational context, through both subjective and objective measures. A database containing speech samples corrupted by echo with different delay and frequency response characteristics was created, and was later used to collect subjective quality ratings. The LPD - BMLS objective metric was then validated using the subjective scores. Third, to evaluate the effect of echo and delay in conversational context, a realtime simulator was developed. Pairs of subjects conversed over the simulated system and rated the quality of their conversations which were degraded by different amount of echo and delay. The quality scores were analysed and LPD+BMLS combination was found to be effective in predicting subjective impressions of quality for condition-averaged data

    VOICE BASED FOR BANKING SYSTEM

    Get PDF
    The trouble with traditional banking system service resulted difficulties, latency and low quality of service, not suitable for disable people and require extra manpower to perform simple bank activities. The goal of this project is to build a voice recognition based system which specifies on the banking activities element and specializes in using voice as a medium to run bank activities via telephony network system. Three fundamental objectives were addressed in the study. First, to develop two-way interactive program of banking system, which use voice as importantmechanism to receive instruction and response to user. Second, it support to first objective which to develop such a user friendly andhighsecurity voice banking system which requires the user first logs on to the system by furnishing the assigned customer identification number and personal identification number before user proceed for further actions. And therefore, there must have a strong database structure development of the application in the voice banking system that purposely to maintain the integrity of the data stored and responds to authorized user only. For third objective, is to determine the best programming in order to implement in telephony network system. There is a study and architecture on how voice can be accepted, manipulated and generated by using combination two types of programming which are Cold Fusion and VoiceXML, which is goes to the third objective. The functions of this system is proved and demanded by user as it provides such convenience and easy services with just use voice to transmit the instruction. Hence, this strategy will grab large number of customers and simultaneously will generate huge profit too to the bank institution that applies this system. It is hoping that, by developing this system it will be a platform for next developer to host the system and can be use a large number of customers simultaneously and efficiently. Keyword: Voice based, telephony, combination of programming, architectur

    Multimodal person recognition for human-vehicle interaction

    Get PDF
    Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies

    In Car Audio

    Get PDF
    This chapter presents implementations of advanced in Car Audio Applications. The system is composed by three main different applications regarding the In Car listening and communication experience. Starting from a high level description of the algorithms, several implementations on different levels of hardware abstraction are presented, along with empirical results on both the design process undergone and the performance results achieved
    corecore