97 research outputs found

    Perceptually-Driven Video Coding with the Daala Video Codec

    Full text link
    The Daala project is a royalty-free video codec that attempts to compete with the best patent-encumbered codecs. Part of our strategy is to replace core tools of traditional video codecs with alternative approaches, many of them designed to take perceptual aspects into account, rather than optimizing for simple metrics like PSNR. This paper documents some of our experiences with these tools, which ones worked and which did not. We evaluate which tools are easy to integrate into a more traditional codec design, and show results in the context of the codec being developed by the Alliance for Open Media.Comment: 19 pages, Proceedings of SPIE Workshop on Applications of Digital Image Processing (ADIP), 201

    INTENT BASED LOAD-BALANCING FOR VOICE OVER INTERNET PROTOCOL (VOIP) ELEMENTS

    Get PDF
    Presented herein is an intelligent call distribution/load balancing solution that performs distribution based on the type of call. The solution determines the type of call based on various factors and uses reinforced learning algorithms to select the element best suited for that type of call based on call-success-ratio for a particular call type (e.g., audio/video/fax/ etc.). This eliminates call failures, call delays and improves customer satisfaction. This solution can be extended to various other details of the call like dual tone multiple frequency (DTMF), codec, payload type, etc

    Experimental Evaluation Platform for Voice Transmission Over Internet of Things (VoIoTs)

    Get PDF
    Internet of Things (IoTs) is an example of the last advances in Information and Communication Technologies. In particular, with the revolutionary development of Wireless Sensor Network (WSN) technologies, researchers largely focused on take benefits of integration embedded low-cost, low-power WSN technology in a various IoTs applications. Real-time voice transmission over IoTs is one interesting application that began to be explored by many researchers. Thus, this paper presents a performance study for transmission of voice over WSN (VoWSN) with and without presence of Internet. A framework using a Raspberry Pi3 (RPi3) and open source FFmpeg technology for processing, compressing and streaming voice to a remote computer is proposed, implemented and evaluated. The performance of the proposed framework is evaluated by studying its behavior utilizing three audio encoding algorithms: AC3, MP3 and OPUS with different sampling rates and a set of evaluation metrics such as :One-way delay, jitter, Bandwidth (B.W), CPU usage and packet losses

    Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

    Get PDF
    We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: speechbot.github.io/resynthesis.Comment: In Proceedings of Interspeech 202

    Large-Scale Measurement of Real-Time Communication on the Web

    Get PDF
    Web Real-Time Communication (WebRTC) is getting wide adoptions across the browsers (Chrome, Firefox, Opera, etc.) and platforms (PC, Android, iOS). It enables application developers to add real-time communications features (text chat, audio/video calls) to web applications using W3C standard JavaScript APIs, and the end users can enjoy real-time multimedia communication experience from the browser without the complication of installing special applications or browser plug-ins. As WebRTC based applications are getting deployed on the Internet by thousands of companies across the globe, it is very important to understand the quality of the real-time communication services provided by these applications. Important performance metrics to be considered include: whether the communication session was properly setup, what are the network delays, packet loss rate, throughput, etc. At Callstats.io, we provide a solution to address the above concerns. By integrating an JavaScript API into WebRTC applications, Callstats.io helps application providers to measure the Quality of Experience (QoE) related metrics on the end user side. This thesis illustrates how this WebRTC performance measurement system is designed and built and we show some statistics derived from the collected data to give some insight into the performance of today’s WebRTC based real-time communication services. According to our measurement, real-time communication over the Internet are generally performing well in terms of latency and loss. The throughput are good for about 30% of the communication sessions

    The Effect Of Acoustic Variability On Automatic Speaker Recognition Systems

    Get PDF
    This thesis examines the influence of acoustic variability on automatic speaker recognition systems (ASRs) with three aims. i. To measure ASR performance under 5 commonly encountered acoustic conditions; ii. To contribute towards ASR system development with the provision of new research data; iii. To assess ASR suitability for forensic speaker comparison (FSC) application and investigative/pre-forensic use. The thesis begins with a literature review and explanation of relevant technical terms. Five categories of research experiments then examine ASR performance, reflective of conditions influencing speech quantity (inhibitors) and speech quality (contaminants), acknowledging quality often influences quantity. Experiments pertain to: net speech duration, signal to noise ratio (SNR), reverberation, frequency bandwidth and transcoding (codecs). The ASR system is placed under scrutiny with examination of settings and optimum conditions (e.g. matched/unmatched test audio and speaker models). Output is examined in relation to baseline performance and metrics assist in informing if ASRs should be applied to suboptimal audio recordings. Results indicate that modern ASRs are relatively resilient to low and moderate levels of the acoustic contaminants and inhibitors examined, whilst remaining sensitive to higher levels. The thesis provides discussion on issues such as the complexity and fragility of the speech signal path, speaker variability, difficulty in measuring conditions and mitigation (thresholds and settings). The application of ASRs to casework is discussed with recommendations, acknowledging the different modes of operation (e.g. investigative usage) and current UK limitations regarding presenting ASR output as evidence in criminal trials. In summary, and in the context of acoustic variability, the thesis recommends that ASRs could be applied to pre-forensic cases, accepting extraneous issues endure which require governance such as validation of method (ASR standardisation) and population data selection. However, ASRs remain unsuitable for broad forensic application with many acoustic conditions causing irrecoverable speech data loss contributing to high error rates
    • …
    corecore