168 research outputs found
Speaker Identification for Swiss German with Spectral and Rhythm Features
We present results of speech rhythm analysis for automatic speaker identification. We expand previous experiments using similar methods for language identification. Features describing the rhythmic properties of salient changes in signal components are extracted and used in an speaker identification task to determine to which extent they are descriptive of speaker variability. We also test the performance of state-of-the-art but simple-to-extract frame-based features. The paper focus is the evaluation on one corpus (swiss german, TEVOID) using support vector machines. Results suggest that the general spectral features can provide very good performance on this dataset, whereas the rhythm features are not as successful in the task, indicating either the lack of suitability for this task or the dataset specificity
Music Structure Boundaries Estimation Using Multiple Self-Similarity Matrices as Input Depth of Convolutional Neural Networks
International audienceIn this paper, we propose a new representation as input of a Convolutional Neural Network with the goal of estimating music structure boundaries. For this task, previous works used a network performing the late-fusion of a Mel-scaled log-magnitude spectrogram and a self-similarity-lag-matrix. We propose here to use the square-sub-matrices centered on the main diagonals of several self-similarity-matrices, each one representing a different audio descriptors. We propose to combine them using the depth of the input layer. We show that this representation improves the results over the use of the self-similarity-lag-matrix. We also show that using the depth of the input layer provide a convenient way for early fusion of audio representations
Gaussian Framework for Interference Reduction in Live Recordings
Here typical live full-length music recordings are considered. In this scenarios, some instrumental voices are captured by microphones intended to other voices, leading to so-called “interferences”. Reducing this phenomenon is desirable because it opens new possibilities for sound engineers and also it has been proven that it increase performances of music analysis and processing tools (e.g. pitch tracking). In this work we propose an fast NMF-based algorithm to solve this problem.ope
Gaussian framework for interference reduction in live recordings
International audienceIn live multitrack recordings, each voice is usually captured by dedicated close microphones. Unfortunately, it is also captured in practice by other microphones intended for other sources, leading to so-called " interferences ". Reducing this interference is desirable because it opens new perspectives for the engineering of live recordings. Hence, it has been the topic of recent research in audio processing. In this paper, we show how a Gaussian probabilistic framework may be set up for obtaining good isolation of the target sources. Doing so, we extend several state-of-the art methods by fixing some heuristic parts of their algorithms. As we show in a perceptual evaluation on real-world multitrack live recordings, the resulting principled techniques yield improved quality
Audiovisual Database with 360 Video and Higher-Order Ambisonics Audio for Perception, Cognition, Behavior, and QoE Evaluation Research
Research into multi-modal perception, human cognition, behavior, and
attention can benefit from high-fidelity content that may recreate
real-life-like scenes when rendered on head-mounted displays. Moreover, aspects
of audiovisual perception, cognitive processes, and behavior may complement
questionnaire-based Quality of Experience (QoE) evaluation of interactive
virtual environments. Currently, there is a lack of high-quality open-source
audiovisual databases that can be used to evaluate such aspects or systems
capable of reproducing high-quality content. With this paper, we provide a
publicly available audiovisual database consisting of twelve scenes capturing
real-life nature and urban environments with a video resolution of 7680x3840 at
60 frames-per-second and with 4th-order Ambisonics audio. These 360 video
sequences, with an average duration of 60 seconds, represent real-life settings
for systematically evaluating various dimensions of uni-/multi-modal
perception, cognition, behavior, and QoE. The paper provides details of the
scene requirements, recording approach, and scene descriptions. The database
provides high-quality reference material with a balanced focus on auditory and
visual sensory information. The database will be continuously updated with
additional scenes and further metadata such as human ratings and saliency
information.Comment: 6 pages, 2 figures, accepted and presented at the 2022 14th
International Conference on Quality of Multimedia Experience (QoMEX).
Database is publicly accessible at https://qoevave.github.io/database
An Intelligent audio workstation in the browser
Music production is a complex process requiring skill and time to undertake. The industry has undergone a digital revolution, but unlike other industries the process has not changed. However, intelligent systems, using the semantic web and signal processing, can reduce this complexity by making certain decisions for the user with minimal interaction, saving both time and investment on the engineers’ part. This paper will outline an intelligent Digital Audio Workstation (DAW) designed for use in the browser. It outlines the architecture of the DAW with its audio engine (built on the Web Audio API), using AngularJS for the user interface and a relational database
Reviews on Technology and Standard of Spatial Audio Coding
Market demands on a more impressive entertainment media have motivated for delivery of three dimensional (3D) audio content to home consumers through Ultra High Definition TV (UHDTV), the next generation of TV broadcasting, where spatial audio coding plays fundamental role. This paper reviews fundamental concept on spatial audio coding which includes technology, standard, and application. Basic principle of object-based audio reproduction system will also be elaborated, compared to the traditional channel-based system, to provide good understanding on this popular interactive audio reproduction system which gives end users flexibility to render their own preferred audio composition.Keywords : spatial audio, audio coding, multi-channel audio signals, MPEG standard, object-based audi
- …