107 research outputs found

    Stereo linear predictive coding of audio

    Get PDF

    Object-based Modeling of Audio for Coding and Source Separation

    Get PDF
    This thesis studies several data decomposition algorithms for obtaining an object-based representation of an audio signal. The estimation of the representation parameters are coupled with audio-specific criteria, such as the spectral redundancy, sparsity, perceptual relevance and spatial position of sounds. The objective is to obtain an audio signal representation that is composed of meaningful entities called audio objects that reflect the properties of real-world sound objects and events. The estimation of the object-based model is based on magnitude spectrogram redundancy using non-negative matrix factorization with extensions to multichannel and complex-valued data. The benefits of working with object-based audio representations over the conventional time-frequency bin-wise processing are studied. The two main applications of the object-based audio representations proposed in this thesis are spatial audio coding and sound source separation from multichannel microphone array recordings. In the proposed spatial audio coding algorithm, the audio objects are estimated from the multichannel magnitude spectrogram. The audio objects are used for recovering the content of each original channel from a single downmixed signal, using time-frequency filtering. The perceptual relevance of modeling the audio signal is considered in the estimation of the parameters of the object-based model, and the sparsity of the model is utilized in encoding its parameters. Additionally, a quantization of the model parameters is proposed that reflects the perceptual relevance of each quantized element. The proposed object-based spatial audio coding algorithm is evaluated via listening tests and comparing the overall perceptual quality to conventional time-frequency block-wise methods at the same bitrates. The proposed approach is found to produce comparable coding efficiency while providing additional functionality via the object-based coding domain representation, such as the blind separation of the mixture of sound sources in the encoded channels. For the sound source separation from multichannel audio recorded by a microphone array, a method combining an object-based magnitude model and spatial covariance matrix estimation is considered. A direction of arrival-based model for the spatial covariance matrices of the sound sources is proposed. Unlike the conventional approaches, the estimation of the parameters of the proposed spatial covariance matrix model ensures a spatially coherent solution for the spatial parameterization of the sound sources. The separation quality is measured with objective criteria and the proposed method is shown to improve over the state-of-the-art sound source separation methods, with recordings done using a small microphone array

    Étude de transformées temps-fréquence pour le codage audio faible retard en haute qualité

    Get PDF
    In recent years there has been a phenomenal increase in the number of products and applications which make use of audio coding formats. Amongthe most successful audio coding schemes, the MPEG-1 Layer III (mp3), the MPEG-2 Advanced Audio Coding (AAC) or its evolution MPEG-4High Efficiency-Advanced Audio Coding (HE-AAC) can be cited. More recently, perceptual audio coding has been adapted to achieve codingat low-delay such to become suitable for conversational applications. Traditionally, the use of filter bank such as the Modified Discrete CosineTransform (MDCT) is a central component of perceptual audio coding and its adaptation to low delay audio coding has become an important researchtopic. Low delay transforms have been developed in order to retain the performance of standard audio coding while reducing dramatically the associated algorithmic delay.This work presents some elements allowing to better accommodate the delay reduction constraint. Among the contributions, a low delay blockswitching tool which allows the direct transition between long transform and short transform without the insertion of transition window. The sameprinciple has been extended to define new perfect reconstruction conditions for the MDCT with relaxed constraints compared to the original definition.As a consequence, a seamless reconstruction method has been derived to increase the flexibility of transform coding schemes with the possibility toselect a transform for a frame independently from its neighbouring frames. Finally, based on this new approach, a new low delay window design procedure has been derived to obtain an analytic definition for a new family of transforms, permitting high quality with a substantial coding delay reduction. The performance of the proposed transforms has been thoroughly evaluated, an evaluation framework involving an objective measurement of the optimal transform sequence is proposed. It confirms the relevance of the proposed transforms used for audio coding. In addition, the new approaches have been successfully applied to the recent standardisation work items, such as the low delay audio coding developed at MPEG (LD-AAC and ELD-AAC) and they have been evaluated with numerous subjective testing, showing a significant improvement of the quality for transient signals. The new low delay window design has been adopted in G.718, a scalable speech and audio codec standardized in ITU-T and has demonstrated its benefit in terms of delay reduction while maintaining the audio quality of a traditional MDCT.Codage audio à faible retard à l'aide de la définition de nouvelles fenêtres pour la transformée MDCT et l'introduction d'un nouveau schéma de commutation de fenêtre

    SIGNAL TRANSFORMATIONS FOR IMPROVING INFORMATION REPRESENTATION, FEATURE EXTRACTION AND SOURCE SEPARATION

    Get PDF
    Questa tesi riguarda nuovi metodi di rappresentazione del segnale nel dominio tempo-frequenza, tali da mostrare le informazioni ricercate come dimensioni esplicite di un nuovo spazio. In particolare due trasformate sono introdotte: lo Spazio di Miscelazione Bivariato (Bivariate Mixture Space) e il Campo della Struttura Spettro-Temporale (Spectro-Temporal Structure-Field). La prima trasformata mira a evidenziare le componenti latenti di un segnale bivariato basandosi sul comportamento di ogni componente frequenziale (ad esempio a fini di separazione delle sorgenti); la seconda trasformata mira invece all'incapsulamento di informazioni relative al vicinato di un punto in R^2 in un vettore associato al punto stesso, tale da descrivere alcune propriet\ue0 topologiche della funzione di partenza. Nel dominio dell'elaborazione digitale del segnale audio, il Bivariate Mixture Space pu\uf2 essere interpretato come un modo di investigare lo spazio stereofonico per operazioni di separazione delle sorgenti o di estrazione di informazioni, mentre lo Spectro-Temporal Structure-Field pu\uf2 essere usato per ispezionare lo spazio spettro-temporale (segregare suoni percussivi da suoni intonati o tracciae modulazioni di frequenza). Queste trasformate sono studiate e testate anche in relazione allo stato del'arte in campi come la separazione delle sorgenti, l'estrazione di informazioni e la visualizzazione dei dati. Nel campo dell'informatica applicata al suono, queste tecniche mirano al miglioramento della rappresentazione del segnale nel dominio tempo-frequenza, in modo tale da rendere possibile l'esplorazione dello spettro anche in spazi alternativi, quali il panorama stereofonico o una dimensione virtuale che separa gli aspetti percussivi da quelli intonati.This thesis is about new methods of signal representation in time-frequency domain, so that required information is rendered as explicit dimensions in a new space. In particular two transformations are presented: Bivariate Mixture Space and Spectro-Temporal Structure-Field. The former transform aims at highlighting latent components of a bivariate signal based on the behaviour of each frequency base (e.g. for source separation purposes), whereas the latter aims at folding neighbourhood information of each point of a R^2 function into a vector, so as to describe some topological properties of the function. In the audio signal processing domain, the Bivariate Mixture Space can be interpreted as a way to investigate the stereophonic space for source separation and Music Information Retrieval tasks, whereas the Spectro-Temporal Structure-Field can be used to inspect spectro-temporal dimension (segregate pitched vs. percussive sounds or track pitch modulations). These transformations are investigated and tested against state-of-the-art techniques in fields such as source separation, information retrieval and data visualization. In the field of sound and music computing, these techniques aim at improving the frequency domain representation of signals such that the exploration of the spectrum can be achieved also in alternative spaces like the stereophonic panorama or a virtual percussive vs. pitched dimension

    The development of speech coding and the first standard coder for public mobile telephony

    Get PDF
    This thesis describes in its core chapter (Chapter 4) the original algorithmic and design features of the ??rst coder for public mobile telephony, the GSM full-rate speech coder, as standardized in 1988. It has never been described in so much detail as presented here. The coder is put in a historical perspective by two preceding chapters on the history of speech production models and the development of speech coding techniques until the mid 1980s, respectively. In the epilogue a brief review is given of later developments in speech coding. The introductory Chapter 1 starts with some preliminaries. It is de- ??ned what speech coding is and the reader is introduced to speech coding standards and the standardization institutes which set them. Then, the attributes of a speech coder playing a role in standardization are explained. Subsequently, several applications of speech coders - including mobile telephony - will be discussed and the state of the art in speech coding will be illustrated on the basis of some worldwide recognized standards. Chapter 2 starts with a summary of the features of speech signals and their source, the human speech organ. Then, historical models of speech production which form the basis of di??erent kinds of modern speech coders are discussed. Starting with a review of ancient mechanical models, we will arrive at the electrical source-??lter model of the 1930s. Subsequently, the acoustic-tube models as they arose in the 1950s and 1960s are discussed. Finally the 1970s are reviewed which brought the discrete-time ??lter model on the basis of linear prediction. In a unique way the logical sequencing of these models is exposed, and the links are discussed. Whereas the historical models are discussed in a narrative style, the acoustic tube models and the linear prediction tech nique as applied to speech, are subject to more mathematical analysis in order to create a sound basis for the treatise of Chapter 4. This trend continues in Chapter 3, whenever instrumental in completing that basis. In Chapter 3 the reader is taken by the hand on a guided tour through time during which successive speech coding methods pass in review. In an original way special attention is paid to the evolutionary aspect. Speci??cally, for each newly proposed method it is discussed what it added to the known techniques of the time. After presenting the relevant predecessors starting with Pulse Code Modulation (PCM) and the early vocoders of the 1930s, we will arrive at Residual-Excited Linear Predictive (RELP) coders, Analysis-by-Synthesis systems and Regular- Pulse Excitation in 1984. The latter forms the basis of the GSM full-rate coder. In Chapter 4, which constitutes the core of this thesis, explicit forms of Multi-Pulse Excited (MPE) and Regular-Pulse Excited (RPE) analysis-by-synthesis coding systems are developed. Starting from current pulse-amplitude computation methods in 1984, which included solving sets of equations (typically of order 10-16) two hundred times a second, several explicit-form designs are considered by which solving sets of equations in real time is avoided. Then, the design of a speci??c explicitform RPE coder and an associated eÆcient architecture are described. The explicit forms and the resulting architectural features have never been published in so much detail as presented here. Implementation of such a codec enabled real-time operation on a state-of-the-art singlechip digital signal processor of the time. This coder, at a bit rate of 13 kbit/s, has been selected as the Full-Rate GSM standard in 1988. Its performance is recapitulated. Chapter 5 is an epilogue brie y reviewing the major developments in speech coding technology after 1988. Many speech coding standards have been set, for mobile telephony as well as for other applications, since then. The chapter is concluded by an outlook

    Neural architecture for echo suppression during sound source localization based on spiking neural cell models

    Get PDF
    Zusammenfassung Diese Arbeit untersucht die biologischen Ursachen des psycho-akustischen Präzedenz Effektes, der Menschen in die Lage versetzt, akustische Echos während der Lokalisation von Schallquellen zu unterdrücken. Sie enthält ein Modell zur Echo-Unterdrückung während der Schallquellenlokalisation, welches in technischen Systemen zur Mensch-Maschine Interaktion eingesetzt werden kann. Die Grundlagen dieses Modells wurden aus eigenen elektrophysiologischen Experimenten an der Mongolischen Wüstenrennmaus gewonnen. Die dabei erstmalig an der Wüstenrennmaus erzielten Ergebnisse, zeigen ein besonderes Verhalten spezifischer Zellen im Dorsalen Kern des Lateral Lemniscus, einer dedizierten Region des auditorischen Hirnstammes. Die dort sichtbare Langzeithemmung scheint die Grundlage für die Echounterdrückung in höheren auditorischen Zentren zu sein. Das entwickelte Model war in der Lage dieses Verhalten nachzubilden, und legt die Vermutung nahe, dass eine starke und zeitlich präzise Hyperpolarisation der zugrundeliegende physiologische Mechanismus dieses Verhaltens ist. Die entwickelte Neuronale Modellarchitektur modelliert das Innenohr und fünf wesentliche Kerne des auditorischen Hirnstammes in ihrer Verbindungsstruktur und internen Dynamik. Sie stellt einen neuen Typus neuronaler Modellierung dar, der als Spike-Interaktionsmodell (SIM) bezeichnet wird. SIM nutzen die präzise räumlich-zeitliche Interaktion einzelner Aktionspotentiale (Spikes) für die Kodierung und Verarbeitung neuronaler Informationen. Die Basis dafür bilden Integrate-and-Fire Neuronenmodelle sowie Hebb'sche Synapsen, welche um speziell entwickelte dynamische Kernfunktionen erweitert wurden. Das Modell ist in der Lage, Zeitdifferenzen von 10 mykrosekunden zu detektieren und basiert auf den Prinzipien der zeitlichen und räumlichen Koinzidenz sowie der präzisen lokalen Inhibition. Es besteht ausschließlich aus Elementen einer eigens entwickelten Neuronalen Basisbibliothek (NBL) die speziell für die Modellierung verschiedenster Spike- Interaktionsmodelle entworfen wurde. Diese Bibliothek erweitert die kommerziell verfügbare dynamische Simulationsumgebung von MATLAB/SIMULINK um verschiedene Modelle von Neuronen und Synapsen, welche die intrinsischen dynamischen Eigenschaften von Nervenzellen nachbilden. Die Nutzung dieser Bibliothek versetzt sowohl den Ingenieur als auch den Biologen in die Lage, eigene, biologisch plausible, Modelle der neuronalen Informationsverarbeitung ohne detaillierte Programmierkenntnisse zu entwickeln. Die grafische Oberfläche ermöglicht strukturelle sowie parametrische Modifikationen und ist in der Lage, den Zeitverlauf mikroskopischer Zellpotentiale aber auch makroskopischer Spikemuster während und nach der Simulation darzustellen. Zwei grundlegende Elemente der Neuronalen Basisbibliothek wurden zur Implementierung als spezielle analog-digitale Schaltungen vorbereitet. Erste Silizium Implementierungen durch das Team des DFG Graduiertenkollegs GRK 164 konnten die Möglichkeit einer vollparallelen on line Verarbeitung von Schallsignalen nachweisen. Durch Zuhilfenahme des im GRK entwickelten automatisierten Layout Generators wird es möglich, spezielle Prozessoren zur Anwendung biologischer Verarbeitungsprinzipien in technischen Systemen zu entwickeln. Diese Prozessoren unterscheiden sich grundlegend von den klassischen von Neumann Prozessoren indem sie räumlich und zeitlich verteilte Spikemuster, anstatt sequentieller binärer Werte zur Informationsrepräsentation nutzen. Sie erweitern das digitale Kodierungsprinzip durch die Dimensionen des Raumes (2 dimensionale Nachbarschaft) der Zeit (Frequenz, Phase und Amplitude) sowie der zeitlichen Dynamik analoger Potentialverläufe. Diese Dissertation besteht aus sieben Kapiteln, welche den verschiedenen Bereichen der Computational Neuroscience gewidmet sind. Kapitel 1 beschreibt die Motivation dieser Arbeit welche aus der Absicht rühren, biologische Prinzipien der Schallverarbeitung zu erforschen und für technische Systeme während der Interaktion mit dem Menschen nutzbar zu machen. Zusätzlich werden fünf Gründe für die Nutzung von Spike-Interaktionsmodellen angeführt sowie deren neuartiger Charakter beschrieben. Kapitel 2 führt die biologischen Prinzipien der Schallquellenlokalisation und den psychoakustischen Präzedenz Effekt ein. Aktuelle Hypothesen zur Entstehung dieses Effektes werden anhand ausgewählter experimenteller Ergebnisse verschiedener Forschungsgruppen diskutiert. Kapitel 3 beschreibt die entwickelte Neuronale Basisbibliothek und führt die einzelnen neuronalen Simulationselemente ein. Es erklärt die zugrundeliegenden mathematischen Funktionen der dynamischen Komponenten und beschreibt deren generelle Einsetzbarkeit zur dynamischen Simulation spikebasierter Neuronaler Netzwerke. Kapitel 4 enthält ein speziell entworfenes Modell des auditorischen Hirnstammes beginnend mit den Filterkaskaden zur Simulation des Innenohres, sich fortsetzend über mehr als 200 Zellen und 400 Synapsen in 5 auditorischen Kernen bis zum Richtungssensor im Bereich des auditorischen Mittelhirns. Es stellt die verwendeten Strukturen und Parameter vor und enthält grundlegende Hinweise zur Nutzung der Simulationsumgebung. Kapitel 5 besteht aus drei Abschnitten, wobei der erste Abschnitt die Experimentalbedingungen und Ergebnisse der eigens durchgeführten Tierversuche beschreibt. Der zweite Abschnitt stellt die Ergebnisse von 104 Modellversuchen zur Simulationen psycho-akustischer Effekte dar, welche u.a. die Fähigkeit des Modells zur Nachbildung des Präzedenz Effektes testen. Schließlich beschreibt der letzte Abschnitt die Ergebnisse der 54 unter realen Umweltbedingungen durchgeführten Experimente. Dabei kamen Signale zur Anwendung, welche in normalen sowie besonders stark verhallten Räumen aufgezeichnet wurden. Kapitel 6 vergleicht diese Ergebnisse mit anderen biologisch motivierten und technischen Verfahren zur Echounterdrückung und Schallquellenlokalisation und führt den aktuellen Status der Hardwareimplementierung ein. Kapitel 7 enthält schließlich eine kurze Zusammenfassung und einen Ausblick auf weitere Forschungsobjekte und geplante Aktivitäten. Diese Arbeit möchte zur Entwicklung der Computational Neuroscience beitragen, indem sie versucht, in einem speziellen Anwendungsfeld die Lücke zwischen biologischen Erkenntnissen, rechentechnischen Modellen und Hardware Engineering zu schließen. Sie empfiehlt ein neues räumlich-zeitliches Paradigma der dynamischen Informationsverarbeitung zur Erschließung biologischer Prinzipien der Informationsverarbeitung für technische Anwendungen.This thesis investigates the biological background of the psycho-acoustical precedence effect, enabling humans to suppress echoes during the localization of sound sources. It provides a technically feasible and biologically plausible model for sound source localization under echoic conditions, ready to be used by technical systems during man-machine interactions. The model is based upon own electro-physiological experiments in the mongolian gerbil. The first time in gerbils obtained results reveal a special behavior of specific cells of the dorsal nucleus of the lateral lemniscus (DNLL) - a distinct region in the auditory brainstem. The explored persistent inhibition effect of these cells seems to account for the base of echo suppression at higher auditory centers. The developed model proved capable to duplicate this behavior and suggests, that a strong and timely precise hyperpolarization is the basic mechanism behind this cell behavior. The developed neural architecture models the inner ear as well as five major nuclei of the auditory brainstem in their connectivity and intrinsic dynamics. It represents a new type of neural modeling described as Spike Interaction Models (SIM). SIM use the precise spatio-temporal interaction of single spike events for coding and processing of neural information. Their basic elements are Integrate-and-Fire Neurons and Hebbian synapses, which have been extended by specially designed dynamic transfer functions. The model is capable to detect time differences as small as 10 mircrosecondes and employs the principles of coincidence detection and precise local inhibition for auditory processing. It consists exclusively of elements of a specifically designed Neural Base Library (NBL), which has been developed for multi purpose modeling of Spike Interaction Models. This library extends the commercially available dynamic simulation environment of MATLAB/SIMULINK by different models of neurons and synapses simulating the intrinsic dynamic properties of neural cells. The usage of this library enables engineers as well as biologists to design their own, biologically plausible models of neural information processing without the need for detailed programming skills. Its graphical interface provides access to structural as well as parametric changes and is capable to display the time course of microscopic cell parameters as well as macroscopic firing pattern during simulations and thereafter. Two basic elements of the Neural Base Library have been prepared for implementation by specialized mixed analog-digital circuitry. First silicon implementations were realized by the team of the DFG Graduiertenkolleg GRK 164 and proved the possibility of fully parallel on line processing of sounds. By using the automated layout processor under development in the Graduiertenkolleg, it will be possible to design specific processors in order to apply theprinciples of distributed biological information processing to technical systems. These processors differ from classical von Neumann processors by the use of spatio temporal spike pattern instead of sequential binary values. They will extend the digital coding principle by the dimensions of space (spatial neighborhood), time (frequency, phase and amplitude) as well as the dynamics of analog potentials and introduce a new type of information processing. This thesis consists of seven chapters, dedicated to the different areas of computational neuroscience. Chapter 1: provides the motivation of this study arising from the attempt to investigate the biological principles of sound processing and make them available to technical systems interacting with humans under real world conditions. Furthermore, five reasons to use spike interaction models are given and their novel characteristics are discussed. Chapter 2: introduces the biological principles of sound source localization and the precedence effect. Current hypothesis on echo suppression and the underlying principles of the precedence effect are discussed by reference to a small selection of physiological and psycho-acoustical experiments. Chapter 3: describes the developed neural base library and introduces each of the designed neural simulation elements. It also explains the developed mathematical functions of the dynamic compartments and describes their general usage for dynamic simulation of spiking neural networks. Chapter 4: introduces the developed specific model of the auditory brainstem, starting from the filtering cascade in the inner ear via more than 200 cells and 400 synapses in five auditory regions up to the directional sensor at the level of the auditory midbrain. It displays the employed parameter sets and contains basic hints for the set up and configuration of the simulation environment. Chapter 5: consists of three sections, whereas the first one describes the set up and results of the own electro-physiological experiments. The second describes the results of 104 model simulations, performed to test the models ability to duplicate psycho-acoustical effects like the precedence effect. Finally, the last section of this chapter contains the results of 54 real world experiments using natural sound signals, recorded under normal as well as highly reverberating conditions. Chapter 6: compares the achieved results to other biologically motivated and technical models for echo suppression and sound source localization and introduces the current status of silicon implementation. Chapter 7: finally provides a short summary and an outlook toward future research subjects and areas of investigation. This thesis aims to contribute to the field of computational neuroscience by bridging the gap between biological investigation, computational modeling and silicon engineering in a specific field of application. It suggests a new spatio-temporal paradigm of information processing in order to access the capabilities of biological systems for technical applications

    Distributed multimedia systems

    Get PDF
    A distributed multimedia system (DMS) is an integrated communication, computing, and information system that enables the processing, management, delivery, and presentation of synchronized multimedia information with quality-of-service guarantees. Multimedia information may include discrete media data, such as text, data, and images, and continuous media data, such as video and audio. Such a system enhances human communications by exploiting both visual and aural senses and provides the ultimate flexibility in work and entertainment, allowing one to collaborate with remote participants, view movies on demand, access on-line digital libraries from the desktop, and so forth. In this paper, we present a technical survey of a DMS. We give an overview of distributed multimedia systems, examine the fundamental concept of digital media, identify the applications, and survey the important enabling technologies.published_or_final_versio

    Computer Models for Musical Instrument Identification

    Get PDF
    PhDA particular aspect in the perception of sound is concerned with what is commonly termed as texture or timbre. From a perceptual perspective, timbre is what allows us to distinguish sounds that have similar pitch and loudness. Indeed most people are able to discern a piano tone from a violin tone or able to distinguish different voices or singers. This thesis deals with timbre modelling. Specifically, the formant theory of timbre is the main theme throughout. This theory states that acoustic musical instrument sounds can be characterised by their formant structures. Following this principle, the central point of our approach is to propose a computer implementation for building musical instrument identification and classification systems. Although the main thrust of this thesis is to propose a coherent and unified approach to the musical instrument identification problem, it is oriented towards the development of algorithms that can be used in Music Information Retrieval (MIR) frameworks. Drawing on research in speech processing, a complete supervised system taking into account both physical and perceptual aspects of timbre is described. The approach is composed of three distinct processing layers. Parametric models that allow us to represent signals through mid-level physical and perceptual representations are considered. Next, the use of the Line Spectrum Frequencies as spectral envelope and formant descriptors is emphasised. Finally, the use of generative and discriminative techniques for building instrument and database models is investigated. Our system is evaluated under realistic recording conditions using databases of isolated notes and melodic phrases

    Proceedings of the Second International Mobile Satellite Conference (IMSC 1990)

    Get PDF
    Presented here are the proceedings of the Second International Mobile Satellite Conference (IMSC), held June 17-20, 1990 in Ottawa, Canada. Topics covered include future mobile satellite communications concepts, aeronautical applications, modulation and coding, propagation and experimental systems, mobile terminal equipment, network architecture and control, regulatory and policy considerations, vehicle antennas, and speech compression
    corecore