198 research outputs found

    AXMEDIS 2008

    Get PDF
    The AXMEDIS International Conference series aims to explore all subjects and topics related to cross-media and digital-media content production, processing, management, standards, representation, sharing, protection and rights management, to address the latest developments and future trends of the technologies and their applications, impacts and exploitation. The AXMEDIS events offer venues for exchanging concepts, requirements, prototypes, research ideas, and findings which could contribute to academic research and also benefit business and industrial communities. In the Internet as well as in the digital era, cross-media production and distribution represent key developments and innovations that are fostered by emergent technologies to ensure better value for money while optimising productivity and market coverage

    Digital neuromorphic auditory systems

    Get PDF
    This dissertation presents several digital neuromorphic auditory systems. Neuromorphic systems are capable of running in real-time at a smaller computing cost and consume lower power than on widely available general computers. These auditory systems are considered neuromorphic as they are modelled after computational models of the mammalian auditory pathway and are capable of running on digital hardware, or more specifically on a field-programmable gate array (FPGA). The models introduced are categorised into three parts: a cochlear model, an auditory pitch model, and a functional primary auditory cortical (A1) model. The cochlear model is the primary interface of an input sound signal and transmits the 2D time-frequency representation of the sound to the pitch models as well as to the A1 model. In the pitch model, pitch information is extracted from the sound signal in the form of a fundamental frequency. From the A1 model, timbre information in the form of time-frequency envelope information of the sound signal is extracted. Since the computational auditory models mentioned above are required to be implemented on FPGAs that possess fewer computational resources than general-purpose computers, the algorithms in the models are optimised so that they fit on a single FPGA. The optimisation includes using simplified hardware-implementable signal processing algorithms. Computational resource information of each model on FPGA is extracted to understand the minimum computational resources required to run each model. This information includes the quantity of logic modules, register quantity utilised, and power consumption. Similarity comparisons are also made between the output responses of the computational auditory models on software and hardware using pure tones, chirp signals, frequency-modulated signal, moving ripple signals, and musical signals as input. The limitation of the responses of the models to musical signals at multiple intensity levels is also presented along with the use of an automatic gain control algorithm to alleviate such limitations. With real-world musical signals as their inputs, the responses of the models are also tested using classifiers – the response of the auditory pitch model is used for the classification of monophonic musical notes, and the response of the A1 model is used for the classification of musical instruments with their respective monophonic signals. Classification accuracy results are shown for model output responses on both software and hardware. With the hardware implementable auditory pitch model, the classification score stands at 100% accuracy for musical notes from the 4th and 5th octaves containing 24 classes of notes. With the hardware implementation auditory timbre model, the classification score is 92% accuracy for 12 classes musical instruments. Also presented is the difference in memory requirements of the model output responses on both software and hardware – pitch and timbre responses used for the classification exercises use 24 and 2 times less memory space for hardware than software

    Automatic Transcription of Bass Guitar Tracks applied for Music Genre Classification and Sound Synthesis

    Get PDF
    ï»żMusiksignale bestehen in der Regel aus einer Überlagerung mehrerer Einzelinstrumente. Die meisten existierenden Algorithmen zur automatischen Transkription und Analyse von Musikaufnahmen im Forschungsfeld des Music Information Retrieval (MIR) versuchen, semantische Information direkt aus diesen gemischten Signalen zu extrahieren. In den letzten Jahren wurde hĂ€ufig beobachtet, dass die LeistungsfĂ€higkeit dieser Algorithmen durch die SignalĂŒberlagerungen und den daraus resultierenden Informationsverlust generell limitiert ist. Ein möglicher Lösungsansatz besteht darin, mittels Verfahren der Quellentrennung die beteiligten Instrumente vor der Analyse klanglich zu isolieren. Die LeistungsfĂ€higkeit dieser Algorithmen ist zum aktuellen Stand der Technik jedoch nicht immer ausreichend, um eine sehr gute Trennung der Einzelquellen zu ermöglichen. In dieser Arbeit werden daher ausschließlich isolierte Instrumentalaufnahmen untersucht, die klanglich nicht von anderen Instrumenten ĂŒberlagert sind. Exemplarisch werden anhand der elektrischen Bassgitarre auf die Klangerzeugung dieses Instrumentes hin spezialisierte Analyse- und Klangsynthesealgorithmen entwickelt und evaluiert.Im ersten Teil der vorliegenden Arbeit wird ein Algorithmus vorgestellt, der eine automatische Transkription von Bassgitarrenaufnahmen durchfĂŒhrt. Dabei wird das Audiosignal durch verschiedene Klangereignisse beschrieben, welche den gespielten Noten auf dem Instrument entsprechen. Neben den ĂŒblichen Notenparametern Anfang, Dauer, LautstĂ€rke und Tonhöhe werden dabei auch instrumentenspezifische Parameter wie die verwendeten Spieltechniken sowie die Saiten- und Bundlage auf dem Instrument automatisch extrahiert. Evaluationsexperimente anhand zweier neu erstellter AudiodatensĂ€tze belegen, dass der vorgestellte Transkriptionsalgorithmus auf einem Datensatz von realistischen Bassgitarrenaufnahmen eine höhere Erkennungsgenauigkeit erreichen kann als drei existierende Algorithmen aus dem Stand der Technik. Die SchĂ€tzung der instrumentenspezifischen Parameter kann insbesondere fĂŒr isolierte Einzelnoten mit einer hohen GĂŒte durchgefĂŒhrt werden.Im zweiten Teil der Arbeit wird untersucht, wie aus einer Notendarstellung typischer sich wieder- holender Basslinien auf das Musikgenre geschlossen werden kann. Dabei werden Audiomerkmale extrahiert, welche verschiedene tonale, rhythmische, und strukturelle Eigenschaften von Basslinien quantitativ beschreiben. Mit Hilfe eines neu erstellten Datensatzes von 520 typischen Basslinien aus 13 verschiedenen Musikgenres wurden drei verschiedene AnsĂ€tze fĂŒr die automatische Genreklassifikation verglichen. Dabei zeigte sich, dass mit Hilfe eines regelbasierten Klassifikationsverfahrens nur Anhand der Analyse der Basslinie eines MusikstĂŒckes bereits eine mittlere Erkennungsrate von 64,8 % erreicht werden konnte.Die Re-synthese der originalen Bassspuren basierend auf den extrahierten Notenparametern wird im dritten Teil der Arbeit untersucht. Dabei wird ein neuer Audiosynthesealgorithmus vorgestellt, der basierend auf dem Prinzip des Physical Modeling verschiedene Aspekte der fĂŒr die Bassgitarre charakteristische Klangerzeugung wie Saitenanregung, DĂ€mpfung, Kollision zwischen Saite und Bund sowie dem Tonabnehmerverhalten nachbildet. Weiterhin wird ein parametrischerAudiokodierungsansatz diskutiert, der es erlaubt, Bassgitarrenspuren nur anhand der ermittel- ten notenweisen Parameter zu ĂŒbertragen um sie auf Dekoderseite wieder zu resynthetisieren. Die Ergebnisse mehrerer Hötest belegen, dass der vorgeschlagene Synthesealgorithmus eine Re- Synthese von Bassgitarrenaufnahmen mit einer besseren KlangqualitĂ€t ermöglicht als die Übertragung der Audiodaten mit existierenden Audiokodierungsverfahren, die auf sehr geringe Bitraten ein gestellt sind.Music recordings most often consist of multiple instrument signals, which overlap in time and frequency. In the field of Music Information Retrieval (MIR), existing algorithms for the automatic transcription and analysis of music recordings aim to extract semantic information from mixed audio signals. In the last years, it was frequently observed that the algorithm performance is limited due to the signal interference and the resulting loss of information. One common approach to solve this problem is to first apply source separation algorithms to isolate the present musical instrument signals before analyzing them individually. The performance of source separation algorithms strongly depends on the number of instruments as well as on the amount of spectral overlap.In this thesis, isolated instrumental tracks are analyzed in order to circumvent the challenges of source separation. Instead, the focus is on the development of instrument-centered signal processing algorithms for music transcription, musical analysis, as well as sound synthesis. The electric bass guitar is chosen as an example instrument. Its sound production principles are closely investigated and considered in the algorithmic design.In the first part of this thesis, an automatic music transcription algorithm for electric bass guitar recordings will be presented. The audio signal is interpreted as a sequence of sound events, which are described by various parameters. In addition to the conventionally used score-level parameters note onset, duration, loudness, and pitch, instrument-specific parameters such as the applied instrument playing techniques and the geometric position on the instrument fretboard will be extracted. Different evaluation experiments confirmed that the proposed transcription algorithm outperformed three state-of-the-art bass transcription algorithms for the transcription of realistic bass guitar recordings. The estimation of the instrument-level parameters works with high accuracy, in particular for isolated note samples.In the second part of the thesis, it will be investigated, whether the sole analysis of the bassline of a music piece allows to automatically classify its music genre. Different score-based audio features will be proposed that allow to quantify tonal, rhythmic, and structural properties of basslines. Based on a novel data set of 520 bassline transcriptions from 13 different music genres, three approaches for music genre classification were compared. A rule-based classification system could achieve a mean class accuracy of 64.8 % by only taking features into account that were extracted from the bassline of a music piece.The re-synthesis of a bass guitar recordings using the previously extracted note parameters will be studied in the third part of this thesis. Based on the physical modeling of string instruments, a novel sound synthesis algorithm tailored to the electric bass guitar will be presented. The algorithm mimics different aspects of the instrument’s sound production mechanism such as string excitement, string damping, string-fret collision, and the influence of the electro-magnetic pickup. Furthermore, a parametric audio coding approach will be discussed that allows to encode and transmit bass guitar tracks with a significantly smaller bit rate than conventional audio coding algorithms do. The results of different listening tests confirmed that a higher perceptual quality can be achieved if the original bass guitar recordings are encoded and re-synthesized using the proposed parametric audio codec instead of being encoded using conventional audio codecs at very low bit rate settings

    Creating music by listening

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2005.Includes bibliographical references (p. 127-139).Machines have the power and potential to make expressive music on their own. This thesis aims to computationally model the process of creating music using experience from listening to examples. Our unbiased signal-based solution models the life cycle of listening, composing, and performing, turning the machine into an active musician, instead of simply an instrument. We accomplish this through an analysis-synthesis technique by combined perceptual and structural modeling of the musical surface, which leads to a minimal data representation. We introduce a music cognition framework that results from the interaction of psychoacoustically grounded causal listening, a time-lag embedded feature representation, and perceptual similarity clustering. Our bottom-up analysis intends to be generic and uniform by recursively revealing metrical hierarchies and structures of pitch, rhythm, and timbre. Training is suggested for top-down un-biased supervision, and is demonstrated with the prediction of downbeat. This musical intelligence enables a range of original manipulations including song alignment, music restoration, cross-synthesis or song morphing, and ultimately the synthesis of original pieces.by Tristan Jehan.Ph.D

    Affect-based indexing and retrieval of multimedia data

    Get PDF
    Digital multimedia systems are creating many new opportunities for rapid access to content archives. In order to explore these collections using search, the content must be annotated with significant features. An important and often overlooked aspect o f human interpretation o f multimedia data is the affective dimension. The hypothesis o f this thesis is that affective labels o f content can be extracted automatically from within multimedia data streams, and that these can then be used for content-based retrieval and browsing. A novel system is presented for extracting affective features from video content and mapping it onto a set o f keywords with predetermined emotional interpretations. These labels are then used to demonstrate affect-based retrieval on a range o f feature films. Because o f the subjective nature o f the words people use to describe emotions, an approach towards an open vocabulary query system utilizing the electronic lexical database WordNet is also presented. This gives flexibility for search queries to be extended to include keywords without predetermined emotional interpretations using a word-similarity measure. The thesis presents the framework and design for the affectbased indexing and retrieval system along with experiments, analysis, and conclusions

    Multi-modal Video Content Understanding

    Get PDF
    Video is an important format of information. Humans use videos for a variety of purposes such as entertainment, education, communication, information sharing, and capturing memories. To this date, humankind accumulated a colossal amount of video material online which is freely available. Manual processing at this scale is simply impossible. To this end, many research efforts have been dedicated to the automatic processing of video content. At the same time, human perception of the world is multi-modal. A human uses multiple senses to understand the environment and objects, and their interactions. When watching a video, we perceive the content via both audio and visual modalities, and removing one of these modalities results in less immersive experience. Similarly, if information in both modalities does not correspond, it may create a sense of dissonance. Therefore, joint modelling of multiple modalities (such as audio, visual, and text) within one model is an active research area. In the last decade, the fields of automatic video understanding and multi-modal modelling have seen exceptional progress due to the ubiquitous success of deep learning models and, more recently, transformer-based architectures in particular. Our work draws on these advances and pushes the state-of-the-art of multi-modal video understanding forward. Applications of automatic multi-modal video processing are broad and exciting! For instance, the content-based textual description of a video (video captioning) may allow a visually- or auditory-impaired person to understand the content and, thus, engage in brighter social interactions. However, prior work in video content description relies on the visual input alone, missing vital information only available in the audio stream. To this end, we proposed two novel multi-modal transformer models that encode audio and visual interactions simultaneously. More specifically, first, we introduced a late-fusion multi-modal transformer that is highly modular and allows the processing of an arbitrary set of modalities. Second, an efficient bi-modal transformer was presented to encode audio-visual cues starting from the lower network layers allowing more rich audio-visual features and stronger performance as a result. Another application is the automatic visually-guided sound generation that might help professional sound (foley) designers who spend hours searching a database for relevant audio for a movie scene. Previous approaches for automatic conditional audio generation support only one class (e. g. “dog barking”), while real-life applications may require generation for hundreds of data classes and one would need to train one model for every data class which can be infeasible. To bridge this gap, we introduced a novel two-stage model that, first, efficiently encodes audio as a set of codebook vectors (i. e. trains to make “building blocks”) and, then, learns to sample these audio vectors given visual inputs to make a relevant audio track for this visual input. Moreover, we studied the automatic evaluation of the conditional audio generation model and proposed metrics that measure both quality and relevance of the generated samples. Finally, as video editing is becoming more common among non-professionals due to the increased popularity of such services as YouTube, automatic assistance during video editing grows in demand, e. g. off-sync detection between audio and visual tracks. Prior work in audio-visual synchronization was devoted to solving the task on lip-syncing datasets with “dense” signals, such as interviews and presentations. In such videos, synchronization cues occur “densely” across time, and it is enough to process just a few tens of a second to synchronize the tracks. In contrast, opendomain videos mostly have only “sparse” cues that occur just once in a seconds-long video clip (e. g. “chopping wood”). To address this, we: a) proposed a novel dataset with “sparse” sounds; b) designed a model which can efficiently encode seconds-long audio-visual tracks in a small set of “learnable selectors” that is, then, used for synchronization. In addition, we explored the temporal artefacts that common audio and video compression algorithms leave in data streams. To prevent a model from learning to rely on these artefacts, we introduced a list of recommendations on how to mitigate them. This thesis provides the details of the proposed methodologies as well as a comprehensive overview of advances in relevant fields of multi-modal video understanding. In addition, we provide a discussion of potential research directions that can bring significant contributions to the field

    Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

    Get PDF

    Proceedings of the 7th Sound and Music Computing Conference

    Get PDF
    Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010

    Contributions to multimedia adaptation within the MPEG-21 framework

    Full text link
    Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, octubre de 201
    • 

    corecore