13 research outputs found
Fusion of Multimodal Information in Music Content Analysis
Music is often processed through its acoustic realization. This is restrictive in the sense that music is clearly a highly multimodal concept where various types of heterogeneous information can be associated to a given piece of music (a musical score, musicians\u27 gestures, lyrics, user-generated metadata, etc.). This has recently led researchers to apprehend music through its various facets, giving rise to "multimodal music analysis" studies. This article gives a synthetic overview of methods that have been successfully employed in multimodal signal analysis. In particular, their use in music content processing is discussed in more details through five case studies that highlight different multimodal integration techniques. The case studies include an example of cross-modal correlation for music video analysis, an audiovisual drum transcription system, a description of the concept of informed source separation, a discussion of multimodal dance-scene analysis, and an example of user-interactive music analysis. In the light of these case studies, some perspectives of multimodality in music processing are finally suggested
A Panorama on Multiscale Geometric Representations, Intertwining Spatial, Directional and Frequency Selectivity
The richness of natural images makes the quest for optimal representations in
image processing and computer vision challenging. The latter observation has
not prevented the design of image representations, which trade off between
efficiency and complexity, while achieving accurate rendering of smooth regions
as well as reproducing faithful contours and textures. The most recent ones,
proposed in the past decade, share an hybrid heritage highlighting the
multiscale and oriented nature of edges and patterns in images. This paper
presents a panorama of the aforementioned literature on decompositions in
multiscale, multi-orientation bases or dictionaries. They typically exhibit
redundancy to improve sparsity in the transformed domain and sometimes its
invariance with respect to simple geometric deformations (translation,
rotation). Oriented multiscale dictionaries extend traditional wavelet
processing and may offer rotation invariance. Highly redundant dictionaries
require specific algorithms to simplify the search for an efficient (sparse)
representation. We also discuss the extension of multiscale geometric
decompositions to non-Euclidean domains such as the sphere or arbitrary meshed
surfaces. The etymology of panorama suggests an overview, based on a choice of
partially overlapping "pictures". We hope that this paper will contribute to
the appreciation and apprehension of a stream of current research directions in
image understanding.Comment: 65 pages, 33 figures, 303 reference
Recommended from our members
Large-Scale Video Event Detection
Because of the rapid growth of large scale video recording and sharing, there is a growing need for robust and scalable solutions for analyzing video content. The ability to detect and recognize video events that capture real-world activities is one of the key and complex problems. This thesis aims at the development of robust and efficient solutions for large scale video event detection systems. In particular, we investigate the problem in two areas: first, event detection with automatically discovered event specific concepts with organized ontology, and second, event detection with multi-modality representations and multi-source fusion.
Existing event detection works use various low-level features with statistical learning models, and achieve promising performance. However, such approaches lack the capability of interpreting the abundant semantic content associated with complex video events. Therefore, mid-level semantic concept representation of complex events has emerged as a promising method for understanding video events. In this area, existing works can be categorized into two groups: those that manually define a specialized concept set for a specific event, and those that apply a general concept lexicon directly borrowed from existing object, scene and action concept libraries. The first approach seems to require tremendous manual efforts, whereas the second approach is often insufficient in capturing the rich semantics contained in video events. In this work, we propose an automatic event-driven concept discovery method, and build a large-scale event and concept library with well-organized ontology, called EventNet. This method is different from past work that applies a generic concept library independent of the target while not requiring tedious manual annotations. Extensive experiments over the zero-shot event retrieval task when no training samples are available show that the proposed EventNet library consistently and significantly outperforms the state-of-the-art methods.
Although concept-based event representation can interpret the semantic content of video events, in order to achieve high accuracy in event detection, we also need to consider and combine various features of different modalities and/or across different levels. One one hand, we observe that joint cross-modality patterns (e.g., audio-visual pattern) often exist in videos and provide strong multi-modal cues for detecting video events. We propose a joint audio-visual bi-modal codeword representation, called bi-modal words, to discover cross-modality correlations. On the other hand, combining features from multiple sources often produces performance gains, especially when the features complement with each other. Existing multi-source late fusion methods usually apply direct combination of confidence scores from different sources. This becomes limiting because heterogeneous results from various sources often produce incomparable confidence scores at different scales. This makes direct late fusion inappropriate, thus posing a great challenge. Based upon the above considerations, we propose a robust late fusion method with rank minimization, that not only achieves isotonicity among various scores from different sources, but also recovers a robust prediction score for individual test samples. We experimentally show that the proposed multi-modality representation and multi-source fusion methods achieve promising results compared with other benchmark baselines.
The main contributions of the thesis include the following.
1. Large scale event and concept ontology: a) propose an automatic framework for discovering event-driven concepts; b) build the largest video event ontology, EventNet, which includes 500 complex events and 4,490 event-specific concepts; c) build the first interactive system that allows users to explore high-level events and associated concepts in videos with event browsing, search, and tagging functions.
2. Event detection with multi-modality representations and multi-source fusion: a) propose novel bi-modal codeword construction for discovering multi-modality correlations; b) propose novel robust late fusion with rank minimization method for combining information from multiple sources.
The two parts of the thesis are complimentary. Concept-based event representation provides rich semantic information for video events. Cross-modality features also provide complementary information from multiple sources. The combination of those two parts in a unified framework can offer great potential for advancing state-of-the-art in large-scale event detection
Deliverable D1.2 Visual, text and audio information analysis for hypervideo, first release
Enriching videos by offering continuative and related information via, e.g., audiostreams, web pages, as well as other videos, is typically hampered by its demand for massive editorial work. While there exist several automatic and semi-automatic methods that analyze audio/video content, one needs to decide which method offers appropriate information for our intended use-case scenarios. We review the technology options for video analysis that we have access to, and describe which training material we opted for to feed our algorithms. For all methods, we offer extensive qualitative and quantitative results, and give an outlook on the next steps within the project
Directional edge and texture representations for image processing
An efficient representation for natural images is of fundamental importance in image processing and analysis. The commonly used separable transforms such as wavelets axe not best suited for images due to their inability to exploit directional regularities such as edges and oriented textural patterns; while most of the recently proposed directional schemes cannot represent these two types of features in a unified transform. This thesis focuses on the development of directional representations for images which can capture both edges and textures in a multiresolution manner. The thesis first considers the problem of extracting linear features with the multiresolution Fourier transform (MFT). Based on a previous MFT-based linear feature model, the work extends the extraction method into the situation when the image is corrupted by noise. The problem is tackled by the combination of a "Signal+Noise" frequency model, a refinement stage and a robust classification scheme. As a result, the MFT is able to perform linear feature analysis on noisy images on which previous methods failed. A new set of transforms called the multiscale polar cosine transforms (MPCT) are also proposed in order to represent textures. The MPCT can be regarded as real-valued MFT with similar basis functions of oriented sinusoids. It is shown that the transform can represent textural patches more efficiently than the conventional Fourier basis. With a directional best cosine basis, the MPCT packet (MPCPT) is shown to be an efficient representation for edges and textures, despite its high computational burden. The problem of representing edges and textures in a fixed transform with less complexity is then considered. This is achieved by applying a Gaussian frequency filter, which matches the disperson of the magnitude spectrum, on the local MFT coefficients. This is particularly effective in denoising natural images, due to its ability to preserve both types of feature. Further improvements can be made by employing the information given by the linear feature extraction process in the filter's configuration. The denoising results compare favourably against other state-of-the-art directional representations
Virtual Synaesthesia: Crossmodal Correspondences and Synesthetic Experiences
As technology develops to allow for the integration of additional senses into interactive experiences, there is a need to bridge the divide between the real and the virtual in a manner that stimulates the five senses consistently and in harmony with the sensory expectations of the user. Applying the philosophy of a neurological condition known as synaesthesia and crossmodal correspondences, defined as the coupling of the senses, can provide numerous cognitive benefits and offers an insight into which senses are most likely to be ‘bound’ together.
This thesis aims to present a design paradigm called ‘virtual synaesthesia’ the goal of the paradigm is to make multisensory experiences more human-orientated by considering how the brain combines senses in both the general population (crossmodal correspondences) and within a select few individuals (natural synaesthesia). Towards this aim, a literature review is conducted covering the related areas of research umbrellaed by the concept of ‘virtual synaesthesia’. Its research areas are natural synaesthesia, crossmodal correspondences, multisensory experiences, and sensory substitution/augmentation. This thesis examines augmenting interactive and multisensory experiences with strong (natural synaesthesia) and weak (crossmodal correspondences) synaesthesia. This thesis answers the following research questions: Is it possible to replicate the underlying cognitive benefits of odour-vision synaesthesia? Do people have consistent correspondences between olfaction and an aggregate of different sensory modalities? What is the nature and origin of these correspondences? And Is it possible to predict the crossmodal correspondences attributed to odours? The benefits of augmenting a human-machine interface using an artificial form of odour-vision synaesthesia are explored to answer these questions. This concept is exemplified by transforming odours transduced using a custom-made electronic nose and transforming an odour's ‘chemical footprint’ into a 2D abstract shape representing the current odour. Electronic noses can transform odours in the vapour phase generating a series of electrical signals that represent the current odour source. Weak synaesthesia (crossmodal correspondences) is then investigated to determine if people have consistent correspondences between odours and the angularity of shapes, the smoothness of texture, perceived pleasantness, pitch, musical, and emotional dimensions. Following on from this research, the nature and origin of these correspondences were explored using the underlying hedonic (values relating to pleasantness), semantic (knowledge of the identity of the odour) and physicochemical (the physical and chemical characteristics of the odour) dependencies. The final research chapter investigates the possibility of removing the bottleneck of conducting extensive human trials by determining what the crossmodal correspondences towards specific odours are by developing machine learning models to predict the crossmodal perception of odours using their underlying physicochemical features.
The work presented in this thesis provides some insight and evidence of the benefits of incorporating the concept ‘virtual synaesthesia’ into human-machine interfaces and research into the methodology embodied by ‘virtual synaesthesia’, namely crossmodal correspondences. Overall, the work presented in this thesis shows potential for augmenting multisensory experiences with more refined capabilities leading to more enriched experiences, better designs, and a more intuitive way to convey information crossmodally
Supervised and unsupervised segmentation of textured images by efficient multi-level pattern classification
This thesis proposes new, efficient methodologies for supervised and unsupervised image segmentation based on texture information. For the supervised case, a technique for pixel classification based on a multi-level strategy that iteratively refines the resulting segmentation is proposed. This strategy utilizes pattern recognition methods based on prototypes (determined by clustering algorithms) and support vector machines. In order to obtain the best performance, an algorithm for automatic parameter selection and methods to reduce the computational cost associated with the segmentation process are also included. For the unsupervised case, the previous methodology is adapted by means of an initial pattern discovery stage, which allows transforming the original unsupervised problem into a supervised one. Several sets of experiments considering a wide variety of images are carried out in order to validate the developed techniques.Esta tesis propone metodologías nuevas y eficientes para segmentar imágenes a partir de información de textura en entornos supervisados y no supervisados. Para el caso supervisado, se propone una técnica basada en una estrategia de clasificación de píxeles multinivel que refina la segmentación resultante de forma iterativa. Dicha estrategia utiliza métodos de reconocimiento de patrones basados en prototipos (determinados mediante algoritmos de agrupamiento) y máquinas de vectores de soporte. Con el objetivo de obtener el mejor rendimiento, se incluyen además un algoritmo para selección automática de parámetros y métodos para reducir el coste computacional asociado al proceso de segmentación. Para el caso no supervisado, se propone una adaptación de la metodología anterior mediante una etapa inicial de descubrimiento de patrones que permite transformar el problema no supervisado en supervisado. Las técnicas desarrolladas en esta tesis se validan mediante diversos experimentos considerando una gran variedad de imágenes
Violent urban disturbance in England 1980-81
This study addresses violent urban disturbances which occurred in England in the early 1980s with particular reference to the Bristol ‘riots’ of April 1980 and the numerous disorders which followed in July 1981. Revisiting two concepts traditionally utilised to explain the spread of collective violence, namely ‘diffusion’ and ‘contagion,’ it argues that the latter offers a more useful model for understanding the above-mentioned events. Diffusion used in this context implies that such disturbances are independent of each other and occur randomly. It is associated with the concept of ‘copycat riots’, which were commonly invoked by the national media as a way of explaining the spread of urban disturbances in July 1981. Contagion by contrast holds that urban disturbances are related to one another and involve a variety of communication processes and rational collective decision-making. This implies that such events can only be fully understood if they are studied in terms of their local dynamics.Providing the first comprehensive macro-historical analysis of the disturbances of July 1981, this thesis utilises a range of quantitative techniques to argue that the temporal and spatial spread of the unrest exhibited patterns of contagion. These mini-waves of disorder located in several conurbations were precipitated by major disturbances in inner-city multi-ethnic areas. This contradicts more conventional explanations which credit the national media as the sole driver of riotous behaviour.The thesis then proceeds to offer a micro analysis of disturbances in Bristol in April 1980, incorporating both qualitative and quantitative techniques. Exploiting previously unexplored primary sources and recently collected oral histories from participants, it establishes detailed narratives of three related disturbances in the city. The anatomy of the individual incidents and local contagious effects are examined using spatial mapping, social network and ethnographic analyses. The results suggest that previously ignored educational, sub-cultural and ethnographic intra- and inter-community linkages were important factors in the spread of the disorders in Bristol.The case studies of the Bristol disorders are then used to illuminate our understanding of the processes at work during the July 1981 disturbances. It is argued that the latter events were essentially characterised by anti-police and anti-racist collective violence, which marked a momentary recomposition of working-class youth across ethnic divides