86 research outputs found

    Music classification by low-rank semantic mappings

    Get PDF
    A challenging open question in music classification is which music representation (i.e., audio features) and which machine learning algorithm is appropriate for a specific music classification task. To address this challenge, given a number of audio feature vectors for each training music recording that capture the different aspects of music (i.e., timbre, harmony, etc.), the goal is to find a set of linear mappings from several feature spaces to the semantic space spanned by the class indicator vectors. These mappings should reveal the common latent variables, which characterize a given set of classes and simultaneously define a multi-class linear classifier that classifies the extracted latent common features. Such a set of mappings is obtained, building on the notion of the maximum margin matrix factorization, by minimizing a weighted sum of nuclear norms. Since the nuclear norm imposes rank constraints to the learnt mappings, the proposed method is referred to as low-rank semantic mappings (LRSMs). The performance of the LRSMs in music genre, mood, and multi-label classification is assessed by conducting extensive experiments on seven manually annotated benchmark datasets. The reported experimental results demonstrate the superiority of the LRSMs over the classifiers that are compared to. Furthermore, the best reported classification results are comparable with or slightly superior to those obtained by the state-of-the-art task-specific music classification methods

    Elastic net subspace clustering applied to pop/rock music structure analysis

    Get PDF
    A novel homogeneity-based method for music structure analysis is proposed. The heart of the method is a similarity measure, derived from first principles, that is based on the matrix Elastic Net (EN) regularization and deals efficiently with highly correlated audio feature vectors. In particular, beat-synchronous mel-frequency cepstral coefficients, chroma features, and auditory temporal modulations model the audio signal. The EN induced similarity measure is employed to construct an affinity matrix, yielding a novel subspace clustering method referred to as Elastic Net subspace clustering (ENSC). The performance of the ENSC in structure analysis is assessed by conducting extensive experiments on the Beatles dataset. The experimental findings demonstrate the descriptive power of the EN-based affinity matrix over the affinity matrices employed in subspace clustering methods, attaining the state-of-the-art performance reported for the Beatles dataset

    Robust correlated and individual component analysis

    Get PDF
    © 1979-2012 IEEE.Recovering correlated and individual components of two, possibly temporally misaligned, sets of data is a fundamental task in disciplines such as image, vision, and behavior computing, with application to problems such as multi-modal fusion (via correlated components), predictive analysis, and clustering (via the individual ones). Here, we study the extraction of correlated and individual components under real-world conditions, namely i) the presence of gross non-Gaussian noise and ii) temporally misaligned data. In this light, we propose a method for the Robust Correlated and Individual Component Analysis (RCICA) of two sets of data in the presence of gross, sparse errors. We furthermore extend RCICA in order to handle temporal incongruities arising in the data. To this end, two suitable optimization problems are solved. The generality of the proposed methods is demonstrated by applying them onto 4 applications, namely i) heterogeneous face recognition, ii) multi-modal feature fusion for human behavior analysis (i.e., audio-visual prediction of interest and conflict), iii) face clustering, and iv) thetemporal alignment of facial expressions. Experimental results on 2 synthetic and 7 real world datasets indicate the robustness and effectiveness of the proposed methodson these application domains, outperforming other state-of-the-art methods in the field

    Automatic construction of robust spherical harmonic subspaces

    Get PDF
    In this paper we propose a method to automatically recover a class specific low dimensional spherical harmonic basis from a set of in-the-wild facial images. We combine existing techniques for uncalibrated photometric stereo and low rank matrix decompositions in order to robustly recover a combined model of shape and identity. We build this basis without aid from a 3D model and show how it can be combined with recent efficient sparse facial feature localisation techniques to recover dense 3D facial shape. Unlike previous works in the area, our method is very efficient and is an order of magnitude faster to train, taking only a few minutes to build a model with over 2000 images. Furthermore, it can be used for real-time recovery of facial shape

    Behavior prediction in-the-wild

    Get PDF
    In this paper, the problem of audio-visual behavior prediction in-the-wild is addressed. In this context, both audio-visual descriptors of behavioral cues (features) and continuous-time real-valued characterizations of behavior (annotations) are (possibly) corrupted by non-Gaussian noise of large magnitude. The modeling assumption behind the proposed framework is that naturalistic affect and behavior captured in audio-visual episodes are smoothly-varying dynamic phenomena and thus the hidden temporal dynamics can be modeled as a generative auto-regressive process. Consequently, continuous-time real-valued characterizations of behavior (annotations) are postulated to be outputs of a low-complexity (i.e., low-order) time-invariant Linear Dynamical System (LDS) when descriptors of behavioral cues (features) act as inputs. To learn the parameters of the LDS, a recently proposed spectral method that relies on Hankel-rank minimization is adopted. Experimental evaluation on a challenging database recorded in the wild demonstrate the effectiveness of the proposed approach in behavior prediction

    Fusion and community detection in multi-layer graphs

    Get PDF
    Relational data arising in many domains can be represented by networks (or graphs) with nodes capturing entities and edges representing relationships between these entities. Community detection in networks has become one of the most important problems having a broad range of applications. Until recently, the vast majority of papers have focused on discovering community structures in a single network. However, with the emergence of multi-view network data in many real-world applications and consequently with the advent of multilayer graph representation, community detection in multi-layer graphs has become a new challenge. Multi-layer graphs provide complementary views of connectivity patterns of the same set of vertices. Fusion of the network layers is expected to achieve better clustering performance. In this paper, we propose two novel methods, coined as WSSNMTF (Weighted Simultaneous Symmetric Non-Negative Matrix Tri-Factorization) and NG-WSSNMTF (Natural Gradient WSSNMTF), for fusion and clustering of multi-layer graphs. Both methods are robust with respect to missing edges and noise. We compare the performance of the proposed methods with two baseline methods, as well as with three state-of-the-art methods on synthetic and three real-world datasets. The experimental results indicate superior performance of the proposed methods

    Music genre classification via joint sparse low-rank representation of audio features

    Get PDF
    A novel framework for music genre classification, namely the joint sparse low-rank representation (JSLRR) is proposed in order to: 1) smooth the noise in the test samples, and 2) identify the subspaces that the test samples lie onto. An efficient algorithm is proposed for obtaining the JSLRR and a novel classifier is developed, which is referred to as the JSLRR-based classifier. Special cases of the JSLRR-based classifier are the joint sparse representation-based classifier and the low-rank representation-based one. The performance of the three aforementioned classifiers is compared against that of the sparse representation-based classifier, the nearest subspace classifier, the support vector machines, and the nearest neighbor classifier for music genre classification on six manually annotated benchmark datasets. The best classification results reported here are comparable with or slightly superior than those obtained by the state-of-the-art music genre classification methods

    Robust Kronecker-decomposable component analysis for low-rank modeling

    Get PDF
    Dictionary learning and component analysis are part of one of the most well-studied and active research fields, at the intersection of signal and image processing, computer vision, and statistical machine learning. In dictionary learning, the current methods of choice are arguably K-SVD and its variants, which learn a dictionary (i.e., a decomposition) for sparse coding via Singular Value Decomposition. In robust component analysis, leading methods derive from Principal Component Pursuit (PCP), which recovers a low-rank matrix from sparse corruptions of unknown magnitude and support. However, K-SVD is sensitive to the presence of noise and outliers in the training set. Additionally, PCP does not provide a dictionary that respects the structure of the data (e.g., images), and requires expensive SVD computations when solved by convex relaxation. In this paper, we introduce a new robust decomposition of images by combining ideas from sparse dictionary learning and PCP. We propose a novel Kronecker-decomposable component analysis which is robust to gross corruption, can be used for low-rank modeling, and leverages separability to solve significantly smaller problems. We design an efficient learning algorithm by drawing links with a restricted form of tensor factorization. The effectiveness of the proposed approach is demonstrated on real-world applications, namely background subtraction and image denoising, by performing a thorough comparison with the current state of the art

    Dynamic behavior analysis via structured rank minimization

    Get PDF
    Human behavior and affect is inherently a dynamic phenomenon involving temporal evolution of patterns manifested through a multiplicity of non-verbal behavioral cues including facial expressions, body postures and gestures, and vocal outbursts. A natural assumption for human behavior modeling is that a continuous-time characterization of behavior is the output of a linear time-invariant system when behavioral cues act as the input (e.g., continuous rather than discrete annotations of dimensional affect). Here we study the learning of such dynamical system under real-world conditions, namely in the presence of noisy behavioral cues descriptors and possibly unreliable annotations by employing structured rank minimization. To this end, a novel structured rank minimization method and its scalable variant are proposed. The generalizability of the proposed framework is demonstrated by conducting experiments on 3 distinct dynamic behavior analysis tasks, namely (i) conflict intensity prediction, (ii) prediction of valence and arousal, and (iii) tracklet matching. The attained results outperform those achieved by other state-of-the-art methods for these tasks and, hence, evidence the robustness and effectiveness of the proposed approach
    corecore