362,153 research outputs found
Latent Semantic Learning with Structured Sparse Representation for Human Action Recognition
This paper proposes a novel latent semantic learning method for extracting
high-level features (i.e. latent semantics) from a large vocabulary of abundant
mid-level features (i.e. visual keywords) with structured sparse
representation, which can help to bridge the semantic gap in the challenging
task of human action recognition. To discover the manifold structure of
midlevel features, we develop a spectral embedding approach to latent semantic
learning based on L1-graph, without the need to tune any parameter for graph
construction as a key step of manifold learning. More importantly, we construct
the L1-graph with structured sparse representation, which can be obtained by
structured sparse coding with its structured sparsity ensured by novel L1-norm
hypergraph regularization over mid-level features. In the new embedding space,
we learn latent semantics automatically from abundant mid-level features
through spectral clustering. The learnt latent semantics can be readily used
for human action recognition with SVM by defining a histogram intersection
kernel. Different from the traditional latent semantic analysis based on topic
models, our latent semantic learning method can explore the manifold structure
of mid-level features in both L1-graph construction and spectral embedding,
which results in compact but discriminative high-level features. The
experimental results on the commonly used KTH action dataset and unconstrained
YouTube action dataset show the superior performance of our method.Comment: The short version of this paper appears in ICCV 201
Deep Shape Representations for 3D Object Recognition
Deep learning is a rapidly growing discipline that models high-level features in data as multilayered
neural networks. The recent trend toward deep neural networks has been driven, in large part, by
a combination of affordable computing hardware, open source software, and the availability of
pre-trained networks on large-scale datasets.
In this thesis, we propose deep learning approaches to 3D shape recognition using a multilevel
feature learning paradigm. We start by comprehensively reviewing recent shape descriptors,
including hand-crafted descriptors that are mostly developed in the spectral geometry setting and
also the ones obtained via learning-based methods. Then, we introduce novel multi-level feature
learning approaches using spectral graph wavelets, bag-of-features and deep learning. Low-level
features are first extracted from a 3D shape using spectral graph wavelets. Mid-level features are
then generated via the bag-of-features model by employing locality-constrained linear coding as a
feature coding method, in conjunction with the biharmonic distance and intrinsic spatial pyramid
matching in a bid to effectively measure the spatial relationship between each pair of the bag-offeature
descriptors.
For the task of 3D shape retrieval, high-level shape features are learned via a deep auto-encoder
on mid-level features. Then, we compare the deep learned descriptor of a query shape to the
descriptors of all shapes in the dataset using a dissimilarity measure for 3D shape retrieval. For the
task of 3D shape classification, mid-level features are represented as 2D images in order to be fed
into a pre-trained convolutional neural network to learn high-level features from the penultimate
fully-connected layer of the network. Finally, a multiclass support vector machine classifier is
trained on these deep learned descriptors, and the classification accuracy is subsequently computed.
The proposed 3D shape retrieval and classification approaches are evaluated on three standard 3D
shape benchmarks through extensive experiments, and the results show compelling superiority of
our approaches over state-of-the-art methods
Feature fusion, feature selection and local n-ary patterns for object recognition and image classification
University of Technology Sydney. Faculty of Engineering and Information Technology.Object recognition is one of the most fundamental topics in computer vision. During past years, it has been the interest for both academies working in computer science and professionals working in the information technology (IT) industry. The popularity of object recognition has been proven by its motivation of sophisticated theories in science and wide spread applications in the industry. Nowadays, with more powerful machine learning tools (both hardware and software) and the huge amount of information (data) readily available, higher expectations are imposed on object recognition. At its early stage in the 1990s, the task of object recognition can be as simple as to differentiate between object of interest and non-object of interest from a single still image. Currently, the task of object recognition may as well includes the segmentation and labeling of different image regions (i.e., to assign each segmented image region a meaningful label based on objects appear in those regions), and then using computer programs to infer the scene of the overall image based on those segmented regions. The original two-class classification problem is now getting more complex as it now evolves toward a multi-class classification problem. In this thesis, contributions on object recognition are made in two aspects. These are, improvements using feature fusion and improvements using feature selection. Three examples are given in this thesis to illustrate three different feature fusion methods, the descriptor concatenation (the low-level fusion), the confidence value escalation (the mid-level fusion) and the coarse-to-fine framework (the high-level fusion). Two examples are provided for feature selection to demonstrate its ideas, those are, optimal descriptor selection and improved classifier selection.
Feature extraction plays a key role in object recognition because it is the first and also the most important step. If we consider the overall object recognition process, machine learning tools are to serve the purpose of finding distinctive features from the visual data. Given distinctive features, object recognition is readily available (e.g., a simple threshold function can be used to classify feature descriptors). The proposal of Local N-ary Pattern (LNP) texture features contributes to both feature extraction and texture classification. The distinctive LNP feature generalizes the texture feature extraction process and improves texture classification. Concretely, the local binary pattern (LBP) is the special case of LNP with n = 2 and the texture spectrum is the special case of LNP with n = 3. The proposed LNP representation has been proven to outperform the popular LBP and one of the LBP’s most successful extension - local ternary pattern (LTP) for texture classification
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation
- …