Search CORE

4,455 research outputs found

Log-Euclidean Bag of Words for Human Action Recognition

Author: Bhatia R.
Conrad Sanderson
Lazebnik S.
Masoud Faraki
Maziar Palhang
Wong Y.
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 01/01/2015
Field of study

Representing videos by densely extracted local space-time features has recently become a popular approach for analysing actions. In this paper, we tackle the problem of categorising human actions by devising Bag of Words (BoW) models based on covariance matrices of spatio-temporal features, with the features formed from histograms of optical flow. Since covariance matrices form a special type of Riemannian manifold, the space of Symmetric Positive Definite (SPD) matrices, non-Euclidean geometry should be taken into account while discriminating between covariance matrices. To this end, we propose to embed SPD manifolds to Euclidean spaces via a diffeomorphism and extend the BoW approach to its Riemannian version. The proposed BoW approach takes into account the manifold geometry of SPD matrices during the generation of the codebook and histograms. Experiments on challenging human action datasets show that the proposed method obtains notable improvements in discrimination accuracy, in comparison to several state-of-the-art methods

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Queensland University of Technology ePrints Archive

University of Queensland eSpace

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Author: Bisk Yonatan
Cheng Yong
Essa Irfan
Hauptmann Alexander G.
Huang Yanping
Jiang Lu
Kumar Vivek
Macherey Wolfgang
Murphy Kevin
Ross David A.
Wang Zhiruo
Yang Ming-Hsuan
Yu Lijun
Publication venue
Publication date: 28/10/2023
Field of study

In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.Comment: NeurIPS 2023 spotligh

arXiv.org e-Print Archive

e-Counterfeit: a mobile-server platform for document counterfeit detection

Author: Berenguel Albert
Cañero Cristina
Lladós Josep
Terrades Oriol Ramos
Publication venue
Publication date: 21/08/2017
Field of study

This paper presents a novel application to detect counterfeit identity documents forged by a scan-printing operation. Texture analysis approaches are proposed to extract validation features from security background that is usually printed in documents as IDs or banknotes. The main contribution of this work is the end-to-end mobile-server architecture, which provides a service for non-expert users and therefore can be used in several scenarios. The system also provides a crowdsourcing mode so labeled images can be gathered, generating databases for incremental training of the algorithms.Comment: 6 pages, 5 figure

arXiv.org e-Print Archive

Crossref

A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features

Author: Baosheng Kang
Jinlin Ma
Ke Lu
Ziping Ma
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

In this paper we propose a novel visual method for protein model classification and retrieval. Different from the conventional methods, the key idea of the proposed method is to extract image features of proteins and measure the visual similarity between proteins. Firstly, the multiview images are captured by vertices and planes of a given octahedron surrounding the protein. Secondly, the local features are extracted from each image of the different views by the SURF algorithm and are vector quantized into visual words using a visual codebook. Finally, KLD is employed to calculate the similarity distance between two feature vectors. Experimental results show that the proposed method has encouraging performances for protein retrieval and categorization as shown in the comparison with other methods

Crossref

Directory of Open Access Journals

PubMed Central