180 research outputs found
Designing Motion Representation in Videos
Motion representation plays a vital role in the vision-based human action recognition in videos. Generally, the information of a video could be divided into spatial information and temporal information. While the spatial information could be easily described by the RGB images, the design of the motion representation is yet a challenging problem. In order to design a motion representation that is efficient and effective, we design the feature according to two principles. First, to guarantee the robustness, the temporal information should be highly related to the informative modalities, e.g., the optical flow. Second, only basic operations could be applied to make the computational cost affordable when extracting the temporal information. Based on these principles, we introduce a novel compact motion representation for video action recognition, named Optical Flow guided Feature (OFF), which enables the network to distil temporal information through a fast and robust approach. The OFF is derived from the definition of optical flow and is orthogonal to the optical flow. The derivation also provides theoretical support for using the difference between two frames. By directly calculating pixel-wise spatiotemporal gradients of the deep feature maps, the OFF could be embedded in any existing CNN based video action recognition framework with only a slight additional cost. It enables the CNN to extract spatiotemporal information. This simple but powerful idea is validated by experimental results. The network with OFF fed only by RGB inputs achieves a competitive accuracy of 93.3% on UCF-101, which is comparable with the result obtained by two streams (RGB and optical flow), but is 15 times faster in speed. Experimental results also show that OFF is complementary to other motion modalities such as optical flow. When the proposed method is plugged into the state-of-the-art video action recognition framework, it has 96.0% and 74.2% accuracy on UCF-101 and HMDB-51 respectively
Towards unified visual perception
This thesis explores the frontier of visual perception in computer vision by leveraging the capabilities of Vision Transformers (ViTs) to create a unified framework that addresses cross-task and cross-granularity challenges. Drawing inspiration from the human visual system's ability to process visual information at varying levels of detail and the success of Transformers in Natural Language Processing (NLP), we aim to bridge the gap between broad visual concepts and their fine-grained counterparts. Our investigation is structured into three parts.
First, we delve into a range of training methods and architectures for ViTs, with the goal of gathering valuable insights. These insights are intended to guide the optimization of ViTs in the subsequent phase of our research, ensuring we build a strong foundation for enhancing their performance in complex visual tasks.
Second, our focus shifts towards the recognition of fine-grained visual concepts, employing precise annotations to delve deeper into the intricate details of visual scenes. Here, we tackle the challenge of discerning and classifying objects and pixels with remarkable accuracy, leveraging the foundational insights gained from our initial explorations of ViTs.
In the final part of our thesis, we demonstrate how language can serve as a bridge, enabling vision-language models, which are only trained to recognize images, to navigate countless visual concepts on fine-grained entities like objects and pixels without the need for fine-tuning
Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition
Motion representation plays a vital role in human action recognition in
videos. In this study, we introduce a novel compact motion representation for
video action recognition, named Optical Flow guided Feature (OFF), which
enables the network to distill temporal information through a fast and robust
approach. The OFF is derived from the definition of optical flow and is
orthogonal to the optical flow. The derivation also provides theoretical
support for using the difference between two frames. By directly calculating
pixel-wise spatiotemporal gradients of the deep feature maps, the OFF could be
embedded in any existing CNN based video action recognition framework with only
a slight additional cost. It enables the CNN to extract spatiotemporal
information, especially the temporal information between frames simultaneously.
This simple but powerful idea is validated by experimental results. The network
with OFF fed only by RGB inputs achieves a competitive accuracy of 93.3% on
UCF-101, which is comparable with the result obtained by two streams (RGB and
optical flow), but is 15 times faster in speed. Experimental results also show
that OFF is complementary to other motion modalities such as optical flow. When
the proposed method is plugged into the state-of-the-art video action
recognition framework, it has 96:0% and 74:2% accuracy on UCF-101 and HMDB-51
respectively. The code for this project is available at
https://github.com/kevin-ssy/Optical-Flow-Guided-Feature.Comment: CVPR 2018. code available at
https://github.com/kevin-ssy/Optical-Flow-Guided-Featur
Treatment Allocation under Uncertain Costs
We consider the problem of learning how to optimally allocate treatments
whose cost is uncertain and can vary with pre-treatment covariates. This
setting may arise in medicine if we need to prioritize access to a scarce
resource that different patients would use for different amounts of time, or in
marketing if we want to target discounts whose cost to the company depends on
how much the discounts are used. Here, we derive the form of the optimal
treatment allocation rule under budget constraints, and propose a practical
random forest based method for learning a treatment rule using data from a
randomized trial or, more broadly, unconfounded data. Our approach leverages a
statistical connection between our problem and that of learning heterogeneous
treatment effects under endogeneity using an instrumental variable. We find our
method to exhibit promising empirical performance both in simulations and in a
marketing application
OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?
This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-scale
dataset for humour generation and understanding. Humour is an abstract,
subjective, and context-dependent cognitive construct involving several
cognitive factors, making it a challenging task to generate and interpret.
Hence, humour generation and understanding can serve as a new task for
evaluating the ability of deep-learning methods to process abstract and
subjective information. Due to the scarcity of data, humour-related generation
tasks such as captioning remain under-explored. To address this gap,
OxfordTVG-HIC offers approximately 2.9M image-text pairs with humour scores to
train a generalizable humour captioning model. Contrary to existing captioning
datasets, OxfordTVG-HIC features a wide range of emotional and semantic
diversity resulting in out-of-context examples that are particularly conducive
to generating humour. Moreover, OxfordTVG-HIC is curated devoid of offensive
content. We also show how OxfordTVG-HIC can be leveraged for evaluating the
humour of a generated text. Through explainability analysis of the trained
models, we identify the visual and linguistic cues influential for evoking
humour prediction (and generation). We observe qualitatively that these cues
are aligned with the benign violation theory of humour in cognitive psychology.Comment: Accepted by ICCV 202
Evaluation Research of Traction Motor Performance for Mine Dump Truck Based on Rough Set Theory
This paper presents the traction motor evaluation method depending on the electric transmission energy transfer characteristics and different source of supply, including motor manufactures, diesel turbine manufacturers, wheel side reducer manufacturers and electric drive system integrated manufacturers. 9 evaluations are proposed in 3 levels from the motor body and control performance, electric drive system coordinate index, driving conditions and specific cycle. Motor performance evaluation system is published by the means of electric transmission tests and computer simulation platform, using rough set theory. Experimental results show that the model can accurate evaluation of state of the traction motor, Evaluation of the accuracy is better than the subjective weighting analysis, verifying the integrity and usefulness of this valuation method. At the same time, the comprehensive evaluations index of permanent magnet synchronous motors is high, it has important research value.
OxfordTVG-HIC: can machine make humorous captions from images?
This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-scale dataset for humour generation and understanding. Humour is an abstract, subjective, and context-dependent cognitive construct involving several cognitive factors, making it a challenging task to generate and interpret. Hence, humour generation and understanding can serve as a new task for evaluating the ability of deep-learning methods to process abstract and subjective information. Due to the scarcity of data, humourrelated generation tasks such as captioning remain underexplored. To address this gap, OxfordTVG-HIC offers approximately 2.9M image-text pairs with humour scores to train a generalizable humour captioning model. Contrary to existing captioning datasets, OxfordTVG-HIC features a wide range of emotional and semantic diversity resulting in out-of-context examples that are particularly conducive to generating humour. Moreover, OxfordTVG-HIC is curated devoid of offensive content. We also show how OxfordTVG-HIC can be leveraged for evaluating the humour of a generated text. Through explainability analysis of the trained models, we identify the visual and linguistic cues influential for evoking humour prediction (and generation). We observe qualitatively that these cues are aligned with the benign violation theory of humour in cognitive psychology
Gamification of mobile wallet as an unconventional innovation for promoting Fintech:An fsQCA approach
Although digitalisation brings important possibilities to banking & finance service, implementing digital technologies in practices can be challenging. Indeed, the adoption of new innovative technology in the banking & finance sector lags behind other business sectors. Many of the valuable banking & finance-related technologies have not been adopted in relation to the strategic implications of decisions in domains such as the development of service innovation and personalization, value co-creation, and marketing strategies. In particular, there is a paucity of research in using gamification to explore ways of customising banking & finance fintech offerings, improving customers’ experience, and developing efficient banking & finance marketing tactics. Drawing on the UTAUT2 and Otcalysis gamification framework, this study develops a research model investigating what configurations of motivations, expectations and conditions can shape consumers’ behavioral intention to adopt a gamified mobile wallet system. Findings suggest that combining effort expectancy, facilitating conditions and perceived value leads to higher intention to use gamified mobile wallet. Accordingly, firms need to consider the three core conditions when design relevant gamifications.</p
- …