226,218 research outputs found
Multi-evidence and multi-modal fusion network for ground-based cloud recognition
In recent times, deep neural networks have drawn much attention in ground-based cloud recognition. Yet such kind of approaches simply center upon learning global features from visual information, which causes incomplete representations for ground-based clouds. In this paper, we propose a novel method named multi-evidence and multi-modal fusion network (MMFN) for ground-based cloud recognition, which could learn extended cloud information by fusing heterogeneous features in a unified framework. Namely, MMFN exploits multiple pieces of evidence, i.e., global and local visual features, from ground-based cloud images using the main network and the attentive network. In the attentive network, local visual features are extracted from attentive maps which are obtained by refining salient patterns from convolutional activation maps. Meanwhile, the multi-modal network in MMFN learns multi-modal features for ground-based cloud. To fully fuse the multi-modal and multi-evidence visual features, we design two fusion layers in MMFN to incorporate multi-modal features with global and local visual features, respectively. Furthermore, we release the first multi-modal ground-based cloud dataset named MGCD which not only contains the ground-based cloud images but also contains the multi-modal information corresponding to each cloud image. The MMFN is evaluated on MGCD and achieves a classification accuracy of 88.63% comparative to the state-of-the-art methods, which validates its effectiveness for ground-based cloud recognition
Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective
Visual representation learning is the key of solving various vision problems.
Relying on the seminal grid structure priors, convolutional neural networks
(CNNs) have been the de facto standard architectures of most deep vision
models. For instance, classical semantic segmentation methods often adopt a
fully-convolutional network (FCN) with an encoder-decoder architecture. The
encoder progressively reduces the spatial resolution and learns more abstract
visual concepts with larger receptive fields. Since context modeling is
critical for segmentation, the latest efforts have been focused on increasing
the receptive field, through either dilated (i.e., atrous) convolutions or
inserting attention modules. However, the FCN-based architecture remains
unchanged. In this paper, we aim to provide an alternative perspective by
treating visual representation learning generally as a sequence-to-sequence
prediction task. Specifically, we deploy a pure Transformer to encode an image
as a sequence of patches, without local convolution and resolution reduction.
With the global context modeled in every layer of the Transformer, stronger
visual representation can be learned for better tackling vision tasks. In
particular, our segmentation model, termed as SEgmentation TRansformer (SETR),
excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on
the day of submission), Pascal Context (55.83% mIoU) and reaches competitive
results on Cityscapes. Further, we formulate a family of Hierarchical
Local-Global (HLG) Transformers characterized by local attention within windows
and global-attention across windows in a hierarchical and pyramidal
architecture. Extensive experiments show that our method achieves appealing
performance on a variety of visual recognition tasks (e.g., image
classification, object detection and instance segmentation and semantic
segmentation).Comment: Extended version of CVPR 2021 paper arXiv:2012.1584
Connecting the Dots for People with Autism: A Data-driven Approach to Designing and Evaluating a Global Filter
Social communication is the use of language in social contexts. It encompasses social interaction, social cognition, pragmatics, and language processing” [3]. One presumed prerequisite of social communication is visual attention–the focus of this work. “Visual attention is a process that directs a tiny fraction of the information arriving at primary visual cortex to high-level centers involved in visual working memory and pattern recognition” [7]. This process involves the integration of two streams: the global and local streams; the global stream rapidly processes the scene, and the local stream processes details. This integration is important to social communication in that attending to both the global and local features of a scene are necessary to grasp the overall meaning. For people with autism spectrum disorder (ASD), the integration of these two streams can be disrupted by the tendency to privilege details (local processing) over seeing the big picture (global processing) [66]. Consequently, people with ASD may have challenges integrating visual attention, which may disrupt their social communication. This doctoral work explores the hypothesis that visual attention can be redirected to the features of an image that contain holistic information about a scene, which when highlighted might enable people with ASD to see the forest as well as the trees (i.e., seeing a scene as a whole rather than parts). The focuses are on 1) designing a global filter that can shift visual attention from local details to global features, and 2) evaluating the performance of a global filter by leveraging eye-tracking technology. This doctoral work manipulates visual stimuli in an effort to shift the visual attention of people with ASD.
This doctoral work includes two development life cycles (i.e., design, develop, evaluate): 1) low-fidelity filter, and 2) high-fidelity filter. The low-fidelity filter life cycle includes the design of four low-fidelity filters for an initial experiment which was tested with an adult participant with ASD. The performance of each filter was evaluated by using verbal responses and eye-tracking data in terms of visual analysis, fixation analysis, and saccade analysis. The results from this cycle informed the decision for designing a high-fidelity filter in the next development life cycle. In this second cycle, ten children with ASD participated in the experiment. The performance of the high-fidelity filter was evaluated by using both verbal responses and eye-tracking data in terms of eye gaze behaviors. Results indicate that baseline conditions slightly outperform global filters in terms of verbal response and the eye gaze behaviors.
To unpack the results in more details beyond group comparisons, three analyses (e.g., luminance, chroma, and spatial frequency) of image characteristics are performed to ascertain relevant aspects that contribute to the filter performance. The results indicate that there are no significant correlations between the image characteristics and the filter performance. However, among the three characteristics, spatial frequency is depicted as the most correlated factor with the filter performance. Additional analyses using neural networks, specifically Multi-Layer Perceptron (MLP) and Convolutional Neural Network (CNN), are also explored. The result shows that CNN is more predictive of the relationship between an image and visual attention than MLP. This is a proof of concept that neural networks can be employed to identify images for future experiments, by avoiding any variance or bias in terms of unbalanced characteristics of images across the experimental image pool
Lightweight Vision Transformer with Cross Feature Attention
Recent advances in vision transformers (ViTs) have achieved great performance
in visual recognition tasks. Convolutional neural networks (CNNs) exploit
spatial inductive bias to learn visual representations, but these networks are
spatially local. ViTs can learn global representations with their
self-attention mechanism, but they are usually heavy-weight and unsuitable for
mobile devices. In this paper, we propose cross feature attention (XFA) to
bring down computation cost for transformers, and combine efficient mobile CNNs
to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can
serve as a general-purpose backbone to learn both global and local
representation. Experimental results show that XFormer outperforms numerous CNN
and ViT-based models across different tasks and datasets. On ImageNet1K
dataset, XFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters,
which is 2.2% and 6.3% more accurate than EfficientNet-B0 (CNN-based) and DeiT
(ViT-based) for similar number of parameters. Our model also performs well when
transferring to object detection and semantic segmentation tasks. On MS COCO
dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3
framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with
only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3,
surpassing state-of-the-art lightweight segmentation networks.Comment: Technical Repor
A Dual-Stream Neural Network Explains the Functional Segregation of Dorsal and Ventral Visual Pathways in Human Brains
The human visual system uses two parallel pathways for spatial processing and
object recognition. In contrast, computer vision systems tend to use a single
feedforward pathway, rendering them less robust, adaptive, or efficient than
human vision. To bridge this gap, we developed a dual-stream vision model
inspired by the human eyes and brain. At the input level, the model samples two
complementary visual patterns to mimic how the human eyes use magnocellular and
parvocellular retinal ganglion cells to separate retinal inputs to the brain.
At the backend, the model processes the separate input patterns through two
branches of convolutional neural networks (CNN) to mimic how the human brain
uses the dorsal and ventral cortical pathways for parallel visual processing.
The first branch (WhereCNN) samples a global view to learn spatial attention
and control eye movements. The second branch (WhatCNN) samples a local view to
represent the object around the fixation. Over time, the two branches interact
recurrently to build a scene representation from moving fixations. We compared
this model with the human brains processing the same movie and evaluated their
functional alignment by linear transformation. The WhereCNN and WhatCNN
branches were found to differentially match the dorsal and ventral pathways of
the visual cortex, respectively, primarily due to their different learning
objectives. These model-based results lead us to speculate that the distinct
responses and representations of the ventral and dorsal streams are more
influenced by their distinct goals in visual attention and object recognition
than by their specific bias or selectivity in retinal inputs. This dual-stream
model takes a further step in brain-inspired computer vision, enabling parallel
neural networks to actively explore and understand the visual surroundings
- …