226,119 research outputs found

    Multi-evidence and multi-modal fusion network for ground-based cloud recognition

    Get PDF
    In recent times, deep neural networks have drawn much attention in ground-based cloud recognition. Yet such kind of approaches simply center upon learning global features from visual information, which causes incomplete representations for ground-based clouds. In this paper, we propose a novel method named multi-evidence and multi-modal fusion network (MMFN) for ground-based cloud recognition, which could learn extended cloud information by fusing heterogeneous features in a unified framework. Namely, MMFN exploits multiple pieces of evidence, i.e., global and local visual features, from ground-based cloud images using the main network and the attentive network. In the attentive network, local visual features are extracted from attentive maps which are obtained by refining salient patterns from convolutional activation maps. Meanwhile, the multi-modal network in MMFN learns multi-modal features for ground-based cloud. To fully fuse the multi-modal and multi-evidence visual features, we design two fusion layers in MMFN to incorporate multi-modal features with global and local visual features, respectively. Furthermore, we release the first multi-modal ground-based cloud dataset named MGCD which not only contains the ground-based cloud images but also contains the multi-modal information corresponding to each cloud image. The MMFN is evaluated on MGCD and achieves a classification accuracy of 88.63% comparative to the state-of-the-art methods, which validates its effectiveness for ground-based cloud recognition

    Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective

    Full text link
    Visual representation learning is the key of solving various vision problems. Relying on the seminal grid structure priors, convolutional neural networks (CNNs) have been the de facto standard architectures of most deep vision models. For instance, classical semantic segmentation methods often adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated (i.e., atrous) convolutions or inserting attention modules. However, the FCN-based architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating visual representation learning generally as a sequence-to-sequence prediction task. Specifically, we deploy a pure Transformer to encode an image as a sequence of patches, without local convolution and resolution reduction. With the global context modeled in every layer of the Transformer, stronger visual representation can be learned for better tackling vision tasks. In particular, our segmentation model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission), Pascal Context (55.83% mIoU) and reaches competitive results on Cityscapes. Further, we formulate a family of Hierarchical Local-Global (HLG) Transformers characterized by local attention within windows and global-attention across windows in a hierarchical and pyramidal architecture. Extensive experiments show that our method achieves appealing performance on a variety of visual recognition tasks (e.g., image classification, object detection and instance segmentation and semantic segmentation).Comment: Extended version of CVPR 2021 paper arXiv:2012.1584

    Connecting the Dots for People with Autism: A Data-driven Approach to Designing and Evaluating a Global Filter

    Get PDF
    Social communication is the use of language in social contexts. It encompasses social interaction, social cognition, pragmatics, and language processing” [3]. One presumed prerequisite of social communication is visual attention–the focus of this work. “Visual attention is a process that directs a tiny fraction of the information arriving at primary visual cortex to high-level centers involved in visual working memory and pattern recognition” [7]. This process involves the integration of two streams: the global and local streams; the global stream rapidly processes the scene, and the local stream processes details. This integration is important to social communication in that attending to both the global and local features of a scene are necessary to grasp the overall meaning. For people with autism spectrum disorder (ASD), the integration of these two streams can be disrupted by the tendency to privilege details (local processing) over seeing the big picture (global processing) [66]. Consequently, people with ASD may have challenges integrating visual attention, which may disrupt their social communication. This doctoral work explores the hypothesis that visual attention can be redirected to the features of an image that contain holistic information about a scene, which when highlighted might enable people with ASD to see the forest as well as the trees (i.e., seeing a scene as a whole rather than parts). The focuses are on 1) designing a global filter that can shift visual attention from local details to global features, and 2) evaluating the performance of a global filter by leveraging eye-tracking technology. This doctoral work manipulates visual stimuli in an effort to shift the visual attention of people with ASD. This doctoral work includes two development life cycles (i.e., design, develop, evaluate): 1) low-fidelity filter, and 2) high-fidelity filter. The low-fidelity filter life cycle includes the design of four low-fidelity filters for an initial experiment which was tested with an adult participant with ASD. The performance of each filter was evaluated by using verbal responses and eye-tracking data in terms of visual analysis, fixation analysis, and saccade analysis. The results from this cycle informed the decision for designing a high-fidelity filter in the next development life cycle. In this second cycle, ten children with ASD participated in the experiment. The performance of the high-fidelity filter was evaluated by using both verbal responses and eye-tracking data in terms of eye gaze behaviors. Results indicate that baseline conditions slightly outperform global filters in terms of verbal response and the eye gaze behaviors. To unpack the results in more details beyond group comparisons, three analyses (e.g., luminance, chroma, and spatial frequency) of image characteristics are performed to ascertain relevant aspects that contribute to the filter performance. The results indicate that there are no significant correlations between the image characteristics and the filter performance. However, among the three characteristics, spatial frequency is depicted as the most correlated factor with the filter performance. Additional analyses using neural networks, specifically Multi-Layer Perceptron (MLP) and Convolutional Neural Network (CNN), are also explored. The result shows that CNN is more predictive of the relationship between an image and visual attention than MLP. This is a proof of concept that neural networks can be employed to identify images for future experiments, by avoiding any variance or bias in terms of unbalanced characteristics of images across the experimental image pool

    Lightweight Vision Transformer with Cross Feature Attention

    Full text link
    Recent advances in vision transformers (ViTs) have achieved great performance in visual recognition tasks. Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations, but these networks are spatially local. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. In this paper, we propose cross feature attention (XFA) to bring down computation cost for transformers, and combine efficient mobile CNNs to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can serve as a general-purpose backbone to learn both global and local representation. Experimental results show that XFormer outperforms numerous CNN and ViT-based models across different tasks and datasets. On ImageNet1K dataset, XFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters, which is 2.2% and 6.3% more accurate than EfficientNet-B0 (CNN-based) and DeiT (ViT-based) for similar number of parameters. Our model also performs well when transferring to object detection and semantic segmentation tasks. On MS COCO dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3 framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3, surpassing state-of-the-art lightweight segmentation networks.Comment: Technical Repor

    A Dual-Stream Neural Network Explains the Functional Segregation of Dorsal and Ventral Visual Pathways in Human Brains

    Full text link
    The human visual system uses two parallel pathways for spatial processing and object recognition. In contrast, computer vision systems tend to use a single feedforward pathway, rendering them less robust, adaptive, or efficient than human vision. To bridge this gap, we developed a dual-stream vision model inspired by the human eyes and brain. At the input level, the model samples two complementary visual patterns to mimic how the human eyes use magnocellular and parvocellular retinal ganglion cells to separate retinal inputs to the brain. At the backend, the model processes the separate input patterns through two branches of convolutional neural networks (CNN) to mimic how the human brain uses the dorsal and ventral cortical pathways for parallel visual processing. The first branch (WhereCNN) samples a global view to learn spatial attention and control eye movements. The second branch (WhatCNN) samples a local view to represent the object around the fixation. Over time, the two branches interact recurrently to build a scene representation from moving fixations. We compared this model with the human brains processing the same movie and evaluated their functional alignment by linear transformation. The WhereCNN and WhatCNN branches were found to differentially match the dorsal and ventral pathways of the visual cortex, respectively, primarily due to their different learning objectives. These model-based results lead us to speculate that the distinct responses and representations of the ventral and dorsal streams are more influenced by their distinct goals in visual attention and object recognition than by their specific bias or selectivity in retinal inputs. This dual-stream model takes a further step in brain-inspired computer vision, enabling parallel neural networks to actively explore and understand the visual surroundings
    • …
    corecore