127 research outputs found

    PDANet: Pyramid Density-aware Attention Net for Accurate Crowd Counting

    Full text link
    Crowd counting, i.e., estimating the number of people in a crowded area, has attracted much interest in the research community. Although many attempts have been reported, crowd counting remains an open real-world problem due to the vast scale variations in crowd density within the interested area, and severe occlusion among the crowd. In this paper, we propose a novel Pyramid Density-Aware Attention-based network, abbreviated as PDANet, that leverages the attention, pyramid scale feature and two branch decoder modules for density-aware crowd counting. The PDANet utilizes these modules to extract different scale features, focus on the relevant information, and suppress the misleading ones. We also address the variation of crowdedness levels among different images with an exclusive Density-Aware Decoder (DAD). For this purpose, a classifier evaluates the density level of the input features and then passes them to the corresponding high and low crowded DAD modules. Finally, we generate an overall density map by considering the summation of low and high crowded density maps as spatial attention. Meanwhile, we employ two losses to create a precise density map for the input scene. Extensive evaluations conducted on the challenging benchmark datasets well demonstrate the superior performance of the proposed PDANet in terms of the accuracy of counting and generated density maps over the well-known state of the arts

    Visual Superordinate Abstraction for Robust Concept Learning

    Full text link
    Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts, e.g. \{red, blue,...\} ∈\in `color' subspace yet cube ∈\in `shape'. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces (i.e. visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view, and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and a superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5\% on reasoning with perturbations and 15.6\% on compositional generalization tests.Comment: 14 pages, 11 figure

    Cross-Modal Contrastive Learning for Robust Reasoning in VQA

    Full text link
    Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently. However, most reasoning models heavily rely on shortcuts learned from training data, which prevents their usage in challenging real-world scenarios. In this paper, we propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning caused by imbalanced annotations and improve the overall performance. Different from existing contrastive learning with complex negative categories on coarse (Image, Question, Answer) triplet level, we leverage the correspondences between the language and image modalities to perform finer-grained cross-modal contrastive learning. We treat each Question-Answer (QA) pair as a whole, and differentiate between images that conform with it and those against it. To alleviate the issue of sampling bias, we further build connected graphs among images. For each positive pair, we regard the images from different graphs as negative samples and deduct the version of multi-positive contrastive learning. To our best knowledge, it is the first paper that reveals a general contrastive learning strategy without delicate hand-craft rules can contribute to robust VQA reasoning. Experiments on several mainstream VQA datasets demonstrate our superiority compared to the state of the arts. Code is available at \url{https://github.com/qizhust/cmcl_vqa_pl}

    BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

    Full text link
    Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image. However, this assumption overlooks that off-screen sounds and background noise often contaminate the audio recordings in real-world scenarios. They impose significant challenges on building a consistent semantic mapping between audio and visual signals for AVS models and thus impede precise sound localization. In this work, we propose a two-stage bootstrapping audio-visual segmentation framework by incorporating multi-modal foundation knowledge. In a nutshell, our BAVS is designed to eliminate the interference of background noise or off-screen sounds in segmentation by establishing the audio-visual correspondences in an explicit manner. In the first stage, we employ a segmentation model to localize potential sounding objects from visual data without being affected by contaminated audio signals. Meanwhile, we also utilize a foundation audio classification model to discern audio semantics. Considering the audio tags provided by the audio foundation model are noisy, associating object masks with audio tags is not trivial. Thus, in the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects. Here, we construct an audio-visual tree based on the hierarchical correspondence between sounds and object categories. We then examine the label concurrency between the localized objects and classified audio tags by tracing the audio-visual tree. With AVIS, we can effectively segment real-sounding objects. Extensive experiments demonstrate the superiority of our method on AVS datasets, particularly in scenarios involving background noise. Our project website is https://yenanliu.github.io/AVSS.github.io/

    Unleashing the Potential of Regularization Strategies in Learning with Noisy Labels

    Full text link
    In recent years, research on learning with noisy labels has focused on devising novel algorithms that can achieve robustness to noisy training labels while generalizing to clean data. These algorithms often incorporate sophisticated techniques, such as noise modeling, label correction, and co-training. In this study, we demonstrate that a simple baseline using cross-entropy loss, combined with widely used regularization strategies like learning rate decay, model weights average, and data augmentations, can outperform state-of-the-art methods. Our findings suggest that employing a combination of regularization strategies can be more effective than intricate algorithms in tackling the challenges of learning with noisy labels. While some of these regularization strategies have been utilized in previous noisy label learning research, their full potential has not been thoroughly explored. Our results encourage a reevaluation of benchmarks for learning with noisy labels and prompt reconsideration of the role of specialized learning algorithms designed for training with noisy labels

    Machine intelligence for nerve conduit design and production

    Get PDF
    Nerve guidance conduits (NGCs) have emerged from recent advances within tissue engineering as a promising alternative to autografts for peripheral nerve repair. NGCs are tubular structures with engineered biomaterials, which guide axonal regeneration from the injured proximal nerve to the distal stump. NGC design can synergistically combine multiple properties to enhance proliferation of stem and neuronal cells, improve nerve migration, attenuate inflammation and reduce scar tissue formation. The aim of most laboratories fabricating NGCs is the development of an automated process that incorporates patient-specific features and complex tissue blueprints (e.g. neurovascular conduit) that serve as the basis for more complicated muscular and skin grafts. One of the major limitations for tissue engineering is lack of guidance for generating tissue blueprints and the absence of streamlined manufacturing processes. With the rapid expansion of machine intelligence, high dimensional image analysis, and computational scaffold design, optimized tissue templates for 3D bioprinting (3DBP) are feasible. In this review, we examine the translational challenges to peripheral nerve regeneration and where machine intelligence can innovate bottlenecks in neural tissue engineering

    Accuracy of Eyes of AIâ„¢ Artificial Intelligence Driven Platform for Lateral Cephalometric Analysis

    Get PDF
    AbstractAim: The objective of this prospective study was to evaluate the accuracy of cephalometric analyses acquired through manual tracing and the Eyes of AITM AI-driven web-based program.Materials and Methods: This prospective study employed randomization conducted via computer software, with a determined sample size of 150 cases. Inclusion criteria encompassed good quality lateral cephalograms available in both digital and print formats, absence of artifacts that might hinder anatomical point location, and presence of a clear calibration ruler for magnification determination. Exclusion criteria included lateral cephalograms with identifiable motion artifacts, resolution disparity, or insufficient contrast, as well as those exhibiting positional errors indicated by ear rod markers. Each lateral cephalogram underwent tracing and analysis using the manual method, as well as Eyes of AITM software. Following landmark plotting, linear and angular measurements of Steiner, Downs, McNamara, and Jefferson analyses were calculated.Results: A comparison of thirty-six cephalometric measurements of Steiner, Downs, McNamara, and Jefferson analyses obtained from manual tracing and AI-driven Eyes of AITM revealed a Concordance Correlation Coefficient (CCC) value above 0.76 for all parameters, indicating strong agreement between manual and AI-driven cephalometric measurements. Furthermore, a CCC value exceeding 0.9 was observed for twenty-eight parameters, indicative of very strong agreement.Conclusion: Automated lateral cephalometric measurements obtained from Eyes of AITM are accurate when compared to manual measurements
    • …
    corecore