127 research outputs found
PDANet: Pyramid Density-aware Attention Net for Accurate Crowd Counting
Crowd counting, i.e., estimating the number of people in a crowded area, has
attracted much interest in the research community. Although many attempts have
been reported, crowd counting remains an open real-world problem due to the
vast scale variations in crowd density within the interested area, and severe
occlusion among the crowd. In this paper, we propose a novel Pyramid
Density-Aware Attention-based network, abbreviated as PDANet, that leverages
the attention, pyramid scale feature and two branch decoder modules for
density-aware crowd counting. The PDANet utilizes these modules to extract
different scale features, focus on the relevant information, and suppress the
misleading ones. We also address the variation of crowdedness levels among
different images with an exclusive Density-Aware Decoder (DAD). For this
purpose, a classifier evaluates the density level of the input features and
then passes them to the corresponding high and low crowded DAD modules.
Finally, we generate an overall density map by considering the summation of low
and high crowded density maps as spatial attention. Meanwhile, we employ two
losses to create a precise density map for the input scene. Extensive
evaluations conducted on the challenging benchmark datasets well demonstrate
the superior performance of the proposed PDANet in terms of the accuracy of
counting and generated density maps over the well-known state of the arts
Visual Superordinate Abstraction for Robust Concept Learning
Concept learning constructs visual representations that are connected to
linguistic semantics, which is fundamental to vision-language tasks. Although
promising progress has been made, existing concept learners are still
vulnerable to attribute perturbations and out-of-distribution compositions
during inference. We ascribe the bottleneck to a failure of exploring the
intrinsic semantic hierarchy of visual concepts, e.g. \{red, blue,...\}
`color' subspace yet cube `shape'. In this paper, we propose a visual
superordinate abstraction framework for explicitly modeling semantic-aware
visual subspaces (i.e. visual superordinates). With only natural visual
question answering data, our model first acquires the semantic hierarchy from a
linguistic view, and then explores mutually exclusive visual superordinates
under the guidance of linguistic hierarchy. In addition, a quasi-center visual
concept clustering and a superordinate shortcut learning schemes are proposed
to enhance the discrimination and independence of concepts within each visual
superordinate. Experiments demonstrate the superiority of the proposed
framework under diverse settings, which increases the overall answering
accuracy relatively by 7.5\% on reasoning with perturbations and 15.6\% on
compositional generalization tests.Comment: 14 pages, 11 figure
Cross-Modal Contrastive Learning for Robust Reasoning in VQA
Multi-modal reasoning in visual question answering (VQA) has witnessed rapid
progress recently. However, most reasoning models heavily rely on shortcuts
learned from training data, which prevents their usage in challenging
real-world scenarios. In this paper, we propose a simple but effective
cross-modal contrastive learning strategy to get rid of the shortcut reasoning
caused by imbalanced annotations and improve the overall performance. Different
from existing contrastive learning with complex negative categories on coarse
(Image, Question, Answer) triplet level, we leverage the correspondences
between the language and image modalities to perform finer-grained cross-modal
contrastive learning. We treat each Question-Answer (QA) pair as a whole, and
differentiate between images that conform with it and those against it. To
alleviate the issue of sampling bias, we further build connected graphs among
images. For each positive pair, we regard the images from different graphs as
negative samples and deduct the version of multi-positive contrastive learning.
To our best knowledge, it is the first paper that reveals a general contrastive
learning strategy without delicate hand-craft rules can contribute to robust
VQA reasoning. Experiments on several mainstream VQA datasets demonstrate our
superiority compared to the state of the arts. Code is available at
\url{https://github.com/qizhust/cmcl_vqa_pl}
BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge
Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate
sounding sources by predicting pixel-wise maps. Previous methods assume that
each sound component in an audio signal always has a visual counterpart in the
image. However, this assumption overlooks that off-screen sounds and background
noise often contaminate the audio recordings in real-world scenarios. They
impose significant challenges on building a consistent semantic mapping between
audio and visual signals for AVS models and thus impede precise sound
localization. In this work, we propose a two-stage bootstrapping audio-visual
segmentation framework by incorporating multi-modal foundation knowledge. In a
nutshell, our BAVS is designed to eliminate the interference of background
noise or off-screen sounds in segmentation by establishing the audio-visual
correspondences in an explicit manner. In the first stage, we employ a
segmentation model to localize potential sounding objects from visual data
without being affected by contaminated audio signals. Meanwhile, we also
utilize a foundation audio classification model to discern audio semantics.
Considering the audio tags provided by the audio foundation model are noisy,
associating object masks with audio tags is not trivial. Thus, in the second
stage, we develop an audio-visual semantic integration strategy (AVIS) to
localize the authentic-sounding objects. Here, we construct an audio-visual
tree based on the hierarchical correspondence between sounds and object
categories. We then examine the label concurrency between the localized objects
and classified audio tags by tracing the audio-visual tree. With AVIS, we can
effectively segment real-sounding objects. Extensive experiments demonstrate
the superiority of our method on AVS datasets, particularly in scenarios
involving background noise. Our project website is
https://yenanliu.github.io/AVSS.github.io/
Unleashing the Potential of Regularization Strategies in Learning with Noisy Labels
In recent years, research on learning with noisy labels has focused on
devising novel algorithms that can achieve robustness to noisy training labels
while generalizing to clean data. These algorithms often incorporate
sophisticated techniques, such as noise modeling, label correction, and
co-training. In this study, we demonstrate that a simple baseline using
cross-entropy loss, combined with widely used regularization strategies like
learning rate decay, model weights average, and data augmentations, can
outperform state-of-the-art methods. Our findings suggest that employing a
combination of regularization strategies can be more effective than intricate
algorithms in tackling the challenges of learning with noisy labels. While some
of these regularization strategies have been utilized in previous noisy label
learning research, their full potential has not been thoroughly explored. Our
results encourage a reevaluation of benchmarks for learning with noisy labels
and prompt reconsideration of the role of specialized learning algorithms
designed for training with noisy labels
Machine intelligence for nerve conduit design and production
Nerve guidance conduits (NGCs) have emerged from recent advances within tissue engineering as a promising alternative to autografts for peripheral nerve repair. NGCs are tubular structures with engineered biomaterials, which guide axonal regeneration from the injured proximal nerve to the distal stump. NGC design can synergistically combine multiple properties to enhance proliferation of stem and neuronal cells, improve nerve migration, attenuate inflammation and reduce scar tissue formation. The aim of most laboratories fabricating NGCs is the development of an automated process that incorporates patient-specific features and complex tissue blueprints (e.g. neurovascular conduit) that serve as the basis for more complicated muscular and skin grafts. One of the major limitations for tissue engineering is lack of guidance for generating tissue blueprints and the absence of streamlined manufacturing processes. With the rapid expansion of machine intelligence, high dimensional image analysis, and computational scaffold design, optimized tissue templates for 3D bioprinting (3DBP) are feasible. In this review, we examine the translational challenges to peripheral nerve regeneration and where machine intelligence can innovate bottlenecks in neural tissue engineering
Accuracy of Eyes of AIâ„¢ Artificial Intelligence Driven Platform for Lateral Cephalometric Analysis
AbstractAim: The objective of this prospective study was to evaluate the accuracy of cephalometric analyses acquired through manual tracing and the Eyes of AITM AI-driven web-based program.Materials and Methods: This prospective study employed randomization conducted via computer software, with a determined sample size of 150 cases. Inclusion criteria encompassed good quality lateral cephalograms available in both digital and print formats, absence of artifacts that might hinder anatomical point location, and presence of a clear calibration ruler for magnification determination. Exclusion criteria included lateral cephalograms with identifiable motion artifacts, resolution disparity, or insufficient contrast, as well as those exhibiting positional errors indicated by ear rod markers. Each lateral cephalogram underwent tracing and analysis using the manual method, as well as Eyes of AITM software. Following landmark plotting, linear and angular measurements of Steiner, Downs, McNamara, and Jefferson analyses were calculated.Results: A comparison of thirty-six cephalometric measurements of Steiner, Downs, McNamara, and Jefferson analyses obtained from manual tracing and AI-driven Eyes of AITM revealed a Concordance Correlation Coefficient (CCC) value above 0.76 for all parameters, indicating strong agreement between manual and AI-driven cephalometric measurements. Furthermore, a CCC value exceeding 0.9 was observed for twenty-eight parameters, indicative of very strong agreement.Conclusion: Automated lateral cephalometric measurements obtained from Eyes of AITM are accurate when compared to manual measurements
- …