342 research outputs found
Size Generalization of Graph Neural Networks on Biological Data: Insights and Practices from the Spectral Perspective
We investigate size-induced distribution shifts in graphs and assess their
impact on the ability of graph neural networks (GNNs) to generalize to larger
graphs relative to the training data. Existing literature presents conflicting
conclusions on GNNs' size generalizability, primarily due to disparities in
application domains and underlying assumptions concerning size-induced
distribution shifts. Motivated by this, we take a data-driven approach: we
focus on real biological datasets and seek to characterize the types of
size-induced distribution shifts. Diverging from prior approaches, we adopt a
spectral perspective and identify that spectrum differences induced by size are
related to differences in subgraph patterns (e.g., average cycle lengths).
While previous studies have identified that the inability of GNNs in capturing
subgraph information negatively impacts their in-distribution generalization,
our findings further show that this decline is more pronounced when evaluating
on larger test graphs not encountered during training. Based on these spectral
insights, we introduce a simple yet effective model-agnostic strategy, which
makes GNNs aware of these important subgraph patterns to enhance their size
generalizability. Our empirical results reveal that our proposed
size-insensitive attention strategy substantially enhances graph classification
performance on large test graphs, which are 2-10 times larger than the training
graphs, resulting in an improvement in F1 scores by up to 8%.Comment: 21 pages, including appendi
Streaming Audio Transformers for Online Audio Tagging
Transformers have emerged as a prominent model framework for audio tagging
(AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset
dataset. However, their impressive performance often comes at the cost of high
memory usage, slow inference speed, and considerable model delay, rendering
them impractical for real-world AT applications. In this study, we introduce
streaming audio transformers (SAT) that combine the vision transformer (ViT)
architecture with Transformer-Xl-like chunk processing, enabling efficient
processing of long-range audio signals. Our proposed SAT is benchmarked against
other transformer-based SOTA methods, achieving significant improvements in
terms of mean average precision (mAP) at a delay of 2s and 1s, while also
exhibiting significantly lower memory usage and computational overhead.
Checkpoints are publicly available https://github.com/RicherMans/SAT
An empirical study of weakly supervised audio tagging embeddings for general audio representations
We study the usability of pre-trained weakly supervised audio tagging (AT)
models as feature extractors for general audio representations. We mainly
analyze the feasibility of transferring those embeddings to other tasks within
the speech and sound domains. Specifically, we benchmark weakly supervised
pre-trained models (MobileNetV2 and EfficientNet-B0) against modern
self-supervised learning methods (BYOL-A) as feature extractors. Fourteen
downstream tasks are used for evaluation ranging from music instrument
classification to language classification. Our results indicate that AT
pre-trained models are an excellent transfer learning choice for music, event,
and emotion recognition tasks. Further, finetuning AT models can also benefit
speech-related tasks such as keyword spotting and intent classification.Comment: Odyssey 202
UniKW-AT: Unified Keyword Spotting and Audio Tagging
Within the audio research community and the industry, keyword spotting (KWS)
and audio tagging (AT) are seen as two distinct tasks and research fields.
However, from a technical point of view, both of these tasks are identical:
they predict a label (keyword in KWS, sound event in AT) for some fixed-sized
input audio segment. This work proposes UniKW-AT: An initial approach for
jointly training both KWS and AT. UniKW-AT enhances the noise-robustness for
KWS, while also being able to predict specific sound events and enabling
conditional wake-ups on sound events. Our approach extends the AT pipeline with
additional labels describing the presence of a keyword. Experiments are
conducted on the Google Speech Commands V1 (GSCV1) and the balanced Audioset
(AS) datasets. The proposed MobileNetV2 model achieves an accuracy of 97.53% on
the GSCV1 dataset and an mAP of 33.4 on the AS evaluation set. Further, we show
that significant noise-robustness gains can be observed on a real-world KWS
dataset, greatly outperforming standard KWS approaches. Our study shows that
KWS and AT can be merged into a single framework without significant
performance degradation.Comment: Accepted in Interspeech202
Hierarchical Large Language Models in Cloud Edge End Architecture for Heterogeneous Robot Cluster Control
Despite their powerful semantic understanding and code generation
capabilities, Large Language Models (LLMs) still face challenges when dealing
with complex tasks. Multi agent strategy generation and motion control are
highly complex domains that inherently require experts from multiple fields to
collaborate. To enhance multi agent strategy generation and motion control, we
propose an innovative architecture that employs the concept of a cloud edge end
hierarchical structure. By leveraging multiple large language models with
distinct areas of expertise, we can efficiently generate strategies and perform
task decomposition. Introducing the cosine similarity approach,aligning task
decomposition instructions with robot task sequences at the vector level, we
can identify subtasks with incomplete task decomposition and iterate on them
multiple times to ultimately generate executable machine task sequences.The
robot is guided through these task sequences to complete tasks of higher
complexity. With this architecture, we implement the process of natural
language control of robots to perform complex tasks, and successfully address
the challenge of multi agent execution of open tasks in open scenarios and the
problem of task decomposition
CED: Consistent ensemble distillation for audio tagging
Augmentation and knowledge distillation (KD) are well-established techniques
employed in audio classification tasks, aimed at enhancing performance and
reducing model sizes on the widely recognized Audioset (AS) benchmark. Although
both techniques are effective individually, their combined use, called
consistent teaching, hasn't been explored before. This paper proposes CED, a
simple training framework that distils student models from large teacher
ensembles with consistent teaching. To achieve this, CED efficiently stores
logits as well as the augmentation methods on disk, making it scalable to
large-scale datasets. Central to CED's efficacy is its label-free nature,
meaning that only the stored logits are used for the optimization of a student
model only requiring 0.3\% additional disk space for AS. The study trains
various transformer-based models, including a 10M parameter model achieving a
49.0 mean average precision (mAP) on AS. Pretrained models and code are
available at https://github.com/RicherMans/CED
- …