1,146 research outputs found
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Pre-training speech models on large volumes of data has achieved remarkable
success. OpenAI Whisper is a multilingual multitask model trained on 680k hours
of supervised speech data. It generalizes well to various speech recognition
and translation benchmarks even in a zero-shot setup. However, the full
pipeline for developing such models (from data collection to training) is not
publicly accessible, which makes it difficult for researchers to further
improve its performance and address training-related issues such as efficiency,
robustness, fairness, and bias. This work presents an Open Whisper-style Speech
Model (OWSM), which reproduces Whisper-style training using an open-source
toolkit and publicly available data. OWSM even supports more translation
directions and can be more efficient to train. We will publicly release all
scripts used for data preparation, training, inference, and scoring as well as
pre-trained models and training logs to promote open science.Comment: Accepted at ASRU 202
BPKD: Boundary Privileged Knowledge Distillation For Semantic Segmentation
Current knowledge distillation approaches in semantic segmentation tend to
adopt a holistic approach that treats all spatial locations equally. However,
for dense prediction, students' predictions on edge regions are highly
uncertain due to contextual information leakage, requiring higher spatial
sensitivity knowledge than the body regions. To address this challenge, this
paper proposes a novel approach called boundary-privileged knowledge
distillation (BPKD). BPKD distills the knowledge of the teacher model's body
and edges separately to the compact student model. Specifically, we employ two
distinct loss functions: (i) edge loss, which aims to distinguish between
ambiguous classes at the pixel level in edge regions; (ii) body loss, which
utilizes shape constraints and selectively attends to the inner-semantic
regions. Our experiments demonstrate that the proposed BPKD method provides
extensive refinements and aggregation for edge and body regions. Additionally,
the method achieves state-of-the-art distillation performance for semantic
segmentation on three popular benchmark datasets, highlighting its
effectiveness and generalization ability. BPKD shows consistent improvements
across a diverse array of lightweight segmentation structures, including both
CNNs and transformers, underscoring its architecture-agnostic adaptability. The
code is available at \url{https://github.com/AkideLiu/BPKD}.Comment: 17 pages, 9 figures, 9 table
Multi-Task Dynamical Systems
Time series datasets are often composed of a variety of sequences from the
same domain, but from different entities, such as individuals, products, or
organizations. We are interested in how time series models can be specialized
to individual sequences (capturing the specific characteristics) while still
retaining statistical power by sharing commonalities across the sequences. This
paper describes the multi-task dynamical system (MTDS); a general methodology
for extending multi-task learning (MTL) to time series models. Our approach
endows dynamical systems with a set of hierarchical latent variables which can
modulate all model parameters. To our knowledge, this is a novel development of
MTL, and applies to time series both with and without control inputs. We apply
the MTDS to motion-capture data of people walking in various styles using a
multi-task recurrent neural network (RNN), and to patient drug-response data
using a multi-task pharmacodynamic model.Comment: 52 pages, 17 figure
- …