Search CORE

1,146 research outputs found

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Author: Arora Siddhant
Berrebbi Dan
Chang Xuankai
Chen William
Jung Jee-weon
Li Xinjian
Maiti Soumi
Peng Yifan
Shakeel Muhammad
Sharma Roshan
Shi Jiatong
Sudo Yui
Tian Jinchuan
Watanabe Shinji
Yan Brian
Zhang Wangyou
Publication venue
Publication date: 24/10/2023
Field of study

Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.Comment: Accepted at ASRU 202

arXiv.org e-Print Archive

BPKD: Boundary Privileged Knowledge Distillation For Semantic Segmentation

Author: Ge Jinchao
Liu Liyang
Liu Yifan
Phan Minh Hieu
Wang Zihan
Zhang Bowen
Publication venue
Publication date: 31/08/2023
Field of study

Current knowledge distillation approaches in semantic segmentation tend to adopt a holistic approach that treats all spatial locations equally. However, for dense prediction, students' predictions on edge regions are highly uncertain due to contextual information leakage, requiring higher spatial sensitivity knowledge than the body regions. To address this challenge, this paper proposes a novel approach called boundary-privileged knowledge distillation (BPKD). BPKD distills the knowledge of the teacher model's body and edges separately to the compact student model. Specifically, we employ two distinct loss functions: (i) edge loss, which aims to distinguish between ambiguous classes at the pixel level in edge regions; (ii) body loss, which utilizes shape constraints and selectively attends to the inner-semantic regions. Our experiments demonstrate that the proposed BPKD method provides extensive refinements and aggregation for edge and body regions. Additionally, the method achieves state-of-the-art distillation performance for semantic segmentation on three popular benchmark datasets, highlighting its effectiveness and generalization ability. BPKD shows consistent improvements across a diverse array of lightweight segmentation structures, including both CNNs and transformers, underscoring its architecture-agnostic adaptability. The code is available at \url{https://github.com/AkideLiu/BPKD}.Comment: 17 pages, 9 figures, 9 table

arXiv.org e-Print Archive

Multi-Task Dynamical Systems

Author: Bird Alex
Hawthorne Christopher
Williams Christopher K. I.
Publication venue
Publication date: 01/08/2022
Field of study

Time series datasets are often composed of a variety of sequences from the same domain, but from different entities, such as individuals, products, or organizations. We are interested in how time series models can be specialized to individual sequences (capturing the specific characteristics) while still retaining statistical power by sharing commonalities across the sequences. This paper describes the multi-task dynamical system (MTDS); a general methodology for extending multi-task learning (MTL) to time series models. Our approach endows dynamical systems with a set of hierarchical latent variables which can modulate all model parameters. To our knowledge, this is a novel development of MTL, and applies to time series both with and without control inputs. We apply the MTDS to motion-capture data of people walking in various styles using a multi-task recurrent neural network (RNN), and to patient drug-response data using a multi-task pharmacodynamic model.Comment: 52 pages, 17 figure

arXiv.org e-Print Archive