17 research outputs found
Experiments on the DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording
In this paper we present our work on Task 1 Acoustic Scene Classi- fication
and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments
we have low-level and high-level features, classifier optimization and other
heuristics specific to each task. Our performance for both tasks improved the
baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9%
compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based
Error Rate of 0.76 compared to the baseline of 0.91
Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages
In this paper, we describe the TTS models developed by NVIDIA for the
MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024
Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by
training additionally on 5 minutes of target speaker data. In Track 3, we
utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as
well as external datasets. We use HiFi-GAN vocoders for all submissions.
RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on
Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS)
of 3.62.Comment: Presentation accepted at ICASSP 202
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Augmenting large language models (LLMs) to understand audio -- including
non-speech sounds and non-verbal speech -- is critically important for diverse
real-world applications of LLMs. In this paper, we propose Audio Flamingo, a
novel audio language model with 1) strong audio understanding abilities, 2) the
ability to quickly adapt to unseen tasks via in-context learning and retrieval,
and 3) strong multi-turn dialogue abilities. We introduce a series of training
techniques, architecture design, and data strategies to enhance our model with
these abilities. Extensive evaluations across various audio understanding tasks
confirm the efficacy of our method, setting new state-of-the-art benchmarks.
Our demo website is https://audioflamingo.github.io/ and the code is
open-sourced at https://github.com/NVIDIA/audio-flamingo.Comment: ICML 202
Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment
Large Language Model (LLM) based text-to-speech (TTS) systems have
demonstrated remarkable capabilities in handling large speech datasets and
generating natural speech for new speakers. However, LLM-based TTS models are
not robust as the generated output can contain repeating words, missing words
and mis-aligned speech (referred to as hallucinations or attention errors),
especially when the text contains multiple occurrences of the same token. We
examine these challenges in an encoder-decoder transformer model and find that
certain cross-attention heads in such models implicitly learn the text and
speech alignment when trained for predicting speech tokens for a given text. To
make the alignment more robust, we propose techniques utilizing CTC loss and
attention priors that encourage monotonic cross-attention over the text tokens.
Our guided attention training technique does not introduce any new learnable
parameters and significantly improves robustness of LLM-based TTS models.Comment: Published as a conference paper at INTERSPEECH 202
Experiments on the DCASE Challenge 2016: Acoustic scene classification and sound event detection in real life recording
International audienceIn this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.48 compared to the baseline of 0.91
PERT era, race‐based healthcare disparities in a large urban safety net hospital
Abstract Pulmonary embolism (PE) is the third leading cause of cardiovascular death in the United States. Black Americans have higher incidence, greater clot severity, and worse outcomes than White Americans. This disparity is not fully understood, especially in the context of the advent of PE response teams (PERT), which aim to standardize PE‐related care. This retrospective single‐center cohort study compared 294 Black and 131 White patients from our institution's PERT database. Primary objectives included severity and in‐hospital management. Secondary outcomes included length of stay, 30‐day readmission, 30‐day mortality, and outpatient follow‐up. Clot (p = 0.42), acute treatment (p = 0.28), 30‐day mortality (p = 0.77), 30‐day readmission (p = 0.50), and outpatient follow‐up (p = 0.98) were similar between races. Black patients had a lower mean household income (63,396, SD 32,987) (p < 0.0001). More Black patients (78.8%) had exclusively government insurance (Medicare/Medicaid) compared to White patients (61.8%) (p = 0.006). Interestingly, government insurance patients had less follow‐up (58.3%) than private insurance patients (79.7%) (p = 0.001). Notably, patients with follow‐up had fewer 30‐day readmissions. Specifically, 12.2% of patients with follow‐up were readmitted compared to 22.2% of patients without follow‐up (p = 0.008). There were no significant differences in PE severity, in‐hospital treatment, mortality, or readmissions between Black and White patients. However, patients with government insurance had less follow‐up and more readmissions, indicating a socioeconomic disparity. Access barriers such as health literacy, treatment cost, and transportation may contribute to this inequity. Improving access to follow‐up care may reduce the disparity in PE outcomes