14 research outputs found
Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study
Medical coding is the task of assigning medical codes to clinical free-text
documentation. Healthcare professionals manually assign such codes to track
patient diagnoses and treatments. Automated medical coding can considerably
alleviate this administrative burden. In this paper, we reproduce, compare, and
analyze state-of-the-art automated medical coding machine learning models. We
show that several models underperform due to weak configurations, poorly
sampled train-test splits, and insufficient evaluation. In previous work, the
macro F1 score has been calculated sub-optimally, and our correction doubles
it. We contribute a revised model comparison using stratified sampling and
identical experimental setups, including hyperparameters and decision boundary
tuning. We analyze prediction errors to validate and falsify assumptions of
previous works. The analysis confirms that all models struggle with rare codes,
while long documents only have a negligible impact. Finally, we present the
first comprehensive results on the newly released MIMIC-IV dataset using the
reproduced models. We release our code, model parameters, and new MIMIC-III and
MIMIC-IV training and evaluation pipelines to accommodate fair future
comparisons.Comment: 11 pages, 6 figures, to be published in Proceedings of the 46th
International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR '23), July 23--27, 2023, Taipei, Taiwa
Tsunami and the Construction of the Disabled Southern Body
We investigate (i) whether human annotators can infer ratings from IMDb movie reviews, (ii) how human performance compares to a regression model, and (iii) whether model performance is affected by the rating source\u9d (i.e. author vs. annotator ratings). We collect a data set of IMDb movie reviews with author-provided ratings, and have it re-annotated by crowdsource and expert annotators. Annotators reproduce the original ratings better than a linear regression model, but are off by a large margin in more than 5% of the cases. Models trained on annotator-labeled data outperform those trained on author-labeled data, questioning the usefulness of author-rated reviews serving as labeled data for sentiment analysis
Benchmarking Generative Latent Variable Models for Speech
Stochastic latent variable models (LVMs) achieve state-of-the-art performance
on natural image generation but are still inferior to deterministic models on
speech. In this paper, we develop a speech benchmark of popular temporal LVMs
and compare them against state-of-the-art deterministic models. We report the
likelihood, which is a much used metric in the image domain, but rarely, or
incomparably, reported for speech models. To assess the quality of the learned
representations, we also compare their usefulness for phoneme recognition.
Finally, we adapt the Clockwork VAE, a state-of-the-art temporal LVM for video
generation, to the speech domain. Despite being autoregressive only in latent
space, we find that the Clockwork VAE can outperform previous LVMs and reduce
the gap to deterministic models by using a hierarchy of latent variables.Comment: Accepted at the 2022 ICLR workshop on Deep Generative Models for
Highly Structured Data (https://deep-gen-struct.github.io
A Brief Overview of Unsupervised Neural Speech Representation Learning
Unsupervised representation learning for speech processing has matured
greatly in the last few years. Work in computer vision and natural language
processing has paved the way, but speech data offers unique challenges. As a
result, methods from other domains rarely translate directly. We review the
development of unsupervised representation learning for speech over the last
decade. We identify two primary model categories: self-supervised methods and
probabilistic latent variable models. We describe the models and develop a
comprehensive taxonomy. Finally, we discuss and compare models from the two
categories.Comment: The 2nd Workshop on Self-supervised Learning for Audio and Speech
Processing (SAS) at AAA
On scaling contrastive representations for low-resource speech recognition
Recent advances in self-supervised learning through contrastive training have
shown that it is possible to learn a competitive speech recognition system with
as little as 10 minutes of labeled data. However, these systems are
computationally expensive since they require pre-training followed by
fine-tuning in a large parameter space. We explore the performance of such
systems without fine-tuning by training a state-of-the-art speech recognizer on
the fixed representations from the computationally demanding wav2vec 2.0
framework. We find performance to decrease without fine-tuning and, in the
extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In
addition, we find that wav2vec 2.0 representations live in a low dimensional
subspace and that decorrelating the features of the representations can
stabilize training of the automatic speech recognizer. Finally, we propose a
bidirectional extension to the original wav2vec framework that consistently
improves performance.Comment: {\copyright} 2021 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Do end-to-end speech recognition models care about context?
The two most common paradigms for end-to-end speech recognition are
connectionist temporal classification (CTC) and attention-based encoder-decoder
(AED) models. It has been argued that the latter is better suited for learning
an implicit language model. We test this hypothesis by measuring temporal
context sensitivity and evaluate how the models perform when we constrain the
amount of contextual information in the audio input. We find that the AED model
is indeed more context sensitive, but that the gap can be closed by adding
self-attention to the CTC model. Furthermore, the two models perform similarly
when contextual information is constrained. Finally, in contrast to previous
research, our results show that the CTC model is highly competitive on WSJ and
LibriSpeech without the help of an external language model.Comment: Published in the proceedings of INTERSPEECH 2020, pp. 4352-435
Automated Medical Coding on MIMIC-III and MIMIC-IV:A Critical Review and Replicability Study
Medical coding is the task of assigning medical codes to clinical free-text documentation. Healthcare professionals manually assign such codes to track patient diagnoses and treatments. Automated medical coding can considerably alleviate this administrative burden. In this paper, we reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models. We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation. In previous work, the macro F1 score has been calculated sub-optimally, and our correction doubles it. We contribute a revised model comparison using stratified sampling and identical experimental setups, including hyperparameters and decision boundary tuning. We analyze prediction errors to validate and falsify assumptions of previous works. The analysis confirms that all models struggle with rare codes, while long documents only have a negligible impact. Finally, we present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models. We release our code, model parameters, and new MIMIC-III and MIMIC-IV training and evaluation pipelines to accommodate fair future comparisons.</p