949 research outputs found
An IDL-based analysis package for COBE and other skycube-formatted astronomical data
UIMAGE is a data analysis package written in IDL for the Cosmic Background Explorer (COBE) project. COBE has extraordinarily stringent accuracy requirements: 1 percent mid-infrared absolute photometry, 0.01 percent submillimeter absolute spectrometry, and 0.0001 percent submillimeter relative photometry. Thus, many of the transformations and image enhancements common to analysis of large data sets must be done with special care. UIMAGE is unusual in this sense in that it performs as many of its operations as possible on the data in its native format and projection, which in the case of COBE is the quadrilateralized sphereical cube ('skycube'). That is, after reprojecting the data, e.g., onto an Aitoff map, the user who performs an operation such as taking a crosscut or extracting data from a pixel is transparently acting upon the skycube data from which the projection was made, thereby preserving the accuracy of the result. Current plans call for formatting external data bases such as CO maps into the skycube format with a high-accuracy transformation, thereby allowing Guest Investigators to use UIMAGE for direct comparison of the COBE maps with those at other wavelengths from other instruments. It is completely menu-driven so that its use requires no knowledge of IDL. Its functionality includes I/O from the COBE archives, FITS files, and IDL save sets as well as standard analysis operations such as smoothing, reprojection, zooming, statistics of areas, spectral analysis, etc. One of UIMAGE's more advanced and attractive features is its terminal independence. Most of the operations (e.g., menu-item selection or pixel selection) that are driven by the mouse on an X-windows terminal are also available using arrow keys and keyboard entry (e.g., pixel coordinates) on VT200 and Tektronix-class terminals. Even limited grey scales of images are available this way. Obviously, image processing is very limited on this type of terminal, but it is nonetheless surprising how much analysis can be done on that medium. Such flexibility has the virtue of expanding the user community to those who must work remotely on non-image terminals, e.g., via modem
Language Model Combination and Adaptation Using Weighted Finite State Transducers
In speech recognition systems language model (LMs) are often constructed by training and combining multiple n-gram models. They can be either used to represent different genres or tasks found in diverse text sources, or capture stochastic properties of different linguistic symbol sequences, for example, syllables and words. Unsupervised LM adaption may also be used to further improve robustness to varying styles or tasks. When using these techniques, extensive software changes are often required. In this paper an alternative and more general approach based on weighted finite state transducers (WFSTs) is investigated for LM combination and adaptation. As it is entirely based on well-defined WFST operations, minimum change to decoding tools is needed. A wide range of LM combination configurations can be flexibly supported. An efficient on-the-fly WFST decoding algorithm is also proposed. Significant error rate gains of 7.3% relative were obtained on a state-of-the-art broadcast audio recognition task using a history dependently adapted multi-level LM modelling both syllable and word sequence
Adapting an ASR Foundation Model for Spoken Language Assessment
A crucial part of an accurate and reliable spoken language assessment system
is the underlying ASR model. Recently, large-scale pre-trained ASR foundation
models such as Whisper have been made available. As the output of these models
is designed to be human readable, punctuation is added, numbers are presented
in Arabic numeric form and abbreviations are included. Additionally, these
models have a tendency to skip disfluencies and hesitations in the output.
Though useful for readability, these attributes are not helpful for assessing
the ability of a candidate and providing feedback. Here a precise transcription
of what a candidate said is needed. In this paper, we give a detailed analysis
of Whisper outputs and propose two solutions: fine-tuning and soft prompt
tuning. Experiments are conducted on both public speech corpora and an English
learner dataset. Results show that we can effectively alter the decoding
behaviour of Whisper to generate the exact words spoken in the response.Comment: Proceedings of SLaT
Adapting an Unadaptable ASR System
As speech recognition model sizes and training data requirements grow, it is
increasingly common for systems to only be available via APIs from online
service providers rather than having direct access to models themselves. In
this scenario it is challenging to adapt systems to a specific target domain.
To address this problem we consider the recently released OpenAI Whisper ASR as
an example of a large-scale ASR system to assess adaptation methods. An error
correction based approach is adopted, as this does not require access to the
model, but can be trained from either 1-best or N-best outputs that are
normally available via the ASR API. LibriSpeech is used as the primary target
domain for adaptation. The generalization ability of the system in two distinct
dimensions are then evaluated. First, whether the form of correction model is
portable to other speech recognition domains, and secondly whether it can be
used for ASR models having a different architecture.Comment: submitted to INTERSPEEC
N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space
Error correction models form an important part of Automatic Speech
Recognition (ASR) post-processing to improve the readability and quality of
transcriptions. Most prior works use the 1-best ASR hypothesis as input and
therefore can only perform correction by leveraging the context within one
sentence. In this work, we propose a novel N-best T5 model for this task, which
is fine-tuned from a T5 model and utilizes ASR N-best lists as model input. By
transferring knowledge from the pre-trained language model and obtaining richer
information from the ASR decoding space, the proposed approach outperforms a
strong Conformer-Transducer baseline. Another issue with standard error
correction is that the generation process is not well-guided. To address this a
constrained decoding process, either based on the N-best list or an ASR
lattice, is used which allows additional information to be propagated.Comment: submitted to INTERSPEEC
Can small additions of an organic polymer or surfactant to sodium borohydride show the way to high hydrogen storage systems for portable applications?
Blind Normalization of Speech From Different Channels
We show how to construct a channel-independent representation of speech that
has propagated through a noisy reverberant channel. This is done by blindly
rescaling the cepstral time series by a non-linear function, with the form of
this scale function being determined by previously encountered cepstra from
that channel. The rescaled form of the time series is an invariant property of
it in the following sense: it is unaffected if the time series is transformed
by any time-independent invertible distortion. Because a linear channel with
stationary noise and impulse response transforms cepstra in this way, the new
technique can be used to remove the channel dependence of a cepstral time
series. In experiments, the method achieved greater channel-independence than
cepstral mean normalization, and it was comparable to the combination of
cepstral mean normalization and spectral subtraction, despite the fact that no
measurements of channel noise or reverberations were required (unlike spectral
subtraction).Comment: 25 pages, 7 figure
- …