49 research outputs found
TESSP: Text-Enhanced Self-Supervised Speech Pre-training
Self-supervised speech pre-training empowers the model with the contextual
structure inherent in the speech signal while self-supervised text pre-training
empowers the model with linguistic information. Both of them are beneficial for
downstream speech tasks such as ASR. However, the distinct pre-training
objectives make it challenging to jointly optimize the speech and text
representation in the same model. To solve this problem, we propose
Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to
incorporate the linguistic information into speech pre-training. Our model
consists of three parts, i.e., a speech encoder, a text encoder and a shared
encoder. The model takes unsupervised speech and text data as the input and
leverages the common HuBERT and MLM losses respectively. We also propose
phoneme up-sampling and representation swapping to enable joint modeling of the
speech and text information. Specifically, to fix the length mismatching
problem between speech and text data, we phonemize the text sequence and
up-sample the phonemes with the alignment information extracted from a small
set of supervised data. Moreover, to close the gap between the learned speech
and text representations, we swap the text representation with the speech
representation extracted by the respective private encoders according to the
alignment information. Experiments on the Librispeech dataset shows the
proposed TESSP model achieves more than 10% improvement compared with WavLM on
the test-clean and test-other sets. We also evaluate our model on the SUPERB
benchmark, showing our model has better performance on Phoneme Recognition,
Acoustic Speech Recognition and Speech Translation compared with WavLM.Comment: 9 pages, 4 figure
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data
How to boost speech pre-training with textual data is an unsolved problem due
to the fact that speech and text are very different modalities with distinct
characteristics. In this paper, we propose a cross-modal Speech and Language
Model (SpeechLM) to explicitly align speech and text pre-training with a
pre-defined unified discrete representation. Specifically, we introduce two
alternative discrete tokenizers to bridge the speech and text modalities,
including phoneme-unit and hidden-unit tokenizers, which can be trained using a
small amount of paired speech-text data. Based on the trained tokenizers, we
convert the unlabeled speech and text data into tokens of phoneme units or
hidden units. The pre-training objective is designed to unify the speech and
the text into the same discrete semantic space with a unified Transformer
network. Leveraging only 10K text sentences, our SpeechLM gets a 16\% relative
WER reduction over the best base model performance (from 6.8 to 5.7) on the
public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even
outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also
evaluate our SpeechLM on various spoken language processing tasks under the
universal representation evaluation framework SUPERB, demonstrating significant
improvements on content-related tasks. Our code and models are available at
https://aka.ms/SpeechLM.Comment: 14 page
Comparison of curative effect between OBS assisted by 3D printing and PFNA in the treatment of AO/OTA type 31-A3 femoral intertrochanteric fractures in elderly patients
ObjectiveTo compare and analyze the Ortho-Bridge System (OBS) clinical efficacy assisted by 3D printing and proximal femoral nail anti-rotation (PFNA) of AO/OTA type 31-A3 femoral intertrochanteric fractures in elderly patients.MethodsA retrospective analysis of 25 elderly patients diagnosed with AO/OTA type 31-A3 femoral intertrochanteric fracture was conducted from January 2020 to August 2022 at Yan’an Hospital, affiliated to Kunming Medical University. The patients were divided into 10 patients in the OBS group and 15 in the PFNA group according to different surgical methods. The OBS group reconstructed the bone models and designed the guide plate by computer before the operation, imported the data of the guide plate and bone models into a stereolithography apparatus (SLA) 3D printer, and printed them using photosensitive resin, thus obtaining the physical object, then simulating the operation and finally applying the guide plate to assist OBS to complete the operation; the PFNA group was treated by proximal femoral nail anti-rotation. The operation time, the intraoperative blood loss, Harris hip score (HHS), Oxford Hip Score (OHS), and complications were compared between the two groups.ResultsThe operation time and the intraoperative blood loss in the PFNA group were less than that in the OBS group, and there was a significant difference between the two groups (P < 0.05). The HHS during the 6th month using OBS was statistically higher than PFNA (P < 0.05), however, there were no significant differences in OHS during the 6th month between the OBS group and PFNA group (P > 0.05). The HHS and OHS during the 12th month in the OBS group were statistically better than in the PFNA group (P < 0.05).ConclusionThe OBS assisted by 3D printing and PFNA are effective measures for treating intertrochanteric fractures. Prior to making any decisions regarding internal fixation, it is crucial to evaluate the distinct circumstances of each patient thoroughly
Spatial locality-aware sparse coding and dictionary learning
Nonlinear encoding of SIFT features has recently shown good promise in image classification. This scheme is able to reduce the training complexity of the traditional bag-of-feature approaches while achieving better performance. As a result, it is suitable for large-scale image classification applications. However, existing nonlinear encoding methods do not explicitly consider the spatial relationship when encoding the local features, but merely leaving the spatial information used at a later stage, e.g. through the spatial pyramid matching, is largely inadequate. In this paper, we propose a joint sparse coding and dictionary learning scheme that take the spatial information into consideration in encoding. Our experiments on synthetic data and benchmark data demonstrate that the proposed scheme can learn a better dictionary and achieve higher classification accuracy.Published versio
Hot deformation behavior and 3D processing map of super austenitic stainless steel containing 7Mo–0.46N–0.02Ce: Effect of the solidification direction orientation of columnar crystal to loading direction
In the present paper, hot compression tests on a super austenitic stainless steel (SASS) containing 7Mo–0.46N–0.02Ce were performed at temperatures of 900 °C-1200 °C and strain rates of 0.01–10 s−1. The effect of angle between solidification direction of columnar crystal and loading direction (0°, 30°, 60°, and 90°) on the flow behavior and microstructure evolution of SASS was studied. Results showed that, samples with different columnar crystal characteristics and test conditions displayed various flow behaviors due to the difference in dynamic restoration softening mechanism, flow localization, and shear band formation. With the decrease of the angle between the solidification direction of columnar crystal and loading direction, the deformation activation energy (Q) and the ln Z value of SASS decreased at the true strain of 0.7. It indicates that dynamic recovery (DRV) and dynamic recrystallization (DRX) are more likely to happen in the samples with low angles. The location and range of high efficiency of power dissipation (η) region and instability region of the samples differed much with different columnar crystal characteristics. This means obviously different hot workability and microstructure evolution behaviors. At the true strain of 0.7, no negative η-value region was observed in the 0° sample, and the range of instability region was relatively small. On the whole, SASS presented the optimum hot workability when the angle between the solidification direction of columnar crystal and loading direction was 0°
Multi-scale blind motion deblurring using local minimum
Blind deconvolution, a chronic inverse problem, is the recovery of the latent sharp image from a blurred one when the blur kernel is unknown. Recent algorithms based on the MAP approach encounter failures since the global minimum of the negative MAP scores really favors the blurry image. The goal of this paper is to demonstrate that the sharp image can be obtained from the local minimum by using the MAP approach. We first propose a cross-scale constraint to make the sharp image correspond to a good local minimum. Then the cross-scale initialization, iterative likelihood update and the iterative residual deconvolution are adopted to trap the MAP approach in the desired local minimum. These techniques result in our cross-scale blind deconvolution approach which constrains the solution from coarse to fine. We test our approach on the standard dataset and many other challenging images. The experimental results suggest that our approach outperforms all existing alternatives