49 research outputs found

    TESSP: Text-Enhanced Self-Supervised Speech Pre-training

    Full text link
    Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the same model. To solve this problem, we propose Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to incorporate the linguistic information into speech pre-training. Our model consists of three parts, i.e., a speech encoder, a text encoder and a shared encoder. The model takes unsupervised speech and text data as the input and leverages the common HuBERT and MLM losses respectively. We also propose phoneme up-sampling and representation swapping to enable joint modeling of the speech and text information. Specifically, to fix the length mismatching problem between speech and text data, we phonemize the text sequence and up-sample the phonemes with the alignment information extracted from a small set of supervised data. Moreover, to close the gap between the learned speech and text representations, we swap the text representation with the speech representation extracted by the respective private encoders according to the alignment information. Experiments on the Librispeech dataset shows the proposed TESSP model achieves more than 10% improvement compared with WavLM on the test-clean and test-other sets. We also evaluate our model on the SUPERB benchmark, showing our model has better performance on Phoneme Recognition, Acoustic Speech Recognition and Speech Translation compared with WavLM.Comment: 9 pages, 4 figure

    SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

    Full text link
    How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. Leveraging only 10K text sentences, our SpeechLM gets a 16\% relative WER reduction over the best base model performance (from 6.8 to 5.7) on the public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also evaluate our SpeechLM on various spoken language processing tasks under the universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Our code and models are available at https://aka.ms/SpeechLM.Comment: 14 page

    Comparison of curative effect between OBS assisted by 3D printing and PFNA in the treatment of AO/OTA type 31-A3 femoral intertrochanteric fractures in elderly patients

    Get PDF
    ObjectiveTo compare and analyze the Ortho-Bridge System (OBS) clinical efficacy assisted by 3D printing and proximal femoral nail anti-rotation (PFNA) of AO/OTA type 31-A3 femoral intertrochanteric fractures in elderly patients.MethodsA retrospective analysis of 25 elderly patients diagnosed with AO/OTA type 31-A3 femoral intertrochanteric fracture was conducted from January 2020 to August 2022 at Yan’an Hospital, affiliated to Kunming Medical University. The patients were divided into 10 patients in the OBS group and 15 in the PFNA group according to different surgical methods. The OBS group reconstructed the bone models and designed the guide plate by computer before the operation, imported the data of the guide plate and bone models into a stereolithography apparatus (SLA) 3D printer, and printed them using photosensitive resin, thus obtaining the physical object, then simulating the operation and finally applying the guide plate to assist OBS to complete the operation; the PFNA group was treated by proximal femoral nail anti-rotation. The operation time, the intraoperative blood loss, Harris hip score (HHS), Oxford Hip Score (OHS), and complications were compared between the two groups.ResultsThe operation time and the intraoperative blood loss in the PFNA group were less than that in the OBS group, and there was a significant difference between the two groups (P < 0.05). The HHS during the 6th month using OBS was statistically higher than PFNA (P < 0.05), however, there were no significant differences in OHS during the 6th month between the OBS group and PFNA group (P > 0.05). The HHS and OHS during the 12th month in the OBS group were statistically better than in the PFNA group (P < 0.05).ConclusionThe OBS assisted by 3D printing and PFNA are effective measures for treating intertrochanteric fractures. Prior to making any decisions regarding internal fixation, it is crucial to evaluate the distinct circumstances of each patient thoroughly

    Spatial locality-aware sparse coding and dictionary learning

    No full text
    Nonlinear encoding of SIFT features has recently shown good promise in image classification. This scheme is able to reduce the training complexity of the traditional bag-of-feature approaches while achieving better performance. As a result, it is suitable for large-scale image classification applications. However, existing nonlinear encoding methods do not explicitly consider the spatial relationship when encoding the local features, but merely leaving the spatial information used at a later stage, e.g. through the spatial pyramid matching, is largely inadequate. In this paper, we propose a joint sparse coding and dictionary learning scheme that take the spatial information into consideration in encoding. Our experiments on synthetic data and benchmark data demonstrate that the proposed scheme can learn a better dictionary and achieve higher classification accuracy.Published versio

    Hot deformation behavior and 3D processing map of super austenitic stainless steel containing 7Mo–0.46N–0.02Ce: Effect of the solidification direction orientation of columnar crystal to loading direction

    No full text
    In the present paper, hot compression tests on a super austenitic stainless steel (SASS) containing 7Mo–0.46N–0.02Ce were performed at temperatures of 900 °C-1200 °C and strain rates of 0.01–10 s−1. The effect of angle between solidification direction of columnar crystal and loading direction (0°, 30°, 60°, and 90°) on the flow behavior and microstructure evolution of SASS was studied. Results showed that, samples with different columnar crystal characteristics and test conditions displayed various flow behaviors due to the difference in dynamic restoration softening mechanism, flow localization, and shear band formation. With the decrease of the angle between the solidification direction of columnar crystal and loading direction, the deformation activation energy (Q) and the ln Z value of SASS decreased at the true strain of 0.7. It indicates that dynamic recovery (DRV) and dynamic recrystallization (DRX) are more likely to happen in the samples with low angles. The location and range of high efficiency of power dissipation (η) region and instability region of the samples differed much with different columnar crystal characteristics. This means obviously different hot workability and microstructure evolution behaviors. At the true strain of 0.7, no negative η-value region was observed in the 0° sample, and the range of instability region was relatively small. On the whole, SASS presented the optimum hot workability when the angle between the solidification direction of columnar crystal and loading direction was 0°

    Multi-scale blind motion deblurring using local minimum

    No full text
    Blind deconvolution, a chronic inverse problem, is the recovery of the latent sharp image from a blurred one when the blur kernel is unknown. Recent algorithms based on the MAP approach encounter failures since the global minimum of the negative MAP scores really favors the blurry image. The goal of this paper is to demonstrate that the sharp image can be obtained from the local minimum by using the MAP approach. We first propose a cross-scale constraint to make the sharp image correspond to a good local minimum. Then the cross-scale initialization, iterative likelihood update and the iterative residual deconvolution are adopted to trap the MAP approach in the desired local minimum. These techniques result in our cross-scale blind deconvolution approach which constrains the solution from coarse to fine. We test our approach on the standard dataset and many other challenging images. The experimental results suggest that our approach outperforms all existing alternatives
    corecore