27 research outputs found

    Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

    Full text link
    Automatic speech recognition (ASR) based on transducers is widely used. In training, a transducer maximizes the summed posteriors of all paths. The path with the highest posterior is commonly defined as the predicted alignment between the speech and the transcription. While the vanilla transducer does not have a prior preference for any of the valid paths, this work intends to enforce the preferred paths and achieve controllable alignment prediction. Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a Bayes risk function to set lower risk values to the preferred paths so that the predicted alignment is more likely to satisfy specific desired properties. We further demonstrate that these predicted alignments with intentionally designed properties can provide practical advantages over the vanilla transducer. Experimentally, the proposed BRT saves inference cost by up to 46% for non-streaming ASR and reduces overall system latency by 41% for streaming ASR

    AutoPrep: An Automatic Preprocessing Framework for In-the-Wild Speech Data

    Full text link
    Recently, the utilization of extensive open-sourced text data has significantly advanced the performance of text-based large language models (LLMs). However, the use of in-the-wild large-scale speech data in the speech technology community remains constrained. One reason for this limitation is that a considerable amount of the publicly available speech data is compromised by background noise, speech overlapping, lack of speech segmentation information, missing speaker labels, and incomplete transcriptions, which can largely hinder their usefulness. On the other hand, human annotation of speech data is both time-consuming and costly. To address this issue, we introduce an automatic in-the-wild speech data preprocessing framework (AutoPrep) in this paper, which is designed to enhance speech quality, generate speaker labels, and produce transcriptions automatically. The proposed AutoPrep framework comprises six components: speech enhancement, speech segmentation, speaker clustering, target speech extraction, quality filtering and automatic speech recognition. Experiments conducted on the open-sourced WenetSpeech and our self-collected AutoPrepWild corpora demonstrate that the proposed AutoPrep framework can generate preprocessed data with similar DNSMOS and PDNSMOS scores compared to several open-sourced TTS datasets. The corresponding TTS system can achieve up to 0.68 in-domain speaker similarity

    Lamellar structure change of waxy corn starch during gelatinization by time-resolved synchrotron SAXS

    Get PDF
    In situ experiment of synchrotron small- and wide-angle X-ray scattering (SAXS/WAXS) was used to study the lamellar structure change of starch during gelatinization. Waxy corn starch was used as a model material to exclude the effect of amylose. The thicknesses of crystalline (d), amorphous (d) regions of the lamella and the long period distance (d) were obtained based on a 1D linear correlation function. The SAXS and WAXS results reveal the multi-stage of gelatinization. Firstly, a preferable increase in the thickness of crystalline lamellae occurs because of the water penetration into the crystalline region. Then, the thickness of amorphous lamellae has a significant increase while that of crystalline lamellae decreases. Next, the thickness of amorphous lamellae starts to decrease probably due to the out-phasing of starch molecules from the lamellae. Finally, the thickness of amorphous lamellae decreases rapidly, with the formation of fractal gel on a larger scale (than that of the lamellae), which gradually decreases as the temperature further increases and is related to the concentration of starch molecular chains. This work system reveals the gelatinization mechanism of waxy corn starch and would be useful in starch amorphous materials processing

    Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

    Full text link
    Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.Comment: Accepted at ASRU 202

    Analysis and design of a hydraulic gripper based on compliant mechanisms

    No full text
    This thesis deals with the design of a gripper based on a compliant mechanism and driven by a hydrostatic actuator. The actuator is a rolling diaphragm type hydraulic cylinder used in robotics and is designed to solve the problems of hydrostatic friction and leakage found in conventional fluid power systems. In addition, the use of compliant mechanisms instead of conventional hinge structures reduces the complexity of assembly/maintenance and costs, and increases transmission efficiency. The first part of the report describes the background and motivation for the project, and reviews the relevant literature on mechanical grippers. Additionally, this section illustrates the actuator used in the mechanical gripper and the compliant mechanism. The second part mainly analyzes the compliant mechanism, providing a theoretical basis for the design and experiments that follow. The third part introduces the kinematic design of the gripper modeled as a rigid linkage system. The fourth part focuses on the design of a compliant mechanisms gripper that is based on the previously studied kinematics, followed by the realization of a prototype of the gripper. The fifth part illustrates a set of experiments that aim at demonstrating and characterizing the prototype of the gripper. In the sixth section, conclusions are given, and the improvement aspects and future development possibilities of the gripper are discussed

    A Hybrid Traffic Scheduling Strategy for Time-Sensitive Networking

    No full text
    The traffic scheduling mechanism in Time-Sensitive Networking (TSN) is the key to guaranteeing the deterministic transmission of traffic. However, when time-sensitive traffic and non-time-sensitive traffic are transmitted together, traffic scheduling conflicts are easy to occur in TSN. As a result, the deterministic transmission of time-sensitive traffic will be disrupted, and non-time-sensitive traffic may be preempted for a long time. To optimize the performance of multi-type hybrid traffic scheduling in TSN, we firstly establish a collaborative scheduling framework that incorporates Time Aware Shaping (TAS) and Cyclic Queuing and Forwarding (CQF) mechanisms. We then design a traffic shaping method in this framework based on Least Laxity First (LLF), which considers traffic characteristics to dynamically arrange the time slot injection sequence for different types of traffic. Finally, the traffic schedulability is evaluated based on the scheduling constraints of different types of traffic. Compared with the existing scheduling strategies, the proposed hybrid traffic scheduling strategy can schedule more non-time-sensitive traffic and achieve better delay performance of rate-constrained traffic in different hybrid traffic scenarios. When the number of flows is 100, the time slot injection ratio is increased by 24.3% compared with the LLF_TAS method

    Thermal-independent properties of PIN-PMN-PT single-crystal linear-array ultrasonic transducers

    No full text
    In this paper, low-frequency 32-element lineararray ultrasonic transducers were designed and fabricated using both ternary Pb(In 1/2 Nb 1/2 )-Pb(Mg 1/3 Nb 2/3 )-PbTiO 3 (PIN-PMN-PT) and binary Pb(Mg 1/3 Nb 2/3 )-PbTiO 3 (PMNPT) single crystals. Performance of the array transducers was characterized as a function of temperature ranging from room temperature to 160°C. It was found that the array transducers fabricated using the PIN-PMN-PT single crystal were capable of satisfactory performance at 160°C, having a -6-dB bandwidth of 66% and an insertion loss of 37 dB. The results suggest that the potential of PIN-PMN-PT linear-array ultrasonic transducers for high-temperature ultrasonic transducer applications is promising
    corecore