192 research outputs found

    Implementing contextual biasing in GPU decoder for online ASR

    Full text link
    GPU decoding significantly accelerates the output of ASR predictions. While GPUs are already being used for online ASR decoding, post-processing and rescoring on GPUs have not been properly investigated yet. Rescoring with available contextual information can considerably improve ASR predictions. Previous studies have proven the viability of lattice rescoring in decoding and biasing language model (LM) weights in offline and online CPU scenarios. In real-time GPU decoding, partial recognition hypotheses are produced without lattice generation, which makes the implementation of biasing more complex. The paper proposes and describes an approach to integrate contextual biasing in real-time GPU decoding while exploiting the standard Kaldi GPU decoder. Besides the biasing of partial ASR predictions, our approach also permits dynamic context switching allowing a flexible rescoring per each speech segment directly on GPU. The code is publicly released and tested with open-sourced test sets.Comment: Accepted to Interspeech 202

    Finstreder: simple and fast spoken language understanding with finite state transducers using modern speech-to-text models

    Get PDF
    In Spoken Language Understanding (SLU) the task is to extract important information from audio commands, like the intent of what a user wants the system to do and special entities like locations or numbers. This paper presents a simple method for embedding intents and entities into Finite State Transducers, and, in combination with a pretrained general-purpose Speech-to-Text model, allows building SLU-models without any additional training. Building those models is very fast and only takes a few seconds. It is also completely language independent. With a comparison on different benchmarks it is shown that this method can outperform multiple other, more resource demanding SLU approaches

    Stride: a flexible software platform for high-performance ultrasound computed tomography

    Get PDF
    BACKGROUND AND OBJECTIVE: Advanced ultrasound computed tomography techniques like full-waveform inversion are mathematically complex and orders of magnitude more computationally expensive than conventional ultrasound imaging methods. This computational and algorithmic complexity, and a lack of open-source libraries in this field, represent a barrier preventing the generalised adoption of these techniques, slowing the pace of research, and hindering reproducibility. Consequently, we have developed Stride, an open-source Python library for the solution of large-scale ultrasound tomography problems. METHODS: On one hand, Stride provides high-level interfaces and tools for expressing the types of optimisation problems encountered in medical ultrasound tomography. On the other, these high-level abstractions seamlessly integrate with high-performance wave-equation solvers and with scalable parallelisation routines. The wave-equation solvers are generated automatically using Devito, a domain-specific language, and the parallelisation routines are provided through the custom actor-based library Mosaic. RESULTS: We demonstrate the modelling accuracy achieved by our wave-equation solvers through a comparison (1) with analytical solutions for a homogeneous medium, and (2) with state-of-the-art modelling software applied to a high-contrast, complex skull section. Additionally, we show through a series of examples how Stride can handle realistic numerical and experimental tomographic problems, in 2D and 3D, and how it can scale robustly from a local multi-processing environment to a multi-node high-performance cluster. CONCLUSIONS: Stride enables researchers to rapidly and intuitively develop new imaging algorithms and to explore novel physics without sacrificing performance and scalability. This will lead to faster scientific progress in this field and will significantly ease clinical translation

    Data parallel string manipulating programs

    Get PDF
    String-manipulating programs are an important class of programs with applications in malware detection, graphics, input sanitization for Web security, and large-scale HTML processing. This paper extends prior work on BEK, an expressive domain-specific language for writing string-manipulating programs, with algorithmic insights that make BEK both analyzable and data-parallel. By analyzable we mean that unlike most general purpose programming languages, many algebraic properties of a BEK program are decidable (i.e., one can check whether two programs commute or compute the inverse of a program). By data-parallel we mean that a BEK program can compute on arbitrary subsections of its input in parallel, thus exploiting parallel hardware. This latter requirement is particularly important for programs which operate on large data: without data parallelism, a programmer cannot hide the latency of reading data from various storage media (i.e., reading a terabyte of data from a modern hard drive takes about 3 hours). With a data-parallel approach, the system can split data across multiple disks and thus hide the latency of reading the data. A BEK program is expressive: a programmer can use conditionals, switch statements, and registers--or local variables--in order to implement common string-manipulating programs. Unfortunately, this expressivity induces data dependencies, which are an obstacle to parallelism. The key contribution of this paper is an algorithm which automatically removes these data dependencies by mapping a B EK program into a intermediate format consisting of symbolic transducers, which extend classical transducers with symbolic predicates and symbolic assignments. We present a novel algorithm that we call exploration which performs symbolic loop unrolling of these transducers to obtain simplified versions of the original program. We show how these simplified versions can then be lifted to a stateless form, and from there compiled to data-parallel hardware. To evaluate the efficacy of our approach, we demonstrate up to 8x speedups for a number of real-world, BEK programs, (e.g., HTML encoder and decoder) on data-parallel hardware. To the best of our knowledge, these are the first data parallel implementation of these programs. To validate that our approach is correct, we use an automatic testing technique to compare our generated code to the original implementations and find no semantic deviations

    Ultra low-power, high-performance accelerator for speech recognition

    Get PDF
    Automatic Speech Recognition (ASR) is undoubtedly one of the most important and interesting applications in the cutting-edge era of Deep-learning deployment, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost, requiring huge memory storage and computational power, which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems as well as reducing its memory pressure, while delivering high-performance. In this thesis, we present a customized accelerator for large-vocabulary, speaker-independent, continuous speech recognition. A state-of-the-art ASR system consists of two major components: acoustic-scoring using DNN and speech-graph decoding using Viterbi search. As the first step, we focus on the Viterbi search algorithm, that represents the main bottleneck in the ASR system. The accelerator includes some innovative techniques to improve the memory subsystem, which is the main bottleneck for performance and power, such as a prefetching scheme and a novel bandwidth saving technique tailored to the needs of ASR. Furthermore, as the speech graph is vast taking more than 1-Gigabyte memory space, we propose to change its representation by partitioning it into several sub-graphs and perform an on-the-fly composition during the Viterbi run-time. This approach together with some simple yet efficient compression techniques result in 31x memory footprint reduction, providing 155x real-time speedup and orders of magnitude power and energy saving compared to CPUs and GPUs. In the next step, we propose a novel hardware-based ASR system that effectively integrates a DNN accelerator for the pruned/quantized models with the Viterbi accelerator. We show that, when either pruning or quantizing the DNN model used for acoustic scoring, ASR accuracy is maintained but the execution time of the ASR system is increased by 33%. Although pruning and quantization improves the efficiency of the DNN, they result in a huge increase of activity in the Viterbi search since the output scores of the pruned model are less reliable. In order to avoid the aforementioned increase in Viterbi search workload, our system loosely selects the N-best hypotheses at every time step, exploring only the N most likely paths. Our final solution manages to efficiently combine both DNN and Viterbi accelerators using all their optimizations, delivering 222x real-time ASR with a small power budget of 1.26 Watt, small memory footprint of 41 MB, and a peak memory bandwidth of 381 MB/s, being amenable for low-power mobile platforms.Los sistemas de reconocimiento automático del habla (ASR por sus siglas en inglés, Automatic Speech Recognition) son sin lugar a dudas una de las aplicaciones más relevantes en el área emergente de aprendizaje profundo (Deep Learning), specialmente en el segmento de los dispositivos móviles. Realizar el reconocimiento del habla de forma rápida y precisa tiene un elevado coste en energía, requiere de gran capacidad de memoria y de cómputo, lo cual no es deseable en sistemas móviles que tienen severas restricciones de consumo energético y disipación de potencia. El uso de arquitecturas específicas en forma de aceleradores hardware permite reducir el consumo energético de los sistemas de reconocimiento del habla, al tiempo que mejora el rendimiento y reduce la presión en el sistema de memoria. En esta tesis presentamos un acelerador específicamente diseñado para sistemas de reconocimiento del habla de gran vocabulario, independientes del orador y que funcionan en tiempo real. Un sistema de reconocimiento del habla estado del arte consiste principalmente en dos componentes: el modelo acústico basado en una red neuronal profunda (DNN, Deep Neural Network) y la búsqueda de Viterbi basada en un grafo que representa el lenguaje. Como primer objetivo nos centramos en la búsqueda de Viterbi, ya que representa el principal cuello de botella en los sistemas ASR. El acelerador para el algoritmo de Viterbi incluye técnicas innovadoras para mejorar el sistema de memoria, que es el mayor cuello de botella en rendimiento y energía, incluyendo técnicas de pre-búsqueda y una nueva técnica de ahorro de ancho de banda a memoria principal específicamente diseñada para sistemas ASR. Además, como el grafo que representa el lenguaje requiere de gran capacidad de almacenamiento en memoria (más de 1 GB), proponemos cambiar su representación y dividirlo en distintos grafos que se componen en tiempo de ejecución durante la búsqueda de Viterbi. De esta forma conseguimos reducir el almacenamiento en memoria principal en un factor de 31x, alcanzar un rendimiento 155 veces superior a tiempo real y reducir el consumo energético y la disipación de potencia en varios órdenes de magnitud comparado con las CPUs y las GPUs. En el siguiente paso, proponemos un novedoso sistema hardware para reconocimiento del habla que integra de forma efectiva un acelerador para DNNs podadas y cuantizadas con el acelerador de Viterbi. Nuestros resultados muestran que podar y/o cuantizar el DNN para el modelo acústico permite mantener la precisión pero causa un incremento en el tiempo de ejecución del sistema completo de hasta el 33%. Aunque podar/cuantizar mejora la eficiencia del DNN, éstas técnicas producen un gran incremento en la carga de trabajo de la búsqueda de Viterbi ya que las probabilidades calculadas por el DNN son menos fiables, es decir, se reduce la confianza en las predicciones del modelo acústico. Con el fin de evitar un incremento inaceptable en la carga de trabajo de la búsqueda de Viterbi, nuestro sistema restringe la búsqueda a las N hipótesis más probables en cada paso de la búsqueda. Nuestra solución permite combinar de forma efectiva un acelerador de DNNs con un acelerador de Viterbi incluyendo todas las optimizaciones de poda/cuantización. Nuestro resultados experimentales muestran que dicho sistema alcanza un rendimiento 222 veces superior a tiempo real con una disipación de potencia de 1.26 vatios, unos requisitos de memoria modestos de 41 MB y un uso de ancho de banda a memoria principal de, como máximo, 381 MB/s, ofreciendo una solución adecuada para dispositivos móviles

    Parallelization and improvement of beamforming process in synthetic aperture systems for real-time ultrasonic image generation

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 9-02-2016La ecografía es hoy en día uno de los métodos de visualización más populares para examinar el interior de cuerpos opacos. Su aplicación es especialmente significativa tanto en el campo del diagnóstico médico como en las aplicaciones de evaluación no destructiva en el ámbito industrial, donde se evalúa la integridad de un componente o una estructura. El desarrollo de sistemas ecográficos de alta calidad y con buenas prestaciones se basa en el empleo de sistemas multisensoriales conocidos como arrays que pueden estar compuestos por varias decenas de elementos. El desarrollo de estos dispositivos tiene asociada una elevada complejidad, tanto por el número de sensores y la electrónica necesaria para la adquisición paralela de señales, como por la etapa de procesamiento de los datos adquiridos que debe operar en tiempo real. Esta etapa de procesamiento de señal trabaja con un elevado flujo de datos en paralelo y desarrolla, además de la composición de imagen, otras sofisticadas técnicas de medidas sobre los datos (medida de elasticidad, flujo, etc). En este sentido, el desarrollo de nuevos sistemas de imagen con mayores prestaciones (resolución, rango dinámico, imagen 3D, etc) está fuertemente limitado por el número de canales en la apertura del array. Mientras algunos estudios se han centrado en la reducción activa de sensores (sparse arrays como ejemplo), otros se han centrado en analizar diferentes estrategias de adquisiciónn que, operando con un número reducido de canales electrónicos en paralelo, sean capaz por multiplexación emular el funcionamiento de una apertura plena. A estas últimas técnicas se las agrupa mediante el concepto de Técnicas de Apertura Sintética (SAFT). Su interés radica en que no solo son capaces de reducir los requerimientos hardware del sistema (bajo consumo, portabilidad, coste, etc) sino que además permiten dentro de cierto compromiso la mejora de la calidad de imagen respecto a los sistemas convencionales...Ultrasound is nowadays one of the most popular visualization methods to examine the interior of opaque objects. Its application is particularly significant in the field of medical diagnosis as well as non-destructive evaluation applications in industry. The development of high performance ultrasound imaging systems is based on the use of multisensory systems known as arrays, which may be composed by dozens of elements. The development of these devices has associated a high complexity, due to the number of sensors and electronics needed for the parallel acquisition of signals, and for the processing stage of the acquired data which must operate on real-time. This signal processing stage works with a high data flow in parallel and develops, besides the image composition, other sophisticated measure techniques (measure of elasticity, flow, etc). In this sense, the development of new imaging systems with higher performance (resolution, dynamic range, 3D imaging, etc) is strongly limited by the number of channels in array’s aperture. While some studies have been focused on the reduction of active sensors (i.e. sparse arrays), others have been centered on analysing different acquisition strategies which, operating with reduced number of electronic channels in parallel, are able to emulate by multiplexing the behavior of a full aperture. These latest techniques are grouped under the term known as Synthetic Aperture Techniques (SAFT). Their interest is that they are able to reduce hardware requirements (low power, portability, cost, etc) and also allow to improve the image quality over conventional systems...Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu
    corecore