    JALAD: Joint Accuracy- and Latency-Aware Deep Structure Decoupling for Edge-Cloud Execution

    Recent years have witnessed a rapid growth of deep-network based services and applications. A practical and critical problem thus has emerged: how to effectively deploy the deep neural network models such that they can be executed efficiently. Conventional cloud-based approaches usually run the deep models in data center servers, causing large latency because a significant amount of data has to be transferred from the edge of network to the data center. In this paper, we propose JALAD, a joint accuracy- and latency-aware execution framework, which decouples a deep neural network so that a part of it will run at edge devices and the other part inside the conventional cloud, while only a minimum amount of data has to be transferred between them. Though the idea seems straightforward, we are facing challenges including i) how to find the best partition of a deep structure; ii) how to deploy the component at an edge device that only has limited computation power; and iii) how to minimize the overall execution latency. Our answers to these questions are a set of strategies in JALAD, including 1) A normalization based in-layer data compression strategy by jointly considering compression rate and model accuracy; 2) A latency-aware deep decoupling strategy to minimize the overall execution latency; and 3) An edge-cloud structure adaptation strategy that dynamically changes the decoupling for different network conditions. Experiments demonstrate that our solution can significantly reduce the execution latency: it speeds up the overall inference execution with a guaranteed model accuracy loss.Comment: conference, copyright transfered to IEE

    Conveying expressivity and vocal effort transformation in synthetic speech with Harmonic plus Noise Models

    Aquesta tesi s'ha dut a terme dins del Grup en de Tecnologies Mèdia (GTM) de l'Escola d'Enginyeria i Arquitectura la Salle. El grup te una llarga trajectòria dins del cap de la síntesi de veu i fins i tot disposa d'un sistema propi de síntesi per concatenació d'unitats (US-TTS) que permet sintetitzar diferents estils expressius usant múltiples corpus. De forma que per a realitzar una síntesi agressiva, el sistema usa el corpus de l'estil agressiu, i per a realitzar una síntesi sensual, usa el corpus de l'estil corresponent. Aquesta tesi pretén proposar modificacions del esquema del US-TTS que permetin millorar la flexibilitat del sistema per sintetitzar múltiples expressivitats usant només un únic corpus d'estil neutre. L'enfoc seguit en aquesta tesi es basa en l'ús de tècniques de processament digital del senyal (DSP) per aplicar modificacions de senyal a la veu sintetitzada per tal que aquesta expressi l'estil de parla desitjat. Per tal de dur a terme aquestes modificacions de senyal s'han usat els models harmònic més soroll per la seva flexibilitat a l'hora de realitzar modificacions de senyal. La qualitat de la veu (VoQ) juga un paper important en els diferents estils expressius. És per això que es va estudiar la síntesi de diferents emocions mitjançant la modificació de paràmetres de VoQ de baix nivell. D'aquest estudi es van identificar un conjunt de limitacions que van donar lloc als objectius d'aquesta tesi, entre ells el trobar un paràmetre amb gran impacte sobre els estils expressius. Per aquest fet l'esforç vocal (VE) es va escollir per el seu paper important en la parla expressiva. Primer es va estudiar la possibilitat de transferir l'VE entre dues realitzacions amb diferent VE de la mateixa paraula basant-se en la tècnica de predicció lineal adaptativa del filtre de pre-èmfasi (APLP). La proposta va permetre transferir l'VE correctament però presentava limitacions per a poder generar nivells intermitjos d'VE. Amb la finalitat de millorar la flexibilitat i control de l'VE expressat a la veu sintetitzada, es va proposar un nou model d'VE basat en polinomis lineals. Aquesta proposta va permetre transferir l'VE entre dues paraules qualsevols i sintetitzar nous nivells d'VE diferents dels disponibles al corpus. Aquesta flexibilitat esta alineada amb l'objectiu general d'aquesta tesi, permetre als sistemes US-TTS sintetitzar diferents estils expressius a partir d'un únic corpus d'estil neutre. La proposta realitzada també inclou un paràmetre que permet controlar fàcilment el nivell d'VE sintetitzat. Això obre moltes possibilitats per controlar fàcilment el procés de síntesi tal i com es va fer al projecte CreaVeu usant interfícies gràfiques simples i intuïtives, també realitzat dins del grup GTM. Aquesta memòria conclou presentant el treball realitzat en aquesta tesi i amb una proposta de modificació de l'esquema d'un sistema US-TTS per incloure els blocs de DSP desenvolupats en aquesta tesi que permetin al sistema sintetitzar múltiple nivells d'VE a partir d'un corpus d'estil neutre. Això obre moltes possibilitats per generar interfícies d'usuari que permetin controlar fàcilment el procés de síntesi, tal i com es va fer al projecte CreaVeu, també realitzat dins del grup GTM. Aquesta memòria conclou presentant el treball realitzat en aquesta tesi i amb una proposta de modificació de l'esquema del sistema US-TTS per incloure els blocs de DSP desenvolupats en aquesta tesi que permetin al sistema sintetitzar múltiple nivells d'VE a partir d'un corpus d'estil neutre.Esta tesis se llevó a cabo en el Grup en Tecnologies Mèdia de la Escuela de Ingeniería y Arquitectura la Salle. El grupo lleva una larga trayectoria dentro del campo de la síntesis de voz y cuenta con su propio sistema de síntesis por concatenación de unidades (US-TTS). El sistema permite sintetizar múltiples estilos expresivos mediante el uso de corpus específicos para cada estilo expresivo. De este modo, para realizar una síntesis agresiva, el sistema usa el corpus de este estilo, y para un estilo sensual, usa otro corpus específico para ese estilo. La presente tesis aborda el problema con un enfoque distinto proponiendo cambios en el esquema del sistema con el fin de mejorar la flexibilidad para sintetizar múltiples estilos expresivos a partir de un único corpus de estilo de habla neutro. El planteamiento seguido en esta tesis esta basado en el uso de técnicas de procesamiento de señales (DSP) para llevar a cabo modificaciones del señal de voz para que este exprese el estilo de habla deseado. Para llevar acabo las modificaciones de la señal de voz se han usado los modelos harmónico más ruido (HNM) por su flexibilidad para efectuar modificaciones de señales. La cualidad de la voz (VoQ) juega un papel importante en diferentes estilos expresivos. Por ello se exploró la síntesis expresiva basada en modificaciones de parámetros de bajo nivel de la VoQ. Durante este estudio se detectaron diferentes problemas que dieron pié a los objetivos planteados en esta tesis, entre ellos el encontrar un único parámetro con fuerte influencia en la expresividad. El parámetro seleccionado fue el esfuerzo vocal (VE) por su importante papel a la hora de expresar diferentes emociones. Las primeras pruebas se realizaron con el fin de transferir el VE entre dos realizaciones con diferente grado de VE de la misma palabra usando una metodología basada en un proceso filtrado de pre-émfasis adaptativo con coeficientes de predicción lineales (APLP). Esta primera aproximación logró transferir el nivel de VE entre dos realizaciones de la misma palabra, sin embargo el proceso presentaba limitaciones para generar niveles de esfuerzo vocal intermedios. A fin de mejorar la flexibilidad y el control del sistema para expresar diferentes niveles de VE, se planteó un nuevo modelo de VE basado en polinomios lineales. Este modelo permitió transferir el VE entre dos palabras diferentes e incluso generar nuevos niveles no presentes en el corpus usado para la síntesis. Esta flexibilidad está alineada con el objetivo general de esta tesis de permitir a un sistema US-TTS expresar múltiples estilos de habla expresivos a partir de un único corpus de estilo neutro. Además, la metodología propuesta incorpora un parámetro que permite de forma sencilla controlar el nivel de VE expresado en la voz sintetizada. Esto abre la posibilidad de controlar fácilmente el proceso de síntesis tal y como se hizo en el proyecto CreaVeu usando interfaces simples e intuitivas, también realizado dentro del grupo GTM. Esta memoria concluye con una revisión del trabajo realizado en esta tesis y con una propuesta de modificación de un esquema de US-TTS para expresar diferentes niveles de VE a partir de un único corpus neutro.This thesis was conducted in the Grup en Tecnologies M`edia (GTM) from Escola d’Enginyeria i Arquitectura la Salle. The group has a long trajectory in the speech synthesis field and has developed their own Unit-Selection Text-To-Speech (US-TTS) which is able to convey multiple expressive styles using multiple expressive corpora, one for each expressive style. Thus, in order to convey aggressive speech, the US-TTS uses an aggressive corpus, whereas for a sensual speech style, the system uses a sensual corpus. Unlike that approach, this dissertation aims to present a new schema for enhancing the flexibility of the US-TTS system for performing multiple expressive styles using a single neutral corpus. The approach followed in this dissertation is based on applying Digital Signal Processing (DSP) techniques for carrying out speech modifications in order to synthesize the desired expressive style. For conducting the speech modifications the Harmonics plus Noise Model (HNM) was chosen for its flexibility in conducting signal modifications. Voice Quality (VoQ) has been proven to play an important role in different expressive styles. Thus, low-level VoQ acoustic parameters were explored for conveying multiple emotions. This raised several problems setting new objectives for the rest of the thesis, among them finding a single parameter with strong impact on the expressive style conveyed. Vocal Effort (VE) was selected for conducting expressive speech style modifications due to its salient role in expressive speech. The first approach working with VE was based on transferring VE between two parallel utterances based on the Adaptive Pre-emphasis Linear Prediction (APLP) technique. This approach allowed transferring VE but the model presented certain restrictions regarding its flexibility for generating new intermediate VE levels. Aiming to improve the flexibility and control of the conveyed VE, a new approach using polynomial model for modelling VE was presented. This model not only allowed transferring VE levels between two different utterances, but also allowed to generate other VE levels than those present in the speech corpus. This is aligned with the general goal of this thesis, allowing US-TTS systems to convey multiple expressive styles with a single neutral corpus. Moreover, the proposed methodology introduces a parameter for controlling the degree of VE in the synthesized speech signal. This opens new possibilities for controlling the synthesis process such as the one in the CreaVeu project using a simple and intuitive graphical interfaces, also conducted in the GTM group. The dissertation concludes with a review of the conducted work and a proposal for schema modifications within a US-TTS system for introducing the VE modification blocks designed in this dissertation

    A Large Sky Simulation of the Gravitational Lensing of the Cosmic Microwave Background

    Large scale structure deflects cosmic microwave background (CMB) photons. Since large angular scales in the large scale structure contribute significantly to the gravitational lensing effect, a realistic simulation of CMB lensing requires a sufficiently large sky area. We describe simulations that include these effects, and present both effective and multiple plane ray-tracing versions of the algorithm, which employs spherical harmonic space and does not use the flat sky approximation. We simulate lensed CMB maps with an angular resolution of ~0.9 arcmin. The angular power spectrum of the simulated sky agrees well with analytical predictions. Maps generated in this manner are a useful tool for the analysis and interpretation of upcoming CMB experiments such as PLANCK and ACT.Comment: 14 pages, 12 figures, replaced with version accepted for publication by the AP

    Prognostic-based Life Extension Methodology with Application to Power Generation Systems

    Practicable life extension of engineering systems would be a remarkable application of prognostics. This research proposes a framework for prognostic-base life extension. This research investigates the use of prognostic data to mobilize the potential residual life. The obstacles in performing life extension include: lack of knowledge, lack of tools, lack of data, and lack of time. This research primarily considers using the acoustic emission (AE) technology for quick-response diagnostic. To be specific, an important feature of AE data was statistically modeled to provide quick, robust and intuitive diagnostic capability. The proposed model was successful to detect the out of control situation when the data of faulty bearing was applied. This research also highlights the importance of self-healing materials. One main component of the proposed life extension framework is the trend analysis module. This module analyzes the pattern of the time-ordered degradation measures. The trend analysis is helpful not only for early fault detection but also to track the improvement in the degradation rate. This research considered trend analysis methods for the prognostic parameters, degradation waveform and multivariate data. In this respect, graphical methods was found appropriate for trend detection of signal features. Hilbert Huang Transform was applied to analyze the trends in waveforms. For multivariate data, it was realized that PCA is able to indicate the trends in the data if accompanied by proper data processing. In addition, two algorithms are introduced to address non-monotonic trends. It seems, both algorithms have the potential to treat the non-monotonicity in degradation data. Although considerable research has been devoted to developing prognostics algorithms, rather less attention has been paid to post-prognostic issues such as maintenance decision making. A multi-objective optimization model is presented for a power generation unit. This model proves the ability of prognostic models to balance between power generation and life extension. In this research, the confronting objective functions were defined as maximizing profit and maximizing service life. The decision variables include the shaft speed and duration of maintenance actions. The results of the optimization models showed clearly that maximizing the service life requires lower shaft speed and longer maintenance time

    Acoustic Event Detection System

    Hlavním cílem této práce je vytvořit systém schopný detekce, klasifikace a lokalizace střelby. Systém se skládá z dedikované desky a serverové aplikace. Systém detekuje zvukové události a jako první krok vyfiltruje události, které nejsou střelba. Následně jsou klíčové vlastnosti nahrávky extrahovány pomocí Mel-Frequency Cepstral Coefficients. Na vektoru klíčových vlastností je dále provedena klasifikace použitého kalibru zbraně, kterou provádí metoda podpůrných vektorů (Support-Vector Machine). Lokalizace střelby je prováděna na zvukových událostech, ke kterým je připojena velmi přesná časová značka (timestamp) a pozice měřícího přístroje (uzlu). Data shromážděná z jednotlivých zařízení jsou použita pro řešení lokalizačního problému na základě změřeného času zaznamenání (Time of Arrival Localization Problem). Pro jeho řešení jsou popsány dvě různé metody, lišící se dle počtu měřících zařízení, které danou událost detekovaly. Vytvořená serverová aplikace je nejen schopna řešit lokalizační úlohu popsanou výše, ale také poskytuje vizualizaci s administrací uživatelů, uzlů a zpráv uzlů. Navržená deska je schopna získat svou pozici spolu s přesnou časovou značkou události a odeslat všechny potřebné informace pomocí LoRaWan sítě na server. Na desce je naimplementován jak detekční, tak klasifikační algoritmus. Navíc deska nabízí rozhraní ve formě příkazové řádky pro nastavení parametrů aplikace, jako jsou například koeficienty detekčního algoritmu.The main goal of the thesis is to create a system capable of gunshot detection, classification, and localization. The detection system consists of a specialized board and a server application. At first, the gunshot detection algorithm is executed for filtration of non-gunshot events. Afterwards, the features are extracted by Mel-Frequency Cepstral Coefficients. The feature vector is then passed to the gunshot classification, performed through Support Vector Machine. The localization task is executed on precisely timestamped acoustic events that are coupled with position of the measuring devices (nodes) on the server. The aggregated data are utilized for solving the Time of Arrival Localization Problem. Two different methods are described based on the number of nodes that detected the event. The created server application solves the localization task as mentioned above but also offers visualization and administration of users, nodes, and node’s messages. The proposed board is able to acquire position with precise timestamping and send the required information through LoRaWAN network to the server. The board implements detection and classification algorithms and also offers a command line interface for setting the firmware’s parameters such as detection algorithms’ coefficients

    Orbital Angular Momentum Waves: Generation, Detection and Emerging Applications

    Orbital angular momentum (OAM) has aroused a widespread interest in many fields, especially in telecommunications due to its potential for unleashing new capacity in the severely congested spectrum of commercial communication systems. Beams carrying OAM have a helical phase front and a field strength with a singularity along the axial center, which can be used for information transmission, imaging and particle manipulation. The number of orthogonal OAM modes in a single beam is theoretically infinite and each mode is an element of a complete orthogonal basis that can be employed for multiplexing different signals, thus greatly improving the spectrum efficiency. In this paper, we comprehensively summarize and compare the methods for generation and detection of optical OAM, radio OAM and acoustic OAM. Then, we represent the applications and technical challenges of OAM in communications, including free-space optical communications, optical fiber communications, radio communications and acoustic communications. To complete our survey, we also discuss the state of art of particle manipulation and target imaging with OAM beams

    Plain-to-clear speech video conversion for enhanced intelligibility

    Clearly articulated speech, relative to plain-style speech, has been shown to improve intelligibility. We examine if visible speech cues in video only can be systematically modified to enhance clear-speech visual features and improve intelligibility. We extract clear-speech visual features of English words varying in vowels produced by multiple male and female talkers. Via a frame-by-frame image-warping based video generation method with a controllable parameter (displacement factor), we apply the extracted clear-speech visual features to videos of plain speech to synthesize clear speech videos. We evaluate the generated videos using a robust, state of the art AI Lip Reader as well as human intelligibility testing. The contributions of this study are: (1) we successfully extract relevant visual cues for video modifications across speech styles, and have achieved enhanced intelligibility for AI; (2) this work suggests that universal talker-independent clear-speech features may be utilized to modify any talker’s visual speech style; (3) we introduce “displacement factor” as a way of systematically scaling the magnitude of displacement modifications between speech styles; and (4) the high definition generated videos make them ideal candidates for human-centric intelligibility and perceptual training studies

    Improving Dysarthric Speech Recognition by Enriching Training Datasets

    Dysarthria is a motor speech disorder that results from disruptions in the neuro-motor interface and is characterised by poor articulation of phonemes and hyper-nasality and is characteristically different from normal speech. Many modern automatic speech recognition systems focus on a narrow range of speech diversity therefore as a consequence of this they exclude a groups of speakers who deviate in aspects of gender, race, age and speech impairment when building training datasets. This study attempts to develop an automatic speech recognition system that deals with dysarthric speech with limited dysarthric speech data. Speech utterances collected from the TORGO database are used to conduct experiments on a wav2vec2.0 model only trained on the Librispeech 960h dataset to obtain a baseline performance of the word error rate (WER) when recognising dysarthric speech. A version of the Librispeech model fine-tuned on multi-language datasets was tested to see if it would improve accuracy and achieved a top reduction of 24.15% in the WER for one of the male dysarthric speakers in the dataset. Transfer learning with speech recognition models and preprocessing dysarthric speech to improve its intelligibility by using general adversarial networks were limited in their potential due to a lack of dysarthric speech dataset of adequate size to use these technologies. The main conclusion drawn from this study is that a large diverse dysarthric speech dataset comparable to the size of datasets used to train machine learning ASR systems like Librispeech,with different types of speech, scripted and unscripted, is required to improve performance.