359 research outputs found

    A System for Simultaneous Translation of Lectures and Speeches

    Get PDF
    This thesis realizes the first existing automatic system for simultaneous speech-to-speech translation. The focus of this system is the automatic translation of (technical oriented) lectures and speeches from English to Spanish, but the different aspects described in this thesis will also be helpful for developing simultaneous translation systems for other domains or languages

    Learning strategies for improving neural networks for image segmentation under class imbalance

    Get PDF
    This thesis aims to improve convolutional neural networks (CNNs) for image segmentation under class imbalance, which is referred to the problem of training dataset when the class distributions are unequal. We particularly focus on medical image segmentation because of its imbalanced nature and clinical importance. Based on our observations of model behaviour, we argue that CNNs cannot generalize well on imbalanced segmentation tasks, mainly because of two counterintuitive reasons. CNNs are prone to overfit the under-represented foreground classes as it would memorize the regions of interest (ROIs) in the training data because they are so rare. Besides, CNNs could underfit the heterogenous background classes as it is difficult to learn from the samples with diverse and complex characteristics. Those behaviours of CNNs are not limited to specific loss functions. To address those limitations, firstly we propose novel asymmetric variants of popular loss functions and regularization techniques, which are explicitly designed to increase the variance of foreground samples to counter overfitting under class imbalance. Secondly we propose context label learning (CoLab) to tackle background underfitting by automatically decomposing the background class into several subclasses. This is achieved by optimizing an auxiliary task generator to generate context labels such that the main network will produce good ROIs segmentation performance. Then we propose a meta-learning based automatic data augmentation framework which builds a balance of foreground and background samples to alleviate class imbalance. Specifically, we learn class-specific training-time data augmentation (TRA) and jointly optimize TRA and test-time data augmentation (TEA) effectively aligning training and test data distribution for better generalization. Finally, we explore how to estimate model performance under domain shifts when trained with imbalanced dataset. We propose class-specific variants of existing confidence-based model evaluation methods which adapts separate parameters per class, enabling class-wise calibration to reduce model bias towards the minority classes.Open Acces

    "Can you hear me now?":Automatic assessment of background noise intrusiveness and speech intelligibility in telecommunications

    Get PDF
    This thesis deals with signal-based methods that predict how listeners perceive speech quality in telecommunications. Such tools, called objective quality measures, are of great interest in the telecommunications industry to evaluate how new or deployed systems affect the end-user quality of experience. Two widely used measures, ITU-T Recommendations P.862 âPESQâ and P.863 âPOLQAâ, predict the overall listening quality of a speech signal as it would be rated by an average listener, but do not provide further insight into the composition of that score. This is in contrast to modern telecommunication systems, in which components such as noise reduction or speech coding process speech and non-speech signal parts differently. Therefore, there has been a growing interest for objective measures that assess different quality features of speech signals, allowing for a more nuanced analysis of how these components affect quality. In this context, the present thesis addresses the objective assessment of two quality features: background noise intrusiveness and speech intelligibility. The perception of background noise is investigated with newly collected datasets, including signals that go beyond the traditional telephone bandwidth, as well as Lombard (effortful) speech. We analyze listener scores for noise intrusiveness, and their relation to scores for perceived speech distortion and overall quality. We then propose a novel objective measure of noise intrusiveness that uses a sparse representation of noise as a model of high-level auditory coding. The proposed approach is shown to yield results that highly correlate with listener scores, without requiring training data. With respect to speech intelligibility, we focus on the case where the signal is degraded by strong background noises or very low bit-rate coding. Considering that listeners use prior linguistic knowledge in assessing intelligibility, we propose an objective measure that works at the phoneme level and performs a comparison of phoneme class-conditional probability estimations. The proposed approach is evaluated on a large corpus of recordings from public safety communication systems that use low bit-rate coding, and further extended to the assessment of synthetic speech, showing its applicability to a large range of distortion types. The effectiveness of both measures is evaluated with standardized performance metrics, using corpora that follow established recommendations for subjective listening tests

    Contributions to speech processing and ambient sound analysis

    Get PDF
    We are constantly surrounded by sounds that we continuously exploit to adapt our actions to situations we are facing. Some of the sounds like speech can have a particular structure from which we can infer some information, explicit or not. This is one reason why speech is possibly that is the most intuitive way to communicate between humans. Within the last decade, there has been significant progress in the domain of speech andaudio processing and in particular in the domain of machine learning applied to speech and audio processing. Thanks to these progresses, speech has become a central element in many human to human distant communication tools as well as in human to machine communication systems. These solutions work pretty well on clean speech or under controlled condition. However, in scenarios that involve the presence of acoustic perturbation such as noise or reverberation systems performance tends to degrade severely. In this thesis we focus on processing speech and its environments from an audio perspective. The algorithms proposed here are relying on a variety of solutions from signal processing based approaches to data-driven solutions based on supervised matrix factorization or deep neural networks. We propose solutions to problems ranging from speech recognition, to speech enhancement or ambient sound analysis. The target is to offer a panorama of the different aspects that could improve a speech processing algorithm working in a real environments. We start by describing automatic speech recognition as a potential end application and progressively unravel the limitations and the proposed solutions ending-up to the more general ambient sound analysis.Nous sommes constamment entourés de sons que nous exploitons pour adapter nos actions aux situations auxquelles nous sommes confrontés. Certains sons comme la parole peuvent avoir une structure particulière à partir de laquelle nous pouvons déduire des informations, explicites ou non. C’est l’une des raisons pour lesquelles la parole est peut-être le moyen le plus intuitif de communiquer entre humains. Au cours de la décennie écoulée, des progrès significatifs ont été réalisés dans le domaine du traitement de la parole et du son et en particulier dans le domaine de l’apprentissage automatique appliqué au traitement de la parole et du son. Grâce à ces progrès, la parole est devenue un élément central de nombreux outils de communication à distance d’humain à humain ainsi que dans les systèmes de communication humain-machine. Ces solutions fonctionnent bien sur un signal de parole propre ou dans des conditions contrôlées. Cependant, dans les scénarios qui impliquent la présence de perturbations acoustiques telles que du bruit ou de la réverbération les performances peuvent avoir tendance à se dégrader gravement. Dans cette HDR, nous nous concentrons sur le traitement de la parole et de son environnement d’un point de vue audio. Les algorithmes proposés ici reposent sur une variété de solutions allant des approches basées sur le traitement du signal aux solutions orientées données à base de factorisation matricielle supervisée ou de réseaux de neurones profonds. Nous proposons des solutions à des problèmes allant de la reconnaissance vocale au rehaussement de la parole ou à l’analyse des sons ambiants. L’objectif est d’offrir un panorama des différents aspects qui pourraient être améliorer un algorithme de traitement de la parole fonctionnant dans un environnement réel. Nous commençons par décrire la reconnaissance automatique de la parole comme une application finale potentielle et analysons progressivement les limites et les solutions proposées aboutissant à l’analyse plus générale des sons ambiants

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Audio source separation for music in low-latency and high-latency scenarios

    Get PDF
    Aquesta tesi proposa mètodes per tractar les limitacions de les tècniques existents de separació de fonts musicals en condicions de baixa i alta latència. En primer lloc, ens centrem en els mètodes amb un baix cost computacional i baixa latència. Proposem l'ús de la regularització de Tikhonov com a mètode de descomposició de l'espectre en el context de baixa latència. El comparem amb les tècniques existents en tasques d'estimació i seguiment dels tons, que són passos crucials en molts mètodes de separació. A continuació utilitzem i avaluem el mètode de descomposició de l'espectre en tasques de separació de veu cantada, baix i percussió. En segon lloc, proposem diversos mètodes d'alta latència que milloren la separació de la veu cantada, gràcies al modelatge de components específics, com la respiració i les consonants. Finalment, explorem l'ús de correlacions temporals i anotacions manuals per millorar la separació dels instruments de percussió i dels senyals musicals polifònics complexes.Esta tesis propone métodos para tratar las limitaciones de las técnicas existentes de separación de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los métodos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularización de Tikhonov como método de descomposición del espectro en el contexto de baja latencia. Lo comparamos con las técnicas existentes en tareas de estimación y seguimiento de los tonos, que son pasos cruciales en muchos métodos de separación. A continuación utilizamos y evaluamos el método de descomposición del espectro en tareas de separación de voz cantada, bajo y percusión. En segundo lugar, proponemos varios métodos de alta latencia que mejoran la separación de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiración y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separación de los instrumentos de percusión y señales musicales polifónicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals

    Pattern Recognition

    Get PDF
    Pattern recognition is a very wide research field. It involves factors as diverse as sensors, feature extraction, pattern classification, decision fusion, applications and others. The signals processed are commonly one, two or three dimensional, the processing is done in real- time or takes hours and days, some systems look for one narrow object class, others search huge databases for entries with at least a small amount of similarity. No single person can claim expertise across the whole field, which develops rapidly, updates its paradigms and comprehends several philosophical approaches. This book reflects this diversity by presenting a selection of recent developments within the area of pattern recognition and related fields. It covers theoretical advances in classification and feature extraction as well as application-oriented works. Authors of these 25 works present and advocate recent achievements of their research related to the field of pattern recognition

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    Affect-based information retrieval

    Get PDF
    One of the main challenges Information Retrieval (IR) systems face nowadays originates from the semantic gap problem: the semantic difference between a user’s query representation and the internal representation of an information item in a collection. The gap is further widened when the user is driven by an ill-defined information need, often the result of an anomaly in his/her current state of knowledge. The formulated search queries, which are submitted to the retrieval systems to locate relevant items, produce poor results that do not address the users’ information needs. To deal with information need uncertainty IR systems have employed in the past a range of feedback techniques, which vary from explicit to implicit. The first category of feedback techniques necessitates the communication of explicit relevance judgments, in return for better query reformulations and recommendations of relevant results. However, the latter happens at the expense of users’ cognitive resources and, furthermore, introduces an additional layer of complexity to the search process. On the other hand, implicit feedback techniques make inferences on what is relevant based on observations of user search behaviour. By doing so, they disengage users from the cognitive burden of document rating and relevance assessments. However, both categories of RF techniques determine topical relevance with respect to the cognitive and situational levels of interaction, failing to acknowledge the importance of emotions in cognition and decision making. In this thesis I investigate the role of emotions in the information seeking process and develop affective feedback techniques for interactive IR. This novel feedback framework aims to aid the search process and facilitate a more natural and meaningful interaction. I develop affective models that determine topical relevance based on information gathered from various sensory channels, and enhance their performance using personalisation techniques. Furthermore, I present an operational video retrieval system that employs affective feedback to enrich user profiles and offers meaningful recommendations of unseen videos. The use of affective feedback as a surrogate for the information need is formalised as the Affective Model of Browsing. This is a cognitive model that motivates the use of evidence extracted from the psycho-somatic mobilisation that occurs during cognitive appraisal. Finally, I address some of the ethical and privacy issues that arise from the social-emotional interaction between users and computer systems. This study involves questionnaire data gathered over three user studies, from 74 participants of different educational background, ethnicity and search experience. The results show that affective feedback is a promising area of research and it can improve many aspects of the information seeking process, such as indexing, ranking and recommendation. Eventually, it may be that relevance inferences obtained from affective models will provide a more robust and personalised form of feedback, which will allow us to deal more effectively with issues such as the semantic gap
    • …
    corecore