660 research outputs found

    An Automatic Real-time Synchronization of Live speech with Its Transcription Approach

    Get PDF
    Most studies in automatic synchronization of speech and transcription focus on the synchronization at the sentence level or the phrase level. Nevertheless, in some languages, like Thai, boundaries of such levels are difficult to linguistically define, especially in case of the synchronization of speech and its transcription. Consequently, the synchronization at a finer level like the syllabic level is promising. In this article, an approach to synchronize live speech with its corresponding transcription in real time at the syllabic level is proposed. Our approach employs the modified real-time syllable detection procedure from our previous work and the transcription verification procedure then adopts to verify correctness and to recover errors caused by the real-time syllable detection procedure. In experiments, the acoustic features and the parameters are customized empirically. Results are compared with two baselines which have been applied to the Thai scenario. Experimental results indicate that, our approach outperforms two baselines with error rate reduction of 75.9% and 41.9% respectively and also can provide results in the real-time situation. Besides, our approach is applied to the practical application, namely ChulaDAISY. Practical experiments show that ChulaDAISY applied with our approach could reduce time consumption for producing audio books

    Multiple Media Correlation: Theory and Applications

    Get PDF
    This thesis introduces multiple media correlation, a new technology for the automatic alignment of multiple media objects such as text, audio, and video. This research began with the question: what can be learned when multiple multimedia components are analyzed simultaneously? Most ongoing research in computational multimedia has focused on queries, indexing, and retrieval within a single media type. Video is compressed and searched independently of audio, text is indexed without regard to temporal relationships it may have to other media data. Multiple media correlation provides a framework for locating and exploiting correlations between multiple, potentially heterogeneous, media streams. The goal is computed synchronization, the determination of temporal and spatial alignments that optimize a correlation function and indicate commonality and synchronization between media objects. The model also provides a basis for comparison of media in unrelated domains. There are many real-world applications for this technology, including speaker localization, musical score alignment, and degraded media realignment. Two applications, text-to-speech alignment and parallel text alignment, are described in detail with experimental validation. Text-to-speech alignment computes the alignment between a textual transcript and speech-based audio. The presented solutions are effective for a wide variety of content and are useful not only for retrieval of content, but in support of automatic captioning of movies and video. Parallel text alignment provides a tool for the comparison of alternative translations of the same document that is particularly useful to the classics scholar interested in comparing translation techniques or styles. The results presented in this thesis include (a) new media models more useful in analysis applications, (b) a theoretical model for multiple media correlation, (c) two practical application solutions that have wide-spread applicability, and (d) Xtrieve, a multimedia database retrieval system that demonstrates this new technology and demonstrates application of multiple media correlation to information retrieval. This thesis demonstrates that computed alignment of media objects is practical and can provide immediate solutions to many information retrieval and content presentation problems. It also introduces a new area for research in media data analysis

    The rhythm of therapy: psychophysiological synchronization in clinical dyads

    Get PDF
    Rhythmicity and synchronization are fundamental mechanisms employed by countless natural phenomena to communicate. Previous research has found evidence for synchronization in patients and therapists during clinical activity, for instance in their body movements (Ramseyer & Tschacher, 2011) and physiological activations (e.g. Marci et al, 2007; Kleinbub et al., 2012; Messina et al., 2013). While this phenomenon has been found associated with different important aspects of clinical relationship, such as empathy, rapport, and outcome, and many authors suggested that it may describe crucial dimensions of the therapeutic dyad interaction and change, a clear explanation of its meaning is still lacking. The goals of the present work were to: 1) Provide a solid theoretical and epistemological background, in which to inscribe the phenomenon. This was pursued by crossing neurophenomenology’s sophisticated ideas on mind-body integration (Varela, 1996) and Infant Research’s detailed observations on development of infants’ Self through their relationships. The common ground for this connection was the complex systems theory (von Bertalanffy, 1968; Haken, 2006). 2) Contribute to literature through two replications of existing studies (Kleinbub et al., 2012; Messina et al., 2013) on skin conductance (SC) synchronization. In addition to the original designs, secure attachment priming (Mikulincer & Shaver, 2007) was introduced to explore if observed SC linkage was susceptible to manipulation, accordingly to the developmental premises defined in the theoretical chapters. Study 1 focused on synchrony between students and psychotherapists in simulated clinical sessions; Study 2 reprised the same methodology with two principal changes: first the clinician’s role was played by psychologists without further clinical trainings, and second, each psychologist was involved in two distinct interviews, in order to assess the impact of individual characteristics on SC synchrony. 3) Provide an ideographical exploration of the psychotherapy processes linked to matched SC activity. In study 3 the highest and lowest synchrony sequences of 6 sessions of psychodynamic psychotherapy were subject of a detailed phenomenological content analysis. These micro-categories were synthetized in more abstract ones, in order to attempt the recognizing of regularities that could shed light on the phenomenon. 4) To explore the pertinence of employing mathematical properties derived from the application of system theory in psychological contexts. In study 4, Shannon’s entropy and order equations (1948) were applied on the transcribed verbal content of 12 depression psychotherapies, to assess both intra-personal and inter-personal (dyad) order in verbal categories. Results from these studies provided further evidence for the existence of a synchronization mechanism in the clinical dyads. Furthermore the various findings were generally supporting the dyad system theoretical model, and its description of regulatory dynamics as a good explanation of the synchronization phenomena. Discrepancies with previous literature highlighted the need for further studies to embrace more methodological sophistication (such as employing lag analysis), and cautiousness in the interpretation of results

    Complexity Science in Human Change

    Get PDF
    This reprint encompasses fourteen contributions that offer avenues towards a better understanding of complex systems in human behavior. The phenomena studied here are generally pattern formation processes that originate in social interaction and psychotherapy. Several accounts are also given of the coordination in body movements and in physiological, neuronal and linguistic processes. A common denominator of such pattern formation is that complexity and entropy of the respective systems become reduced spontaneously, which is the hallmark of self-organization. The various methodological approaches of how to model such processes are presented in some detail. Results from the various methods are systematically compared and discussed. Among these approaches are algorithms for the quantification of synchrony by cross-correlational statistics, surrogate control procedures, recurrence mapping and network models.This volume offers an informative and sophisticated resource for scholars of human change, and as well for students at advanced levels, from graduate to post-doctoral. The reprint is multidisciplinary in nature, binding together the fields of medicine, psychology, physics, and neuroscience

    MediaSync: Handbook on Multimedia Synchronization

    Get PDF
    This book provides an approachable overview of the most recent advances in the fascinating field of media synchronization (mediasync), gathering contributions from the most representative and influential experts. Understanding the challenges of this field in the current multi-sensory, multi-device, and multi-protocol world is not an easy task. The book revisits the foundations of mediasync, including theoretical frameworks and models, highlights ongoing research efforts, like hybrid broadband broadcast (HBB) delivery and users' perception modeling (i.e., Quality of Experience or QoE), and paves the way for the future (e.g., towards the deployment of multi-sensory and ultra-realistic experiences). Although many advances around mediasync have been devised and deployed, this area of research is getting renewed attention to overcome remaining challenges in the next-generation (heterogeneous and ubiquitous) media ecosystem. Given the significant advances in this research area, its current relevance and the multiple disciplines it involves, the availability of a reference book on mediasync becomes necessary. This book fills the gap in this context. In particular, it addresses key aspects and reviews the most relevant contributions within the mediasync research space, from different perspectives. Mediasync: Handbook on Multimedia Synchronization is the perfect companion for scholars and practitioners that want to acquire strong knowledge about this research area, and also approach the challenges behind ensuring the best mediated experiences, by providing the adequate synchronization between the media elements that constitute these experiences

    Multi-modal surrogates for retrieving and making sense of videos: is synchronization between the multiple modalities optimal?

    Get PDF
    Video surrogates can help people quickly make sense of the content of a video before downloading or seeking more detailed information. Visual and audio features of a video are primary information carriers and might become important components of video retrieval and video sense-making. In the past decades, most research and development efforts on video surrogates have focused on visual features of the video, and comparatively little work has been done on audio surrogates and examining their pros and cons in aiding users' retrieval and sense-making of digital videos. Even less work has been done on multi-modal surrogates, where more than one modality are employed for consuming the surrogates, for example, the audio and visual modalities. This research examined the effectiveness of a number of multi-modal surrogates, and investigated whether synchronization between the audio and visual channels is optimal. A user study was conducted to evaluate six different surrogates on a set of six recognition and inference tasks to answer two main research questions: (1) How do automatically-generated multi-modal surrogates compare to manually-generated ones in video retrieval and video sense-making? and (2) Does synchronization between multiple surrogate channels enhance or inhibit video retrieval and video sense-making? Forty-eight participants participated in the study, in which the surrogates were measured on the the time participants spent on experiencing the surrogates, the time participants spent on doing the tasks, participants' performance accuracy on the tasks, participants' confidence in their task responses, and participants' subjective ratings on the surrogates. On average, the uncoordinated surrogates were more helpful than the coordinated ones, but the manually-generated surrogates were only more helpful than the automatically-generated ones in terms of task completion time. Participants' subjective ratings were more favorable for the coordinated surrogate C2 (Magic A + V) and the uncoordinated surrogate U1 (Magic A + Storyboard V) with respect to usefulness, usability, enjoyment, and engagement. The post-session questionnaire comments demonstrated participants' preference for the coordinated surrogates, but the comments also revealed the value of having uncoordinated sensory channels

    Bio-motivated features and deep learning for robust speech recognition

    Get PDF
    Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech Recognition (ASR) technologies has experienced over the last five years their performance under hard environmental condition is still far from that of humans preventing their adoption in several real applications. In this thesis the challenge of robustness of modern automatic speech recognition systems is addressed following two main research lines. The first one focuses on modeling the human auditory system to improve the robustness of the feature extraction stage yielding to novel auditory motivated features. Two main contributions are produced. On the one hand, a model of the masking behaviour of the Human Auditory System (HAS) is introduced, based on the non-linear filtering of a speech spectro-temporal representation applied simultaneously to both frequency and time domains. This filtering is accomplished by using image processing techniques, in particular mathematical morphology operations with an specifically designed Structuring Element (SE) that closely resembles the masking phenomena that take place in the cochlea. On the other hand, the temporal patterns of auditory-nerve firings are modeled. Most conventional acoustic features are based on short-time energy per frequency band discarding the information contained in the temporal patterns. Our contribution is the design of several types of feature extraction schemes based on the synchrony effect of auditory-nerve activity, showing that the modeling of this effect can indeed improve speech recognition accuracy in the presence of additive noise. Both models are further integrated into the well known Power Normalized Cepstral Coefficients (PNCC). The second research line addresses the problem of robustness in noisy environments by means of the use of Deep Neural Networks (DNNs)-based acoustic modeling and, in particular, of Convolutional Neural Networks (CNNs) architectures. A deep residual network scheme is proposed and adapted for our purposes, allowing Residual Networks (ResNets), originally intended for image processing tasks, to be used in speech recognition where the network input is small in comparison with usual image dimensions. We have observed that ResNets on their own already enhance the robustness of the whole system against noisy conditions. Moreover, our experiments demonstrate that their combination with the auditory motivated features devised in this thesis provide significant improvements in recognition accuracy in comparison to other state-of-the-art CNN-based ASR systems under mismatched conditions, while maintaining the performance in matched scenarios. The proposed methods have been thoroughly tested and compared with other state-of-the-art proposals for a variety of datasets and conditions. The obtained results prove that our methods outperform other state-of-the-art approaches and reveal that they are suitable for practical applications, specially where the operating conditions are unknown.El objetivo de esta tesis se centra en proponer soluciones al problema del reconocimiento de habla robusto; por ello, se han llevado a cabo dos líneas de investigación. En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento del sistema auditivo humano, modelando especialmente los fenómenos de enmascaramiento y sincronía. En la segunda, se propone mejorar las tasas de reconocimiento mediante el uso de técnicas de aprendizaje profundo, en conjunto con las características propuestas. Los métodos propuestos tienen como principal objetivo, mejorar la precisión del sistema de reconocimiento cuando las condiciones de operación no son conocidas, aunque el caso contrario también ha sido abordado. En concreto, nuestras principales propuestas son los siguientes: Simular el sistema auditivo humano con el objetivo de mejorar la tasa de reconocimiento en condiciones difíciles, principalmente en situaciones de alto ruido, proponiendo esquemas de extracción de características novedosos. Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación: • Modelar el comportamiento de enmascaramiento del sistema auditivo humano, usando técnicas del procesado de imagen sobre el espectro, en concreto, llevando a cabo el diseño de un filtro morfológico que captura este efecto. • Modelar el efecto de la sincroní que tiene lugar en el nervio auditivo. • La integración de ambos modelos en los conocidos Power Normalized Cepstral Coefficients (PNCC). La aplicación de técnicas de aprendizaje profundo con el objetivo de hacer el sistema más robusto frente al ruido, en particular con el uso de redes neuronales convolucionales profundas, como pueden ser las redes residuales. Por último, la aplicación de las características propuestas en combinación con las redes neuronales profundas, con el objetivo principal de obtener mejoras significativas, cuando las condiciones de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Understanding interactive behaviour : a quantitative approach.

    Get PDF
    SIGLEAvailable from British Library Document Supply Centre-DSC:DXN029243 / BLDSC - British Library Document Supply CentreGBUnited Kingdo
    • …
    corecore