262 research outputs found

    An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-talker Single Channel Audio-Visual ASR

    Get PDF
    In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a cocktail party scenario. Therefore we considered two simple end-to-end LSTM-based models that perform single-channel audiovisual speech enhancement and phone recognition respectively. Then, we studied how the two models interact, and how to train them jointly affects the final result. We analyzed different training strategies that reveal some interesting and unexpected behaviors. The experiments show that during optimization of the ASR task the speech enhancement capability of the model significantly decreases and vice-versa. Nevertheless the joint optimization of the two tasks shows a remarkable drop of the Phone Error Rate (PER) compared to the audio-visual baseline models trained only to perform phone recognition. We analyzed the behaviors of the proposed models by using two limited-size datasets, and in particular we used the mixed-speech versions of GRID and TCD-TIMIT

    Između istorije i sistema: pojam kulture Hajnriha Rikerta

    Get PDF
    The paper reconstructs the concept of culture that emerges from Heinrich Rickert’s neo-Kantianism, uncovering its major historical-problematic, methodological, and philosophical implications. The central theme of the first section is the idea that modern culture is uniquely characterized by “fragmentation”. It also unpacks the programme of Rickert’s philosophy of culture, which pursues the task of reconstructing the lost unity of culture. The second section explains the methodological implications of the problematic relationship between value and reality established in cultural goods and evaluations. Finally, the third section reconstructs the Rickertian system of values, with its peculiar effort to reconcile historicity and value absoluteness. The last part develops a critical discussion of the Rickertian project

    La filosofia della cultura. Genesi e prospettive

    Get PDF
    Questo volume raccoglie i contributi che studiosi della più ampia estrazione, italiani e stranieri, hanno dedicato a un tema fondamentale per i nostri tempi. L’oggetto “cultura”, tema centrale pure della filosofia cassireriana, è letto, analizzato e proposto come nodo problematico ma gravido di spunti fecondi e attuali, da plurime prospettive teoriche e ambiti disciplinari diversificati

    Audio-Visual Speech Inpainting with Deep Learning

    Get PDF
    In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision

    Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras

    Full text link
    We propose a method to address audio-visual target speaker enhancement in multi-talker environments using event-driven cameras. State of the art audio-visual speech separation methods shows that crucial information is the movement of the facial landmarks related to speech production. However, all approaches proposed so far work offline, using frame-based video input, making it difficult to process an audio-visual signal with low latency, for online applications. In order to overcome this limitation, we propose the use of event-driven cameras and exploit compression, high temporal resolution and low latency, for low cost and low latency motion feature extraction, going towards online embedded audio-visual speech processing. We use the event-driven optical flow estimation of the facial landmarks as input to a stacked Bidirectional LSTM trained to predict an Ideal Amplitude Mask that is then used to filter the noisy audio, to obtain the audio signal of the target speaker. The presented approach performs almost on par with the frame-based approach, with very low latency and computational cost.Comment: Accepted at ISCAS 202

    Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras

    Get PDF
    In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion

    La filosofia della cultura: genesi e prospettive

    Get PDF
    [Italiano]: Questo volume raccoglie i contributi che studiosi della più ampia estrazione, italiani e stranieri, hanno dedicato a un tema fondamentale per i nostri tempi. L’oggetto “cultura”, tema centrale pure della filosofia cassireriana, è letto, analizzato e proposto come nodo problematico ma gravido di spunti fecondi e attuali, da plurime prospettive teoriche e ambiti disciplinari diversificati ./[English]: This volume collects the contributions that scholars of the widest extraction, Italian and foreign, have dedicated to a fundamental theme for our times. The “culture”-object, a central theme of Cassirer's philosophy as well, is read, analyzed and proposed as a problematic node, but full of fruitful and current ideas, from multiple theoretical perspectives and diversified disciplinary fields

    Cortical thickness of primary visual cortex correlates with motion deficits in periventricular leukomalacia

    Get PDF
    Abstract Impairments of visual motion perception and, in particular, of flow motion have been consistently observed in premature and very low birth weight subjects during infancy. Flow motion information is analyzed at various cortical levels along the dorsal pathways, with information mainly provided by primary and early visual cortex (V1, V2 and V3). We investigated the cortical stage of the visual processing that underlies these motion impairments, measuring Grey Matter Volume and Cortical Thickness in 13 children with Periventricular Leukomalacia (PVL). The cortical thickness, but not the grey matter volume of area V1, correlates negatively with motion coherence sensitivity, indicating that the thinner the cortex, the better the performance among the patients. However, we did not find any such association with either the thickness or volume of area MT, MST and areas of the IPS, suggesting damage at the level of primary visual cortex or along the optic radiation

    An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings

    Full text link
    We performed an experimental review of current diarization systems for the conversational telephone speech (CTS) domain. In detail, we considered a total of eight different algorithms belonging to clustering-based, end-to-end neural diarization (EEND), and speech separation guided diarization (SSGD) paradigms. We studied the inference-time computational requirements and diarization accuracy on four CTS datasets with different characteristics and languages. We found that, among all methods considered, EEND-vector clustering (EEND-VC) offers the best trade-off in terms of computing requirements and performance. More in general, EEND models have been found to be lighter and faster in inference compared to clustering-based methods. However, they also require a large amount of diarization-oriented annotated data. In particular EEND-VC performance in our experiments degraded when the dataset size was reduced, whereas self-attentive EEND (SA-EEND) was less affected. We also found that SA-EEND gives less consistent results among all the datasets compared to EEND-VC, with its performance degrading on long conversations with high speech sparsity. Clustering-based diarization systems, and in particular VBx, instead have more consistent performance compared to SA-EEND but are outperformed by EEND-VC. The gap with respect to this latter is reduced when overlap-aware clustering methods are considered. SSGD is the most computationally demanding method, but it could be convenient if speech recognition has to be performed. Its performance is close to SA-EEND but degrades significantly when the training and inference data characteristics are less matched.Comment: 52 pages, 10 figure
    corecore