Search CORE

105 research outputs found

Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling

Author: Katsamanis Athanasios
Katsouros Vassilis
Kouzelis Theodoros
Paraskevopoulos Georgios
Publication venue
Publication date: 30/05/2023
Field of study

The study of speech disorders can benefit greatly from time-aligned data. However, audio-text mismatches in disfluent speech cause rapid performance degradation for modern speech aligners, hindering the use of automatic approaches. In this work, we propose a simple and effective modification of alignment graph construction of CTC-based models using Weighted Finite State Transducers. The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment. During the graph construction, we allow the modeling of common speech disfluencies, i.e. repetitions and omissions. Further, we show that by assessing the degree of audio-text mismatch through the use of Oracle Error Rate, our method can be effectively used in the wild. Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements, particularly for recall, achieving a 23-25% relative improvement over our baselines.Comment: Interspeech 202

arXiv.org e-Print Archive

Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition

Author: Athanassios Katsamanis
George Papandreou
Petros Maragos
Vassilis Pitsikalis
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

Author: Filntisis Panagiotis P.
Katsamanis Athanasios
Maragos Petros
Paraperas-Papantoniou Foivos
Retsinas George
Roussos Anastasios
Publication venue
Publication date: 22/07/2022
Field of study

The recent state of the art on monocular 3D face reconstruction from image data has made some impressive advancements, thanks to the advent of Deep Learning. However, it has mostly focused on input coming from a single RGB image, overlooking the following important factors: a) Nowadays, the vast majority of facial image data of interest do not originate from single images but rather from videos, which contain rich dynamic information. b) Furthermore, these videos typically capture individuals in some form of verbal communication (public talks, teleconferences, audiovisual human-computer interactions, interviews, monologues/dialogues in movies, etc). When existing 3D face reconstruction methods are applied in such videos, the artifacts in the reconstruction of the shape and motion of the mouth area are often severe, since they do not match well with the speech audio. To overcome the aforementioned limitations, we present the first method for visual speech-aware perceptual reconstruction of 3D mouth expressions. We do this by proposing a "lipread" loss, which guides the fitting process so that the elicited perception from the 3D reconstructed talking head resembles that of the original video footage. We demonstrate that, interestingly, the lipread loss is better suited for 3D reconstruction of mouth movements compared to traditional landmark losses, and even direct 3D supervision. Furthermore, the devised method does not rely on any text transcriptions or corresponding audio, rendering it ideal for training in unlabeled datasets. We verify the efficiency of our method through exhaustive objective evaluations on three large-scale datasets, as well as subjective evaluation with two web-based user studies

arXiv.org e-Print Archive

Face Active Appearance Modeling and Speech Acoustic Information to Recover Articulation

Author: A. Katsamanis
G. Papandreou
P. Maragos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Comparing timed-division multiplexing and best-effort networks-on-chip

Author: Damsgaard Hans Jakob
Katsamanis Dimitrios
Schoeberl Martin
Sparsø Jens
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Best-effort (BE) networks-on-chips (NOCs) are usually preferred over time-division multiplexed (TDM) NOCs in multi-core platforms because they are work-conserving and have lower (zero-load) latency. On the other hand, BE NOCs are significantly more expensive to implement than TDM NOCs because of their virtual channel buffers, allocators/arbiters, and (credit-based) flow control; functionality that a TDM NOC avoids altogether. The objective of this paper is to compare the performance of BE and TDM NOCs, taking hardware cost into consideration. The networks are compared using graphs showing average latency as a function of offered load. For the BE NOCs, we use the BookSim simulator, and for the TDM NOCs, we derive a queuing theory model and an associated TDM NOC simulator. Through experiments with both router architectures, packet length, link width, and different traffic patterns, we show that for the same hardware cost, a TDM NOC can provide higher bandwidth and comparable latency. We also show that the packet length is the most important factor affecting the TDM period, which again is the primary factor affecting latency. The best TDM NOC design for BE traffic uses single flit packets, wide links/flits, and a router with two pipeline stages: link and router traversal.publishedVersionPeer reviewe

Trepo - Institutional Repository of Tampere University

Online Research Database In Technology

Smart subtitles for vocabulary learning

Author: Danan M
Dummitt N.
Harris C.
Katsamanis A.
Kurohashi S.
Sakunkoo N.
Tseng H.
Wesche M.
Ydewalle G
Zesch T.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/04/2014
Field of study

Language learners often use subtitled videos to help them learn. However, standard subtitles are geared more towards comprehension than vocabulary learning, as translations are nonliteral and are provided only for phrases, not vocabulary. This paper presents Smart Subtitles, which are interactive subtitles tailored towards vocabulary learning. Smart Subtitles can be automatically generated from common video sources such as subtitled DVDs. They provide features such as vocabulary definitions on hover, and dialog-based video navigation. In our pilot study with intermediate learners studying Chinese, participants correctly defined over twice as many new words in a post-viewing vocabulary test when they used Smart Subtitles, compared to dual Chinese-English subtitles. Learners spent the same amount of time watching clips with each tool, and enjoyed viewing videos with Smart Subtitles as much as with dual subtitles. Learners understood videos equally well using either tool, as indicated by self-assessments and independent evaluations of their summaries

CiteSeerX

DSpace@MIT

Crossref

The relationship between self-complexity and psychological stress : an examination of self-complexity assessment using a twenty-statement test

Author: KATSAMANIS Maria
佐部利真吾
榊原雅人
Publication venue: 東海学園大学
Publication date: 31/03/2012
Field of study

Tokaigakuen University Repository / 東海学園大学学術情報リポジトリ

Inversion from Audiovisual Speech to Articulatory Information by Exploiting Multimodal Data

Author: Aron Michael
Berger Marie-Odile
Katsamanis Athanassios
Maragos Petros
Roussos Anastasios
Publication venue: HAL CCSD
Publication date: 08/12/2008
Field of study

International audienceWe present an inversion framework to identify speech production properties from audiovisual information. Our system is built on a multimodal articulatory dataset comprising ultrasound, X-ray, magnetic resonance images as well as audio and stereovisual recordings of the speaker. Visual information is captured via stereovision while the vocal tract state is represented by a properly trained articulatory model. Inversion is based on an adaptive piecewise linear approximation of the audiovisualto- articulation mapping. The presented system can recover the hidden vocal tract shapes and may serve as a basis for a more widely applicable inversion setup

INRIA a CCSD electronic archive server

HAL-Rennes 1