105 research outputs found
Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling
The study of speech disorders can benefit greatly from time-aligned data.
However, audio-text mismatches in disfluent speech cause rapid performance
degradation for modern speech aligners, hindering the use of automatic
approaches. In this work, we propose a simple and effective modification of
alignment graph construction of CTC-based models using Weighted Finite State
Transducers. The proposed weakly-supervised approach alleviates the need for
verbatim transcription of speech disfluencies for forced alignment. During the
graph construction, we allow the modeling of common speech disfluencies, i.e.
repetitions and omissions. Further, we show that by assessing the degree of
audio-text mismatch through the use of Oracle Error Rate, our method can be
effectively used in the wild. Our evaluation on a corrupted version of the
TIMIT test set and the UCLASS dataset shows significant improvements,
particularly for recall, achieving a 23-25% relative improvement over our
baselines.Comment: Interspeech 202
Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos
The recent state of the art on monocular 3D face reconstruction from image
data has made some impressive advancements, thanks to the advent of Deep
Learning. However, it has mostly focused on input coming from a single RGB
image, overlooking the following important factors: a) Nowadays, the vast
majority of facial image data of interest do not originate from single images
but rather from videos, which contain rich dynamic information. b) Furthermore,
these videos typically capture individuals in some form of verbal communication
(public talks, teleconferences, audiovisual human-computer interactions,
interviews, monologues/dialogues in movies, etc). When existing 3D face
reconstruction methods are applied in such videos, the artifacts in the
reconstruction of the shape and motion of the mouth area are often severe,
since they do not match well with the speech audio.
To overcome the aforementioned limitations, we present the first method for
visual speech-aware perceptual reconstruction of 3D mouth expressions. We do
this by proposing a "lipread" loss, which guides the fitting process so that
the elicited perception from the 3D reconstructed talking head resembles that
of the original video footage. We demonstrate that, interestingly, the lipread
loss is better suited for 3D reconstruction of mouth movements compared to
traditional landmark losses, and even direct 3D supervision. Furthermore, the
devised method does not rely on any text transcriptions or corresponding audio,
rendering it ideal for training in unlabeled datasets. We verify the efficiency
of our method through exhaustive objective evaluations on three large-scale
datasets, as well as subjective evaluation with two web-based user studies
Comparing timed-division multiplexing and best-effort networks-on-chip
Best-effort (BE) networks-on-chips (NOCs) are usually preferred over time-division multiplexed (TDM) NOCs in multi-core platforms because they are work-conserving and have lower (zero-load) latency. On the other hand, BE NOCs are significantly more expensive to implement than TDM NOCs because of their virtual channel buffers, allocators/arbiters, and (credit-based) flow control; functionality that a TDM NOC avoids altogether. The objective of this paper is to compare the performance of BE and TDM NOCs, taking hardware cost into consideration. The networks are compared using graphs showing average latency as a function of offered load. For the BE NOCs, we use the BookSim simulator, and for the TDM NOCs, we derive a queuing theory model and an associated TDM NOC simulator. Through experiments with both router architectures, packet length, link width, and different traffic patterns, we show that for the same hardware cost, a TDM NOC can provide higher bandwidth and comparable latency. We also show that the packet length is the most important factor affecting the TDM period, which again is the primary factor affecting latency. The best TDM NOC design for BE traffic uses single flit packets, wide links/flits, and a router with two pipeline stages: link and router traversal.publishedVersionPeer reviewe
Smart subtitles for vocabulary learning
Language learners often use subtitled videos to help them learn. However, standard subtitles are geared more towards comprehension than vocabulary learning, as translations are nonliteral and are provided only for phrases, not vocabulary. This paper presents Smart Subtitles, which are interactive subtitles tailored towards vocabulary learning. Smart Subtitles can be automatically generated from common video sources such as subtitled DVDs. They provide features such as vocabulary definitions on hover, and dialog-based video navigation. In our pilot study with intermediate learners studying Chinese, participants correctly defined over twice as many new words in a post-viewing vocabulary test when they used Smart Subtitles, compared to dual Chinese-English subtitles. Learners spent the same amount of time watching clips with each tool, and enjoyed viewing videos with Smart Subtitles as much as with dual subtitles. Learners understood videos equally well using either tool, as indicated by self-assessments and independent evaluations of their summaries
Inversion from Audiovisual Speech to Articulatory Information by Exploiting Multimodal Data
International audienceWe present an inversion framework to identify speech production properties from audiovisual information. Our system is built on a multimodal articulatory dataset comprising ultrasound, X-ray, magnetic resonance images as well as audio and stereovisual recordings of the speaker. Visual information is captured via stereovision while the vocal tract state is represented by a properly trained articulatory model. Inversion is based on an adaptive piecewise linear approximation of the audiovisualto- articulation mapping. The presented system can recover the hidden vocal tract shapes and may serve as a basis for a more widely applicable inversion setup
- …