11,862 research outputs found
Processing and Linking Audio Events in Large Multimedia Archives: The EU inEvent Project
In the inEvent EU project [1], we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected “hyper-events ” (a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal. We present initial results for feature extraction from lecture recordings using the TED talks. Index Terms: Networked multimedia events; audio processing: speech recognition; speaker diarization and linking; multimedia indexing and searching; hyper-events. 1
Language model adaptation for video lectures transcription
Videolectures are currently being digitised all over the world for its enormous value as reference resource. Many of these lectures are accompanied with slides. The slides offer a great opportunity for improving ASR systems performance. We propose a simple yet powerful extension to the linear interpolation of language models for adapting language models with slide information. Two types of slides are considered, correct slides, and slides automatic extracted from the videos with OCR. Furthermore, we compare both time aligned and unaligned slides. Results report an improvement of up to 3.8 % absolute WER points when using correct slides. Surprisingly, when using automatic slides obtained with poor OCR quality, the ASR system still improves up to 2.2 absolute WER points.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures). Also supported by the Spanish Government (Plan E, iTrans2 TIN2009-14511).Martínez-Villaronga, A.; Del Agua Teba, MA.; Andrés Ferrer, J.; Juan Císcar, A. (2013). Language model adaptation for video lectures transcription. En Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IInstitute of Electrical and Electronics Engineers (IEEE). 8450-8454. https://doi.org/10.1109/ICASSP.2013.6639314S8450845
Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling
Recording university lectures through lecture capture systems is increasingly common.
However, a single continuous audio recording is often unhelpful for users, who may wish
to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set
of recordings.
A transcript of the recording can enable faster navigation and searching. Automatic speech
recognition (ASR) technologies may be used to create automated transcripts, to avoid the
significant time and cost involved in manual transcription.
Low accuracy of ASR-generated transcripts may however limit their usefulness. In
particular, ASR systems optimized for general speech recognition may not recognize the
many technical or discipline-specific words occurring in university lectures. To improve
the usefulness of ASR transcripts for the purposes of information retrieval (search) and
navigating within recordings, the lexicon and language model used by the ASR engine may
be dynamically adapted for the topic of each lecture.
A prototype is presented which uses the English Wikipedia as a semantically dense, large
language corpus to generate a custom lexicon and language model for each lecture from a
small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia
articles are investigated: a naïve crawler which follows all article links from a set of seed
articles produced by a Wikipedia search from the initial keywords, and a refinement which
follows only links to articles sufficiently similar to the parent article. Pair-wise article
similarity is computed from a pre-computed vector space model of Wikipedia article term
scores generated using latent semantic indexing.
The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded
lectures from Open Yale Courses, using the English HUB4 language model as a reference
and the two topic-specific language models generated for each lecture from Wikipedia
Evaluating intelligent interfaces for post-editing automatic transcriptions of online video lectures
Video lectures are fast becoming an everyday educational resource in higher education. They are being incorporated into existing university curricula around the world, while also emerging as a key component of the open education movement. In 2007, the Universitat Politècnica de València (UPV) implemented its poliMedia lecture capture system for the creation and publication of quality educational video content and now has a collection of over 10,000 video objects. In 2011, it embarked on the EU-subsidised transLectures project to add automatic subtitles to these videos in both Spanish and other languages. By doing so, it allows access to their educational content by non-native speakers and the deaf and hard-of-hearing, as well as enabling advanced repository management functions. In this paper, following a short introduction to poliMedia, transLectures and Docència en Xarxa (Teaching Online), the UPV s action plan to boost the use of digital resources at the university, we will discuss the three-stage evaluation process carried out with the collaboration of UPV lecturers to find the best interaction protocol for the task of post-editing automatic subtitles.Valor Miró, JD.; Spencer, RN.; Pérez González De Martos, AM.; Garcés Díaz-Munío, GV.; Turró Ribalta, C.; Civera Saiz, J.; Juan Císcar, A. (2014). Evaluating intelligent interfaces for post-editing automatic transcriptions of online video lectures. Open Learning: The Journal of Open and Distance Learning. 29(1):72-85. doi:10.1080/02680513.2014.909722S7285291Fujii, A., Itou, K., & Ishikawa, T. (2006). LODEM: A system for on-demand video lectures. Speech Communication, 48(5), 516-531. doi:10.1016/j.specom.2005.08.006Gilbert, M., Knight, K., & Young, S. (2008). Spoken Language Technology [From the Guest Editors]. IEEE Signal Processing Magazine, 25(3), 15-16. doi:10.1109/msp.2008.918412Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9(2), 171-185. doi:10.1006/csla.1995.0010Proceedings of the 9th ACM SIGCHI New Zealand Chapter’s International Conference on Human-Computer Interaction Design Centered HCI - CHINZ ’08. (2008). doi:10.1145/1496976Martinez-Villaronga, A., del Agua, M. A., Andres-Ferrer, J., & Juan, A. (2013). Language model adaptation for video lectures transcription. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. doi:10.1109/icassp.2013.6639314Munteanu, C., Baecker, R., & Penn, G. (2008). Collaborative editing for improved usefulness and usability of transcript-enhanced webcasts. Proceeding of the twenty-sixth annual CHI conference on Human factors in computing systems - CHI ’08. doi:10.1145/1357054.1357117Repp, S., Gross, A., & Meinel, C. (2008). Browsing within Lecture Videos Based on the Chain Index of Speech Transcription. IEEE Transactions on Learning Technologies, 1(3), 145-156. doi:10.1109/tlt.2008.22Proceedings of the 2012 ACM international conference on Intelligent User Interfaces - IUI ’12. (2012). doi:10.1145/2166966Serrano, N., Giménez, A., Civera, J., Sanchis, A., & Juan, A. (2013). Interactive handwriting recognition with limited user effort. International Journal on Document Analysis and Recognition (IJDAR), 17(1), 47-59. doi:10.1007/s10032-013-0204-5Torre Toledano, D., Ortega Giménez, A., Teixeira, A., González Rodríguez, J., Hernández Gómez, L., San Segundo Hernández, R., & Ramos Castro, D. (Eds.). (2012). Advances in Speech and Language Technologies for Iberian Languages. Communications in Computer and Information Science. doi:10.1007/978-3-642-35292-8Wald, M. (2006). Creating accessible educational multimedia through editing automatic speech recognition captioning in real time. Interactive Technology and Smart Education, 3(2), 131-141. doi:10.1108/1741565068000005
AutoLV: Automatic Lecture Video Generator
We propose an end-to-end lecture video generation system that can generate
realistic and complete lecture videos directly from annotated slides,
instructor's reference voice and instructor's reference portrait video. Our
system is primarily composed of a speech synthesis module with few-shot speaker
adaptation and an adversarial learning-based talking-head generation module. It
is capable of not only reducing instructors' workload but also changing the
language and accent which can help the students follow the lecture more easily
and enable a wider dissemination of lecture contents. Our experimental results
show that the proposed model outperforms other current approaches in terms of
authenticity, naturalness and accuracy. Here is a video demonstration of how
our system works, and the outcomes of the evaluation and comparison:
https://youtu.be/cY6TYkI0cog.Comment: 4 pages, 4 figures, ICIP 202
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Online Incremental Machine Translation
In this thesis we investigate the automatic improvements of statistical machine translation systems at runtime based on user feedback. We also propose a framework to use the proposed algorithms in large scale translation settings
Semi-Supervised Acoustic Model Training by Discriminative Data Selection from Multiple ASR Systems' Hypotheses
While the performance of ASR systems depends on the size of the training data, it is very costly to prepare accurate and faithful transcripts. In this paper, we investigate a semisupervised training scheme, which takes the advantage of huge quantities of unlabeled video lecture archive, particularly for the deep neural network (DNN) acoustic model. In the proposed method, we obtain ASR hypotheses by complementary GMM-and DNN-based ASR systems. Then, a set of CRF-based classifiers is trained to select the correct hypotheses and verify the selected data. The proposed hypothesis combination shows higher quality compared with the conventional system combination method (ROVER). Moreover, compared with the conventional data selection based on confidence measure score, our method is demonstrated more effective for filtering usable data. Significant improvement in the ASR accuracy is achieved over the baseline system and in comparison with the models trained with the conventional system combination and data selection methods
- …