11,862 research outputs found

    Processing and Linking Audio Events in Large Multimedia Archives: The EU inEvent Project

    Get PDF
    In the inEvent EU project [1], we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected “hyper-events ” (a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal. We present initial results for feature extraction from lecture recordings using the TED talks. Index Terms: Networked multimedia events; audio processing: speech recognition; speaker diarization and linking; multimedia indexing and searching; hyper-events. 1

    Language model adaptation for video lectures transcription

    Full text link
    Videolectures are currently being digitised all over the world for its enormous value as reference resource. Many of these lectures are accompanied with slides. The slides offer a great opportunity for improving ASR systems performance. We propose a simple yet powerful extension to the linear interpolation of language models for adapting language models with slide information. Two types of slides are considered, correct slides, and slides automatic extracted from the videos with OCR. Furthermore, we compare both time aligned and unaligned slides. Results report an improvement of up to 3.8 % absolute WER points when using correct slides. Surprisingly, when using automatic slides obtained with poor OCR quality, the ASR system still improves up to 2.2 absolute WER points.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures). Also supported by the Spanish Government (Plan E, iTrans2 TIN2009-14511).Martínez-Villaronga, A.; Del Agua Teba, MA.; Andrés Ferrer, J.; Juan Císcar, A. (2013). Language model adaptation for video lectures transcription. En Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IInstitute of Electrical and Electronics Engineers (IEEE). 8450-8454. https://doi.org/10.1109/ICASSP.2013.6639314S8450845

    Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling

    Get PDF
    Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia

    Evaluating intelligent interfaces for post-editing automatic transcriptions of online video lectures

    Full text link
    Video lectures are fast becoming an everyday educational resource in higher education. They are being incorporated into existing university curricula around the world, while also emerging as a key component of the open education movement. In 2007, the Universitat Politècnica de València (UPV) implemented its poliMedia lecture capture system for the creation and publication of quality educational video content and now has a collection of over 10,000 video objects. In 2011, it embarked on the EU-subsidised transLectures project to add automatic subtitles to these videos in both Spanish and other languages. By doing so, it allows access to their educational content by non-native speakers and the deaf and hard-of-hearing, as well as enabling advanced repository management functions. In this paper, following a short introduction to poliMedia, transLectures and Docència en Xarxa (Teaching Online), the UPV s action plan to boost the use of digital resources at the university, we will discuss the three-stage evaluation process carried out with the collaboration of UPV lecturers to find the best interaction protocol for the task of post-editing automatic subtitles.Valor Miró, JD.; Spencer, RN.; Pérez González De Martos, AM.; Garcés Díaz-Munío, GV.; Turró Ribalta, C.; Civera Saiz, J.; Juan Císcar, A. (2014). Evaluating intelligent interfaces for post-editing automatic transcriptions of online video lectures. Open Learning: The Journal of Open and Distance Learning. 29(1):72-85. doi:10.1080/02680513.2014.909722S7285291Fujii, A., Itou, K., & Ishikawa, T. (2006). LODEM: A system for on-demand video lectures. Speech Communication, 48(5), 516-531. doi:10.1016/j.specom.2005.08.006Gilbert, M., Knight, K., & Young, S. (2008). Spoken Language Technology [From the Guest Editors]. IEEE Signal Processing Magazine, 25(3), 15-16. doi:10.1109/msp.2008.918412Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9(2), 171-185. doi:10.1006/csla.1995.0010Proceedings of the 9th ACM SIGCHI New Zealand Chapter’s International Conference on Human-Computer Interaction Design Centered HCI - CHINZ ’08. (2008). doi:10.1145/1496976Martinez-Villaronga, A., del Agua, M. A., Andres-Ferrer, J., & Juan, A. (2013). Language model adaptation for video lectures transcription. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. doi:10.1109/icassp.2013.6639314Munteanu, C., Baecker, R., & Penn, G. (2008). Collaborative editing for improved usefulness and usability of transcript-enhanced webcasts. Proceeding of the twenty-sixth annual CHI conference on Human factors in computing systems - CHI ’08. doi:10.1145/1357054.1357117Repp, S., Gross, A., & Meinel, C. (2008). Browsing within Lecture Videos Based on the Chain Index of Speech Transcription. IEEE Transactions on Learning Technologies, 1(3), 145-156. doi:10.1109/tlt.2008.22Proceedings of the 2012 ACM international conference on Intelligent User Interfaces - IUI ’12. (2012). doi:10.1145/2166966Serrano, N., Giménez, A., Civera, J., Sanchis, A., & Juan, A. (2013). Interactive handwriting recognition with limited user effort. International Journal on Document Analysis and Recognition (IJDAR), 17(1), 47-59. doi:10.1007/s10032-013-0204-5Torre Toledano, D., Ortega Giménez, A., Teixeira, A., González Rodríguez, J., Hernández Gómez, L., San Segundo Hernández, R., & Ramos Castro, D. (Eds.). (2012). Advances in Speech and Language Technologies for Iberian Languages. Communications in Computer and Information Science. doi:10.1007/978-3-642-35292-8Wald, M. (2006). Creating accessible educational multimedia through editing automatic speech recognition captioning in real time. Interactive Technology and Smart Education, 3(2), 131-141. doi:10.1108/1741565068000005

    AutoLV: Automatic Lecture Video Generator

    Full text link
    We propose an end-to-end lecture video generation system that can generate realistic and complete lecture videos directly from annotated slides, instructor's reference voice and instructor's reference portrait video. Our system is primarily composed of a speech synthesis module with few-shot speaker adaptation and an adversarial learning-based talking-head generation module. It is capable of not only reducing instructors' workload but also changing the language and accent which can help the students follow the lecture more easily and enable a wider dissemination of lecture contents. Our experimental results show that the proposed model outperforms other current approaches in terms of authenticity, naturalness and accuracy. Here is a video demonstration of how our system works, and the outcomes of the evaluation and comparison: https://youtu.be/cY6TYkI0cog.Comment: 4 pages, 4 figures, ICIP 202

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Online Incremental Machine Translation

    Get PDF
    In this thesis we investigate the automatic improvements of statistical machine translation systems at runtime based on user feedback. We also propose a framework to use the proposed algorithms in large scale translation settings

    Semi-Supervised Acoustic Model Training by Discriminative Data Selection from Multiple ASR Systems' Hypotheses

    Get PDF
    While the performance of ASR systems depends on the size of the training data, it is very costly to prepare accurate and faithful transcripts. In this paper, we investigate a semisupervised training scheme, which takes the advantage of huge quantities of unlabeled video lecture archive, particularly for the deep neural network (DNN) acoustic model. In the proposed method, we obtain ASR hypotheses by complementary GMM-and DNN-based ASR systems. Then, a set of CRF-based classifiers is trained to select the correct hypotheses and verify the selected data. The proposed hypothesis combination shows higher quality compared with the conventional system combination method (ROVER). Moreover, compared with the conventional data selection based on confidence measure score, our method is demonstrated more effective for filtering usable data. Significant improvement in the ASR accuracy is achieved over the baseline system and in comparison with the models trained with the conventional system combination and data selection methods
    corecore