7,902 research outputs found

    Spott : on-the-spot e-commerce for television using deep learning-based video analysis techniques

    Get PDF
    Spott is an innovative second screen mobile multimedia application which offers viewers relevant information on objects (e.g., clothing, furniture, food) they see and like on their television screens. The application enables interaction between TV audiences and brands, so producers and advertisers can offer potential consumers tailored promotions, e-shop items, and/or free samples. In line with the current views on innovation management, the technological excellence of the Spott application is coupled with iterative user involvement throughout the entire development process. This article discusses both of these aspects and how they impact each other. First, we focus on the technological building blocks that facilitate the (semi-) automatic interactive tagging process of objects in the video streams. The majority of these building blocks extensively make use of novel and state-of-the-art deep learning concepts and methodologies. We show how these deep learning based video analysis techniques facilitate video summarization, semantic keyframe clustering, and (similar) object retrieval. Secondly, we provide insights in user tests that have been performed to evaluate and optimize the application's user experience. The lessons learned from these open field tests have already been an essential input in the technology development and will further shape the future modifications to the Spott application

    Gaming control using a wearable and wireless EEG-based brain-computer interface device with novel dry foam-based sensors

    Get PDF
    A brain-computer interface (BCI) is a communication system that can help users interact with the outside environment by translating brain signals into machine commands. The use of electroencephalographic (EEG) signals has become the most common approach for a BCI because of their usability and strong reliability. Many EEG-based BCI devices have been developed with traditional wet- or micro-electro-mechanical-system (MEMS)-type EEG sensors. However, those traditional sensors have uncomfortable disadvantage and require conductive gel and skin preparation on the part of the user. Therefore, acquiring the EEG signals in a comfortable and convenient manner is an important factor that should be incorporated into a novel BCI device. In the present study, a wearable, wireless and portable EEG-based BCI device with dry foam-based EEG sensors was developed and was demonstrated using a gaming control application. The dry EEG sensors operated without conductive gel; however, they were able to provide good conductivity and were able to acquire EEG signals effectively by adapting to irregular skin surfaces and by maintaining proper skin-sensor impedance on the forehead site. We have also demonstrated a real-time cognitive stage detection application of gaming control using the proposed portable device. The results of the present study indicate that using this portable EEG-based BCI device to conveniently and effectively control the outside world provides an approach for researching rehabilitation engineering

    Survey and Systematization of Secure Device Pairing

    Full text link
    Secure Device Pairing (SDP) schemes have been developed to facilitate secure communications among smart devices, both personal mobile devices and Internet of Things (IoT) devices. Comparison and assessment of SDP schemes is troublesome, because each scheme makes different assumptions about out-of-band channels and adversary models, and are driven by their particular use-cases. A conceptual model that facilitates meaningful comparison among SDP schemes is missing. We provide such a model. In this article, we survey and analyze a wide range of SDP schemes that are described in the literature, including a number that have been adopted as standards. A system model and consistent terminology for SDP schemes are built on the foundation of this survey, which are then used to classify existing SDP schemes into a taxonomy that, for the first time, enables their meaningful comparison and analysis.The existing SDP schemes are analyzed using this model, revealing common systemic security weaknesses among the surveyed SDP schemes that should become priority areas for future SDP research, such as improving the integration of privacy requirements into the design of SDP schemes. Our results allow SDP scheme designers to create schemes that are more easily comparable with one another, and to assist the prevention of persisting the weaknesses common to the current generation of SDP schemes.Comment: 34 pages, 5 figures, 3 tables, accepted at IEEE Communications Surveys & Tutorials 2017 (Volume: PP, Issue: 99

    Disentangling Prosody Representations with Unsupervised Speech Reconstruction

    Full text link
    Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processin

    A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

    Full text link
    While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for speech recognition and speaker verification. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input

    Rushes summarization by IRIM consortium: redundancy removal and multi-feature fusion

    Get PDF
    International audienceIn this paper, we present the first participation of a consortium of French laboratories, IRIM, to the TRECVID 2008 BBC Rushes Summarization task. Our approach resorts to video skimming. We propose two methods to reduce redundancy, as rushes include several takes of scenes. We also take into account low and midlevel semantic features in an ad-hoc fusion method in order to retain only significant content

    Skylab SO71/SO72 circadian periodicity experiment

    Get PDF
    The circadian rhythm hardware activities from 1965 through 1973 are considered. A brief history of the programs leading to the development of the combined Skylab SO71/SO72 Circadian Periodicity Experiment (CPE) is given. SO71 is the Skylab experiment number designating the pocket mouse circadian experiment, and SO72 designates the vinegar gnat circadian experiment. Final design modifications and checkout of the CPE, integration testing with the Apollo service module CSM 117 and the launch preparation and support tasks at Kennedy Space Center are reported

    Artificial Intelligence for Multimedia Signal Processing

    Get PDF
    Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining

    Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units

    Full text link
    We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on timbre, and ignore people's unique speaking style (prosody). The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train. All conversion modules are only trained on reconstruction like tasks, thus suitable for any-to-many VC with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate that DISSC significantly outperforms the evaluated baselines. Code and samples are available at https://pages.cs.huji.ac.il/adiyoss-lab/dissc/.Comment: Accepted at EMNLP 202
    corecore