840 research outputs found

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Multimedia Context Awareness for Smart Mobile Environments

    Get PDF
    openNowadays the development of the IoT framework and the resulting huge number of smart connected devices opens the door to exploit the presence of multiple smart nodes to accomplish a variety of tasks. Multimedia context awareness, together with the concept of ambient intelligence, is tightly related to the IoT framework, and it can be applied to a large number of smart scenarios. In this thesis, the aim is to study and analyze the role of context awareness in different applications related to smart mobile environments, such as future smart spaces and connected cities. Indeed, this research work focuses on different aspects of ambient intelligence, such as audio-awareness and wireless-awareness. In particular, this thesis tackles two main research topics: the first one, related to the framework of audio-awareness, concerns a multiple observations approach for smart speaker recognition in mobile environments; the second one, tied to the concept of wireless-awareness, regards Unmanned Aerial Vehicle (UAV) detection based on WiFi statistical fingerprint analysis.openXXXI CICLO - SC. E TECN. ING. ELETTR. E DELLE TEL. - Ambienti cognitivi interattiviGaribotto, Chiar

    Personalized Interaction with High-Resolution Wall Displays

    Get PDF
    Fallende Hardwarepreise sowie eine zunehmende Offenheit gegenüber neuartigen Interaktionsmodalitäten haben in den vergangen Jahren den Einsatz von wandgroßen interaktiven Displays möglich gemacht, und in der Folge ist ihre Anwendung, unter anderem in den Bereichen Visualisierung, Bildung, und der Unterstützung von Meetings, erfolgreich demonstriert worden. Aufgrund ihrer Größe sind Wanddisplays für die Interaktion mit mehreren Benutzern prädestiniert. Gleichzeitig kann angenommen werden, dass Zugang zu persönlichen Daten und Einstellungen — mithin personalisierte Interaktion — weiterhin essentieller Bestandteil der meisten Anwendungsfälle sein wird. Aktuelle Benutzerschnittstellen im Desktop- und Mobilbereich steuern Zugriffe über ein initiales Login. Die Annahme, dass es nur einen Benutzer pro Bildschirm gibt, zieht sich durch das gesamte System, und ermöglicht unter anderem den Zugriff auf persönliche Daten und Kommunikation sowie persönliche Einstellungen. Gibt es hingegen mehrere Benutzer an einem großen Bildschirm, müssen hierfür Alternativen gefunden werden. Die daraus folgende Forschungsfrage dieser Dissertation lautet: Wie können wir im Kontext von Mehrbenutzerinteraktion mit wandgroßen Displays personalisierte Schnittstellen zur Verfügung stellen? Die Dissertation befasst sich sowohl mit personalisierter Interaktion in der Nähe (mit Touch als Eingabemodalität) als auch in etwas weiterer Entfernung (unter Nutzung zusätzlicher mobiler Geräte). Grundlage für personalisierte Mehrbenutzerinteraktion sind technische Lösungen für die Zuordnung von Benutzern zu einzelnen Interaktionen. Hierzu werden zwei Alternativen untersucht: In der ersten werden Nutzer via Kamera verfolgt, und in der zweiten werden Mobilgeräte anhand von Ultraschallsignalen geortet. Darauf aufbauend werden Interaktionstechniken vorgestellt, die personalisierte Interaktion unterstützen. Diese nutzen zusätzliche Mobilgeräte, die den Zugriff auf persönliche Daten sowie Interaktion in einigem Abstand von der Displaywand ermöglichen. Einen weiteren Teil der Arbeit bildet die Untersuchung der praktischen Auswirkungen der Ausgabe- und Interaktionsmodalitäten für personalisierte Interaktion. Hierzu wird eine qualitative Studie vorgestellt, die Nutzerverhalten anhand des kooperativen Mehrbenutzerspiels Miners analysiert. Der abschließende Beitrag beschäftigt sich mit dem Analyseprozess selber: Es wird das Analysetoolkit für Wandinteraktionen GIAnT vorgestellt, das Nutzerbewegungen, Interaktionen, und Blickrichtungen visualisiert und dadurch die Untersuchung der Interaktionen stark vereinfacht.An increasing openness for more diverse interaction modalities as well as falling hardware prices have made very large interactive vertical displays more feasible, and consequently, applications in settings such as visualization, education, and meeting support have been demonstrated successfully. Their size makes wall displays inherently usable for multi-user interaction. At the same time, we can assume that access to personal data and settings, and thus personalized interaction, will still be essential in most use-cases. In most current desktop and mobile user interfaces, access is regulated via an initial login and the complete user interface is then personalized to this user: Access to personal data, configurations and communications all assume a single user per screen. In the case of multiple people using one screen, this is not a feasible solution and we must find alternatives. Therefore, this thesis addresses the research question: How can we provide personalized interfaces in the context of multi-user interaction with wall displays? The scope spans personalized interaction both close to the wall (using touch as input modality) and further away (using mobile devices). Technical solutions that identify users at each interaction can replace logins and enable personalized interaction for multiple users at once. This thesis explores two alternative means of user identification: Tracking using RGB+depth-based cameras and leveraging ultrasound positioning of the users' mobile devices. Building on this, techniques that support personalized interaction using personal mobile devices are proposed. In the first contribution on interaction, HyDAP, we examine pointing from the perspective of moving users, and in the second, SleeD, we propose using an arm-worn device to facilitate access to private data and personalized interface elements. Additionally, the work contributes insights on practical implications of personalized interaction at wall displays: We present a qualitative study that analyses interaction using a multi-user cooperative game as application case, finding awareness and occlusion issues. The final contribution is a corresponding analysis toolkit that visualizes users' movements, touch interactions and gaze points when interacting with wall displays and thus allows fine-grained investigation of the interactions

    Hashing for Multimedia Similarity Modeling and Large-Scale Retrieval

    Get PDF
    In recent years, the amount of multimedia data such as images, texts, and videos have been growing rapidly on the Internet. Motivated by such trends, this thesis is dedicated to exploiting hashing-based solutions to reveal multimedia data correlations and support intra-media and inter-media similarity search among huge volumes of multimedia data. We start by investigating a hashing-based solution for audio-visual similarity modeling and apply it to the audio-visual sound source localization problem. We show that synchronized signals in audio and visual modalities demonstrate similar temporal changing patterns in certain feature spaces. We propose to use a permutation-based random hashing technique to capture the temporal order dynamics of audio and visual features by hashing them along the temporal axis into a common Hamming space. In this way, the audio-visual correlation problem is transformed into a similarity search problem in the Hamming space. Our hashing-based audio-visual similarity modeling has shown superior performances in the localization and segmentation of sounding objects in videos. The success of the permutation-based hashing method motivates us to generalize and formally define the supervised ranking-based hashing problem, and study its application to large-scale image retrieval. Specifically, we propose an effective supervised learning procedure to learn optimized ranking-based hash functions that can be used for large-scale similarity search. Compared with the randomized version, the optimized ranking-based hash codes are much more compact and discriminative. Moreover, it can be easily extended to kernel space to discover more complex ranking structures that cannot be revealed in linear subspaces. Experiments on large image datasets demonstrate the effectiveness of the proposed method for image retrieval. We further studied the ranking-based hashing method for the cross-media similarity search problem. Specifically, we propose two optimization methods to jointly learn two groups of linear subspaces, one for each media type, so that features\u27 ranking orders in different linear subspaces maximally preserve the cross-media similarities. Additionally, we develop this ranking-based hashing method in the cross-media context into a flexible hashing framework with a more general solution. We have demonstrated through extensive experiments on several real-world datasets that the proposed cross-media hashing method can achieve superior cross-media retrieval performances against several state-of-the-art algorithms. Lastly, to make better use of the supervisory label information, as well as to further improve the efficiency and accuracy of supervised hashing, we propose a novel multimedia discrete hashing framework that optimizes an instance-wise loss objective, as compared to the pairwise losses, using an efficient discrete optimization method. In addition, the proposed method decouples the binary codes learning and hash function learning into two separate stages, thus making the proposed method equally applicable for both single-media and cross-media search. Extensive experiments on both single-media and cross-media retrieval tasks demonstrate the effectiveness of the proposed method
    • …
    corecore