48 research outputs found
Digital Watermarking for Verification of Perception-based Integrity of Audio Data
In certain application fields digital audio recordings contain sensitive content. Examples are historical archival material in public archives that preserve our cultural heritage, or digital evidence in the context of law enforcement and civil proceedings. Because of the powerful capabilities of modern editing tools for multimedia such material is vulnerable to doctoring of the content and forgery of its origin with malicious intent. Also inadvertent data modification and mistaken origin can be caused by human error. Hence, the credibility and provenience in terms of an unadulterated and genuine state of such audio content and the confidence about its origin are critical factors.
To address this issue, this PhD thesis proposes a mechanism for verifying the integrity and authenticity of digital sound recordings. It is designed and implemented to be insensitive to common post-processing operations of the audio data that influence the subjective acoustic perception only marginally (if at all). Examples of such operations include lossy compression that maintains a high sound quality of the audio media, or lossless format conversions. It is the objective to avoid de facto false alarms that would be expectedly observable in standard crypto-based authentication protocols in the presence of these legitimate post-processing. For achieving this, a feasible combination of the techniques of digital watermarking and audio-specific hashing is investigated.
At first, a suitable secret-key dependent audio hashing algorithm is developed. It incorporates and enhances so-called audio fingerprinting technology from the state of the art in contentbased audio identification. The presented algorithm (denoted as ”rMAC” message authentication code) allows ”perception-based” verification of integrity. This means classifying integrity breaches as such not before they become audible. As another objective, this rMAC is embedded and stored silently inside the audio media by means of audio watermarking technology. This approach allows maintaining the authentication code across the above-mentioned admissible post-processing operations and making it available for integrity verification at a later date. For this, an existent secret-key ependent audio watermarking algorithm is used and enhanced in this thesis work.
To some extent, the dependency of the rMAC and of the watermarking processing from a secret key also allows authenticating the origin of a protected audio. To elaborate on this security aspect, this work also estimates the brute-force efforts of an adversary attacking this combined rMAC-watermarking approach. The experimental results show that the proposed method provides a good distinction and classification
performance of authentic versus doctored audio content. It also allows the temporal localization of audible data modification within a protected audio file. The experimental evaluation finally provides recommendations about technical configuration settings of the combined watermarking-hashing approach.
Beyond the main topic of perception-based data integrity and data authenticity for audio, this PhD work provides new general findings in the fields of audio fingerprinting and digital watermarking. The main contributions of this PhD were published and presented mainly at conferences about multimedia security. These publications were cited by a number of other authors and hence had some impact on their works
Deep Neural Network Architectures for Large-scale, Robust and Small-Footprint Speaker and Language Recognition
Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura : 27-04-2017Artificial neural networks are powerful learners of the information embedded in speech signals.
They can provide compact, multi-level, nonlinear representations of temporal sequences
and holistic optimization algorithms capable of surpassing former leading paradigms. Artificial
neural networks are, therefore, a promising technology that can be used to enhance our
ability to recognize speakers and languages–an ability increasingly in demand in the context
of new, voice-enabled interfaces used today by millions of users. The aim of this thesis is to
advance the state-of-the-art of language and speaker recognition through the formulation,
implementation and empirical analysis of novel approaches for large-scale and portable
speech interfaces. Its major contributions are: (1) novel, compact network architectures
for language and speaker recognition, including a variety of network topologies based on
fully-connected, recurrent, convolutional, and locally connected layers; (2) a bottleneck combination
strategy for classical and neural network approaches for long speech sequences; (3)
the architectural design of the first, public, multilingual, large vocabulary continuous speech
recognition system; and (4) a novel, end-to-end optimization algorithm for text-dependent
speaker recognition that is applicable to a range of verification tasks. Experimental results
have demonstrated that artificial neural networks can substantially reduce the number of
model parameters and surpass the performance of previous approaches to language and
speaker recognition, particularly in the cases of long short-term memory recurrent networks
(used to model the input speech signal), end-to-end optimization algorithms (used to predict
languages or speakers), short testing utterances, and large training data collections.Las redes neuronales artificiales son sistemas de aprendizaje capaces de extraer la información
embebida en las señales de voz. Son capaces de modelar de forma eficiente secuencias
temporales complejas, con información no lineal y distribuida en distintos niveles semanticos,
mediante el uso de algoritmos de optimización integral con la capacidad potencial de mejorar
los sistemas aprendizaje automático existentes. Las redes neuronales artificiales son, pues,
una tecnología prometedora para mejorar el reconocimiento automático de locutores e
idiomas; siendo el reconocimiento de de locutores e idiomas, tareas con cada vez más
demanda en los nuevos sistemas de control por voz, que ya utilizan millones de personas. Esta
tesis tiene como objetivo la mejora del estado del arte de las tecnologías de reconocimiento
de locutor y de idioma mediante la formulación, implementación y análisis empírico de
nuevos enfoques basados en redes neuronales, aplicables a dispositivos portátiles y a su uso
en gran escala. Las principales contribuciones de esta tesis incluyen la propuesta original de:
(1) arquitecturas eficientes que hacen uso de capas neuronales densas, localmente densas,
recurrentes y convolucionales; (2) una nueva estrategia de combinación de enfoques clásicos
y enfoques basados en el uso de las denominadas redes de cuello de botella; (3) el diseño del
primer sistema público de reconocimiento de voz, de vocabulario abierto y continuo, que es
además multilingüe; y (4) la propuesta de un nuevo algoritmo de optimización integral para
tareas de reconocimiento de locutor, aplicable también a otras tareas de verificación. Los
resultados experimentales extraídos de esta tesis han demostrado que las redes neuronales
artificiales son capaces de reducir el número de parámetros usados por los algoritmos de
reconocimiento tradicionales, así como de mejorar el rendimiento de dichos sistemas de
forma substancial. Dicha mejora relativa puede acentuarse a través del modelado de voz
mediante redes recurrentes de memoria a largo plazo, el uso de algoritmos de optimización
integral, el uso de locuciones de evaluation de corta duración y mediante la optimización del
sistema con grandes cantidades de datos de entrenamiento
Recent Advances in Signal Processing
The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
Exploiting Spatio-Temporal Coherence for Video Object Detection in Robotics
This paper proposes a method to enhance video object detection for indoor environments in robotics. Concretely, it exploits knowledge about the camera motion between frames to propagate previously detected objects to successive frames. The proposal is rooted in the concepts of planar homography to propose regions of interest where to find objects, and recursive Bayesian filtering to integrate observations over time. The proposal is evaluated on six virtual, indoor environments, accounting for the detection of nine object classes over a total of ∼ 7k frames. Results show that our proposal improves the recall and the F1-score by a factor of 1.41 and 1.27, respectively, as well as it achieves a significant reduction of the object categorization entropy (58.8%) when compared to a two-stage video object detection method used as baseline, at the cost of small time overheads (120 ms) and precision loss (0.92).</p