679 research outputs found

    On the Perceptual Organization of Speech

    Get PDF
    A general account of auditory perceptual organization has developed in the past 2 decades. It relies on primitive devices akin to the Gestalt principles of organization to assign sensory elements to probable groupings and invokes secondary schematic processes to confirm or to repair the possible organization. Although this conceptualization is intended to apply universally, the variety and arrangement of acoustic constituents of speech violate Gestalt principles at numerous junctures, cohering perceptually, nonetheless. The authors report 3 experiments on organization in phonetic perception, using sine wave synthesis to evade the Gestalt rules and the schematic processes alike. These findings falsify a general auditory account, showing that phonetic perceptual organization is achieved by specific sensitivity to the acoustic modulations characteristic of speech signals

    Presentation attack detection in voice biometrics

    Get PDF
    Recent years have shown an increase in both the accuracy of biometric systems and their practical use. The application of biometrics is becoming widespread with fingerprint sensors in smartphones, automatic face recognition in social networks and video-based applications, and speaker recognition in phone banking and other phone-based services. The popularization of the biometric systems, however, exposed their major flaw --- high vulnerability to spoofing attacks. A fingerprint sensor can be easily tricked with a simple glue-made mold, a face recognition system can be accessed using a printed photo, and a speaker recognition system can be spoofed with a replay of pre-recorded voice. The ease with which a biometric system can be spoofed demonstrates the importance of developing efficient anti-spoofing systems that can detect both known (conceivable now) and unknown (possible in the future) spoofing attacks. Therefore, it is important to develop mechanisms that can detect such attacks, and it is equally important for these mechanisms to be seamlessly integrated into existing biometric systems for practical and attack-resistant solutions. To be practical, however, an attack detection should have (i) high accuracy, (ii) be well-generalized for different attacks, and (iii) be simple and efficient. One reason for the increasing demand for effective presentation attack detection (PAD) systems is the ease of access to people's biometric data. So often, a potential attacker can almost effortlessly obtain necessary biometric samples from social networks, including facial images, audio and video recordings, and even extract fingerprints from high resolution images. Therefore, various privacy protection solutions, such as legal privacy requirements and algorithms for obfuscating personal information, e.g., visual privacy filters, as well as, social awareness of threats to privacy can also increase security of personal information and potentially reduce the vulnerability of biometric systems. In this chapter, however, we focus on presentation attacks detection in voice biometrics, i.e., automatic speaker verification (ASV) systems. We discuss vulnerabilities of these systems to presentation attacks (PAs), present different state of the art PAD systems, give the insights into their performances, and discuss the integration of PAD and ASV systems

    THE USE OF REGRESSION MODELS FOR DETECTING DIGITAL FINGERPRINTS IN SYNTHETIC AUDIO

    Get PDF
    Modern advancements in text to speech and voice conversion techniques make it increasingly difficult to distinguish an authentic voice from a synthetically generated voice. These techniques, though complex, are relatively easy to use, even for non-technical users. It is important to develop mechanisms for detecting false content that easily scale to the size of the monitoring requirement. Current approaches for detecting spoofed audio are difficult to scale because of their processing requirements. Individually analyzing spectrograms for aberrations at higher frequencies relies too much on independent verification and is more resource intensive. Our method addresses the resource consideration by only looking at the residual differences between an audio file’s smoothed signal and its actual signal. We conjecture that natural audio has greater variance than spoofed audio because spoofed audio’s generation is conditioned on trying to mimic an existing pattern. To test this, we develop a classifier that distinguishes between spoofed and real audio by analyzing the differences in residual patterns between audio files.Outstanding ThesisMajor, United States ArmyApproved for public release. Distribution is unlimited

    Long Term Spectral Statistics for Voice Presentation Attack Detection

    Get PDF
    Automatic speaker verification systems can be spoofed through recorded, synthetic or voice converted speech of target speakers. To make these systems practically viable, the detection of such attacks, referred to as presentation attacks, is of paramount interest. In that direction, this paper investigates two aspects: (a) a novel approach to detect presentation attacks where, unlike conventional approaches, no speech signal related assumptions are made, rather the attacks are detected by computing first order and second order spectral statistics and feeding them to a classifier, and (b) generalization of the presentation attack detection systems across databases. Our investigations on Interspeech 2015 ASVspoof challenge dataset and AVspoof dataset show that, when compared to the approaches based on conventional short-term spectral processing, the proposed approach with a linear discriminative classifier yields a better system, irrespective of whether the spoofed signal is replayed to the microphone or is directly injected into the system software process. Cross-database investigations show that neither the short-term spectral processing based approaches nor the proposed approach yield systems which are able to generalize across databases or methods of attack. Thus, revealing the difficulty of the problem and the need for further resources and research

    The effects of background noise and test subject on the perceived amount of bass in phase-modified harmonic complex tones

    Get PDF
    Äänenvärin havaitseminen liittyy läheisesti äänen tuottamiin suhteellisiin tasoihin simpu- kassa eri taajuuskaistoilla, joita kutsutaan kriittisiksi kaistoiksi. Äänen magnitudispektri määrittää sen taajuuskomponenttien suhteelliset voimakkuudet ja vaihespektri niiden suhteelliset vaiheet. Äänenväri siis riippuu usein pelkästään magnitudispektristä. Tutki- mustulokset ovat kuitenkin osoittaneet, että tietyn tyyppisten äänien äänenväriä voidaan muuttaa myös pelkästään vaihespektriä muuttamalla. Tämän lisäksi aiempi tutkimus on osoittanut, että muuttamalla harmonisen äänen vaihespektriä tietyllä tavalla havaittu bassokkuus muuttuu. Tällaiset äänet ovat siis ’vaiheherkkiä’. Kyseisessä tutkimuksessa käytettiin kahta tällaista vaihemuokattua ääntä, joista toisessa taajuuskomponenttien välillä oli -90 asteen ja toisessa 90 asteen vaihe- ero, ja perustaajuuskomponentti oli molemmissa kosinivaiheessa. Tutkimus osoitti, että suurin bassokkuusero havaitaan matalilla perustaajuuksilla ja se vastaa keskimäärin 2 – 4 dB:n vahvistusta magnitudispektrissä matalilla taajuuksilla. Tämä ilmiön suuruus riippui kuitenkin huomattavasti testihenkilöstä. Lisäksi huomattiin, että bassokkuuserot ovat helpompia kuulla taustakohinan kanssa. Tämän työn tavoitteena oli tutkia edelleen taustakohinan merkitystä ja yksilöllisiä eroja tällaisten vaiheherkkien äänien bassokkuuden havaitsemisessa. Kaksi formaalia kuuntelu- koetta järjestettiin käyttäen kuulokkeita. Ensiksi tutkittiin taustakohinan vaikutusta kyseisten äänien bassokkuuserojen kuulemiseen olettaen, että nämä erot ovat kuultavissa äänekkyyseroina. Tulokset viittaavat, että taustakohinan tason nousun vaikutus testiäänien äänekkyyseroon ei ole tilastollisesti merkittävä, mutta on lähellä merkittävyyden rajaa ja trendi on nähtävissä äänekkyyseron kasvulle. Lisäksi nähdään, että kyseisten vaiheherkkien äänien yleinen äänekkyys laskee kun taustakohinan tasoa voimistetaan. Toiseksi tutkittiin sitä, minkä vaihespektrin omaavan äänen eri ihmiset kuulevat bassokkaimpana. Tulokset osoittavat, että testihenkilöt eroavat siinä, minkä vaihespektrin omaavan äänen he kuulevat bassokkaimpana, ja että tämä ero on tilastollisesti merkittävä.The perception of timbre is closely related to the relative levels produced by a sound in each frequency band, called ‘critical band’, in the cochlea. The magnitude spectrum defines the relative levels and phase spectrum the relative phases of the frequency components in a complex sound. Thus, the timbre of sound depends often only on the magnitude spectrum. However, several studies have shown that the timbre of certain complex sounds can be affected by modifying only the phase spectrum. Moreover, a recent study has shown that with certain modifications of only the phase spectrum of a ‘phase-sensitive’ harmonic complex tone, the perceived level of bass changes. That experiment was conducted using two synthetic harmonic complex tones in which adjacent frequency components have a phase-shift of -90◦ and 90◦, respectively, and the fundamental component is in cosine-phase. The greatest difference in perceived level of bass was found at the fundamental frequency of 50 Hz and it corresponds to a 2 – 4-dB amplification of the magnitude spectrum at low frequencies. However, this effect was reported to vary substantially between individuals. Moreover, the differences were found to be easier to detect in the presence of background noise. The aim of this thesis was to investigate further the roles of background noise and the individual in the perceived level of bass in the phase-sensitive tones. Two formal listening tests were conducted accordingly using headphones. Firstly, the effect of background noise on the discrimination of the phase-sensitive tones based on the perceived level of bass was studied. The effect of increasing background noise level on the perceived loudness difference was found not to be statistically significant, but a trend could be seen towards increasing loudness difference. Additionally, the results indicate that the overall perceived loudness of the test tones decreases with increasing level of background noise. Secondly, an experiment was conducted to find the preferred value of the constant phase shift between adjacent components that produces a tone with the perceptually loudest bass for different individuals. The results show that individuals hear the phase spectrum required to produce the perception of the loudest bass statistically significantly differently from each other

    Utilización de la fase armónica en la detección de voz sintética.

    Get PDF
    156 p.Los sistemas de verificación de locutor (SV) tienen que enfrentarse a la posibilidad de ser atacados mediante técnicas de spoofing. Hoy en día, las tecnologías de conversión de voces y de síntesis de voz adaptada a locutor han avanzado lo suficiente para poder crear voces que sean capaces de engañar a un sistema SV. En esta tesis se propone un módulo de detección de habla sintética (SSD) que puede utilizarse como complemento a un sistema SV, pero que es capaz de funcionar de manera independiente. Lo conforma un clasificador basado en GMM, dotado de modelos de habla humana y sintética. Cada entrada se compara con ambos, y, si la diferencia de verosimilitudes supera un determinado umbral, se acepta como humana, rechazándose en caso contrario. El sistema desarrollado es independiente de locutor. Para la generación de modelos se utilizarán parámetros RPS. Se propone una técnica para reducir la complejidad del proceso de entrenamiento, evitando generar TTSs adaptados o un conversor de voz para cada locutor. Para ello, como la mayoría de los sistemas de adaptación o síntesis modernos hacen uso de vocoders, se propone transcodificar las señales humanas mediante vocoders para obtener de esta forma sus versiones sintéticas, con las que se generarán los modelos sintéticos del clasificador. Se demostrará que se pueden detectar señales sintéticas detectando que se crearon mediante un vocoder. El rendimiento del sistema prueba en diferentes condiciones: con las propias señales transcodificadas o con ataques TTS. Por último, se plantean estrategias para el entrenamiento de modelos para sistemas SSD

    Image and Video Forensics

    Get PDF
    Nowadays, images and videos have become the main modalities of information being exchanged in everyday life, and their pervasiveness has led the image forensics community to question their reliability, integrity, confidentiality, and security. Multimedia contents are generated in many different ways through the use of consumer electronics and high-quality digital imaging devices, such as smartphones, digital cameras, tablets, and wearable and IoT devices. The ever-increasing convenience of image acquisition has facilitated instant distribution and sharing of digital images on digital social platforms, determining a great amount of exchange data. Moreover, the pervasiveness of powerful image editing tools has allowed the manipulation of digital images for malicious or criminal ends, up to the creation of synthesized images and videos with the use of deep learning techniques. In response to these threats, the multimedia forensics community has produced major research efforts regarding the identification of the source and the detection of manipulation. In all cases (e.g., forensic investigations, fake news debunking, information warfare, and cyberattacks) where images and videos serve as critical evidence, forensic technologies that help to determine the origin, authenticity, and integrity of multimedia content can become essential tools. This book aims to collect a diverse and complementary set of articles that demonstrate new developments and applications in image and video forensics to tackle new and serious challenges to ensure media authenticity
    corecore