209 research outputs found

    A Detailed Investigation into Low-Level Feature Detection in Spectrogram Images

    Get PDF
    Being the first stage of analysis within an image, low-level feature detection is a crucial step in the image analysis process and, as such, deserves suitable attention. This paper presents a systematic investigation into low-level feature detection in spectrogram images. The result of which is the identification of frequency tracks. Analysis of the literature identifies different strategies for accomplishing low-level feature detection. Nevertheless, the advantages and disadvantages of each are not explicitly investigated. Three model-based detection strategies are outlined, each extracting an increasing amount of information from the spectrogram, and, through ROC analysis, it is shown that at increasing levels of extraction the detection rates increase. Nevertheless, further investigation suggests that model-based detection has a limitation—it is not computationally feasible to fully evaluate the model of even a simple sinusoidal track. Therefore, alternative approaches, such as dimensionality reduction, are investigated to reduce the complex search space. It is shown that, if carefully selected, these techniques can approach the detection rates of model-based strategies that perform the same level of information extraction. The implementations used to derive the results presented within this paper are available online from http://stdetect.googlecode.com

    Detailed versus gross spectro-temporal cues for the perception of stop consonants

    Get PDF
    x+182hlm.;24c

    Audio Processing and Loudness Estimation Algorithms with iOS Simulations

    Get PDF
    abstract: The processing power and storage capacity of portable devices have improved considerably over the past decade. This has motivated the implementation of sophisticated audio and other signal processing algorithms on such mobile devices. Of particular interest in this thesis is audio/speech processing based on perceptual criteria. Specifically, estimation of parameters from human auditory models, such as auditory patterns and loudness, involves computationally intensive operations which can strain device resources. Hence, strategies for implementing computationally efficient human auditory models for loudness estimation have been studied in this thesis. Existing algorithms for reducing computations in auditory pattern and loudness estimation have been examined and improved algorithms have been proposed to overcome limitations of these methods. In addition, real-time applications such as perceptual loudness estimation and loudness equalization using auditory models have also been implemented. A software implementation of loudness estimation on iOS devices is also reported in this thesis. In addition to the loudness estimation algorithms and software, in this thesis project we also created new illustrations of speech and audio processing concepts for research and education. As a result, a new suite of speech/audio DSP functions was developed and integrated as part of the award-winning educational iOS App 'iJDSP." These functions are described in detail in this thesis. Several enhancements in the architecture of the application have also been introduced for providing the supporting framework for speech/audio processing. Frame-by-frame processing and visualization functionalities have been developed to facilitate speech/audio processing. In addition, facilities for easy sound recording, processing and audio rendering have also been developed to provide students, practitioners and researchers with an enriched DSP simulation tool. Simulations and assessments have been also developed for use in classes and training of practitioners and students.Dissertation/ThesisM.S. Electrical Engineering 201

    Novel Signal Reconstruction Techniques in Cyclotron Radiation Emission Spectroscopy for Neutrino Mass Measurement

    Get PDF
    The Project 8 experiment is developing Cyclotron Radiation Emission Spectroscopy (CRES) on the beta-decay spectrum of tritium for the measurement of the absolute neutrino mass scale. CRES is a frequency-based technique which aims to probe the endpoint of the tritium energy spectrum with a final target sensitivity of 0.04 eV, pushing the limits beyond the inverted mass hierarchy. A phased-approach experiment, both Phase I and Phase II efforts use a combination of 83mKr and molecular tritium T_2 as source gases. The technique relies on an accurate, precise, and well-understood reconstructed beta-spectrum whose endpoint and spectral shape near the endpoint may be constrained by a kinematical model which uses the neutrino mass m_beta as a free parameter. Since the decays in the last eV of the tritium spectrum encompass O(10^(-13)) of all decays and the precise variation of the spectrum, distorted by the presence of a massive neutrino, is fundamental to the measurement, reconstruction techniques which yield accurate measurements of the frequency (and therefore energy) of the signal and correctly classify signal from background are necessary. In this work, we discuss the open-problem of the absolute neutrino mass scale, the fundamentals of measurements tailored to resolve this, the underpinning and details of the CRES technology, and the measurement of the first-ever CRES tritium β\beta-spectrum. Finally, we focus on novel reconstruction techniques at both the signal and event levels using machine learning algorithms that allow us to adapt our technique to the complex dynamics of the electron inside our detector. We will show that such methods can separate true events from backgrounds at \u3e 94% accuracy and are able to improve the efficiency of reconstruction when compared to traditional reconstruction methods by \u3e 23%

    TIME AND LOCATION FORENSICS FOR MULTIMEDIA

    Get PDF
    In the modern era, a vast quantities of digital information is available in the form of audio, image, video, and other sensor recordings. These recordings may contain metadata describing important information such as the time and the location of recording. As the stored information can be easily modified using readily available digital editing software, determining the authenticity of a recording has utmost importance, especially for critical applications such as law enforcement, journalism, and national and business intelligence. In this dissertation, we study novel environmental signatures induced by power networks, which are known as Electrical Network Frequency (ENF) signals and become embedded in multimedia data at the time of recording. ENF fluctuates slightly over time from its nominal value of 50 Hz/60 Hz. The major trend of fluctuations in the ENF remains consistent across the entire power grid, including when measured at physically distant geographical locations. We investigate the use of ENF signals for a variety of applications such as estimation/verification of time and location of a recording's creation, and develop a theoretical foundation to support ENF based forensic analysis. In the first part of the dissertation, the presence of ENF signals in visual recordings captured in electric powered lighting environments is demonstrated. The source of ENF signals in visual recordings is shown to be the invisible flickering of indoor lighting sources such as fluorescent and incandescent lamps. The techniques to extract ENF signals from recordings demonstrate that a high correlation is observed between the ENF fluctuations obtained from indoor lighting and that from the power mains supply recorded at the same time. Applications of the ENF signal analysis to tampering detection of surveillance video recordings, and forensic binding of the audio and visual track of a video are also discussed. In the following part, an analytical model is developed to gain an understanding of the behavior of ENF signals. It is demonstrated that ENF signals can be modeled using a time-varying autoregressive process. The performance of the proposed model is evaluated for a timestamp verification application. Based on this model, an improved algorithm for ENF matching between a reference signal and a query signal is provided. It is shown that the proposed approach provides an improved matching performance as compared to the case when matching is performed directly on ENF signals. Another application of the proposed model in learning the power grid characteristics is also explicated. These characteristics are learnt by using the modeling parameters as features to train a classifier to determine the creation location of a recording among candidate grid-regions. The last part of the dissertation demonstrates that differences exist between ENF signals recorded in the same grid-region at the same time. These differences can be extracted using a suitable filter mechanism and follow a relationship with the distance between different locations. Based on this observation, two localization protocols are developed to identify the location of a recording within the same grid-region, using ENF signals captured at anchor locations. Localization accuracy of the proposed protocols are then compared. Challenges in using the proposed technique to estimate the creation location of multimedia recordings within the same grid, along with efficient and resilient trilateration strategies in the presence of outliers and malicious anchors, are also discussed

    Astronomy with integral field spectroscopy:: observation, data analysis and results

    Get PDF
    With a new generation of facility instruments being commissioned for 8 metre telescopes, integral field spectroscopy will soon be a standard tool in astronomy, opening a range of exciting new research opportunities. It is clear, however, that reducing and analyzing integral field data is a complex problem, which will need considerable attention before the full potential of the hardware can be realized. The purpose of this thesis is therefore to explore some of the scientific capabilities of integral field spectroscopy, developing the techniques needed to produce astrophysical results from the data. Two chapters are dedicated to the problem of analyzing observations from the densely-packed optical fibre instruments pioneered at Durham. It is shown that, in the limit where each spectrum is sampled by only one detector row, data maybe treated in a similar way to those from an image slicer. The properties of raw fibre data are considered in the context of the Sampling Theorem and methods for three dimensional image reconstruction are discussed. These ideas are implemented in an IRAF data reduction package for the Thousand Element Integral Field Unit (TEIFU), with source code provided on the accompanying compact disc. Two observational studies are also presented. In the first case, the 3D infrared image slicer has been used to test for the presence of a super-massive black hole in the giant early-type galaxy NGC 1316. Measurements of the stellar kinematics do not reveal a black hole of mass 5 x l0(^9)M©, as predicted from bulge luminosity using the relationship of Kormendy & Richstone (1995). The second study is an investigation into the origin of [Fell] line emission in the Seyfert galaxy NGC4151, using Durham University's SMIRFS-IFU. By mapping [Fell] line strength and velocity at the galaxy centre, it is shown that the emission is associated with the optical narrow line region, rather than the radio jet, indicating that the excitation is primarily due to photoionizing X-rays.Finally, a report is given on the performance of TEIFU, which was commissioned at the William Herschel Telescope in 1999. Measurements of throughput and fibre response variation are given and a reconstructed test observation of the radio galaxy 3C 327 is shown, demonstrating the functionality of the instrument and software

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Analysis of nonmodal glottal event patterns with application to automatic speaker recognition

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2008.Includes bibliographical references (p. 211-215).Regions of phonation exhibiting nonmodal characteristics are likely to contain information about speaker identity, language, dialect, and vocal-fold health. As a basis for testing such dependencies, we develop a representation of patterns in the relative timing and height of nonmodal glottal pulses. To extract the timing and height of candidate pulses, we investigate a variety of inverse-filtering schemes including maximum-entropy deconvolution that minimizes predictability of a signal and minimum-entropy deconvolution that maximizes pulse-likeness. Hybrid formulations of these methods are also considered. we then derive a theoretical framework for understanding frequency- and time-domain properties of a pulse sequence, a process that sheds light on the transformation of nonmodal pulse trains into useful parameters. In the frequency domain, we introduce the first comprehensive mathematical derivation of the effect of deterministic and stochastic source perturbation on the short-time spectrum. We also propose a pitch representation of nonmodality that provides an alternative viewpoint on the frequency content that does not rely on Fourier bases. In developing time-domain properties, we use projected low-dimensional histograms of feature vectors derived from pulse timing and height parameters. For these features, we have found clusters of distinct pulse patterns, reflecting a wide variety of glottal-pulse phenomena including near-modal phonation, shimmer and jitter, diplophonia and triplophonia, and aperiodicity. Using temporal relationships between successive feature vectors, an algorithm by which to separate these different classes of glottal-pulse characteristics has also been developed.(cont.) We have used our glottal-pulse-pattern representation to automatically test for one signal dependency: speaker dependence of glottal-pulse sequences. This choice is motivated by differences observed between talkers in our separated feature space. Using an automatic speaker verification experiment, we investigate tradeoffs in speaker dependency for short-time pulse patterns, reflecting local irregularity, as well as long-time patterns related to higher-level cyclic variations. Results, using speakers with a broad array of modal and nonmodal behaviors, indicate a high accuracy in speaker recognition performance, complementary to the use of conventional mel-cepstral features. These results suggest that there is rich structure to the source excitation that provides information about a particular speaker's identity.by Nicolas Malyska.Ph.D

    Harmonic Sinusoid Modeling of Tonal Music Events

    Get PDF
    PhDThis thesis presents the theory, implementation and applications of the harmonic sinusoid modeling of pitched audio events. Harmonic sinusoid modeling is a parametric model that expresses an audio signal, or part of an audio signal, as the linear combination of concurrent slow-varying sinusoids, grouped together under harmonic frequency constraints. The harmonic sinusoid modeling is an extension of the sinusoid modeling, with the additional frequency constraints so that it is capable to directly model tonal sounds. This enables applications such as object-oriented audio manipulations, polyphonic transcription, instrument/singer recognition with background music, etc. The modeling system consists of an analyzer and a synthesizer. The analyzer extracts harmonic sinusoidal parameters from an audio waveform, while the synthesizer rebuilds an audio waveform from these parameters. Parameter estimation is based on a detecting-grouping-tracking framework. The detecting stage finds and estimates sinusoid atoms; the grouping stage collects concurrent atoms into harmonic groups; the tracking stage collects the atom groups at different time to form continuous harmonic sinusoid tracks. Compared to standard sinusoid model, the harmonic model focuses on harmonic groups of atoms rather than on isolated atoms, therefore naturally represents tonal sounds. The synthesizer rebuilds the audio signal by interpolating measured parameters along the found tracks. We propose the first application of the harmonic sinusoid model in digital audio editors. For audio editing, with the tonal events directly represented by a parametric model, we can implement standard audio editing functionalities on tonal events embedded in an audio signal, or invent new sound effects based on the model parameters themselves. Possibilities for other applications are suggested at the end of this thesis.Financial support: European Commission, the Higher Education Funding Council for England, and Queen Mary, University of Londo

    Neural Basis and Computational Strategies for Auditory Processing

    Get PDF
    Our senses are our window to the world, and hearing is the window through which we perceive the world of sound. While seemingly effortless, the process of hearing involves complex transformations by which the auditory system consolidates acoustic information from the environment into perceptual and cognitive experiences. Studies of auditory processing try to elucidate the mechanisms underlying the function of the auditory system, and infer computational strategies that are valuable both clinically and intellectually, hence contributing to our understanding of the function of the brain. In this thesis, we adopt both an experimental and computational approach in tackling various aspects of auditory processing. We first investigate the neural basis underlying the function of the auditory cortex, and explore the dynamics and computational mechanisms of cortical processing. Our findings offer physiological evidence for a role of primary cortical neurons in the integration of sound features at different time constants, and possibly in the formation of auditory objects. Based on physiological principles of sound processing, we explore computational implementations in tackling specific perceptual questions. We exploit our knowledge of the neural mechanisms of cortical auditory processing to formulate models addressing the problems of speech intelligibility and auditory scene analysis. The intelligibility model focuses on a computational approach for evaluating loss of intelligibility, inspired from mammalian physiology and human perception. It is based on a multi-resolution filter-bank implementation of cortical response patterns, which extends into a robust metric for assessing loss of intelligibility in communication channels and speech recordings. This same cortical representation is extended further to develop a computational scheme for auditory scene analysis. The model maps perceptual principles of auditory grouping and stream formation into a computational system that combines aspects of bottom-up, primitive sound processing with an internal representation of the world. It is based on a framework of unsupervised adaptive learning with Kalman estimation. The model is extremely valuable in exploring various aspects of sound organization in the brain, allowing us to gain interesting insight into the neural basis of auditory scene analysis, as well as practical implementations for sound separation in ``cocktail-party'' situations
    • …
    corecore