2,863 research outputs found
Recommended from our members
Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem
of identifying a speaker from its voice regardless of the content (i.e.
text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system.
A novel approach towards speaker identification is developed using
wavelet analysis, and multiple neural networks including Probabilistic
Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND
voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state-
of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA).
Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear.
Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that
the proposed scheme is one of the best candidates for the fusion of
face and voice due to its low computational time and high recognition accuracy
Recommended from our members
The role of HG in the analysis of temporal iteration and interaural correlation
Unmasking Clever Hans Predictors and Assessing What Machines Really Learn
Current learning machines have successfully solved hard application problems,
reaching high accuracy and displaying seemingly "intelligent" behavior. Here we
apply recent techniques for explaining decisions of state-of-the-art learning
machines and analyze various tasks from computer vision and arcade games. This
showcases a spectrum of problem-solving behaviors ranging from naive and
short-sighted, to well-informed and strategic. We observe that standard
performance evaluation metrics can be oblivious to distinguishing these diverse
problem solving behaviors. Furthermore, we propose our semi-automated Spectral
Relevance Analysis that provides a practically effective way of characterizing
and validating the behavior of nonlinear learning machines. This helps to
assess whether a learned model indeed delivers reliably for the problem that it
was conceived for. Furthermore, our work intends to add a voice of caution to
the ongoing excitement about machine intelligence and pledges to evaluate and
judge some of these recent successes in a more nuanced manner.Comment: Accepted for publication in Nature Communication
Multi-Sensory Interaction for Blind and Visually Impaired People
This book conveyed the visual elements of artwork to the visually impaired through various sensory elements to open a new perspective for appreciating visual artwork. In addition, the technique of expressing a color code by integrating patterns, temperatures, scents, music, and vibrations was explored, and future research topics were presented. A holistic experience using multi-sensory interaction acquired by people with visual impairment was provided to convey the meaning and contents of the work through rich multi-sensory appreciation. A method that allows people with visual impairments to engage in artwork using a variety of senses, including touch, temperature, tactile pattern, and sound, helps them to appreciate artwork at a deeper level than can be achieved with hearing or touch alone. The development of such art appreciation aids for the visually impaired will ultimately improve their cultural enjoyment and strengthen their access to culture and the arts. The development of this new concept aids ultimately expands opportunities for the non-visually impaired as well as the visually impaired to enjoy works of art and breaks down the boundaries between the disabled and the non-disabled in the field of culture and arts through continuous efforts to enhance accessibility. In addition, the developed multi-sensory expression and delivery tool can be used as an educational tool to increase product and artwork accessibility and usability through multi-modal interaction. Training the multi-sensory experiences introduced in this book may lead to more vivid visual imageries or seeing with the mind’s eye
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Audio Event Classification Using Deep Learning Methods
Whether crossing the road or enjoying a concert, sound carries important information about the world around us. Audio event classification refers to recognition tasks involving the assignment of one or several labels, such as ‘dog bark’ or ‘doorbell’, to a particular audio signal. Thus, teaching machines to conduct this classification task can help humans in many fields. Since deep learning has shown its great potential and usefulness in many AI applications, this thesis focuses on studying deep learning methods and building suitable neural networks for this audio event classification task. In order to evaluate the performance of different neural networks, we tested them on both Google AudioSet and the dataset for DCASE 2018 Task 2. Instead of providing original audio files, AudioSet offers compact 128-dimensional embeddings outputted by a modified VGG model for audio with a frame length of 960ms. For DCASE 2018 Task 2, we firstly preprocessed the soundtracks and then fine-tuned the VGG model that AudioSet used as a feature extractor. Thus, each soundtrack from both tasks is represented as a series of 128-dimensional features. We then compared the DNN, LSTM, and multi-level attention models with different hyper parameters. The results show that fine-tuning the feature generation model for the DCASE task greatly improved the evaluation score. In addition, the attention models were found to perform the best in our settings for both tasks. The results indicate that utilizing a CNN-like model as a feature extractor for the log-mel spectrograms and modeling the dynamics information using an attention model can achieve state-of-the-art results in the task of audio event classification. For future research, the thesis suggests training a better CNN model for feature extraction, utilizing multi-scale and multi-level features for better classification, and combining the audio features with other multimodal information for audiovisual data analysis
Automated camera ranking and selection using video content and scene context
PhDWhen observing a scene with multiple cameras, an important problem to solve is to automatically
identify “what camera feed should be shown and when?” The answer to this question is of interest
for a number of applications and scenarios ranging from sports to surveillance. In this thesis we
present a framework for the ranking of each video frame and camera across time and the camera
network, respectively. This ranking is then used for automated video production. In the first stage
information from each camera view and from the objects in it is extracted and represented in a way
that allows for object- and frame-ranking. First objects are detected and ranked within and across
camera views. This ranking takes into account both visible and contextual information related to
the object. Then content ranking is performed based on the objects in the view and camera-network
level information. We propose two novel techniques for content ranking namely: Routing Based
Ranking (RBR) and Multivariate Gaussian based Ranking (MVG). In RBR we use a rule based
framework where weighted fusion of object and frame level information takes place while in MVG
the rank is estimated as a multivariate Gaussian distribution. Through experimental and subjective
validation we demonstrate that the proposed content ranking strategies allows the identification of
the best-camera at each time.
The second part of the thesis focuses on the automatic generation of N-to-1 videos based on the
ranked content. We demonstrate that in such production settings it is undesirable to have frequent
inter-camera switching. Thus motivating the need for a compromise, between selecting the best
camera most of the time and minimising the frequent inter-camera switching, we demonstrate that
state-of-the-art techniques for this task are inadequate and fail in dynamic scenes. We propose three
novel methods for automated camera selection. The first method (¡go f ) performs a joint optimization
of a cost function that depends on both the view quality and inter-camera switching so that a
i
Abstract ii
pleasing best-view video sequence can be composed. The other two methods (¡dbn and ¡util) include
the selection decision into the ranking-strategy. In ¡dbn we model the best-camera selection
as a state sequence via Directed Acyclic Graphs (DAG) designed as a Dynamic Bayesian Network
(DBN), which encodes the contextual knowledge about the camera network and employs the past
information to minimize the inter camera switches. In comparison ¡util utilizes the past as well
as the future information in a Partially Observable Markov Decision Process (POMDP) where the
camera-selection at a certain time is influenced by the past information and its repercussions in
the future. The performance of the proposed approach is demonstrated on multiple real and synthetic
multi-camera setups. We compare the proposed architectures with various baseline methods
with encouraging results. The performance of the proposed approaches is also validated through
extensive subjective testing
Image and Video Forensics
Nowadays, images and videos have become the main modalities of information being exchanged in everyday life, and their pervasiveness has led the image forensics community to question their reliability, integrity, confidentiality, and security. Multimedia contents are generated in many different ways through the use of consumer electronics and high-quality digital imaging devices, such as smartphones, digital cameras, tablets, and wearable and IoT devices. The ever-increasing convenience of image acquisition has facilitated instant distribution and sharing of digital images on digital social platforms, determining a great amount of exchange data. Moreover, the pervasiveness of powerful image editing tools has allowed the manipulation of digital images for malicious or criminal ends, up to the creation of synthesized images and videos with the use of deep learning techniques. In response to these threats, the multimedia forensics community has produced major research efforts regarding the identification of the source and the detection of manipulation. In all cases (e.g., forensic investigations, fake news debunking, information warfare, and cyberattacks) where images and videos serve as critical evidence, forensic technologies that help to determine the origin, authenticity, and integrity of multimedia content can become essential tools. This book aims to collect a diverse and complementary set of articles that demonstrate new developments and applications in image and video forensics to tackle new and serious challenges to ensure media authenticity
Adaptive detection and tracking using multimodal information
This thesis describes work on fusing data from multiple sources of information, and focuses on two main areas: adaptive detection and adaptive object tracking in automated vision scenarios. The work on adaptive object detection explores a new paradigm in dynamic parameter selection, by selecting thresholds for object detection to maximise agreement between pairs of sources. Object tracking, a complementary technique to object detection, is also explored in a multi-source context and an efficient framework for robust tracking, termed the Spatiogram Bank tracker, is proposed as a means to overcome the difficulties of traditional histogram tracking. As well as performing theoretical analysis of the proposed methods, specific example applications are given for both the detection and the tracking aspects, using thermal infrared and visible spectrum video data, as well as other multi-modal information sources
- …