52 research outputs found
An intelligent multimodal interface for in-car communication systems
In-car communication systems (ICCS) are becoming more frequently used by drivers. ICCS are used in order to minimise the driving distraction due to using a mobile phone while driving. Several usability studies of ICCS utilising speech user interfaces (SUIs) have identified usability issues that can affect the workload, performance, satisfaction and user experience of the driver. This is due to current speech technologies which can be a source of errors that may frustrate the driver and negatively affect the user experience. The aim of this research was to design a new multimodal interface that will manage the interaction between an ICCS and the driver. Unlike the current ICCS, it should make more voice input available, so as to support tasks (e.g. sending text messages; browsing the phone book, etc), which still require a cognitive workload from the driver. An adaptive multimodal interface was proposed in order to address current ICCS issues. The multimodal interface used both speech and manual input; however only the speech channel is used as output. This was done in order to minimise the visual distraction that graphical user interfaces or haptics devices can cause with current ICCS. The adaptive interface was designed to minimise the cognitive distraction of the driver. The adaptive interface ensures that whenever the distraction level of the driver is high, any information communication is postponed. After the design and the implementation of the first version of the prototype interface, called MIMI, a usability evaluation was conducted in order to identify any possible usability issues. Although voice dialling was found to be problematic, the results were encouraging in terms of performance, workload and user satisfaction. The suggestions received from the participants to improve the system usability were incorporated in the next implementation of MIMI. The adaptive module was then implemented to reduce driver distraction based on the driver‟s current context. The proposed architecture showed encouraging results in terms of usability and safety. The adaptive behaviour of MIMI significantly contributed to the reduction of cognitive distraction, because drivers received less information during difficult driving situations
A comparison of features for large population speaker identification
Bibliography: leaves 95-104.Speech recognition systems all have one criterion in common; they perform better in a controlled environment using clean speech. Though performance can be excellent, even exceeding human capabilities for clean speech, systems fail when presented with speech data from more realistic environments such as telephone channels. The differences using a recognizer in clean and noisy environments are extreme, and this causes one of the major obstacles in producing commercial recognition systems to be used in normal environments. It is the lack of performance of speaker recognition systems with telephone channels that this work addresses. The human auditory system is a speech recognizer with excellent performance, especially in noisy environments. Since humans perform well at ignoring noise more than any machine, auditory-based methods are the promising approaches since they attempt to model the working of the human auditory system. These methods have been shown to outperform more conventional signal processing schemes for speech recognition, speech coding, word-recognition and phone classification tasks. Since speaker identification has received lot of attention in speech processing because of its waiting real-world applications, it is attractive to evaluate the performance using auditory models as features. Firstly, this study rums at improving the results for speaker identification. The improvements were made through the use of parameterized feature-sets together with the application of cepstral mean removal for channel equalization. The study is further extended to compare an auditory-based model, the Ensemble Interval Histogram, with mel-scale features, which was shown to perform almost error-free in clean speech. The previous studies of Elli to be more robust to noise were conducted on speaker dependent, small population, isolated words and now are extended to speaker independent, larger population, continuous speech. This study investigates whether the Elli representation is more resistant to telephone noise than mel-cepstrum as was shown in the previous studies, when now for the first time, it is applied for speaker identification task using the state-of-the-art Gaussian mixture model system
Audio processing on constrained devices
This thesis discusses the future of smart business applications on mobile phones
and the integration of voice interface across several business applications. It proposes
a framework that provides speech processing support for business applications
on mobile phones. The framework uses Gaussian Mixture Models (GMM)
for low-enrollment speaker recognition and limited vocabulary speech recognition.
Algorithms are presented for pre-processing of audio signals into different categories
and for start and end point detection. A method is proposed for speech processing
that uses Mel Frequency Cepstral Coeffcients (MFCC) as primary feature for extraction.
In addition, optimization schemes are developed to improve performance,
and overcome constraints of a mobile phone. Experimental results are presented
for some prototype applications that evaluate the performance of computationally
expensive algorithms on constrained hardware. The thesis concludes by discussing
the scope for improvement for the work done in this thesis and future directions in
which this work could possibly be extended
Comparing speech recognition and touch tone as input modalities for Technologically unsophisticated users
Using an automated service to access information via the telephone has become an important productivity enhancer in the developed world. However, such automated services are generally quite inaccessible to users who have had little technological exposure. There has been a widespread belief that speech-recognition technology can be used to bridge this gap, but little objective evidence for this belief has been produced. To address this situation, two interfaces, touchtone and speech-based, were designed and implemented as input modalities to a system that provides technologically unsophisticated users with access to an informational/transactional service. These interfaces were optimised and compared using transaction completion rates, time taken to complete tasks, error rates and user satisfaction. The speech-based interface was found to outperform the touchtone interface in terms of completion rate, error rate and user satisfaction. The data obtained on time taken to complete tasks could not be compared as the DTMF interface data were highly influenced by people who are not technologically unsophisticated. These results serve as a confirmation that speech-based interfaces are more effective and more satisfying and can therefore enhance information dissemination to people who are not well exposed to the technology.Dissertation (MSc)--University of Pretoria, 2006.Computer Scienceunrestricte
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Robust text independent closed set speaker identification systems and their evaluation
PhD ThesisThis thesis focuses upon text independent closed set speaker
identi cation. The contributions relate to evaluation studies in the
presence of various types of noise and handset e ects. Extensive
evaluations are performed on four databases.
The rst contribution is in the context of the use of the Gaussian
Mixture Model-Universal Background Model (GMM-UBM) with
original speech recordings from only the TIMIT database. Four main
simulations for Speaker Identi cation Accuracy (SIA) are presented
including di erent fusion strategies: Late fusion (score based), early
fusion (feature based) and early-late fusion (combination of feature and
score based), late fusion using concatenated static and dynamic
features (features with temporal derivatives such as rst order
derivative delta and second order derivative delta-delta features,
namely acceleration features), and nally fusion of statistically
independent normalized scores.
The second contribution is again based on the GMM-UBM
approach. Comprehensive evaluations of the e ect of Additive White
Gaussian Noise (AWGN), and Non-Stationary Noise (NSN) (with and
without a G.712 type handset) upon identi cation performance are
undertaken. In particular, three NSN types with varying Signal to
Noise Ratios (SNRs) were tested corresponding to: street tra c, a bus
interior and a crowded talking environment. The performance
evaluation also considered the e ect of late fusion techniques based on
score fusion, namely mean, maximum, and linear weighted sum fusion.
The databases employed were: TIMIT, SITW, and NIST 2008; and 120
speakers were selected from each database to yield 3,600 speech
utterances.
The third contribution is based on the use of the I-vector, four
combinations of I-vectors with 100 and 200 dimensions were employed.
Then, various fusion techniques using maximum, mean, weighted sum
and cumulative fusion with the same I-vector dimension were used to
improve the SIA. Similarly, both interleaving and concatenated I-vector
fusion were exploited to produce 200 and 400 I-vector dimensions. The
system was evaluated with four di erent databases using 120 speakers
from each database. TIMIT, SITW and NIST 2008 databases were
evaluated for various types of NSN namely, street-tra c NSN,
bus-interior NSN and crowd talking NSN; and the G.712 type handset
at 16 kHz was also applied.
As recommendations from the study in terms of the GMM-UBM
approach, mean fusion is found to yield overall best performance in terms
of the SIA with noisy speech, whereas linear weighted sum fusion is
overall best for original database recordings. However, in the I-vector
approach the best SIA was obtained from the weighted sum and the
concatenated fusion.Ministry of Higher Education
and Scienti c Research (MoHESR), and the Iraqi Cultural Attach e,
Al-Mustansiriya University, Al-Mustansiriya University College of
Engineering in Iraq for supporting my PhD scholarship
Non-Intrusive Subscriber Authentication for Next Generation Mobile Communication Systems
Merged with duplicate record 10026.1/753 on 14.03.2017 by CS (TIS)The last decade has witnessed massive growth in both the technological development, and
the consumer adoption of mobile devices such as mobile handsets and PDAs. The recent
introduction of wideband mobile networks has enabled the deployment of new services
with access to traditionally well protected personal data, such as banking details or
medical records. Secure user access to this data has however remained a function of the
mobile device's authentication system, which is only protected from masquerade abuse by
the traditional PIN, originally designed to protect against telephony abuse.
This thesis presents novel research in relation to advanced subscriber authentication for
mobile devices. The research began by assessing the threat of masquerade attacks on
such devices by way of a survey of end users. This revealed that the current methods of
mobile authentication remain extensively unused, leaving terminals highly vulnerable to
masquerade attack. Further investigation revealed that, in the context of the more
advanced wideband enabled services, users are receptive to many advanced
authentication techniques and principles, including the discipline of biometrics which
naturally lends itself to the area of advanced subscriber based authentication.
To address the requirement for a more personal authentication capable of being applied
in a continuous context, a novel non-intrusive biometric authentication technique was
conceived, drawn from the discrete disciplines of biometrics and Auditory Evoked
Responses. The technique forms a hybrid multi-modal biometric where variations in the
behavioural stimulus of the human voice (due to the propagation effects of acoustic
waves within the human head), are used to verify the identity o f a user. The resulting
approach is known as the Head Authentication Technique (HAT).
Evaluation of the HAT authentication process is realised in two stages. Firstly, the
generic authentication procedures of registration and verification are automated within a
prototype implementation. Secondly, a HAT demonstrator is used to evaluate the
authentication process through a series of experimental trials involving a representative
user community. The results from the trials confirm that multiple HAT samples from
the same user exhibit a high degree of correlation, yet samples between users exhibit a
high degree of discrepancy. Statistical analysis of the prototypes performance realised
early system error rates of; FNMR = 6% and FMR = 0.025%. The results clearly
demonstrate the authentication capabilities of this novel biometric approach and the
contribution this new work can make to the protection of subscriber data in next
generation mobile networks.Orange Personal Communication Services Lt
New authentication applications in the protection of caller ID and banknote
In the era of computers and the Internet, where almost everything is interconnected, authentication plays a crucial role in safeguarding online and offline data. As authentication systems face continuous testing from advanced attacking techniques and tools, the need for evolving authentication technology becomes imperative. In this thesis, we study attacks on authentication systems and propose countermeasures. Considering various nominated techniques, the thesis is divided into two parts.
The first part introduces caller ID verification (CIV) protocol to address caller ID spoofing in telecommunication systems. This kind of attack usually follows fraud, which not only inflicts financial losses on victims but also reduces public trust in the telephone system. We propose CIV to authenticate the caller ID based on a challenge-response process. We show that spoofing can be leveraged, in conjunction with dual tone multi-frequency (DTMF), to efficiently implement the challenge-response process, i.e., using spoofing to fight against spoofing. We conduct extensive experiments showing that our solution can work reliably across the legacy and new telephony systems, including landline, cellular and Internet protocol (IP) network, without the cooperation of telecom providers.
In the second part, we present polymer substrate fingerprinting (PSF) as a method to combat counterfeiting of banknotes in the financial area. Our technique is built on the observation that the opacity coating leaves uneven thickness in the polymer substrate, resulting in random translucent patterns when a polymer banknote is back-lit by a light source. With extensive experiments, we show that our method can reliably authenticate banknotes and is robust against rough daily handling of banknotes. Furthermore, we show that the extracted fingerprints are extremely scalable to identify every polymer note circulated globally. Our method ensures that even when counterfeiters have procured the same printing equipment and ink as used by a legitimate government, counterfeiting banknotes remains infeasible
Robust visual speech recognition using optical flow analysis and rotation invariant features
The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancement and is prone to acoustic noise. Considering this problem, aim of this research is to investigate and develop a visual only speech recognition system which should be suitable for noisy environments. Potential applications of such a system include the lip-reading mobile phones, human computer interface (HCI) for mobility-impaired users, robotics, surveillance, improvement of speech based computer control in a noisy environment and for the rehabilitation of the persons who have undergone a laryngectomy surgery. In the literature, there are several models and algorithms available for visual feature extraction. These features are extracted from static mouth images and characterized as appearance and shape based features. However, these methods rarely incorporate the time dependent information of mouth dynamics. This dissertation presents two optical flow based approaches of visual feature extraction, which capture the mouth motions in an image sequence. The motivation for using motion features is, because the human perception of lip-reading is concerned with the temporal dynamics of mouth motion. The first approach is based on extraction of features from the optical flow vertical component. The optical flow vertical component is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. To overcome the issue of large variation in speed of speech, each utterance is normalized using simple linear interpolation method. In the second approach, four directional motion templates based on optical flow are developed, each representing the consolidated motion information in an utterance in four directions (i.e.,up, down, left and right). This approach is an evolution of a view based approach known as motion history image (MHI). One of the main issues with the MHI method is its motion overwriting problem because of self-occlusion. DMHIs seem to solve this issue of overwriting. Two types of image descriptors, Zernike moments and Hu moments are used to represent each image of DMHIs. A support vector machine (SVM) classifier was used to classify the features obtained from the optical flow vertical component, Zernike and Hu moments separately. For identification of visemes, a multiclass SVM approach was employed. A video speech corpus of seven subjects was used for evaluating the efficiency of the proposed methods for lip-reading. The experimental results demonstrate the promising performance of the optical flow based mouth movement representations. Performance comparison between DMHI and MHI based on Zernike moments, shows that the DMHI technique outperforms the MHI technique. A video based adhoc temporal segmentation method is proposed in the thesis for isolated utterances. It has been used to detect the start and the end frame of an utterance from an image sequence. The technique is based on a pair-wise pixel comparison method. The efficiency of the proposed technique was tested on the available data set with short pauses between each utterance
- …