52 research outputs found

    An intelligent multimodal interface for in-car communication systems

    Get PDF
    In-car communication systems (ICCS) are becoming more frequently used by drivers. ICCS are used in order to minimise the driving distraction due to using a mobile phone while driving. Several usability studies of ICCS utilising speech user interfaces (SUIs) have identified usability issues that can affect the workload, performance, satisfaction and user experience of the driver. This is due to current speech technologies which can be a source of errors that may frustrate the driver and negatively affect the user experience. The aim of this research was to design a new multimodal interface that will manage the interaction between an ICCS and the driver. Unlike the current ICCS, it should make more voice input available, so as to support tasks (e.g. sending text messages; browsing the phone book, etc), which still require a cognitive workload from the driver. An adaptive multimodal interface was proposed in order to address current ICCS issues. The multimodal interface used both speech and manual input; however only the speech channel is used as output. This was done in order to minimise the visual distraction that graphical user interfaces or haptics devices can cause with current ICCS. The adaptive interface was designed to minimise the cognitive distraction of the driver. The adaptive interface ensures that whenever the distraction level of the driver is high, any information communication is postponed. After the design and the implementation of the first version of the prototype interface, called MIMI, a usability evaluation was conducted in order to identify any possible usability issues. Although voice dialling was found to be problematic, the results were encouraging in terms of performance, workload and user satisfaction. The suggestions received from the participants to improve the system usability were incorporated in the next implementation of MIMI. The adaptive module was then implemented to reduce driver distraction based on the driver‟s current context. The proposed architecture showed encouraging results in terms of usability and safety. The adaptive behaviour of MIMI significantly contributed to the reduction of cognitive distraction, because drivers received less information during difficult driving situations

    A comparison of features for large population speaker identification

    Get PDF
    Bibliography: leaves 95-104.Speech recognition systems all have one criterion in common; they perform better in a controlled environment using clean speech. Though performance can be excellent, even exceeding human capabilities for clean speech, systems fail when presented with speech data from more realistic environments such as telephone channels. The differences using a recognizer in clean and noisy environments are extreme, and this causes one of the major obstacles in producing commercial recognition systems to be used in normal environments. It is the lack of performance of speaker recognition systems with telephone channels that this work addresses. The human auditory system is a speech recognizer with excellent performance, especially in noisy environments. Since humans perform well at ignoring noise more than any machine, auditory-based methods are the promising approaches since they attempt to model the working of the human auditory system. These methods have been shown to outperform more conventional signal processing schemes for speech recognition, speech coding, word-recognition and phone classification tasks. Since speaker identification has received lot of attention in speech processing because of its waiting real-world applications, it is attractive to evaluate the performance using auditory models as features. Firstly, this study rums at improving the results for speaker identification. The improvements were made through the use of parameterized feature-sets together with the application of cepstral mean removal for channel equalization. The study is further extended to compare an auditory-based model, the Ensemble Interval Histogram, with mel-scale features, which was shown to perform almost error-free in clean speech. The previous studies of Elli to be more robust to noise were conducted on speaker dependent, small population, isolated words and now are extended to speaker independent, larger population, continuous speech. This study investigates whether the Elli representation is more resistant to telephone noise than mel-cepstrum as was shown in the previous studies, when now for the first time, it is applied for speaker identification task using the state-of-the-art Gaussian mixture model system

    Audio processing on constrained devices

    Get PDF
    This thesis discusses the future of smart business applications on mobile phones and the integration of voice interface across several business applications. It proposes a framework that provides speech processing support for business applications on mobile phones. The framework uses Gaussian Mixture Models (GMM) for low-enrollment speaker recognition and limited vocabulary speech recognition. Algorithms are presented for pre-processing of audio signals into different categories and for start and end point detection. A method is proposed for speech processing that uses Mel Frequency Cepstral Coeffcients (MFCC) as primary feature for extraction. In addition, optimization schemes are developed to improve performance, and overcome constraints of a mobile phone. Experimental results are presented for some prototype applications that evaluate the performance of computationally expensive algorithms on constrained hardware. The thesis concludes by discussing the scope for improvement for the work done in this thesis and future directions in which this work could possibly be extended

    Progress in Speech Recognition for Romanian Language

    Get PDF

    Comparing speech recognition and touch tone as input modalities for Technologically unsophisticated users

    Get PDF
    Using an automated service to access information via the telephone has become an important productivity enhancer in the developed world. However, such automated services are generally quite inaccessible to users who have had little technological exposure. There has been a widespread belief that speech-recognition technology can be used to bridge this gap, but little objective evidence for this belief has been produced. To address this situation, two interfaces, touchtone and speech-based, were designed and implemented as input modalities to a system that provides technologically unsophisticated users with access to an informational/transactional service. These interfaces were optimised and compared using transaction completion rates, time taken to complete tasks, error rates and user satisfaction. The speech-based interface was found to outperform the touchtone interface in terms of completion rate, error rate and user satisfaction. The data obtained on time taken to complete tasks could not be compared as the DTMF interface data were highly influenced by people who are not technologically unsophisticated. These results serve as a confirmation that speech-based interfaces are more effective and more satisfying and can therefore enhance information dissemination to people who are not well exposed to the technology.Dissertation (MSc)--University of Pretoria, 2006.Computer Scienceunrestricte

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Robust text independent closed set speaker identification systems and their evaluation

    Get PDF
    PhD ThesisThis thesis focuses upon text independent closed set speaker identi cation. The contributions relate to evaluation studies in the presence of various types of noise and handset e ects. Extensive evaluations are performed on four databases. The rst contribution is in the context of the use of the Gaussian Mixture Model-Universal Background Model (GMM-UBM) with original speech recordings from only the TIMIT database. Four main simulations for Speaker Identi cation Accuracy (SIA) are presented including di erent fusion strategies: Late fusion (score based), early fusion (feature based) and early-late fusion (combination of feature and score based), late fusion using concatenated static and dynamic features (features with temporal derivatives such as rst order derivative delta and second order derivative delta-delta features, namely acceleration features), and nally fusion of statistically independent normalized scores. The second contribution is again based on the GMM-UBM approach. Comprehensive evaluations of the e ect of Additive White Gaussian Noise (AWGN), and Non-Stationary Noise (NSN) (with and without a G.712 type handset) upon identi cation performance are undertaken. In particular, three NSN types with varying Signal to Noise Ratios (SNRs) were tested corresponding to: street tra c, a bus interior and a crowded talking environment. The performance evaluation also considered the e ect of late fusion techniques based on score fusion, namely mean, maximum, and linear weighted sum fusion. The databases employed were: TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3,600 speech utterances. The third contribution is based on the use of the I-vector, four combinations of I-vectors with 100 and 200 dimensions were employed. Then, various fusion techniques using maximum, mean, weighted sum and cumulative fusion with the same I-vector dimension were used to improve the SIA. Similarly, both interleaving and concatenated I-vector fusion were exploited to produce 200 and 400 I-vector dimensions. The system was evaluated with four di erent databases using 120 speakers from each database. TIMIT, SITW and NIST 2008 databases were evaluated for various types of NSN namely, street-tra c NSN, bus-interior NSN and crowd talking NSN; and the G.712 type handset at 16 kHz was also applied. As recommendations from the study in terms of the GMM-UBM approach, mean fusion is found to yield overall best performance in terms of the SIA with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings. However, in the I-vector approach the best SIA was obtained from the weighted sum and the concatenated fusion.Ministry of Higher Education and Scienti c Research (MoHESR), and the Iraqi Cultural Attach e, Al-Mustansiriya University, Al-Mustansiriya University College of Engineering in Iraq for supporting my PhD scholarship

    Non-Intrusive Subscriber Authentication for Next Generation Mobile Communication Systems

    Get PDF
    Merged with duplicate record 10026.1/753 on 14.03.2017 by CS (TIS)The last decade has witnessed massive growth in both the technological development, and the consumer adoption of mobile devices such as mobile handsets and PDAs. The recent introduction of wideband mobile networks has enabled the deployment of new services with access to traditionally well protected personal data, such as banking details or medical records. Secure user access to this data has however remained a function of the mobile device's authentication system, which is only protected from masquerade abuse by the traditional PIN, originally designed to protect against telephony abuse. This thesis presents novel research in relation to advanced subscriber authentication for mobile devices. The research began by assessing the threat of masquerade attacks on such devices by way of a survey of end users. This revealed that the current methods of mobile authentication remain extensively unused, leaving terminals highly vulnerable to masquerade attack. Further investigation revealed that, in the context of the more advanced wideband enabled services, users are receptive to many advanced authentication techniques and principles, including the discipline of biometrics which naturally lends itself to the area of advanced subscriber based authentication. To address the requirement for a more personal authentication capable of being applied in a continuous context, a novel non-intrusive biometric authentication technique was conceived, drawn from the discrete disciplines of biometrics and Auditory Evoked Responses. The technique forms a hybrid multi-modal biometric where variations in the behavioural stimulus of the human voice (due to the propagation effects of acoustic waves within the human head), are used to verify the identity o f a user. The resulting approach is known as the Head Authentication Technique (HAT). Evaluation of the HAT authentication process is realised in two stages. Firstly, the generic authentication procedures of registration and verification are automated within a prototype implementation. Secondly, a HAT demonstrator is used to evaluate the authentication process through a series of experimental trials involving a representative user community. The results from the trials confirm that multiple HAT samples from the same user exhibit a high degree of correlation, yet samples between users exhibit a high degree of discrepancy. Statistical analysis of the prototypes performance realised early system error rates of; FNMR = 6% and FMR = 0.025%. The results clearly demonstrate the authentication capabilities of this novel biometric approach and the contribution this new work can make to the protection of subscriber data in next generation mobile networks.Orange Personal Communication Services Lt

    New authentication applications in the protection of caller ID and banknote

    Get PDF
    In the era of computers and the Internet, where almost everything is interconnected, authentication plays a crucial role in safeguarding online and offline data. As authentication systems face continuous testing from advanced attacking techniques and tools, the need for evolving authentication technology becomes imperative. In this thesis, we study attacks on authentication systems and propose countermeasures. Considering various nominated techniques, the thesis is divided into two parts. The first part introduces caller ID verification (CIV) protocol to address caller ID spoofing in telecommunication systems. This kind of attack usually follows fraud, which not only inflicts financial losses on victims but also reduces public trust in the telephone system. We propose CIV to authenticate the caller ID based on a challenge-response process. We show that spoofing can be leveraged, in conjunction with dual tone multi-frequency (DTMF), to efficiently implement the challenge-response process, i.e., using spoofing to fight against spoofing. We conduct extensive experiments showing that our solution can work reliably across the legacy and new telephony systems, including landline, cellular and Internet protocol (IP) network, without the cooperation of telecom providers. In the second part, we present polymer substrate fingerprinting (PSF) as a method to combat counterfeiting of banknotes in the financial area. Our technique is built on the observation that the opacity coating leaves uneven thickness in the polymer substrate, resulting in random translucent patterns when a polymer banknote is back-lit by a light source. With extensive experiments, we show that our method can reliably authenticate banknotes and is robust against rough daily handling of banknotes. Furthermore, we show that the extracted fingerprints are extremely scalable to identify every polymer note circulated globally. Our method ensures that even when counterfeiters have procured the same printing equipment and ink as used by a legitimate government, counterfeiting banknotes remains infeasible

    Robust visual speech recognition using optical flow analysis and rotation invariant features

    Get PDF
    The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancement and is prone to acoustic noise. Considering this problem, aim of this research is to investigate and develop a visual only speech recognition system which should be suitable for noisy environments. Potential applications of such a system include the lip-reading mobile phones, human computer interface (HCI) for mobility-impaired users, robotics, surveillance, improvement of speech based computer control in a noisy environment and for the rehabilitation of the persons who have undergone a laryngectomy surgery. In the literature, there are several models and algorithms available for visual feature extraction. These features are extracted from static mouth images and characterized as appearance and shape based features. However, these methods rarely incorporate the time dependent information of mouth dynamics. This dissertation presents two optical flow based approaches of visual feature extraction, which capture the mouth motions in an image sequence. The motivation for using motion features is, because the human perception of lip-reading is concerned with the temporal dynamics of mouth motion. The first approach is based on extraction of features from the optical flow vertical component. The optical flow vertical component is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. To overcome the issue of large variation in speed of speech, each utterance is normalized using simple linear interpolation method. In the second approach, four directional motion templates based on optical flow are developed, each representing the consolidated motion information in an utterance in four directions (i.e.,up, down, left and right). This approach is an evolution of a view based approach known as motion history image (MHI). One of the main issues with the MHI method is its motion overwriting problem because of self-occlusion. DMHIs seem to solve this issue of overwriting. Two types of image descriptors, Zernike moments and Hu moments are used to represent each image of DMHIs. A support vector machine (SVM) classifier was used to classify the features obtained from the optical flow vertical component, Zernike and Hu moments separately. For identification of visemes, a multiclass SVM approach was employed. A video speech corpus of seven subjects was used for evaluating the efficiency of the proposed methods for lip-reading. The experimental results demonstrate the promising performance of the optical flow based mouth movement representations. Performance comparison between DMHI and MHI based on Zernike moments, shows that the DMHI technique outperforms the MHI technique. A video based adhoc temporal segmentation method is proposed in the thesis for isolated utterances. It has been used to detect the start and the end frame of an utterance from an image sequence. The technique is based on a pair-wise pixel comparison method. The efficiency of the proposed technique was tested on the available data set with short pauses between each utterance
    • …
    corecore