10,565 research outputs found

    Generative Adversarial Network with Convolutional Wavelet Packet Transforms for Automated Speaker Recognition and Classification

    Get PDF
    Speech is an effective mode of communication that always conveys abundant and pertinent information, such as the gender, accent, and other distinguishing characteristics of the speaker. These distinctive characteristics allow researchers to identify human voices using artificial intelligence (AI) techniques, which are useful for forensic voice verification, security and surveillance, electronic voice eavesdropping, mobile banking, and mobile purchasing. Deep learning (DL) and other advances in hardware have piqued the interest of researchers studying automatic speaker identification (SI). In recent years, Generative Adversarial Networks (GANs) have demonstrated exceptional ability in producing synthetic data and improving the performance of several machine learning tasks. The capacity of Convolutional Wavelet Packet Transform (CWPT) and Generative Adversarial Networks are combined in this paper to propose a novel way of enhancing the accuracy and robustness of Speaker Recognition and Classification systems. Audio signals are dissected using the Convolutional Wavelet Packet Transform into a multi-resolution, time-frequency representation that faithfully preserves local and global characteristics. The improved audio features better precisely describe speech traits and handle pitch, tone, and pronunciation variations that are frequent in speaker recognition tasks. Using GANs to create synthetic speech samples, our suggested method GAN-CWPT enriches the training data and broadens the dataset's diversity. The generator and discriminator components of the GAN architecture have been tweaked to produce realistic speech samples with attributes quite similar to genuine speaker utterances. The new dataset enhances the Speaker Recognition and Classification system's robustness and generalization, even in environments with little training data. We conduct extensive tests on standard speaker recognition datasets to determine how well our method works. The findings demonstrate that, compared to conventional methods, the GAN-CWPTs combination significantly improves speaker recognition, classification accuracy, and efficiency. Additionally, the suggested model GAN-CWPT exhibits stronger generalization on unknown speakers and excels even with loud and poor audio inputs

    Effective Detection of Parkinson’s Disease at Different Stages using Measurements of Dysphonia

    Get PDF
    This paper addressees the problem of multiclass of Parkinson’s disease by the characteristic features of person’s voice. So we computed 22 dysphonia measures from 375 voice samples of healthy and people suffer from PD. We used the particle swarm optimization (PSO) feature selection method, with random forest and the linear discriminant analysis (LDA) along with the 4-fold cross validation analysis to classify the subjects in 4 classes according to the severity of symptoms. With a classification accuracy score of 95.2%. Promisingly, the proposed diagnosis system might serve as a powerful tool for diagnosing PD, and could also extended for other voice pathologies

    The individual and the system : Assessing the stability of the output of a semi-automatic forensic voice comparison system

    Get PDF
    Semi-automatic systems based on traditional linguistic-phonetic features are increasingly being used for forensic voice comparison (FVC) casework. In this paper, we examine the stability of the output of a semi-automatic system, based on the long-term formant distributions (LTFDs) of F1, F2, and F3, as the channel quality of the input recordings decreases. Cross-validated, calibrated GMM-UBM log likelihood-ratios (LLRs) were computed for 97 Standard Southern British English speakers under four conditions. In each condition the same speech material was used, but the technical properties of the recordings changed (high quality studio recording, landline telephone recording, high bit-rate GSM mobile telephone recording and low bit-rate GSM mobile telephone recording). Equal error rate (EER) and the log LR cost function (Cllr) were compared across conditions. System validity was found to decrease with poorer technical quality, with the largest differences in EER (21.66%) and Cllr (0.46) found between the studio and the low bit-rate GSM conditions. However, importantly, performance for individual speakers was affected differently by channel quality. Speakers that produced stronger evidence overall were found to be more variable. Mean F3 was also found to be a predictor of LLR variability, however no effects were found based on speakers’ voice quality profiles

    Toward an affect-sensitive multimodal human-computer interaction

    No full text
    The ability to recognize affective states of a person... This paper argues that next-generation human-computer interaction (HCI) designs need to include the essence of emotional intelligence -- the ability to recognize a user's affective states -- in order to become more human-like, more effective, and more efficient. Affective arousal modulates all nonverbal communicative cues (facial expressions, body movements, and vocal and physiological reactions). In a face-to-face interaction, humans detect and interpret those interactive signals of their communicator with little or no effort. Yet design and development of an automated system that accomplishes these tasks is rather difficult. This paper surveys the past work in solving these problems by a computer and provides a set of recommendations for developing the first part of an intelligent multimodal HCI -- an automatic personalized analyzer of a user's nonverbal affective feedback

    Machine Learning Methods for Automatic Silent Speech Recognition Using a Wearable Graphene Strain Gauge Sensor.

    Get PDF
    Silent speech recognition is the ability to recognise intended speech without audio information. Useful applications can be found in situations where sound waves are not produced or cannot be heard. Examples include speakers with physical voice impairments or environments in which audio transference is not reliable or secure. Developing a device which can detect non-auditory signals and map them to intended phonation could be used to develop a device to assist in such situations. In this work, we propose a graphene-based strain gauge sensor which can be worn on the throat and detect small muscle movements and vibrations. Machine learning algorithms then decode the non-audio signals and create a prediction on intended speech. The proposed strain gauge sensor is highly wearable, utilising graphene's unique and beneficial properties including strength, flexibility and high conductivity. A highly flexible and wearable sensor able to pick up small throat movements is fabricated by screen printing graphene onto lycra fabric. A framework for interpreting this information is proposed which explores the use of several machine learning techniques to predict intended words from the signals. A dataset of 15 unique words and four movements, each with 20 repetitions, was developed and used for the training of the machine learning algorithms. The results demonstrate the ability for such sensors to be able to predict spoken words. We produced a word accuracy rate of 55% on the word dataset and 85% on the movements dataset. This work demonstrates a proof-of-concept for the viability of combining a highly wearable graphene strain gauge and machine leaning methods to automate silent speech recognition.EP/S023046/
    • …
    corecore