2,919 research outputs found

    Outlier Edge Detection Using Random Graph Generation Models and Applications

    Get PDF
    Outliers are samples that are generated by different mechanisms from other normal data samples. Graphs, in particular social network graphs, may contain nodes and edges that are made by scammers, malicious programs or mistakenly by normal users. Detecting outlier nodes and edges is important for data mining and graph analytics. However, previous research in the field has merely focused on detecting outlier nodes. In this article, we study the properties of edges and propose outlier edge detection algorithms using two random graph generation models. We found that the edge-ego-network, which can be defined as the induced graph that contains two end nodes of an edge, their neighboring nodes and the edges that link these nodes, contains critical information to detect outlier edges. We evaluated the proposed algorithms by injecting outlier edges into some real-world graph data. Experiment results show that the proposed algorithms can effectively detect outlier edges. In particular, the algorithm based on the Preferential Attachment Random Graph Generation model consistently gives good performance regardless of the test graph data. Further more, the proposed algorithms are not limited in the area of outlier edge detection. We demonstrate three different applications that benefit from the proposed algorithms: 1) a preprocessing tool that improves the performance of graph clustering algorithms; 2) an outlier node detection algorithm; and 3) a novel noisy data clustering algorithm. These applications show the great potential of the proposed outlier edge detection techniques.Comment: 14 pages, 5 figures, journal pape

    Speech Processing in Computer Vision Applications

    Get PDF
    Deep learning has been recently proven to be a viable asset in determining features in the field of Speech Analysis. Deep learning methods like Convolutional Neural Networks facilitate the expansion of specific feature information in waveforms, allowing networks to create more feature dense representations of data. Our work attempts to address the problem of re-creating a face given a speaker\u27s voice and speaker identification using deep learning methods. In this work, we first review the fundamental background in speech processing and its related applications. Then we introduce novel deep learning-based methods to speech feature analysis. Finally, we will present our deep learning approaches to speaker identification and speech to face synthesis. The presented method can convert a speaker audio sample to an image of their predicted face. This framework is composed of several chained together networks, each with an essential step in the conversion process. These include Audio embedding, encoding, and face generation networks, respectively. Our experiments show that certain features can map to the face and that with a speaker\u27s voice, DNNs can create their face and that a GUI could be used in conjunction to display a speaker recognition network\u27s data

    Continuous Authentication for Voice Assistants

    Full text link
    Voice has become an increasingly popular User Interaction (UI) channel, mainly contributing to the ongoing trend of wearables, smart vehicles, and home automation systems. Voice assistants such as Siri, Google Now and Cortana, have become our everyday fixtures, especially in scenarios where touch interfaces are inconvenient or even dangerous to use, such as driving or exercising. Nevertheless, the open nature of the voice channel makes voice assistants difficult to secure and exposed to various attacks as demonstrated by security researchers. In this paper, we present VAuth, the first system that provides continuous and usable authentication for voice assistants. We design VAuth to fit in various widely-adopted wearable devices, such as eyeglasses, earphones/buds and necklaces, where it collects the body-surface vibrations of the user and matches it with the speech signal received by the voice assistant's microphone. VAuth guarantees that the voice assistant executes only the commands that originate from the voice of the owner. We have evaluated VAuth with 18 users and 30 voice commands and find it to achieve an almost perfect matching accuracy with less than 0.1% false positive rate, regardless of VAuth's position on the body and the user's language, accent or mobility. VAuth successfully thwarts different practical attacks, such as replayed attacks, mangled voice attacks, or impersonation attacks. It also has low energy and latency overheads and is compatible with most existing voice assistants

    Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation

    Get PDF
    Whispering is a natural, unphonated, secondary aspect of speech communications for most people. However, it is the primary mechanism of communications for some speakers who have impaired voice production mechanisms, such as partial laryngectomees, as well as for those prescribed voice rest, which often follows surgery or damage to the larynx. Unlike most people, who choose when to whisper and when not to, these speakers may have little choice but to rely on whispers for much of their daily vocal interaction. Even though most speakers will whisper at times, and some speakers can only whisper, the majority of today’s computational speech technology systems assume or require phonated speech. This article considers conversion of whispers into natural-sounding phonated speech as a noninvasive prosthetic aid for people with voice impairments who can only whisper. As a by-product, the technique is also useful for unimpaired speakers who choose to whisper. Speech reconstruction systems can be classified into those requiring training and those that do not. Among the latter, a recent parametric reconstruction framework is explored and then enhanced through a refined estimation of plausible pitch from weighted formant differences. The improved reconstruction framework, with proposed formant-derived artificial pitch modulation, is validated through subjective and objective comparison tests alongside state-of-the-art alternatives

    TranssionADD: A multi-frame reinforcement based sequence tagging model for audio deepfake detection

    Full text link
    Thanks to recent advancements in end-to-end speech modeling technology, it has become increasingly feasible to imitate and clone a user`s voice. This leads to a significant challenge in differentiating between authentic and fabricated audio segments. To address the issue of user voice abuse and misuse, the second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake speech utterances. Specifically, Track 2, named the Manipulation Region Location (RL), aims to pinpoint the location of manipulated regions in audio, which can be present in both real and generated audio segments. We propose our novel TranssionADD system as a solution to the challenging problem of model robustness and audio segment outliers in the trace competition. Our system provides three unique contributions: 1) we adapt sequence tagging task for audio deepfake detection; 2) we improve model generalization by various data augmentation techniques; 3) we incorporate multi-frame detection (MFD) module to overcome limited representation provided by a single frame and use isolated-frame penalty (IFP) loss to handle outliers in segments. Our best submission achieved 2nd place in Track 2, demonstrating the effectiveness and robustness of our proposed system

    SPEECH EMOTION DETECTION USING MACHINE LEARNING TECHNIQUES

    Get PDF
    Communication is the key to express one’s thoughts and ideas clearly. Amongst all forms of communication, speech is the most preferred and powerful form of communications in human. The era of the Internet of Things (IoT) is rapidly advancing in bringing more intelligent systems available for everyday use. These applications range from simple wearables and widgets to complex self-driving vehicles and automated systems employed in various fields. Intelligent applications are interactive and require minimum user effort to function, and mostly function on voice-based input. This creates the necessity for these computer applications to completely comprehend human speech. A speech percept can reveal information about the speaker including gender, age, language, and emotion. Several existing speech recognition systems used in IoT applications are integrated with an emotion detection system in order to analyze the emotional state of the speaker. The performance of the emotion detection system can greatly influence the overall performance of the IoT application in many ways and can provide many advantages over the functionalities of these applications. This research presents a speech emotion detection system with improvements over an existing system in terms of data, feature selection, and methodology that aims at classifying speech percepts based on emotions, more accurately

    Using transcranial direct-current stimulation (tDCS) to understand cognitive processing

    Full text link
    Noninvasive brain stimulation methods are becoming increasingly common tools in the kit of the cognitive scientist. In particular, transcranial direct-current stimulation (tDCS) is showing great promise as a tool to causally manipulate the brain and understand how information is processed. The popularity of this method of brain stimulation is based on the fact that it is safe, inexpensive, its effects are long lasting, and you can increase the likelihood that neurons will fire near one electrode and decrease the likelihood that neurons will fire near another. However, this method of manipulating the brain to draw causal inferences is not without complication. Because tDCS methods continue to be refined and are not yet standardized, there are reports in the literature that show some striking inconsistencies. Primary among the complications of the technique is that the tDCS method uses two or more electrodes to pass current and all of these electrodes will have effects on the tissue underneath them. In this tutorial, we will share what we have learned about using tDCS to manipulate how the brain perceives, attends, remembers, and responds to information from our environment. Our goal is to provide a starting point for new users of tDCS and spur discussion of the standardization of methods to enhance replicability.The authors declare that they had no conflicts of interest with respect to their authorship or the publication of this article. This work was supported by grants from the National Institutes of Health (R01-EY019882, R01-EY025272, P30-EY08126, F31-MH102042, and T32-EY007135). (R01-EY019882 - National Institutes of Health; R01-EY025272 - National Institutes of Health; P30-EY08126 - National Institutes of Health; F31-MH102042 - National Institutes of Health; T32-EY007135 - National Institutes of Health)Accepted manuscrip

    Opinion mining with the SentWordNet lexical resource

    Get PDF
    Sentiment classification concerns the application of automatic methods for predicting the orientation of sentiment present on text documents. It is an important subject in opinion mining research, with applications on a number of areas including recommender and advertising systems, customer intelligence and information retrieval. SentiWordNet is a lexical resource of sentiment information for terms in the English language designed to assist in opinion mining tasks, where each term is associated with numerical scores for positive and negative sentiment information. A resource that makes term level sentiment information readily available could be of use in building more effective sentiment classification methods. This research presents the results of an experiment that applied the SentiWordNet lexical resource to the problem of automatic sentiment classification of film reviews. First, a data set of relevant features extracted from text documents using SentiWordNet was designed and implemented. The resulting feature set is then used as input for training a support vector machine classifier for predicting the sentiment orientation of the underlying film review. Several scenarios exploring variations on the parameters that generate the data set, outlier removal and feature selection were executed. The results obtained are compared to other methods documented in the literature. It was found that they are in line with other experiments that propose similar approaches and use the same data set of film reviews, indicating SentiWordNet could become an important resource for the task of sentiment classification. Considerations on future improvements are also presented based on a detailed analysis of classification results

    Role of N-methyl-D-aspartate receptors in action-based predictive coding deficits in schizophrenia

    Full text link
    Published in final edited form as:Biol Psychiatry. 2017 March 15; 81(6): 514–524. doi:10.1016/j.biopsych.2016.06.019.BACKGROUND: Recent theoretical models of schizophrenia posit that dysfunction of the neural mechanisms subserving predictive coding contributes to symptoms and cognitive deficits, and this dysfunction is further posited to result from N-methyl-D-aspartate glutamate receptor (NMDAR) hypofunction. Previously, by examining auditory cortical responses to self-generated speech sounds, we demonstrated that predictive coding during vocalization is disrupted in schizophrenia. To test the hypothesized contribution of NMDAR hypofunction to this disruption, we examined the effects of the NMDAR antagonist, ketamine, on predictive coding during vocalization in healthy volunteers and compared them with the effects of schizophrenia. METHODS: In two separate studies, the N1 component of the event-related potential elicited by speech sounds during vocalization (talk) and passive playback (listen) were compared to assess the degree of N1 suppression during vocalization, a putative measure of auditory predictive coding. In the crossover study, 31 healthy volunteers completed two randomly ordered test days, a saline day and a ketamine day. Event-related potentials during the talk/listen task were obtained before infusion and during infusion on both days, and N1 amplitudes were compared across days. In the case-control study, N1 amplitudes from 34 schizophrenia patients and 33 healthy control volunteers were compared. RESULTS: N1 suppression to self-produced vocalizations was significantly and similarly diminished by ketamine (Cohen’s d = 1.14) and schizophrenia (Cohen’s d = .85). CONCLUSIONS: Disruption of NMDARs causes dysfunction in predictive coding during vocalization in a manner similar to the dysfunction observed in schizophrenia patients, consistent with the theorized contribution of NMDAR hypofunction to predictive coding deficits in schizophrenia.This work was supported by AstraZeneca for an investigator-initiated study (DHM) and the National Institute of Mental Health Grant Nos. R01 MH-58262 (to JMF) and T32 MH089920 (to NSK). JHK was supported by the Yale Center for Clinical Investigation Grant No. UL1RR024139 and the US National Institute on Alcohol Abuse and Alcoholism Grant No. P50AA012879. (AstraZeneca for an investigator-initiated study (DHM); R01 MH-58262 - National Institute of Mental Health; T32 MH089920 - National Institute of Mental Health; UL1RR024139 - Yale Center for Clinical Investigation; P50AA012879 - US National Institute on Alcohol Abuse and Alcoholism)Accepted manuscrip

    Text-to-Speech translation using Support Vector Machine, an approach to find a potential path for human-computer speech synthesizer

    Get PDF
    Text-to-Speech (TTS), an astounding feature to assemble computer with intelligence and to induce sound is seemingly a challenging task as it is related to the propagation of uncertainty with the input. This is because TTS evolutes the input based on the probabilities and not with certainty ratios. TTS is accomplished by generating the sound structure/phoneme and then classifying these phonemes in the phonetic dictionary. The Wards' algorithms, BIRCH, Support Vector Machine (SVM) are used to figure out the appropriate sound representation for the given context. To distinguish correct elocution, the SVM procedures are equipped with the principles of pruning. The output was analyzed using divergent stages of uncertainty. In order to study the effect of the output 10 listeners were considered for determining Signal-to-Noise (SNR) ratio. SNR shows that the errors of both type phase and uncertainty were approximately 6% resulting 94% of accuracy. These results manifested that SVM stratagem can be used to obtain better results for TTS synthesizer
    • …
    corecore