16 research outputs found

    Deteksi Ucapan untuk Sistem Pengawasan Asesmen (iProctor) Menggunakan Metode Deep Learning

    Get PDF
    Asesmen adalah kegiatan mengumpulkan informasi ketercapaian kompetensi siswa. Asesmen merupakan bagian integral dari proses pembelajaran, tak terkecuali dalam pembelajaran berbasis Out Door Learning (ODL) dan Massive Open Online Course (MOOC). Sebuah studi menyatakan bahwa persentase siswa yang melakukan kecurangan dalam pelaksanaan kegiatan akademik terus meningkat, dan lebih mudah bagi mereka untuk berlaku curang pada asesmen yang dilakukan secara daring. Hal ini menjadi tantangan untuk perkembangan iProctor, yaitu platform untuk melakukan asesmen seara daring. Untuk mengurangi risiko kecurangan, sistem pelaksanaan dan pengawasan ujian yang valid menjadi suatu hal yang penting. Pada penelitian ini diuji sistem pengawasan otomatis bedasarkan audio. Data audio didapatkan dari mikrofon yang terletak pada ruang dilakukannya asesmen. Sistem pengawasan asesmen dilakukan secara otomatis dengan metode deteksi ucapan menggunakan metode deep learning dengan model CNN. Data audio di ekstraksi fitur menggunakan log-mel spectrogram. Hasil esktraksi fitur menjadi input model CNN MobileNetV3. Hasil prediksi dari MobileNetV3 dilakukan proses smoothing dengan metode Majority Vote. Hasil penelitian ini menunjukkan bahwa model deteksi ucapan memberikan hasil terbaik dengan model CNN MobileNetV3-Large pada dataset librispeech dengan speech f1 score 0.8652, non-speech f1 score 0.7332, dan hasil weighted average 0.8242. Ekstraksi fitur menggunakan metode log-mel spectrogram menggunakan parameter fft size 512, mel bins 40, hope size 8, lower frequency 300, upper frequency 8000. Hasil dari log-mel spectrogram dibagi menjadi banyak frame 25ms dan step 12.5ms atau overlap 50

    Analysis of audio data to measure social interaction in the treatment of autism spectrum disorder using speaker diarization and identification

    Get PDF
    Autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects communication and behavior in social environments. Some common characteristics of a person with ASD include difficulty with communication or interaction with others, restricted interests paired with repetitive behaviors and other symptoms that may affect the person's overall social life. People with ASD endure a lower quality of life due to their inability to navigate their daily social interactions. Autism is referred to as a spectrum disorder due to the variation in type and severity of symptoms. Therefore, measurement of the social interaction of a person with ASD in a clinical setting is inaccurate because the tests are subjective, time consuming, and not naturalistic. The goal of this study is to lay the foundation to passively collect continuous audio data of people with ASD through a voice recorder application that runs in the background of their mobile device and propose a methodology to understand and analyze the collected audio data while maintaining minimal human intervention. Speaker Diarization and Speaker Identification are two methods that are explored to answer essential questions when processing unlabeled audio data such as who spoke when and to whom does a certain speaker label belong to? Speaker Diarization is the process of partitioning an audio signal that involves multiple people into homogenous segments associated with each person. It provides an answer to the question of "who spoke when?". The implemented Speaker Diarization algorithm utilizes the state-of-the-art d-vector embeddings that take advantage of neural networks by using large datasets for training so variation in speech, accent, and acoustic conditions of the audio signal can be better accounted for. Furthermore, the algorithm uses a non-parametric, connection-based clustering algorithm commonly known as spectral clustering. The spectral clustering algorithm is applied to these previously extracted d-vector embeddings to determine the number of unique speakers and assign each portion of the audio file to a specific cluster. Through various experiments and trials, we chose Microsoft Azure Cognitive Services due to the robust algorithms and models that are available to identify speakers in unlabeled audio data. The Speaker Identification API from Microsoft Azure Cognitive Services provides a state-of-the-art service to identify human voices through RESTful API calls. A simple web interface was implemented to send audio data to the Speaker Identification API which returned data in JSON format. This returned data provides an answer to the question -- "who does a certain speaker label belong to?". The proposed methods were tested extensively on numerous audio files which contain various numbers of speakers who emulate a realistic conversational exchange. The results support our goal of digitally measuring social interaction of people with ASD through the analysis of audio data while maintaining minimal human intervention. We were able to identify our target speaker and differentiate them from others given an audio signal which could ultimately unlock valuable insights such as creating a bio marker to measure response to treatment

    Analysis and Annotation of Emotional Traits on Audio Conversations in Real-time

    Get PDF
    It is a challenging task for computers to recognize humans’ emotions through their conversations. Therefore, this research is aimed at analyzing conversation audio data, then labeling humans’ emotions, finally annotating and visualizing the identified emotional traits of audio conversions in real-time. In order to make computer to process speech emotion features, the raw audio is converted from time domain to frequency domain and extract speech emotion by Mel-Frequency Cepstral Coefficients. In terms of speech emotion recognition, deep neural network and extreme learning machine are used to predict emotion traits. Each emotional trait is captured by speech recognition precision. There are four emotional traits which include sadness, happiness, neutral, anger in the dataset. The total precision value of four emotional traits is normalized into 1. In this study, the normalized precision is used as emotional trait relative intensity in which each emotional trait is labeled and displayed along with conversion. For better visualization, a Graphical User Interface is made to display the waveform graph, spectrogram graph and speech emotion prediction graph of a given speech audio. Meanwhile, the effect of voice activity detection algorithm is analyzed in this study. The timestamps for emotion annotation can be obtained by the result of voice activity detection

    Feasibility of Deep Learning-Based Analysis of Auscultation for Screening Significant Stenosis of Native Arteriovenous Fistula for Hemodialysis Requiring Angioplasty

    Get PDF
    Objective: To investigate the feasibility of using a deep learning-based analysis of auscultation data to predict significant stenosis of arteriovenous fistulas (AVF) in patients undergoing hemodialysis requiring percutaneous transluminal angioplasty (PTA). Materials and methods: Forty patients (24 male and 16 female; median age, 62.5 years) with dysfunctional native AVF were prospectively recruited. Digital sounds from the AVF shunt were recorded using a wireless electronic stethoscope before (pre-PTA) and after PTA (post-PTA), and the audio files were subsequently converted to mel spectrograms, which were used to construct various deep convolutional neural network (DCNN) models (DenseNet201, EfficientNetB5, and ResNet50). The performance of these models for diagnosing ≥ 50% AVF stenosis was assessed and compared. The ground truth for the presence of ≥ 50% AVF stenosis was obtained using digital subtraction angiography. Gradient-weighted class activation mapping (Grad-CAM) was used to produce visual explanations for DCNN model decisions. Results: Eighty audio files were obtained from the 40 recruited patients and pooled for the study. Mel spectrograms of "pre-PTA" shunt sounds showed patterns corresponding to abnormal high-pitched bruits with systolic accentuation observed in patients with stenotic AVF. The ResNet50 and EfficientNetB5 models yielded an area under the receiver operating characteristic curve of 0.99 and 0.98, respectively, at optimized epochs for predicting ≥ 50% AVF stenosis. However, Grad-CAM heatmaps revealed that only ResNet50 highlighted areas relevant to AVF stenosis in the mel spectrogram. Conclusion: Mel spectrogram-based DCNN models, particularly ResNet50, successfully predicted the presence of significant AVF stenosis requiring PTA in this feasibility study and may potentially be used in AVF surveillance.ope
    corecore