3,142 research outputs found

    Denoising Deep Neural Networks Based Voice Activity Detection

    Full text link
    Recently, the deep-belief-networks (DBN) based voice activity detection (VAD) has been proposed. It is powerful in fusing the advantages of multiple features, and achieves the state-of-the-art performance. However, the deep layers of the DBN-based VAD do not show an apparent superiority to the shallower layers. In this paper, we propose a denoising-deep-neural-network (DDNN) based VAD to address the aforementioned problem. Specifically, we pre-train a deep neural network in a special unsupervised denoising greedy layer-wise mode, and then fine-tune the whole network in a supervised way by the common back-propagation algorithm. In the pre-training phase, we take the noisy speech signals as the visible layer and try to extract a new feature that minimizes the reconstruction cross-entropy loss between the noisy speech signals and its corresponding clean speech signals. Experimental results show that the proposed DDNN-based VAD not only outperforms the DBN-based VAD but also shows an apparent performance improvement of the deep layers over shallower layers.Comment: This paper has been accepted by IEEE ICASSP-2013, and will be published online after May, 201

    Perancangan Simulasi dan Implementasi Voice Activity Detection (VAD) pada TMS320C6455 / Design Simulation and Implementation Voice Activity Detection (VAD) on TMS320C6455

    Get PDF
    ABSTRAKSI: Discontinuous Transmission (DTX) adalah sebuah metode transmisi yang membuat proses transmisi data menjadi efisien pada sisi power dan bit rate. Pada transmisi data audio, DTX hanya akan mengirimkan data audio ketika terdeteksi periode aktif suara. Ketika tidak terdeteksi adanya periode aktif, transmitter tidak akan mengirimkan data audio namun akan mengirimkan paket kecil sebagai pemberitahuan pada receiver bahwa data audio yang berupa periode pasif tidak dikirim sehingga receiver akan membangkitkan comfort noise untuk mengisi periode pasif yang tidak dikirim tersebut. Voice Activity Detection (VAD) memiliki peranan penting pada DTX, karena VAD akan mendeteksi aktivitas suara pada setiap frame sehingga DTX dapat direalisasikan. VAD dilakukan pada setiap frame, artinya setiap frame akan dilihat apakah terdapat aktivitas suara atau tidak. Untuk pengidentifikasiannya, digunakan ciri yaitu Short Term Energy (STE). Apabila sebuah frame memiliki nilai STE diatas threshold yang ditentukan, maka diklasifikasikan pada frame aktif sedangkan jika tidak, maka termasuk ke dalam frame aktif. Pada tugas akhir ini, sistem VAD dibangun secara simulasi (menggunakan MatLab) dan implementasi pada TMS320C6455 DSK (menggunakan Code Composer Studio). Dari hasil pengujian dan analisis, didapat kesimpulan bahwa threshold STE yang optimal bergantung pada kondisi noise. semakin besar pengaruh niose, threshold STE yang optimal semakin tinggi. Pada SNR 5 dB, threshold yang optimal adalah 15 dB dimana menghasilkan SDER 5,2674%, NDER 0,1810%, dan DAPR 99,38%. Sedangkan untuk SNR 0 dB, threshold yang optimal adalah 18 dB dimana menghasilkan SDER 4,2823%, NDER 1,9633%, dan DAPR 97,87%. Pada pengujian akuisisi langsung dari Codec AIC23 pada TMS320C6455 DSK, sistem menghasilkan DAPR 100%.Kata Kunci : DTX, VAD, STE, Noise, SDER, NDER, DAPR, TMS320C6455 DSKABSTRACT: Discontinuous Transmission (DTX) is a method of transmission that can make the power and bitrate of transmission become more efficient. In audio data transmission, DTX will only transmit the data when the activie period detected. When the passive period detected, the transmitter won’t transmit anything but a small data packet as an announcement for the receiver that the transmitter don’t transmit the data audio so the receiver will generate comfort noise to fill the unsent passive period. Voice Activity Detection (VAD) has an important role in DTX, VAD will detect activity of voice for every frame hence DTX can be implemented. VAD is done for every frame. It means that in every frame, VAD will detect wether there is activity of voice or not. For identification, the feature which is used is Short Term Energy (STE). If a frame has higher STE than the defined threshold than it’s classified into active frame, if the STE is lower, than it’s classified into passive frame. In this Final Project, the system of VAD is built in simulation (using MatLab) and implementation on TMS320C6455 (using Code Composer Studio). From the system test and analysis, in summary the optimum threshold of STE depends on the noise influence. The bigger noise, then the optimum STE will be bigger too. When SNR is 5 dB, the optimum threshold is 15 dB which results SDER 5,2674%, NDER 0,1810%, dan DAPR 99,38%. When SNR is 0 dB, the optimum threshold is 18 dB which results SDER 4,2823%, NDER 1,9633%, dan DAPR 97,87%. From the test of directly acquisition from Codec AIC23 on TMS320C6455 DSK, the system got DAPR 100%.Keyword: DTX, VAD, STE, Noise, SDER, NDER, DAPR, TMS320C6455 DS

    Voicing classification of visual speech using convolutional neural networks

    Get PDF
    The application of neural network and convolutional neural net- work (CNN) architectures is explored for the tasks of voicing classification (classifying frames as being either non-speech, unvoiced, or voiced) and voice activity detection (VAD) of vi- sual speech. Experiments are conducted for both speaker de- pendent and speaker independent scenarios. A Gaussian mixture model (GMM) baseline system is de- veloped using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94%, for voicing classification and VAD respectively. Additionally, a single- layer neural network system trained using the same visual fea- tures achieves accuracies of 86 % and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classifi- cation and VAD results using the system are further improved to 88 % and 98 % respectively. The speaker independent results show the neural network system to outperform both the GMM and CNN systems, achiev- ing accuracies of 63 % for voicing classification, and 79 % for voice activity detection

    The DKU-DukeECE Diarization System for the VoxCeleb Speaker Recognition Challenge 2022

    Full text link
    This paper discribes the DKU-DukeECE submission to the 4th track of the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22). Our system contains a fused voice activity detection model, a clustering-based diarization model, and a target-speaker voice activity detection-based overlap detection model. Overall, the submitted system is similar to our previous year's system in VoxSRC-21. The difference is that we use a much better speaker embedding and a fused voice activity detection, which significantly improves the performance. Finally, we fuse 4 different systems using DOVER-lap and achieve 4.75 of the diarization error rate, which ranks the 1st place in track 4.Comment: arXiv admin note: substantial text overlap with arXiv:2109.0200

    Automotive three-microphone voice activity detector and noise-canceller

    Get PDF
    This paper addresses issues in improving hands-free speech recognition performance in car environments. A three-microphone array has been used to form a beamformer with leastmean squares (LMS) to improve Signal to Noise Ratio (SNR). A three-microphone array has been paralleled to a Voice Activity Detection (VAD). The VAD uses time-delay estimation together with magnitude-squared coherence (MSC)
    • …
    corecore