216 research outputs found

    Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering

    Full text link
    Sound event detection (SED) methods typically rely on either strongly labelled data or weakly labelled data. As an alternative, sequentially labelled data (SLD) was proposed. In SLD, the events and the order of events in audio clips are known, without knowing the occurrence time of events. This paper proposes a connectionist temporal classification (CTC) based SED system that uses SLD instead of strongly labelled data, with a novel unsupervised clustering stage. Experiments on 41 classes of sound events show that the proposed two-stage method trained on SLD achieves performance comparable to the previous state-of-the-art SED system trained on strongly labelled data, and is far better than another state-of-the-art SED system trained on weakly labelled data, which indicates the effectiveness of the proposed two-stage method trained on SLD without any onset/offset time of sound events

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Rule-embedded network for audio-visual voice activity detection in live musical video streams

    Full text link
    Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level label is introduced.Comment: Submitted to ICASSP 202

    Connectionist systems for image processing and anomaly detection

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaA Inteligência Artificial (IA) e a Ciência de Dados estão cada vez mais presentes no nosso quotidiano e os benefícios que trouxeram para a sociedade nos últimos anos são notáveis. O sucesso da IA foi impulsionado pela capacidade adaptativa que as máquinas adquiriram e está estreitamente relacionada com a sua habilidade para aprender. Os sistemas conexionistas, apresentados na forma de Redes Neurais Artificiais (RNAs), que se inspiram no sistema nervoso humano, são um dos mais importantes modelos que permitem a aprendizagem. Estes são utilizados em diversas áreas, como em problemas de previsão ou classificação, apresentando resultados cada vez mais satisfatórios. Uma das áreas em que esta tecnologia se tem destacado é a Visão Computacional (Computer Vision (CV)), permitindo, por exemplo, a localização de objetos em imagens e a sua correta identificação. A Deteção de Anomalias (Anomaly Detection (AD)) é outro campo onde as RNAs vêm surgindo como uma das tecnologias para a resolução de problemas. Em cada área são utilizadas diferentes arquiteturas de acordo com o tipo de dados e o problema a resolver. Combinando o processamento de imagens e a deteção de anomalias, verifica-se uma convergência de metodologias que utilizam módulos convolucionais em arquiteturas dedicadas a AD. O objetivo principal desta dissertação é estudar as técnicas existentes nestes domínios, desenvolvendo diferentes arquiteturas e modelos, aplicando-as a casos práticos de forma a comparar os resultados obtidos em cada abordagem. O caso prático principal consiste na monitorização de pavimentos rodoviários por meio de imagens para a identificação automática de áreas degradadas. Para isso, dois protótipos de software são propostos para recolher e visualizar os dados adquiridos. O estudo de arquiteturas de RNAs para o diagnóstico da condição do asfalto por meio de imagens é o foco central no processo científico apresentado. Os métodos de Machine Learning (ML) utilizados incluem classificadores binários, Autoencoders (AEs) e Variational Autoencoders (VAEs). Para os dois últimos modelos, práticas supervisionadas e não supervisionadas são também comparadas, comprovando a sua utilidade em cenários onde não há dados rotulados disponíveis. Usando o modelo VAE num ambiente supervisionado, este apresenta uma excelente distinção entre áreas de pavimentação em boas condições e degradadas. Quando não existem dados rotulados disponíveis, a melhor opção é utilizar o modelo AE, utilizando a distribuição de semelhanças das reconstruções para calcular o threshold de separação, atingindo accuracy e precision superiores a 94%). O processo completo de desenvolvimento mostra que é possível construir uma solução alternativa para diminuir os custos de operação em relação aos sistemas comerciais existentes e melhorar a usabilidade quando comparada às soluções tradicionais. Adicionalmente, dois estudos demonstram a versatilidade dos sistemas conexionistas na resolução de problemas, nomeadamente no projeto de estruturas mecânicas, possibilitando a modelação de campos de deslocamento e pressão em placas reforçadas; e na utilização de AD para identificar locais de aglomeração de pessoas através de técnicas de crowdsensing.Artificial Intelligence (AI) and Data Science (DS) have become increasingly present in our daily lives, and the benefits it has brought to society in recent years are remarkable. The success of AI was driven by the adaptive capacity that machines gained, and it is closely related to their ability to learn. Connectionist systems, presented in the form of Artificial Neural Networks (ANNs), which are inspired by the human nervous system, are one of the principal models that allows learning. These models are used in several areas, like forecasting or classification problems, presenting increasingly satisfactory results. One area in which this technology has excelled is Com puter Vision (CV), allowing, for example, the location of objects in images and their correct identification. Anomaly Detection (AD) is another field where ANNs have been emerging as one technology for problem solving. In each area, different architectures are used according to the type of data and the problem to be solved. Combining im age processing and the finding of anomalies in this type of data, there is a convergence of methodologies using convolutional modules in architectures dedicated to AD. The main objective of this dissertation is to study the existent techniques in these domains, developing different model architectures, and applying them to practical case studies in order to compare the results obtained in each approach. The major practical use case consists of monitoring road pavements using images to automatically identify degraded areas. For that, two software prototypes are proposed to gather and visualise the acquired data. Moreover, the study of ANN architectures to diagnose the asphalt condition through images is the central focus of this work. The experimented methods for AD in images include a binary classifier network as a baseline, Autoencoders (AEs) and Variational Autoen coders (VAEs). Supervised and unsupervised practises are also compared, proving their utility also in scenarios where there is no labelled data available. Using the VAE model in a supervised setting, it presents a excellent distinction between good and bad pavement areas. When labelled data is not available, using the AE and the distribution of similarities of good pavement reconstructions to calculate the threshold is the best option with both accuracy and precision above 94%. The full development process shows it is possible to build an alternative solution to decrease the operation costs relatively to expensive commercial systems and improve usability when compared with traditional solutions. Additionally, two case studies demonstrate the versatility of connectionist systems to solve problems, namely in Mechanical Structural Design enabling the modelling of displacement and pressure fields in reinforced plates; and using AD to identify crowded places through crowd-sensing techniques

    Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

    Full text link
    Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful.Comment: Accepted by INTERSPEECH 202

    Unsupervised categorisation and cross-classification in humans and rats

    Get PDF
    This thesis examines how stimulus similarity structure and the statistical properties of the environment influence human and nonhuman animal categorisation. Two aspects of categorisation behaviour are explored: unsupervised (spontaneous) categorisation and stimulus cross-classification. In my General Introduction, I raise the issue of the respective roles of similarity and the classifier in determining categorisation behaviour. In Chapter 1, I review previous laboratory-based unsupervised categorisation research, which shows an overwhelming bias for unsupervised classification based on a single feature. Given the prominent role of overall similarity (family resemblance) in theories of human conceptual structure, I argue that this bias for unidimensional classification is likely an artefact. One factor in producing this artefact, I suggest, are the biases that exist within the similarity structure of laboratory stimuli. Consequently, Chapter 2 examines if it is possible to predict unidimensional versus multidimensional classification based solely on abstract similarity structure. Results show that abstract similarity structure commands a strong influence over participants' unsupervised classification behaviour (although not always in the manner predicted), and a bias for multidimensional unsupervised classification is reported. In Chapter 3, I examine unsupervised categorisation more broadly, by investigating how stimulus similarity structure influences spontaneous classification in both humans and rats. In this way, evidence is sought for human-like spontaneous classification behaviour in rats. Results show that humans and rats show qualitatively different patterns of behaviour following incidental stimulus exposure that should encourage spontaneous classification. In Chapter 4,1 investigate whether rats exhibit another important aspect of human categorisation namely, stimulus cross-classification. Results show that the statistical properties of the environment can engender such cognitively flexible behaviour in rats. Overall, the results of this thesis document the important influence of stimulus similarity structure and the statistical properties of the environment on human and nonhuman animal behaviour

    Early word learning through communicative inference

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 109-122).How do children learn their first words? Do they do it by gradually accumulating information about the co-occurrence of words and their referents over time, or are words learned via quick social inferences linking what speakers are looking at, pointing to, and talking about? Both of these conceptions of early word learning are supported by empirical data. This thesis presents a computational and theoretical framework for unifying these two different ideas by suggesting that early word learning can best be described as a process of joint inferences about speakers' referential intentions and the meanings of words. Chapter 1 describes previous empirical and computational research on "statistical learning"--the ability of learners to use distributional patterns in their language input to learn about the elements and structure of language-and argues that capturing this abifity requires models of learning that describe inferences over structured representations, not just simple statistics. Chapter 2 argues that social signals of speakers' intentions, even eye-gaze and pointing, are at best noisy markers of reference and that in order to take advantage of these signals fully, learners must integrate information across time. Chapter 3 describes the kinds of inferences that learners can make by assuming that speakers are informative with respect to their intended meaning, introducing and testing a formalization of how Grice's pragmatic maxims can be used for word learning. Chapter 4 presents a model of cross-situational intentional word learning that both learns words and infers speakers' referential intentions from labeled corpus data.by Michael C. Frank.Ph.D

    Deep audio-visual speech recognition

    Get PDF
    Decades of research in acoustic speech recognition have led to systems that we use in our everyday life. However, even the most advanced speech recognition systems fail in the presence of noise. The degraded performance can be compensated by introducing visual speech information. However, Visual Speech Recognition (VSR) in naturalistic conditions is very challenging, in part due to the lack of architectures and annotations. This thesis contributes towards the problem of Audio-Visual Speech Recognition (AVSR) from different aspects. Firstly, we develop AVSR models for isolated words. In contrast to previous state-of-the-art methods that consists of a two-step approach, feature extraction and recognition, we present an End-to-End (E2E) approach inside a deep neural network, and this has led to a significant improvement in audio-only, visual-only and audio-visual experiments. We further replace Bi-directional Gated Recurrent Unit (BGRU) with Temporal Convolutional Networks (TCN) to greatly simplify the training procedure. Secondly, we extend our AVSR model for continuous speech by presenting a hybrid Connectionist Temporal Classification (CTC)/Attention model, that can be trained in an end-to-end manner. We then propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations. Next, we present a self-supervised framework, Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech, and find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading. We also investigate the Lombard effect influence in an end-to-end AVSR system, which is the first work using end-to-end deep architectures and presents results on unseen speakers. We show that even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved. Lastly, we propose a detection method against adversarial examples in an AVSR system, where the strong correlation between audio and visual streams is leveraged. The synchronisation confidence score is leveraged as a proxy for audio-visual correlation and based on it, we can detect adversarial attacks. We apply recent adversarial attacks on two AVSR models and the experimental results demonstrate that the proposed approach is an effective way for detecting such attacks.Open Acces
    corecore