216 research outputs found
Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering
Sound event detection (SED) methods typically rely on either strongly
labelled data or weakly labelled data. As an alternative, sequentially labelled
data (SLD) was proposed. In SLD, the events and the order of events in audio
clips are known, without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based SED system that
uses SLD instead of strongly labelled data, with a novel unsupervised
clustering stage. Experiments on 41 classes of sound events show that the
proposed two-stage method trained on SLD achieves performance comparable to the
previous state-of-the-art SED system trained on strongly labelled data, and is
far better than another state-of-the-art SED system trained on weakly labelled
data, which indicates the effectiveness of the proposed two-stage method
trained on SLD without any onset/offset time of sound events
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Rule-embedded network for audio-visual voice activity detection in live musical video streams
Detecting anchor's voice in live musical streams is an important
preprocessing for music and speech signal processing. Existing approaches to
voice activity detection (VAD) primarily rely on audio, however, audio-based
VAD is difficult to effectively focus on the target voice in noisy
environments. With the help of visual information, this paper proposes a
rule-embedded network to fuse the audio-visual (A-V) inputs to help the model
better detect target voice. The core role of the rule in the model is to
coordinate the relation between the bi-modal information and use visual
representations as the mask to filter out the information of non-target sound.
Experiments show that: 1) with the help of cross-modal fusion by the proposed
rule, the detection result of A-V branch outperforms that of audio branch; 2)
the performance of bi-modal model far outperforms that of audio-only models,
indicating that the incorporation of both audio and visual signals is highly
beneficial for VAD. To attract more attention to the cross-modal music and
audio signal processing, a new live musical video corpus with frame-level label
is introduced.Comment: Submitted to ICASSP 202
Connectionist systems for image processing and anomaly detection
Dissertação de mestrado integrado em Engenharia InformáticaA Inteligência Artificial (IA) e a Ciência de Dados estão cada vez mais presentes no nosso quotidiano e os
benefícios que trouxeram para a sociedade nos últimos anos são notáveis. O sucesso da IA foi impulsionado
pela capacidade adaptativa que as máquinas adquiriram e está estreitamente relacionada com a sua habilidade para aprender. Os sistemas conexionistas, apresentados na forma de Redes Neurais Artificiais (RNAs), que se inspiram no sistema nervoso humano, são um dos mais importantes modelos que permitem a aprendizagem. Estes são utilizados em diversas áreas, como em problemas de previsão ou classificação, apresentando resultados cada vez mais satisfatórios. Uma das áreas em que esta tecnologia se tem destacado é a Visão Computacional (Computer Vision (CV)), permitindo, por exemplo, a localização de objetos em imagens e a sua correta identificação. A Deteção de Anomalias (Anomaly Detection (AD)) é outro campo onde as RNAs vêm surgindo como uma das tecnologias para a resolução de problemas. Em cada área são utilizadas diferentes
arquiteturas de acordo com o tipo de dados e o problema a resolver. Combinando o processamento de imagens
e a deteção de anomalias, verifica-se uma convergência de metodologias que utilizam módulos convolucionais
em arquiteturas dedicadas a AD. O objetivo principal desta dissertação é estudar as técnicas existentes nestes
domínios, desenvolvendo diferentes arquiteturas e modelos, aplicando-as a casos práticos de forma a comparar
os resultados obtidos em cada abordagem. O caso prático principal consiste na monitorização de pavimentos
rodoviários por meio de imagens para a identificação automática de áreas degradadas. Para isso, dois protótipos de software são propostos para recolher e visualizar os dados adquiridos. O estudo de arquiteturas de
RNAs para o diagnóstico da condição do asfalto por meio de imagens é o foco central no processo científico
apresentado. Os métodos de Machine Learning (ML) utilizados incluem classificadores binários, Autoencoders
(AEs) e Variational Autoencoders (VAEs). Para os dois últimos modelos, práticas supervisionadas e não supervisionadas são também comparadas, comprovando a sua utilidade em cenários onde não há dados rotulados
disponíveis. Usando o modelo VAE num ambiente supervisionado, este apresenta uma excelente distinção entre
áreas de pavimentação em boas condições e degradadas. Quando não existem dados rotulados disponíveis, a
melhor opção é utilizar o modelo AE, utilizando a distribuição de semelhanças das reconstruções para calcular o
threshold de separação, atingindo accuracy e precision superiores a 94%). O processo completo de desenvolvimento mostra que é possível construir uma solução alternativa para diminuir os custos de operação em relação
aos sistemas comerciais existentes e melhorar a usabilidade quando comparada às soluções tradicionais. Adicionalmente, dois estudos demonstram a versatilidade dos sistemas conexionistas na resolução de problemas,
nomeadamente no projeto de estruturas mecânicas, possibilitando a modelação de campos de deslocamento e
pressão em placas reforçadas; e na utilização de AD para identificar locais de aglomeração de pessoas através
de técnicas de crowdsensing.Artificial Intelligence (AI) and Data Science (DS) have become increasingly present in our daily lives, and the
benefits it has brought to society in recent years are remarkable. The success of AI was driven by the adaptive
capacity that machines gained, and it is closely related to their ability to learn. Connectionist systems, presented
in the form of Artificial Neural Networks (ANNs), which are inspired by the human nervous system, are one of the
principal models that allows learning. These models are used in several areas, like forecasting or classification
problems, presenting increasingly satisfactory results. One area in which this technology has excelled is Com puter Vision (CV), allowing, for example, the location of objects in images and their correct identification. Anomaly
Detection (AD) is another field where ANNs have been emerging as one technology for problem solving. In each
area, different architectures are used according to the type of data and the problem to be solved. Combining im age processing and the finding of anomalies in this type of data, there is a convergence of methodologies using
convolutional modules in architectures dedicated to AD. The main objective of this dissertation is to study the
existent techniques in these domains, developing different model architectures, and applying them to practical
case studies in order to compare the results obtained in each approach. The major practical use case consists
of monitoring road pavements using images to automatically identify degraded areas. For that, two software
prototypes are proposed to gather and visualise the acquired data. Moreover, the study of ANN architectures
to diagnose the asphalt condition through images is the central focus of this work. The experimented methods
for AD in images include a binary classifier network as a baseline, Autoencoders (AEs) and Variational Autoen coders (VAEs). Supervised and unsupervised practises are also compared, proving their utility also in scenarios
where there is no labelled data available. Using the VAE model in a supervised setting, it presents a excellent
distinction between good and bad pavement areas. When labelled data is not available, using the AE and the
distribution of similarities of good pavement reconstructions to calculate the threshold is the best option with both
accuracy and precision above 94%. The full development process shows it is possible to build an alternative
solution to decrease the operation costs relatively to expensive commercial systems and improve usability when
compared with traditional solutions. Additionally, two case studies demonstrate the versatility of connectionist
systems to solve problems, namely in Mechanical Structural Design enabling the modelling of displacement and
pressure fields in reinforced plates; and using AD to identify crowded places through crowd-sensing techniques
Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
Many previous audio-visual voice-related works focus on speech, ignoring the
singing voice in the growing number of musical video streams on the Internet.
For processing diverse musical video data, voice activity detection is a
necessary step. This paper attempts to detect the speech and singing voices of
target performers in musical video streams using audiovisual information. To
integrate information of audio and visual modalities, a multi-branch network is
proposed to learn audio and image representations, and the representations are
fused by attention based on semantic similarity to shape the acoustic
representations through the probability of anchor vocalization. Experiments
show the proposed audio-visual multi-branch network far outperforms the
audio-only model in challenging acoustic environments, indicating the
cross-modal information fusion based on semantic correlation is sensible and
successful.Comment: Accepted by INTERSPEECH 202
Unsupervised categorisation and cross-classification in humans and rats
This thesis examines how stimulus similarity structure and the statistical properties of the environment influence human and nonhuman animal categorisation. Two aspects of categorisation behaviour are explored: unsupervised (spontaneous) categorisation and stimulus cross-classification. In my General Introduction, I raise the issue of the respective roles of similarity and the classifier in determining categorisation behaviour. In Chapter 1, I review previous laboratory-based unsupervised categorisation research, which shows an overwhelming bias for unsupervised classification based on a single feature. Given the prominent role of overall similarity (family resemblance) in theories of human conceptual structure, I argue that this bias for unidimensional classification is likely an artefact. One factor in producing this artefact, I suggest, are the biases that exist within the similarity structure of laboratory stimuli. Consequently, Chapter 2 examines if it is possible to predict unidimensional versus multidimensional classification based solely on abstract similarity structure. Results show that abstract similarity structure commands a strong influence over participants' unsupervised classification behaviour (although not always in the manner predicted), and a bias for multidimensional unsupervised classification is reported. In Chapter 3, I examine unsupervised categorisation more broadly, by investigating how stimulus similarity structure influences spontaneous classification in both humans and rats. In this way, evidence is sought for human-like spontaneous classification behaviour in rats. Results show that humans and rats show qualitatively different patterns of behaviour following incidental stimulus exposure that should encourage spontaneous classification. In Chapter 4,1 investigate whether rats exhibit another important aspect of human categorisation namely, stimulus cross-classification. Results show that the statistical properties of the environment can engender such cognitively flexible behaviour in rats. Overall, the results of this thesis document the important influence of stimulus similarity structure and the statistical properties of the environment on human and nonhuman animal behaviour
Early word learning through communicative inference
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 109-122).How do children learn their first words? Do they do it by gradually accumulating information about the co-occurrence of words and their referents over time, or are words learned via quick social inferences linking what speakers are looking at, pointing to, and talking about? Both of these conceptions of early word learning are supported by empirical data. This thesis presents a computational and theoretical framework for unifying these two different ideas by suggesting that early word learning can best be described as a process of joint inferences about speakers' referential intentions and the meanings of words. Chapter 1 describes previous empirical and computational research on "statistical learning"--the ability of learners to use distributional patterns in their language input to learn about the elements and structure of language-and argues that capturing this abifity requires models of learning that describe inferences over structured representations, not just simple statistics. Chapter 2 argues that social signals of speakers' intentions, even eye-gaze and pointing, are at best noisy markers of reference and that in order to take advantage of these signals fully, learners must integrate information across time. Chapter 3 describes the kinds of inferences that learners can make by assuming that speakers are informative with respect to their intended meaning, introducing and testing a formalization of how Grice's pragmatic maxims can be used for word learning. Chapter 4 presents a model of cross-situational intentional word learning that both learns words and infers speakers' referential intentions from labeled corpus data.by Michael C. Frank.Ph.D
Deep audio-visual speech recognition
Decades of research in acoustic speech recognition have led to systems that we use in our everyday life. However, even the most advanced speech recognition systems fail in the presence of noise. The degraded performance can be compensated by introducing visual speech information. However, Visual Speech Recognition (VSR) in naturalistic conditions is very challenging, in part due to the lack of architectures and annotations.
This thesis contributes towards the problem of Audio-Visual Speech Recognition (AVSR) from different aspects. Firstly, we develop AVSR models for isolated words. In contrast to previous state-of-the-art methods that consists of a two-step approach, feature extraction and recognition, we present an End-to-End (E2E) approach inside a deep neural network, and this has led to a significant improvement in audio-only, visual-only and audio-visual experiments. We further replace Bi-directional Gated Recurrent Unit (BGRU) with Temporal Convolutional Networks (TCN) to greatly simplify the training procedure.
Secondly, we extend our AVSR model for continuous speech by presenting a hybrid Connectionist Temporal Classification (CTC)/Attention model, that can be trained in an end-to-end manner. We then propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations.
Next, we present a self-supervised framework, Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech, and find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading.
We also investigate the Lombard effect influence in an end-to-end AVSR system, which is the first work using end-to-end deep architectures and presents results on unseen speakers. We show that even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved.
Lastly, we propose a detection method against adversarial examples in an AVSR system, where the strong correlation between audio and visual streams is leveraged. The synchronisation confidence score is leveraged as a proxy for audio-visual correlation and based on it, we can detect adversarial attacks. We apply recent adversarial attacks on two AVSR models and the experimental results demonstrate that the proposed approach is an effective way for detecting such attacks.Open Acces
- …