892 research outputs found

    Polyphonic Sound Event Detection by using Capsule Neural Networks

    Full text link
    Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, Deep Learning offers valuable techniques for this goal such as Convolutional Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic-SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called "dynamic routing" that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state of the art algorithms

    MLPerf Inference Benchmark

    Full text link
    Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.Comment: ISCA 202

    Towards holistic scene understanding:Semantic segmentation and beyond

    Get PDF
    This dissertation addresses visual scene understanding and enhances segmentation performance and generalization, training efficiency of networks, and holistic understanding. First, we investigate semantic segmentation in the context of street scenes and train semantic segmentation networks on combinations of various datasets. In Chapter 2 we design a framework of hierarchical classifiers over a single convolutional backbone, and train it end-to-end on a combination of pixel-labeled datasets, improving generalizability and the number of recognizable semantic concepts. Chapter 3 focuses on enriching semantic segmentation with weak supervision and proposes a weakly-supervised algorithm for training with bounding box-level and image-level supervision instead of only with per-pixel supervision. The memory and computational load challenges that arise from simultaneous training on multiple datasets are addressed in Chapter 4. We propose two methodologies for selecting informative and diverse samples from datasets with weak supervision to reduce our networks' ecological footprint without sacrificing performance. Motivated by memory and computation efficiency requirements, in Chapter 5, we rethink simultaneous training on heterogeneous datasets and propose a universal semantic segmentation framework. This framework achieves consistent increases in performance metrics and semantic knowledgeability by exploiting various scene understanding datasets. Chapter 6 introduces the novel task of part-aware panoptic segmentation, which extends our reasoning towards holistic scene understanding. This task combines scene and parts-level semantics with instance-level object detection. In conclusion, our contributions span over convolutional network architectures, weakly-supervised learning, part and panoptic segmentation, paving the way towards a holistic, rich, and sustainable visual scene understanding.Comment: PhD Thesis, Eindhoven University of Technology, October 202

    Deep learning techniques for computer audition

    Get PDF
    Automatically recognising audio signals plays a crucial role in the development of intelligent computer audition systems. Particularly, audio signal classification, which aims to predict a label for an audio wave, has promoted many real-life applications. Amounts of efforts have been made to develop effective audio signal classification systems in the real world. However, several challenges in deep learning techniques for audio signal classification remain to be addressed. For instance, training a deep neural network (DNN) from scratch is time-consuming to extracting high-level deep representations. Furthermore, DNNs have not been well explained to construct the trust between humans and machines, and facilitate developing realistic intelligent systems. Moreover, most DNNs are vulnerable to adversarial attacks, resulting in many misclassifications. To deal with these challenges, this thesis proposes and presents a set of deep-learning-based approaches for audio signal classification. In particular, to tackle the challenge of extracting high-level deep representations, the transfer learning frameworks, benefiting from pre-trained models on large-scale image datasets, are introduced to produce effective deep spectrum representations. Furthermore, the attention mechanisms at both the frame level and the time-frequency level are proposed to explain the DNNs by respectively estimating the contributions of each frame and each time-frequency bin to the predictions. Likewise, the convolutional neural networks (CNNs) with an attention mechanism at the time-frequency level is extended to atrous CNNs with attention, aiming to explain the CNNs by visualising high-resolution attention tensors. Additionally, to interpret the CNNs evaluated on multi-device datasets, the atrous CNNs with attention are trained in the conditional training frameworks. Moreover, to improve the robustness of the DNNs against adversarial attacks, models are trained in the adversarial training frameworks. Besides, the transferability of adversarial attacks is enhanced by a lifelong learning framework. Finally, the experiments conducted with various datasets demonstrate that these presented approaches are effective to address the challenges
    corecore