1,472 research outputs found

    Selective deep convolutional neural network for low cost distorted image classification

    Get PDF
    Neural networks trained using images with a certain type of distortion should be better at classifying test images with the same type of distortion than generally-trained neural networks, given other factors being equal. Based on this observation, an ensemble of convolutional neural networks (CNNs) trained with different types and degrees of distortions is used. However, instead of simply classifying test images of unknown distortion types with the entire ensemble of CNNs, an extra tiny CNN is specifically trained to distinguish between the different types and degrees of distortions. Then, only the dedicated CNN for that specific type and degree of distortion, as determined by the tiny CNN, is activated and used to classify a possibly distorted test image. This proposed architecture, referred to as a \textit{selective deep convolutional neural network (DCNN)}, is implemented and found to result in high accuracy with low hardware costs. Detailed simulations with realistic image distortion scenarios using three popular datasets show that memory, MAC operations, and energy savings of up to 93.68%, 93.61%, and 91.92%, respectively, can be achieved with almost no reduction in image classification accuracy. The proposed selective DCNN scores up to 2.18x higher than the state-of-the-art DCNN model when evaluated using NetScore, a comprehensive metric that considers both CNN performance and hardware cost. In addition, it is shown that even higher hardware cost reduction can be achieved when selective DCNN is combined with previously proposed model compression techniques. Finally, experiments conducted with extended types and degrees of image distortion show that selective DCNN is highly scalable.11Ysciescopu

    On Using Backpropagation for Speech Texture Generation and Voice Conversion

    Full text link
    Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a different voice.Comment: Accepted to ICASSP 201

    An Empirical Evaluation of Deep Learning on Highway Driving

    Full text link
    Numerous groups have applied a variety of deep learning techniques to computer vision problems in highway perception scenarios. In this paper, we presented a number of empirical evaluations of recent deep learning advances. Computer vision, combined with deep learning, has the potential to bring about a relatively inexpensive, robust solution to autonomous driving. To prepare deep learning for industry uptake and practical applications, neural networks will require large data sets that represent all possible driving environments and scenarios. We collect a large data set of highway data and apply deep learning and computer vision algorithms to problems such as car and lane detection. We show how existing convolutional neural networks (CNNs) can be used to perform lane and vehicle detection while running at frame rates required for a real-time system. Our results lend credence to the hypothesis that deep learning holds promise for autonomous driving.Comment: Added a video for lane detectio

    Tree-Based Deep Mixture of Experts with Applications to Visual Saliency Prediction and Quality Robust Visual Recognition

    Get PDF
    abstract: Mixture of experts is a machine learning ensemble approach that consists of individual models that are trained to be ``experts'' on subsets of the data, and a gating network that provides weights to output a combination of the expert predictions. Mixture of experts models do not currently see wide use due to difficulty in training diverse experts and high computational requirements. This work presents modifications of the mixture of experts formulation that use domain knowledge to improve training, and incorporate parameter sharing among experts to reduce computational requirements. First, this work presents an application of mixture of experts models for quality robust visual recognition. First it is shown that human subjects outperform deep neural networks on classification of distorted images, and then propose a model, MixQualNet, that is more robust to distortions. The proposed model consists of ``experts'' that are trained on a particular type of image distortion. The final output of the model is a weighted sum of the expert models, where the weights are determined by a separate gating network. The proposed model also incorporates weight sharing to reduce the number of parameters, as well as increase performance. Second, an application of mixture of experts to predict visual saliency is presented. A computational saliency model attempts to predict where humans will look in an image. In the proposed model, each expert network is trained to predict saliency for a set of closely related images. The final saliency map is computed as a weighted mixture of the expert networks' outputs, with weights determined by a separate gating network. The proposed model achieves better performance than several other visual saliency models and a baseline non-mixture model. Finally, this work introduces a saliency model that is a weighted mixture of models trained for different levels of saliency. Levels of saliency include high saliency, which corresponds to regions where almost all subjects look, and low saliency, which corresponds to regions where some, but not all subjects look. The weighted mixture shows improved performance compared with baseline models because of the diversity of the individual model predictions.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    PIXOR: Real-time 3D Object Detection from Point Clouds

    Full text link
    We address the problem of real-time 3D object detection from point clouds in the context of autonomous driving. Computation speed is critical as detection is a necessary component for safety. Existing approaches are, however, expensive in computation due to high dimensionality of point clouds. We utilize the 3D data more efficiently by representing the scene from the Bird's Eye View (BEV), and propose PIXOR, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixel-wise neural network predictions. The input representation, network architecture, and model optimization are especially designed to balance high accuracy and real-time efficiency. We validate PIXOR on two datasets: the KITTI BEV object detection benchmark, and a large-scale 3D vehicle detection benchmark. In both datasets we show that the proposed detector surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at >28 FPS.Comment: Update of CVPR2018 paper: correct timing, fix typos, add acknowledgemen
    corecore