15 research outputs found

    Approximating Memory-bound Applications on Mobile GPUs

    Get PDF
    Accepted for 2019 International Conference on High Performance Computing & Simulation (HPCS)Approximate computing techniques are often used to improve the performance of applications that can tolerate some amount of impurity in the calculations or data. In the context of embedded and mobile systems, a broad number of applications have exploited approximation techniques to improve performance and overcome the limited capabilities of the hardware. On such systems, even small performance improvements can be sufficient to meet scheduled requirements such as hard real-time deadlines. We study the approximation of memory-bound applications on mobile GPUs using kernel perforation, an approximation technique that exploits the availability of fast GPU local memory to provide high performance with more accurate results. Using this approximation technique, we approximated six applications and evaluated them on two mobile GPU architectures with very different memory layouts: a Qualcomm Adreno 506 and an ARM Mali T860 MP2. Results show that, even when the local memory is not mapped to dedicated fast memory in hardware, kernel perforation is still capable of 1.25x speedup because of improved memory layout and caching effects. Mobile GPUs with local memory show a speedup of up to 1.38x

    Parallel Multistage Wide Neural Network

    Get PDF
    Deep learning networks have achieved great success in many areas such as in large scale image processing. They usually need large computing resources and time, and process easy and hard samples inefficiently in the same way. Another undesirable problem is that the network generally needs to be retrained to learn new incoming data. Efforts have been made to reduce the computing resources and realize incremental learning by adjusting architectures, such as scalable effort classifiers, multi-grained cascade forest (gc forest), conditional deep learning (CDL), tree CNN, decision tree structure with knowledge transfer (ERDK), forest of decision trees with RBF networks and knowledge transfer (FDRK). In this paper, a parallel multistage wide neural network (PMWNN) is presented. It is composed of multiple stages to classify different parts of data. First, a wide radial basis function (WRBF) network is designed to learn features efficiently in the wide direction. It can work on both vector and image instances, and be trained fast in one epoch using subsampling and least squares (LS). Secondly, successive stages of WRBF networks are combined to make up the PMWNN. Each stage focuses on the misclassified samples of the previous stage. It can stop growing at an early stage, and a stage can be added incrementally when new training data is acquired. Finally, the stages of the PMWNN can be tested in parallel, thus speeding up the testing process. To sum up, the proposed PMWNN network has the advantages of (1) fast training, (2) optimized computing resources, (3) incremental learning, and (4) parallel testing with stages. The experimental results with the MNIST, a number of large hyperspectral remote sensing data, CVL single digits, SVHN datasets, and audio signal datasets show that the WRBF and PMWNN have the competitive accuracy compared to learning models such as stacked auto encoders, deep belief nets, SVM, MLP, LeNet-5, RBF network, recently proposed CDL, broad learning, gc forest etc. In fact, the PMWNN has often the best classification performance
    corecore