7 research outputs found

    The Multi-Lane Capsule Network (MLCN)

    Full text link
    We introduce Multi-Lane Capsule Networks (MLCN), which are a separable and resource efficient organization of Capsule Networks (CapsNet) that allows parallel processing, while achieving high accuracy at reduced cost. A MLCN is composed of a number of (distinct) parallel lanes, each contributing to a dimension of the result, trained using the routing-by-agreement organization of CapsNet. Our results indicate similar accuracy with a much reduced cost in number of parameters for the Fashion-MNIST and Cifar10 datsets. They also indicate that the MLCN outperforms the original CapsNet when using a proposed novel configuration for the lanes. MLCN also has faster training and inference times, being more than two-fold faster than the original CapsNet in the same accelerator

    ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks

    Full text link
    The deployment of AI models on low-power, real-time edge devices requires accelerators for which energy, latency, and area are all first-order concerns. There are many approaches to enabling deep neural networks (DNNs) in this domain, including pruning, quantization, compression, and binary neural networks (BNNs), but with the emergence of the "extreme edge", there is now a demand for even more efficient models. In order to meet the constraints of ultra-low-energy devices, we propose ULEEN, a model architecture based on weightless neural networks. Weightless neural networks (WNNs) are a class of neural model which use table lookups, not arithmetic, to perform computation. The elimination of energy-intensive arithmetic operations makes WNNs theoretically well suited for edge inference; however, they have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by BNNs to make significant strides in improving accuracy and reducing model size. We compare FPGA and ASIC implementations of an inference accelerator for ULEEN against edge-optimized DNN and BNN devices. On a Xilinx Zynq Z-7045 FPGA, we demonstrate classification on the MNIST dataset at 14.3 million inferences per second (13 million inferences/Joule) with 0.21 μ\mus latency and 96.2% accuracy, while Xilinx FINN achieves 12.3 million inferences per second (1.69 million inferences/Joule) with 0.31 μ\mus latency and 95.83% accuracy. In a 45nm ASIC, we achieve 5.1 million inferences/Joule and 38.5 million inferences/second at 98.46% accuracy, while a quantized Bit Fusion model achieves 9230 inferences/Joule and 19,100 inferences/second at 99.35% accuracy. In our search for ever more efficient edge devices, ULEEN shows that WNNs are deserving of consideration.Comment: 14 pages, 14 figures Portions of this article draw heavily from arXiv:2203.01479, most notably sections 5E and 5F.

    Architecture Synthesis of High Performance Application-Specific Processors

    No full text
    Abstract A new method to design Application-Specific Processors (ASP) for computation-intensive scientific and/or embedded applications is presented. Target application areas include scien­tific and engineering programs and mission-oriented signal-processing systems requiring very high numerical computation and memory bandwidths. The application code in conventional HLL such as FORTRAN or C is the input to the synthesis process. Latest powerful VLSI chips are used as the primitive building blocks for design implementation. The eventual performance of the application-specific processor in executing the application code is the primary goal of the synthesis task. Advanced code scheduling techniques that go beyond basic block bound­aries are employed to achieve high performance via exploitation of fine-grain parallelism. The Application-Specific Processor Design (ASPD) method divides the task of designing an special-purpose processor architecture into Specification Optimization (behavioral) and Implementation Optimization (structural) phases. An architectural template resembling a scalable Very Long Instruction Word (VLIW) processor and a suite of compilation tools are used to generate an optimized processor specification. The designer quickly explores vari­ous cost versus performance tradeoff points by performing repeated compilation for scaled architectures. The powerful microcode compilation techniques of Percolation Scheduling and Enhanced Pipeline Scheduling extract and enhance parallelism in. the application object code to generate highly parallelized code, which serves as the optimized specification for the architecture. Further performance/efficiency enhancement is obtained in Implementation Optimization by tailoring the implementation template to the execution requirements of the optimized processor specification. A scalable implementation template constrains the im­plementation style. Graph-coloring algorithms that exploit special graph characteristics are used to minimize the amount of hardware to support execution of the optimized application microcode without impairing code performance. Compilation techniques to allocate data over multiple memory banks are used to enhance concurrent access. The entire architecture synthesis procedure has been implemented and applied to numerous examples. Speedups in the range of 2.6 to 7.7 over contemporary RISC processors have been obtained. The computation times needed for the synthesis of these examples are on the order of a few seconds

    PY-PITS : a scalable python runtime system for the computation of partially idempotent tasks

    No full text
    Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)The popularization of multi-core architectures and cloud services has allowed users access to high performance computing infrastructures. However, programming for these systems might be cumbersome due to challenges involving system failures, load balancing, and task scheduling. Aiming at solving these problems, we previously introduced SPITS, a programming model and reference architecture for executing bag-of-task applications. In this work, we discuss how this programming model allowed us to design and implement PY-PITS, a simple and effective open source runtime system that is scalable, tolerates faults and allows dynamic provisioning of resources during computation of tasks. We also discuss how PY-PITS can be used to improve utilization of multi-user computational clusters equipped with queues to submit jobs and propose a performance model to aid users to understand when the performance of PY-PITS scales with the number of Workers.The popularization of multi-core architectures and cloud services has allowed users access to high performance computing infrastructures. However, programming for these systems might be cumbersome due to challenges involving system failures, load balancinCAPES - COORDENAÇÃO DE APERFEIÇOAMENTO DE PESSOAL E NÍVEL SUPERIORFAPESP - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULOCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOCoordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)sem informaçãosem informaçãosem informaçãoInternational symposium on computer architecture and high performance computing workshops (SBAC-PADW)OCT 26-28, 2016Los Angeles, C

    A Unified Model for Accelerating Unsupervised Iterative Re-Ranking Algorithms

    No full text
    Despite the continuous advances in image retrieval technologies, performing effective and efficient content-based searches remains a challenging task. Unsupervised iterative re-ranking algorithms have emerged as a promising solution and have been widely used to improve the effectiveness of multimedia retrieval systems. Although substantially more efficient than related approaches based on diffusion processes, these re-ranking algorithms can still be computationally costly, demanding the specification and implementation of efficient big multimedia analysis approaches. Such demand associated with the significant potential for parallelization and highly effective results achieved by recently proposed re-ranking algorithms creates the need for exploiting efficiency vs effectiveness trade-offs. In this article, we introduce a class of unsupervised iterative re-ranking algorithms and present a model that can be used to guide their implementation and optimization for parallel architectures. We also analyze the impact of the parallelization on the performance of four algorithms that belong to the proposed class: Contextual Spaces, RL-Sim, Contextual Re-ranking, and Cartesian Product of Ranking References. The experiments show speedups that reach up to 6.0×,16.1×,3.3×, and7.1×for each algorithm, respectively. These results demonstrate that the proposed parallel programming model can be successfully applied to various algorithms and used to improve the performance of multimedia retrieval systems

    A unified model for accelerating unsupervised iterative re‐ranking algorithms

    No full text
    Despite the continuous advances in image retrieval technologies, performing effective and efficient content‐based searches remains a challenging task. Unsupervised iterative re‐ranking algorithms have emerged as a promising solution and have been widely used to improve the effectiveness of multimedia retrieval systems. Although substantially more efficient than related approaches based on diffusion processes, these re‐ranking algorithms can still be computationally costly, demanding the specification and implementation of efficient big multimedia analysis approaches. Such demand associated with the significant potential for parallelization and highly effective results achieved by recently proposed re‐ranking algorithms creates the need for exploiting efficiency vs effectiveness trade‐offs. In this article, we introduce a class of unsupervised iterative re‐ranking algorithms and present a model that can be used to guide their implementation and optimization for parallel architectures. We also analyze the impact of the parallelization on the performance of four algorithms that belong to the proposed class: Contextual Spaces, RL‐Sim, Contextual Re‐ranking, and Cartesian Product of Ranking References. The experiments show speedups that reach up to 6.0×, 16.1×, 3.3×, and 7.1× for each algorithm, respectively. These results demonstrate that the proposed parallel programming model can be successfully applied to various algorithms and used to improve the performance of multimedia retrieval systems3214CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO - CNPQCOORDENAÇÃO DE APERFEIÇOAMENTO DE PESSOAL DE NÍVEL SUPERIOR - CAPESFUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULO - FAPESP307560/2016‐3; 484254/2012‐0; 308194/2017‐9; 140653/2017‐188881.145912/2017‐012018/15597‐6; 2017/25908‐6; 2014/12236‐1; 2015/24494‐8; 2016/50250‐1; 2017/20945‐0; 2019/19312‐9; 2013/50155‐0; 2013/50169‐1; 2014/50715‐9The authors thank AMD, FAEPEX, CAPES (grant #88881.145912/2017‐01), FAPESP (grants #2018/15597‐6, 2017/25908‐6, #2014/12236‐1, #2015/24494‐8, #2016/50250‐1, #2017/20945‐0, and #2019/19312‐9), the FAPESP‐Microsoft Virtual Institute (grants #2013/50155‐0, #2013/50169‐1, and #2014/50715‐9), and CNPq (grants #307560/2016‐3, #484254/2012‐0, #308194/2017‐9, and #140653/2017‐1) for the financial suppor
    corecore