81 research outputs found

    Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems

    Full text link
    Deep Learning is increasingly being adopted by industry for computer vision applications running on embedded devices. While Convolutional Neural Networks' accuracy has achieved a mature and remarkable state, inference latency and throughput are a major concern especially when targeting low-cost and low-power embedded platforms. CNNs' inference latency may become a bottleneck for Deep Learning adoption by industry, as it is a crucial specification for many real-time processes. Furthermore, deployment of CNNs across heterogeneous platforms presents major compatibility issues due to vendor-specific technology and acceleration libraries. In this work, we present QS-DNN, a fully automatic search based on Reinforcement Learning which, combined with an inference engine optimizer, efficiently explores through the design space and empirically finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices. We show that, an optimized combination can achieve 45x speedup in inference latency on CPU compared to a dependency-free baseline and 2x on average on GPGPU compared to the best vendor library. Further, we demonstrate that, the quality of results and time "to-solution" is much better than with Random Search and achieves up to 15x better results for a short-time search

    Shortest Path Distance in Manhattan Poisson Line Cox Process

    Get PDF
    While the Euclidean distance characteristics of the Poisson line Cox process (PLCP) have been investigated in the literature, the analytical characterization of the path distances is still an open problem. In this paper, we solve this problem for the stationary Manhattan Poisson line Cox process (MPLCP), which is a variant of the PLCP. Specifically, we derive the exact cumulative distribution function (CDF) for the length of the shortest path to the nearest point of the MPLCP in the sense of path distance measured from two reference points: (i) the typical intersection of the Manhattan Poisson line process (MPLP), and (ii) the typical point of the MPLCP. We also discuss the application of these results in infrastructure planning, wireless communication, and transportation networks

    Wafer-Scale Fast Fourier Transforms

    Full text link
    We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a n3n^3 problem with up to n2n^2 PEs. At this point a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. This high efficiency on fine-grain communication allow wsFFT to achieve unprecedented levels of parallelism and performance. We analyse in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123512^3 complex input array using a 512x512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier

    System-Level Performance Analysis in 3D Drone Mobile Networks

    Get PDF
    We present a system-level analysis for drone mobile networks on a finite three-dimensional (3D) space. A performance boundary derived by deterministic random (Brownian) motion model over Nakagami-m fading interfering channels is developed. This method allows us to circumvent the extremely complex reality model and obtain the upper and lower performance bounds of actual drone mobile networks. The validity and advantages of the proposed framework are confirmed via extensive Monte-Carlo (MC) simulations. The results reveal several important trends and design guidelines for the practical deployment of drone mobile networks
    • …
    corecore