8,796 research outputs found

    Performance evaluation over HW/SW co-design SoC memory transfers for a CNN accelerator

    Get PDF
    Many FPGAs vendors have recently included embedded processors in their devices, like Xilinx with ARM-Cortex A cores, together with programmable logic cells. These devices are known as Programmable System on Chip (PSoC). Their ARM cores (embedded in the processing system or PS) communicates with the programmable logic cells (PL) using ARM-standard AXI buses. In this paper we analyses the performance of exhaustive data transfers between PS and PL for a Xilinx Zynq FPGA in a co-design real scenario for Convolutional Neural Networks (CNN) accelerator, which processes, in dedicated hardware, a stream of visual information from a neuromorphic visual sensor for classification. In the PS side, a Linux operating system is running, which recollects visual events from the neuromorphic sensor into a normalized frame, and then it transfers these frames to the accelerator of multi-layered CNNs, and read results, using an AXI-DMA bus in a per-layer way. As these kind of accelerators try to process information as quick as possible, data bandwidth becomes critical and maintaining a good balanced data throughput rate requires some considerations. We present and evaluate several data partitioning techniques to improve the balance between RX and TX transfer and two different ways of transfers management: through a polling routine at the userlevel of the OS, and through a dedicated interrupt-based kernellevel driver. We demonstrate that for longer enough packets, the kernel-level driver solution gets better timing in computing a CNN classification example. Main advantage of using kernel-level driver is to have safer solutions and to have tasks scheduling in the OS to manage other important processes for our application, like frames collection from sensors and their normalization.Ministerio de Economía y Competitividad TEC2016-77785-

    Run Time Approximation of Non-blocking Service Rates for Streaming Systems

    Full text link
    Stream processing is a compute paradigm that promises safe and efficient parallelism. Modern big-data problems are often well suited for stream processing's throughput-oriented nature. Realization of efficient stream processing requires monitoring and optimization of multiple communications links. Most techniques to optimize these links use queueing network models or network flow models, which require some idea of the actual execution rate of each independent compute kernel within the system. What we want to know is how fast can each kernel process data independent of other communicating kernels. This is known as the "service rate" of the kernel within the queueing literature. Current approaches to divining service rates are static. Modern workloads, however, are often dynamic. Shared cloud systems also present applications with highly dynamic execution environments (multiple users, hardware migration, etc.). It is therefore desirable to continuously re-tune an application during run time (online) in response to changing conditions. Our approach enables online service rate monitoring under most conditions, obviating the need for reliance on steady state predictions for what are probably non-steady state phenomena. First, some of the difficulties associated with online service rate determination are examined. Second, the algorithm to approximate the online non-blocking service rate is described. Lastly, the algorithm is implemented within the open source RaftLib framework for validation using a simple microbenchmark as well as two full streaming applications.Comment: technical repor

    Cold Matter Assembled Atom-by-Atom

    Get PDF
    The realization of large-scale fully controllable quantum systems is an exciting frontier in modern physical science. We use atom-by-atom assembly to implement a novel platform for the deterministic preparation of regular arrays of individually controlled cold atoms. In our approach, a measurement and feedback procedure eliminates the entropy associated with probabilistic trap occupation and results in defect-free arrays of over 50 atoms in less than 400 ms. The technique is based on fast, real-time control of 100 optical tweezers, which we use to arrange atoms in desired geometric patterns and to maintain these configurations by replacing lost atoms with surplus atoms from a reservoir. This bottom-up approach enables controlled engineering of scalable many-body systems for quantum information processing, quantum simulations, and precision measurements.Comment: 12 pages, 9 figures, 3 movies as ancillary file

    Undermining User Privacy on Mobile Devices Using AI

    Full text link
    Over the past years, literature has shown that attacks exploiting the microarchitecture of modern processors pose a serious threat to the privacy of mobile phone users. This is because applications leave distinct footprints in the processor, which can be used by malware to infer user activities. In this work, we show that these inference attacks are considerably more practical when combined with advanced AI techniques. In particular, we focus on profiling the activity in the last-level cache (LLC) of ARM processors. We employ a simple Prime+Probe based monitoring technique to obtain cache traces, which we classify with Deep Learning methods including Convolutional Neural Networks. We demonstrate our approach on an off-the-shelf Android phone by launching a successful attack from an unprivileged, zeropermission App in well under a minute. The App thereby detects running applications with an accuracy of 98% and reveals opened websites and streaming videos by monitoring the LLC for at most 6 seconds. This is possible, since Deep Learning compensates measurement disturbances stemming from the inherently noisy LLC monitoring and unfavorable cache characteristics such as random line replacement policies. In summary, our results show that thanks to advanced AI techniques, inference attacks are becoming alarmingly easy to implement and execute in practice. This once more calls for countermeasures that confine microarchitectural leakage and protect mobile phone applications, especially those valuing the privacy of their users

    Memory and information processing in neuromorphic systems

    Full text link
    A striking difference between brain-inspired neuromorphic processors and current von Neumann processors architectures is the way in which memory and processing is organized. As Information and Communication Technologies continue to address the need for increased computational power through the increase of cores within a digital processor, neuromorphic engineers and scientists can complement this need by building processor architectures where memory is distributed with the processing. In this paper we present a survey of brain-inspired processor architectures that support models of cortical networks and deep neural networks. These architectures range from serial clocked implementations of multi-neuron systems to massively parallel asynchronous ones and from purely digital systems to mixed analog/digital systems which implement more biological-like models of neurons and synapses together with a suite of adaptation and learning mechanisms analogous to the ones found in biological nervous systems. We describe the advantages of the different approaches being pursued and present the challenges that need to be addressed for building artificial neural processing systems that can display the richness of behaviors seen in biological systems.Comment: Submitted to Proceedings of IEEE, review of recently proposed neuromorphic computing platforms and system

    Audiovisual preservation strategies, data models and value-chains

    No full text
    This is a report on preservation strategies, models and value-chains for digital file-based audiovisual content. The report includes: (a)current and emerging value-chains and business-models for audiovisual preservation;(b) a comparison of preservation strategies for audiovisual content including their strengths and weaknesses, and(c) a review of current preservation metadata models, and requirements for extension to support audiovisual files

    고성능 컴퓨팅 시스템에서 버스트 버퍼를 위한 I/O 분리 기법의 실증적 구현

    Get PDF
    학위논문(석사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2019. 8. 엄현상.To meet the exascale I/O requirements in the High-Performance Computing (HPC), a new I/O subsystem, named Burst Buffer, based on non-volatile memory, has been developed. However, the diverse HPC workloads and the bursty I/O pattern cause severe data fragmentation to SSDs, which creates the need for expensive garbage collection (GC) and also increase the number of bytes actually written to SSD. The new multi-stream feature in SSDs offers an option to reduce the cost of garbage collection. In this paper, we leverage this multi-stream feature to group the I/O streams based on the user IDs and implement this strategy in a burst buffer we call BIOS, short for Burst Buffer with an I/O Separation scheme. Furthermore, to optimize the I/O separation scheme in burst buffer environments, we propose a stream-aware scheduling policy based on burst buffer pools in workload manager and implement the real burst buffer system, BIOS framework, by integrating the BIOS with workload manager. We evaluate the BIOS and framework with a burst buffer I/O traces from Cori Supercomputer including a diverse set of applications. We also disclose and analyze the benefits and limitations of using I/O separation scheme in HPC systems. Experimental results show that the BIOS could improve the performance by 1.44× on average and reduce the Write Amplification Factor (WAF) by up to 1.20×, and prove that the framework can keep on the benefits of the I/O separation scheme in the HPC environment.Abstract Introduction 1 Background and Challenges 5 Burst Buffer 5 Write Amplification in SSDs 6 Multi-streamed SSD 7 Challenges of Multi-stream Feature in Burst Buffers 7 I/O Separation Scheme in Burst Buffer 10 Stream Allocation Criteria 10 Implementation 12 Limitations of User ID-based Stream Allocation 14 BIOS Framework 15 Support in Workload Manager 15 Burst Buffer Pools 16 Stream-Aware Scheduling Policy 18 Workflow of BIOS Framework 20 Evaluation 21 Experiment Setup 21 Evaluation with Synthetic Workload 21 Evaluation with HPC Applications 25 Evaluation with Emulated Workload 27 Evaluation with Different Striping Configuration 29 Evaluation on BIOS Framework 30 Summary and Lessons Learned 33 An I/O Separation Scheme in Burst Buffer 33 Evaluation with Synthetic Workload 33 Evaluation with HPC Applications 33 Evaluation with Emulated Workload 34 Evaluation with Striping Configurations 34 A BIOS Framework 34 Evaluation with Real Burst Buffer Environments 34 Discussion 36 Limited Number of Nodes 36 Advanced BIOS Framework 37 Related work 38 Conclusions 40 Bibliography 42 초록 48Maste
    corecore