3 research outputs found

    Evaluating a Cluster of Low-Power ARM64 Single-Board Computers with MapReduce

    Get PDF
    With the meteoric rise of enormous data collection in science, industry, and the cloud, methods for processing massive datasets have become more crucial than ever. MapReduce is a restricted programing model for expressing parallel computations as simple serial functions, and an execution framework for distributing those computations over large datasets residing on clusters of commodity hardware. MapReduce abstracts away the challenging low-level synchronization and scalability details which parallel and distributed computing often necessitate, reducing the concept burden on programmers and scientists who require data processing at-scale. Typically, MapReduce clusters are implemented using inexpensive commodity hardware, emphasizing quantity over quality due to the fault-tolerant nature of the MapReduce execution framework. The nascent explosion of inexpensive single-board computers designed around multi-core 64bit ARM processors, such as the RasberryPi 3, Pine64, and Odroid C2, has opened new avenues for inexpensive and low-power cluster computing. In this thesis, we implement a novel cluster around low-power ARM64 single-board computers and the Disco Python MapReduce execution framework. We use MapReduce to empirically evaluate our cluster by solving the Word Count and Inverted Link Index problems for the Wikipedia article dataset. We benchmark our MapReduce solutions against local solutions of the same algorithms for a conventional low-power x86 platform. We show our cluster out-performs the conventional platform for larger benchmarks, thus demonstrating low-power single-board computers as a viable avenue for data-intensive cluster computing

    Towards Intelligent Data Acquisition Systems with Embedded Deep Learning on MPSoC

    Get PDF
    Large-scale scientific experiments rely on dedicated high-performance data-acquisition systems to sample, readout, analyse, and store experimental data. However, with the rapid development in detector technology in various fields, the number of channels and the data rate are increasing. For trigger and control tasks data acquisition systems needs to satisfy real-time constraints, enable short-time latency and provide the possibility to integrate intelligent data processing. During recent years machine learning approaches have been used successfully in many applications. This dissertation will study how machine learning techniques can be integrated already in the data acquisition of large-scale experiments. A universal data acquisition platform for multiple data channels has been developed. Different machine learning implementation methods and application have been realized using this system. On the hardware side, recent FPGAs do not only provide high-performance parallel logic but more and more additional features, like ultra-fast transceivers and embedded ARM processors. TSMC\u27s 16nm FinFET Plus (16FF+) 3D transistor technology enables Xilinx in the Zynq UltraScale+ FPGA devices to increase the performance/watt ratio by 2 to 5 times compared to their previous generation. The selected main processor ZU11EG owns 32 GTH transceivers where each one could operate up to 16.316.3 Gb/s and 16 GTY transceivers where each of them could operate up to 32.7532.75 Gb/s. These transceivers are routed to x16 lanes Gen 33/44 PCIe, 1212 lanes full-duplex FireFly electrical/optical data link and VITA 57.4 FMC+ connector. The new Zynq UltraScale+ device provides at least three major advantages for advanced data acquisition systems: First, the 16nm FinFET+ programmable logic (PL) provides high-speed readout capabilities by high-speed transceivers; second, built-in quad-core 64-bit ARM Cortex-A53 processor enable host embedded Linux system. Thus, webservers, slow control and monitoring application could be realized in a embedded processor environment; third, the Zynq Multiprocessor System-on-Chip technology connects programmable logic and microprocessors. In this thesis, the benefits of such architectures for the integration of machine learning algorithms in data acquisition systems and control application are demonstrated. On the algorithm side, there have been many achievements in the field of machine learning over the last decades. Existing machine learning algorithms split into several categories depending on how the learning phase is organized: Supervised Learning, Unsupervised Learning, Semi-Supervised Learning and Reinforcement Learning. Most commonly used in scientific applications are supervised learning and reinforcement learning. Supervised learning learns from the labelled input and output, and generates a function that could predict the future different input to the appropriate output. A common application instance is a classification. They have a wide difference in basic math theory, training, inference, and their implementation. One of the natural solutions is Application Specific Integrated Circuit (ASIC) Artificial Intelligence (AI) chips. A typical example is the Google Tensor Processing Unit (TPU), it could cover the training and inference for both supervised learning and reinforcement learning. One of the major issues is that such chip could not provide high data transferring bandwidth other than high compute power. As a comparison, the Xilinx UltraScale+ FPGA could also provide raw compute power and efficiency for all different data types down to a single bit. From a deployment point of view, the training part of supervised learning is typically performed by CPU/GPU/TPU on a fixed dataset. For reinforcement learning, the training phase is more complex. The algorithm needs to periodically interact with the controlled system and execute a Markov Decision Process (MDP). There is no static training dataset, but it is obtained in real-time. The time slot between each step depends on the dynamics of the controlled system. The inference is also bound to this sampling time because the algorithm needs to interact with the environment and decide the appropriate action for a response, then a higher demand on time is proposed. This thesis gives solutions for both training and inference of reinforcement learning. At first, the requirements are analyzed, then the algorithm is deduced from scratch, and training on the PS part of Zynq device is implemented, meanwhile the inference at FPGA side is proposed which is similar solution compared with supervised learning. The results for Policy Gradient show a lot of improvement over a CPU/GPU-based machine learning framework. The Deep Deterministic Policy Gradient also has improvement regarding both training latency and stability. This implementation method provides a low-latency approach for reinforcement learning on-field training process