72 research outputs found
Harnessing single board computers for military data analytics
Executive summary: This chapter covers the use of Single Board Computers (SBCs) to expedite onsite data analytics for a variety of military applications. Onsite data summarization and analytics is increasingly critical for command, control, and intelligence (C2I) operations, as excessive power consumption and communication latency can restrict the efficacy of down-range operations. SBCs offer power-efficient, inexpensive data-processing capabilities while maintaining a small form factor. We discuss the use of SBCs in a variety of domains, including wireless sensor networks, unmanned vehicles, and cluster computing. We conclude with a discussion of existing challenges and opportunities for future use.https://digitalcommons.usmalibrary.org/books/1010/thumbnail.jp
Software Defined Multi-Spectral Imaging for Arctic Sensor Networks
Availability of off-the-shelf infrared sensors combined with high definition visible cameras has made possible the construction of a Software Defined Multi-Spectral Imager (SDMSI) combining long-wave, near-infrared and visible imaging. The SDMSI requires a real-time embedded processor to fuse images and to create real-time depth maps for opportunistic uplink in sensor networks. Researchers at Embry Riddle Aeronautical University working with University of Alaska Anchorage at the Arctic Domain Awareness Center and the University of Colorado Boulder have built several versions of a low-cost drop-in-place SDMSI to test alternatives for power efficient image fusion. The SDMSI is intended for use in field applications including marine security, search and rescue operations and environmental surveys in the Arctic region. Based on Arctic marine sensor network mission goals, the team has designed the SDMSI to include features to rank images based on saliency and to provide on camera fusion and depth mapping. A major challenge has been the design of the camera computing system to operate within a 10 to 20 Watt power budget. This paper presents a power analysis of three options: 1) multi-core, 2) field programmable gate array with multi-core, and 3) graphics processing units with multi-core. For each test, power consumed for common fusion workloads has been measured at a range of frame rates and resolutions. Detailed analyses from our power efficiency comparison for workloads specific to stereo depth mapping and sensor fusion are summarized. Preliminary mission feasibility results from testing with off-the-shelf long-wave infrared and visible cameras in Alaska and Arizona are also summarized to demonstrate the value of the SDMSI for applications such as ice tracking, ocean color, soil moisture, animal and marine vessel detection and tracking. The goal is to select the most power efficient solution for the SDMSI for use on UAVs (Unoccupied Aerial Vehicles) and other drop-in-place installations in the Arctic. The prototype selected will be field tested in Alaska in the summer of 2016
Energy reconstruction on the LHC ATLAS TileCal upgraded front end: feasibility study for a sROD co-processing unit
Dissertation presented in ful lment of the requirements for the degree of:
Master of Science in Physics
2016The Phase-II upgrade of the Large Hadron Collider at CERN in the early 2020s
will enable an order of magnitude increase in the data produced, unlocking the
potential for new physics discoveries. In the ATLAS detector, the upgraded Hadronic
Tile Calorimeter (TileCal) Phase-II front end read out system is currently being
prototyped to handle a total data throughput of 5.1 TB/s, from the current 20.4 GB/s.
The FPGA based Super Read Out Driver (sROD) prototype must perform an energy
reconstruction algorithm on 2.88 GB/s raw data, or 275 million events per second.
Due to the very high level of pro ciency required and time consuming nature of
FPGA rmware development, it may be more e ective to implement certain complex
energy reconstruction and monitoring algorithms on a general purpose, CPU based
sROD co-processor. Hence, the feasibility of a general purpose ARM System on Chip
based co-processing unit (PU) for the sROD is determined in this work.
A PCI-Express test platform was designed and constructed to link two ARM
Cortex-A9 SoCs via their PCI-Express Gen-2 x1 interfaces. Test results indicate that
the latency of the PCI-Express interface is su ciently low and the data throughput is
superior to that of alternative interfaces such as Ethernet, for use as an interconnect
for the SoCs to the sROD. CPU performance benchmarks were performed on ve ARM
development platforms to determine the CPU integer,
oating point and memory
system performance as well as energy e ciency. To complement the benchmarks,
Fast Fourier Transform and Optimal Filtering (OF) applications were also tested.
Based on the test results, in order for the PU to process 275 million events per
second with OF, within the 6 s timing budget of the ATLAS triggering system, a
cluster of three Tegra-K1, Cortex-A15 SoCs connected to the sROD via a Gen-2 x8
PCI-Express interface would be suitable. A high level design for the PU is proposed
which surpasses the requirements for the sROD co-processor and can also be used
in a general purpose, high data throughput system, with 80 Gb/s Ethernet and
15 GB/s PCI-Express throughput, using four X-Gene SoCs
sensor data collection and performance evaluation using a TK1 board
Monitoring applications are abundant in todaýђةs world. Our goal is to monitor an individualand his neighborhood using wearable sensors. The system is smart in the sense it can process the captured data in near real-time and communicate-opportunistically with other such systems as well as smart phones and computers. we develop the hardware platform using existing components to support such functionalities. The Nvidia Jetson TEGRA-KEPLER (TK) board is used as the processor as it is one of the most powerful processors for embedded applications with the flexibility to connect to a plethora of sensors. Data transfer for communication is facilitated via Bluetooth and Wireless Fidelity (Wi-Fi). Results on the performance of this setup is reported in experiments with different sensors such as cameras, microphone, gas sensor,temperature/pressure/humidity sensor, and Garmin smart health watch determined heart rate/distance/speed/altitude/- position latitude and longitude and using metrics such as read/write speed,heat generated of Central Processing Unit (CPU), TK board and transmission delay
Avaliação de clusters baseados em sistemas em um chip para a computação de alto desempenho: uma revisão
High-performance computing systems are the maximum expression in the field of processing for large amounts of data. However, their energy consumption is an aspect of great importance, which was not considered decades ago. Hence, software developers and hardware providers are obligated to approach new challenges to address energy consumption, and costs. Constructing a computational cluster with a large amount of systems on a chip can result in a powerful, ecologic platform, with the capacity to offer sufficient performance for different applications, as long as low costs and minimum energy consumption can be maintained. As a result, energy efficient hardware has an opportunity to impact upon the area of high-performance computing. This article presents a systematic review of the evaluations conducted on clusters of ystems on a Chip for High-Performance computing in the research setting.Los sistemas de computación de alto desempeño son la máxima expresión en el campo de procesamiento para grandes cantidades de datos. Sin embargo, su consumo de energía es un aspecto de gran importancia que no era tenido en cuenta en décadas pasadas. Por lo tanto, desarrolladores de software y proveedores de hardware están obligados a enfocarse en nuevos retos para abordar el consumo de energía y costos. Construir un clúster informático con una gran cantidad de sistemas en un chip puede dar como resultado una plataforma poderosa, ecológica y capaz de ofrecer el rendimiento suficiente para diferentes aplicaciones, siempre y cuando se puedan mantener bajos costos y el menor consumo de energía posible. Como resultado, el hardware eficiente en el consumo de energía tiene la oportunidad de tener un impacto en el área de la computación de alto desempeño. En este artículo se presenta una revisión sistemática para conocer las evaluaciones realizadas a clústeres de sistemas en un chip para computación de alto desempeño en el ámbito investigativo. Os sistemas de computação de alto desempenho são a máxima expressão no campo de processamento para grandes quantidades de dados. No entanto, seu consumo de energia é um aspecto de grande importância que não era levado em consideração em décadas passadas. Portanto, desenvolvedores de software e provedores de hardware estão obrigados a focar-se em novos desafios para abordar o consumo de energia e ustos. Construir um cluster informático com uma grande quantidade de sistemas em um chip pode dar como resultado uma plataforma poderosa, ecológica e capaz de oferecer o rendimento suficiente para diferentes aplicações, desde que possam ser mantidos baixos custos e o menor consumo de energia possível. Como resultado, o hardware eficiente no consumo de energia tem a oportunidade de ter um impacto na área da computação de alto desempenho. Neste artigo, apresenta-se uma revisão sistemática para conhecer as avaliações realizadas a clusters de sistemas em um chip para computação de alto desempenho no âmbito investigativo. 
On the use of heterogenous computing in high-energy particle physics at the ATLAS detector
A dissertation submitted in fulfillment of the requirements
for the degree of Master of Physics
in the
School of Physics
November 1, 2017.The ATLAS detector at the Large Hadron Collider (LHC) at CERN is
undergoing upgrades to its instrumentation, as well as the hardware and
software that comprise its Trigger and Data Acquisition (TDAQ) system.
The increased energy will yield larger cross sections for interesting physics
processes, but will also lead to increased artifacts in on-line reconstruction
in the trigger, as well as increased trigger rates, beyond the current system’s
capabilities. To meet these demands it is likely that the massive parallelism
of General-Purpose Programming with Graphic Processing Units (GPGPU)
will be utilised. This dissertation addresses the problem of integrating GPGPU
into the existing Trigger and TDAQ platforms; detailing and analysing
GPGPU performance in the context of performing in a high-throughput,
on-line environment like ATLAS. Preliminary tests show low to moderate
speed-up with GPU relative to CPU, indicating that to achieve a more significant
performance increase it may be necessary to alter the current platform
beyond pairing suitable GPUs to CPUs in an optimum ratio. Possible
solutions are proposed and recommendations for future work are given.LG201
Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II
The High Performance Computing (HPC) community recognizes energy
consumption as a major problem. Extensive research is underway to
identify means to increase energy efficiency of HPC systems
including consideration of alternative
building blocks for future systems. This thesis considers one
such system, the Texas Instruments Keystone II, a heterogeneous
Low-Power System-on-Chip (LPSoC) processor that combines a quad
core ARM CPU with an octa-core Digital Signal Processor (DSP). It
was first released in 2012.
Four issues are considered: i) maximizing the Keystone II ARM CPU
performance; ii) implementation and extension of the OpenMP
programming model for the Keystone II; iii) simultaneous use of
ARM and DSP cores across multiple Keystone SoCs; and iv) an
energy model for applications running on LPSoCs like the Keystone
II and heterogeneous systems in general.
Maximizing the performance of the ARM CPU on the Keystone II
system is fundamental to adoption of this system by the HPC
community and, of the ARM architecture more broadly. Key to
achieving good performance is exploitation of the ARM vector
instructions. This thesis presents the first detailed comparison
of the use of ARM compiler intrinsic functions with automatic
compiler vectorization across four generations of ARM processors.
Comparisons are also made with x86 based platforms and the use of
equivalent Intel vector instructions.
Implementation of the OpenMP programming model on the Keystone II
system presents both challenges and opportunities. Challenges in
that the OpenMP model was originally developed for a homogeneous
programming environment with a common instruction set
architecture, and in 2012 work had only just begun to consider
how OpenMP might work with accelerators. Opportunities in that
shared memory is accessible to all processing elements on the
LPSoC, offering performance advantages over what typically exists
with attached accelerators. This thesis presents an analysis of a
prototype version of OpenMP implemented as a bare-metal runtime
on the DSP of a Keystone I system. An implementation for the
Keystone II that maps OpenMP 4.0 accelerator directives to OpenCL
runtime library operations is presented and evaluated.
Exploitation of some of the underlying hardware features of the
Keystone II is also discussed.
Simultaneous use of the ARM and DSP cores across multiple
Keystone II boards is fundamental to the creation of commercially
viable HPC offerings based on Keystone technology. The nCore
BrownDwarf and HPE Moonshot systems represent two such systems.
This thesis presents a proof-of-concept implementation of matrix
multiplication (GEMM) for the BrownDwarf system. The BrownDwarf
utilizes both Keystone II and Keystone I SoCs through a
point-to-point interconnect called Hyperlink. Details of how a
novel message passing communication framework across Hyperlink
was implemented to support this complex environment are
provided.
An energy model that can be used to predict energy usage as a
function of what fraction of a particular computation is
performed on each of the available compute devices offers the
opportunity for making runtime decisions on how best to minimize
energy usage. This thesis presents a basic energy usage model
that considers rates of executions on each device and their
active and idle power usages. Using this model, it is shown that
only under certain conditions does there exist an energy-optimal
work partition that uses multiple compute devices. To validate
the model a high resolution energy measurement environment is
developed and used to gather energy measurements for a matrix
multiplication benchmark running on a variety of systems. Results
presented support the model.
Drawing on the four issues noted above and other developments
that have occurred since the Keystone II system was first
announced, the thesis concludes by making comments regarding the
future of LPSoCs as building blocks for HPC systems
- …