Search CORE

27,206 research outputs found

Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture

Author: Chen Langshi
Jiang Lei
Qiu Judy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/02/2019
Field of study

Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of terascale integration. Among emerging killer applications, parallel graph processing has been a critical technique to analyze connected data. In this paper, we empirically evaluate various computing platforms including an Intel Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor codenamed Knights Landing (KNL) in the domain of parallel graph processing. We show that the KNL gains encouraging performance when processing graphs, so that it can become a promising solution to accelerating multi-threaded graph applications. We further characterize the impact of KNL architectural enhancements on the performance of a state-of-the art graph framework.We have four key observations: 1 Different graph applications require distinctive numbers of threads to reach the peak performance. For the same application, various datasets need even different numbers of threads to achieve the best performance. 2 Only a few graph applications benefit from the high bandwidth MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units executing AVX512 SIMD instructions on KNLs are underutilized when running the state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering the lowest local memory access latency hurts the performance of graph benchmarks that are lack of NUMA awareness. At last, We suggest future works including system auto-tuning tools and graph framework optimizations to fully exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance Characterization of Multi-threaded Graph Processing Applications on Many-Integrated-Core Architecture," 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Belfast, United Kingdom, 2018, pp. 199-20

arXiv.org e-Print Archive

IUScholarWorks Open

A New Paradigm in Split Manufacturing: Lock the FEOL, Unlock at the BEOL

Author: Knechtel Johann
Nabeel Mohammed
Sengupta Abhrajit
Sinanoglu Ozgur
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/03/2019
Field of study

Split manufacturing was introduced as an effective countermeasure against hardware-level threats such as IP piracy, overbuilding, and insertion of hardware Trojans. Nevertheless, the security promise of split manufacturing has been challenged by various attacks, which exploit the well-known working principles of physical design tools to infer the missing BEOL interconnects. In this work, we advocate a new paradigm to enhance the security for split manufacturing. Based on Kerckhoff's principle, we protect the FEOL layout in a formal and secure manner, by embedding keys. These keys are purposefully implemented and routed through the BEOL in such a way that they become indecipherable to the state-of-the-art FEOL-centric attacks. We provide our secure physical design flow to the community. We also define the security of split manufacturing formally and provide the associated proofs. At the same time, our technique is competitive with current schemes in terms of layout overhead, especially for practical, large-scale designs (ITC'99 benchmarks).Comment: DATE 2019 (https://www.date-conference.com/conference/session/4.5

arXiv.org e-Print Archive

Crossref

SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

Author: Abadi M.
Coates A.
Collobert R.
Dean J.
Krizhevsky A.
LeCun Y.
Szegedy C.
Vanhoucke V.
Wang L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/01/2018
Field of study

Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far beyond the GPU DRAM capacity. SuperNeurons features 3 memory optimizations, \textit{Liveness Analysis}, \textit{Unified Tensor Pool}, and \textit{Cost-Aware Recomputation}, all together they effectively reduce the network-wide peak memory usage down to the maximal memory usage among layers. We also address the performance issues in those memory saving techniques. Given the limited GPU DRAM, SuperNeurons not only provisions the necessary memory for the training, but also dynamically allocates the memory for convolution workspaces to achieve the high performance. Evaluations against Caffe, Torch, MXNet and TensorFlow have demonstrated that SuperNeurons trains at least 3.2432 deeper network than current ones with the leading performance. Particularly, SuperNeurons can train ResNet2500 that has

10^4

basic network layers on a 12GB K40c.Comment: PPoPP '2018: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programmin

arXiv.org e-Print Archive

Crossref

Amorphous Placement and Retrieval of Sensory Data in Sparse Mobile Ad-Hoc Networks

Author: Bestavros Azer
Matta Ibrahim
Morcos Hany
Publication venue: Boston University Computer Science Department
Publication date: 14/04/2006
Field of study

Abstract—Personal communication devices are increasingly being equipped with sensors that are able to passively collect information from their surroundings – information that could be stored in fairly small local caches. We envision a system in which users of such devices use their collective sensing, storage, and communication resources to query the state of (possibly remote) neighborhoods. The goal of such a system is to achieve the highest query success ratio using the least communication overhead (power). We show that the use of Data Centric Storage (DCS), or directed placement, is a viable approach for achieving this goal, but only when the underlying network is well connected. Alternatively, we propose, amorphous placement, in which sensory samples are cached locally and informed exchanges of cached samples is used to diffuse the sensory data throughout the whole network. In handling queries, the local cache is searched first for potential answers. If unsuccessful, the query is forwarded to one or more direct neighbors for answers. This technique leverages node mobility and caching capabilities to avoid the multi-hop communication overhead of directed placement. Using a simplified mobility model, we provide analytical lower and upper bounds on the ability of amorphous placement to achieve uniform field coverage in one and two dimensions. We show that combining informed shuffling of cached samples upon an encounter between two nodes, with the querying of direct neighbors could lead to significant performance improvements. For instance, under realistic mobility models, our simulation experiments show that amorphous placement achieves 10% to 40% better query answering ratio at a 25% to 35% savings in consumed power over directed placement.National Science Foundation (CNS Cybertrust 0524477, CNS NeTS 0520166, CNS ITR 0205294, EIA RI 0202067

Boston University Institutional Repository (OpenBU)