59,534 research outputs found
TaskPoint: sampled simulation of task-based programs
Sampled simulation is a mature technique for reducing simulation time of single-threaded programs, but it is not directly applicable to simulation of multi-threaded architectures. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to system noise and variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals we employ a novel fast-forward mechanism for dynamically scheduled programs. We evaluate the proposed technique on a set of 19 task-based parallel benchmarks and two different architectures. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 64 simulated threads by an average factor of 19.1 at an average error of 1.8% and a maximum error of 15.0%.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), the Spanish Ministry of Science and Innovation
(contract TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Mont-Blanc project (EU-FP7-610402 and EU-H2020-671697). M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas is supported by the Ministry of Economy
and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the EUFP7 (contract 2013BP B 00243). T.Grass has been partially
supported by the AGAUR of the Generalitat de Catalunya (grant 2013FI B 0058).Peer ReviewedPostprint (author's final draft
BarrierPoint: sampled simulation of multi-threaded applications
Sampling is a well-known technique to speed up architectural simulation of long-running workloads while maintaining accurate performance predictions. A number of sampling techniques have recently been developed that extend well- known single-threaded techniques to allow sampled simulation of multi-threaded applications. Unfortunately, prior work is limited to non-synchronizing applications (e.g., server throughput workloads); requires the functional simulation of the entire application using a detailed cache hierarchy which limits the overall simulation speedup potential; leads to different units of work across different processor architectures which complicates performance analysis; or, requires massive machine resources to achieve reasonable simulation speedups. In this work, we propose BarrierPoint, a sampling methodology to accelerate simulation by leveraging globally synchronizing barriers in multi-threaded applications. BarrierPoint collects microarchitecture-independent code and data signatures to determine the most representative inter-barrier regions, called barrierpoints. BarrierPoint estimates total application execution time (and other performance metrics of interest) through detailed simulation of these barrierpoints only, leading to substantial simulation speedups. Barrierpoints can be simulated in parallel, use fewer simulation resources, and define fixed units of work to be used in performance comparisons across processor architectures. Our evaluation of BarrierPoint using NPB and Parsec benchmarks reports average simulation speedups of 24.7x (and up to 866.6x) with an average simulation error of 0.9% and 2.9% at most. On average, BarrierPoint reduces the number of simulation machine resources needed by 78x
FASTCUDA: Open Source FPGA Accelerator & Hardware-Software Codesign Toolset for CUDA Kernels
Using FPGAs as hardware accelerators that communicate with a central CPU is becoming a common practice in the embedded design world but there is no standard methodology and toolset to facilitate this path yet. On the other hand, languages such as CUDA and OpenCL provide standard development environments for Graphical Processing Unit (GPU) programming. FASTCUDA is a platform that provides the necessary software toolset, hardware architecture, and design methodology to efficiently adapt the CUDA approach into a new FPGA design flow. With FASTCUDA, the CUDA kernels of a CUDA-based application are partitioned into two groups with minimal user intervention: those that are compiled and executed in parallel software, and those that are synthesized and implemented in hardware. A modern low power FPGA can provide the processing power (via numerous embedded micro-CPUs) and the logic capacity for both the software and hardware implementations of the CUDA kernels. This paper describes the system requirements and the architectural decisions behind the FASTCUDA approach
Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts
Word embedding is a Natural Language Processing (NLP) technique that
automatically maps words from a vocabulary to vectors of real numbers in an
embedding space. It has been widely used in recent years to boost the
performance of a vari-ety of NLP tasks such as Named Entity Recognition,
Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such
as Word2Vec and GloVe work well when they are given a large text corpus. When
the input texts are sparse as in many specialized domains (e.g.,
cybersecurity), these methods often fail to produce high-quality vectors. In
this pa-per, we describe a novel method to train domain-specificword embeddings
from sparse texts. In addition to domain texts, our method also leverages
diverse types of domain knowledge such as domain vocabulary and semantic
relations. Specifi-cally, we first propose a general framework to encode
diverse types of domain knowledge as text annotations. Then we de-velop a novel
Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text
annotations in word em-bedding. We have evaluated our method on two
cybersecurity text corpora: a malware description corpus and a Common
Vulnerability and Exposure (CVE) corpus. Our evaluation re-sults have
demonstrated the effectiveness of our method in learning domain-specific word
embeddings
A Novel Apex-Time Network for Cross-Dataset Micro-Expression Recognition
The automatic recognition of micro-expression has been boosted ever since the
successful introduction of deep learning approaches. As researchers working on
such topics are moving to learn from the nature of micro-expression, the
practice of using deep learning techniques has evolved from processing the
entire video clip of micro-expression to the recognition on apex frame. Using
the apex frame is able to get rid of redundant video frames, but the relevant
temporal evidence of micro-expression would be thereby left out. This paper
proposes a novel Apex-Time Network (ATNet) to recognize micro-expression based
on spatial information from the apex frame as well as on temporal information
from the respective-adjacent frames. Through extensive experiments on three
benchmarks, we demonstrate the improvement achieved by learning such temporal
information. Specially, the model with such temporal information is more robust
in cross-dataset validations.Comment: 6 pages, 3 figures, 3 tables, code available, accepted in ACII 201
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
Intel Xeon Phi is a recently released high-performance coprocessor which
features 61 cores each supporting 4 hardware threads with 512-bit wide SIMD
registers achieving a peak theoretical performance of 1Tflop/s in double
precision. Many scientific applications involve operations on large sparse
matrices such as linear solvers, eigensolver, and graph mining algorithms. The
core of most of these applications involves the multiplication of a large,
sparse matrix with a dense vector (SpMV). In this paper, we investigate the
performance of the Xeon Phi coprocessor for SpMV. We first provide a
comprehensive introduction to this new architecture and analyze its peak
performance with a number of micro benchmarks. Although the design of a Xeon
Phi core is not much different than those of the cores in modern processors,
its large number of cores and hyperthreading capability allow many application
to saturate the available memory bandwidth, which is not the case for many
cutting-edge processors. Yet, our performance studies show that it is the
memory latency not the bandwidth which creates a bottleneck for SpMV on this
architecture. Finally, our experiments show that Xeon Phi's sparse kernel
performance is very promising and even better than that of cutting-edge general
purpose processors and GPUs
- …