68,698 research outputs found
TTC: A Tensor Transposition Compiler for Multiple Architectures
We consider the problem of transposing tensors of arbitrary dimension and
describe TTC, an open source domain-specific parallel compiler. TTC generates
optimized parallel C++/CUDA C code that achieves a significant fraction of the
system's peak memory bandwidth. TTC exhibits high performance across multiple
architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD
Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such
as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a
meaningful baseline implementation generated by external C++ compilers; the
results suggest that a domain-specific compiler can outperform its general
purpose counterpart significantly: For instance, comparing with Intel's latest
C++ compiler on the Haswell and Knights Corner architecture, TTC yields
speedups of up to and , respectively. We also showcase
TTC's support for multiple leading dimensions, making it a suitable candidate
for the generation of performance-critical packing functions that are at the
core of the ubiquitous BLAS 3 routines
CMOS Vision Sensors: Embedding Computer Vision at Imaging Front-Ends
CMOS Image Sensors (CIS) are key for imaging technol-ogies. These chips are conceived for capturing opticalscenes focused on their surface, and for delivering elec-trical images, commonly in digital format. CISs may incor-porate intelligence; however, their smartness basicallyconcerns calibration, error correction and other similartasks. The term CVISs (CMOS VIsion Sensors) definesother class of sensor front-ends which are aimed at per-forming vision tasks right at the focal plane. They havebeen running under names such as computational imagesensors, vision sensors and silicon retinas, among others. CVIS and CISs are similar regarding physical imple-mentation. However, while inputs of both CIS and CVISare images captured by photo-sensors placed at thefocal-plane, CVISs primary outputs may not be imagesbut either image features or even decisions based on thespatial-temporal analysis of the scenes. We may hencestate that CVISs are more âintelligentâ than CISs as theyfocus on information instead of on raw data. Actually,CVIS architectures capable of extracting and interpretingthe information contained in images, and prompting reac-tion commands thereof, have been explored for years inacademia, and industrial applications are recently ramp-ing up.One of the challenges of CVISs architects is incorporat-ing computer vision concepts into the design flow. Theendeavor is ambitious because imaging and computervision communities are rather disjoint groups talking dif-ferent languages. The Cellular Nonlinear Network Univer-sal Machine (CNNUM) paradigm, proposed by Profs.Chua and Roska, defined an adequate framework forsuch conciliation as it is particularly well suited for hard-ware-software co-design [1]-[4]. This paper overviewsCVISs chips that were conceived and prototyped at IMSEVision Lab over the past twenty years. Some of them fitthe CNNUM paradigm while others are tangential to it. Allthem employ per-pixel mixed-signal processing circuitryto achieve sensor-processing concurrency in the quest offast operation with reduced energy budget.Junta de AndalucĂa TIC 2012-2338Ministerio de EconomĂa y Competitividad TEC 2015-66878-C3-1-R y TEC 2015-66878-C3-3-
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
Breadth First Search Vectorization on the Intel Xeon Phi
Breadth First Search (BFS) is a building block for graph algorithms and has
recently been used for large scale analysis of information in a variety of
applications including social networks, graph databases and web searching. Due
to its importance, a number of different parallel programming models and
architectures have been exploited to optimize the BFS. However, due to the
irregular memory access patterns and the unstructured nature of the large
graphs, its efficient parallelization is a challenge. The Xeon Phi is a
massively parallel architecture available as an off-the-shelf accelerator,
which includes a powerful 512 bit vector unit with optimized scatter and gather
functions. Given its potential benefits, work related to graph traversing on
this architecture is an active area of research.
We present a set of experiments in which we explore architectural features of
the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the
techniques can be applied to the current state-of-the-art hybrid, top-down plus
bottom-up, algorithms.
We focus on the exploitation of the vector unit by developing an improved
highly vectorized OpenMP parallel algorithm, using vector intrinsics, and
understanding the use of data alignment and prefetching. In addition, we
investigate the impact of hyperthreading and thread affinity on performance, a
topic that appears under researched in the literature. As a result, we achieve
what we believe is the fastest published top-down BFS algorithm on the version
of Xeon Phi used in our experiments. The vectorized BFS top-down source code
presented in this paper can be available on request as free-to-use software
Recommended from our members
Cross-platform validation of notional baseline architecture models of naval electric ship power systems
To support efforts in assessing the relative merit of alternative power system architectures for future naval combatants, the Electric Ship Research and Development Consortium (ESRDC) has developed notional baseline models for each of the primary candidate architectures currently considered, medium-voltage DC (MVDC), conventional 60 Hz medium-voltage (MVAC), and high-frequency medium-voltage (HFAC). Initial efforts have focused on the development of a consistent set of component models, of which the system models can be comprised, and the basic definition of the system models. The broader objectives of the consortium, however, go beyond the definition of the baseline models. The focus is on the process by which the models are implemented in software and validated, the process by which the performance of the disparate system models are objectively and quantitatively assessed and compared, and, ultimately, the process by which the relative merits of the architectures may be assessed. This paper focuses specifically on cross-platform component validation.Center for Electromechanic
gCSP: A Graphical Tool for Designing CSP systems
For broad acceptance of an engineering paradigm, a graphical notation and a supporting design tool seem necessary. This paper discusses certain issues of developing a design environment for building systems based on CSP. Some of the issues discussed depend specifically on the underlying theory of CSP, while a number of them are common for any graphical notation and supporting tools, such as provisions for complexity management and design overview
- âŠ