9 research outputs found

    Parallel structurally-symmetric sparse matrix-vector products on multi-core processors

    Full text link
    We consider the problem of developing an efficient multi-threaded implementation of the matrix-vector multiplication algorithm for sparse matrices with structural symmetry. Matrices are stored using the compressed sparse row-column format (CSRC), designed for profiting from the symmetric non-zero pattern observed in global finite element matrices. Unlike classical compressed storage formats, performing the sparse matrix-vector product using the CSRC requires thread-safe access to the destination vector. To avoid race conditions, we have implemented two partitioning strategies. In the first one, each thread allocates an array for storing its contributions, which are later combined in an accumulation step. We analyze how to perform this accumulation in four different ways. The second strategy employs a coloring algorithm for grouping rows that can be concurrently processed by threads. Our results indicate that, although incurring an increase in the working set size, the former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo

    Characterization and modeling of PIDX parallel I/O for performance optimization

    Get PDF
    pre-printParallel I/O library performance can vary greatly in re- sponse to user-tunable parameter values such as aggregator count, file count, and aggregation strategy. Unfortunately, manual selection of these values is time consuming and dependent on characteristics of the target machine, the underlying file system, and the dataset itself. Some characteristics, such as the amount of memory per core, can also impose hard constraints on the range of viable parameter values. In this work we address these problems by using machine learning techniques to model the performance of the PIDX parallel I/O library and select appropriate tunable parameter values. We characterize both the network and I/O phases of PIDX on a Cray XE6 as well as an IBM Blue Gene/P system. We use the results of this study to develop a machine learning model for parameter space exploration and performance prediction

    On Optimal Partitioning For Sparse Matrices In Variable Block Row Format

    Full text link
    The Variable Block Row (VBR) format is an influential blocked sparse matrix format designed to represent shared sparsity structure between adjacent rows and columns. VBR consists of groups of adjacent rows and columns, storing the resulting blocks that contain nonzeros in a dense format. This reduces the memory footprint and enables optimizations such as register blocking and instruction-level parallelism. Existing approaches use heuristics to determine which rows and columns should be grouped together. We adapt and optimize a dynamic programming algorithm for sequential hypergraph partitioning to produce a linear time algorithm which can determine the optimal partition of rows under an expressive cost model, assuming the column partition remains fixed. Furthermore, we show that the problem of determining an optimal partition for the rows and columns simultaneously is NP-Hard under a simple linear cost model. To evaluate our algorithm empirically against existing heuristics, we introduce the 1D-VBR format, a specialization of VBR format where columns are left ungrouped. We evaluate our algorithms on all 1626 real-valued matrices in the SuiteSparse Matrix Collection. When asked to minimize an empirically derived cost model for a sparse matrix-vector multiplication kernel, our algorithm produced partitions whose 1D-VBR realizations achieve a speedup of at least 1.18 over an unblocked kernel on 25% of the matrices, and a speedup of at least 1.59 on 12.5% of the matrices. The 1D-VBR representation produced by our algorithm had faster SpMVs than the 1D-VBR representations produced by any existing heuristics on 87.8% of the test matrices

    A fast sparse block circulant matrix vector product

    Full text link
    In the context of computed tomography (CT), iterative image reconstruction techniques are gaining attention because high-quality images are becoming computationally feasible. They involve the solution of large systems of equations, whose cost is dominated by the sparse matrix vector product (SpMV). Our work considers the case of the sparse matrices being block circulant, which arises when taking advantage of the rotational symmetry in the tomographic system. Besides the straightforward storage saving, we exploit the circulant structure to rewrite the poor-performance SpMVs into a high-performance product between sparse and dense matrices. This paper describes the implementations developed for multi-core CPUs and GPUs, and presents experimental results with typical CT matrices. The presented approach is up to ten times faster than without exploiting the circulant structure.Romero Alcalde, E.; Tomás Domínguez, AE.; Soriano Asensi, A.; Blanquer Espert, I. (2014). A fast sparse block circulant matrix vector product. En Euro-Par 2014 Parallel Processing. Springer. 548-559. doi:10.1007/978-3-319-09873-9_46S548559Bian, J., Siewerdsen, J.H., Han, X., Sidky, E.Y., Prince, J.L., Pelizzari, C.A., Pal, X.: Evaluation of sparse-view reconstruction from flat-panel-detector cone-beam ct. Physics in Medicine and Biology 55, 6575–6599 (2010)Dalton, S., Bell, N.: CUSP: A C++ templated sparse matrix library version 0.4.0 (2014), http://cusplibrary.github.com/Feldkamp, L., Davis, L., Kress, J.: Practical cone-beam algorithm. Journal of the Optical Society of America 1, 612–619 (1984)Ganine, V., Legrand, M., Michalska, H., Pierre, C.: A sparse preconditioned iterative method for vibration analysis of geometrically mistuned bladed disks. Computers & Structures 87(5-6), 342–354 (2009)Hara, A.K., Paden, R.G., Silva, A.C., Kujak, J.L., Lawder, H.J., Pavlicek, W.: Iterative reconstruction technique for reducing body radiation dose at CT: Feasibility study. American Journal of Roentgenology 193, 764–771 (2009)Heroux, M.A., Bartlett, R.A., Howle, V.E., Hoekstra, R.J., Hu, J.J., Kolda, T.G., Lehoucq, R.B., Long, K.R., Pawlowski, R.P., Phipps, E.T., Salinger, A.G., Thornquist, H.K., Tuminaro, R.S., Willenbring, J.M., Williams, A., Stanley, K.S.: An overview of the Trilinos project. ACM Trans. Math. Softw. 31(3), 397–423 (2005)Im, E.J., Yelick, K., Vuduc, R.: Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications 18(1), 135–158 (2004)Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: Open source scientific tools for Python (2001), http://www.scipy.org/Kaveh, A., Rahami, H.: Block circulant matrices and applications in free vibration analysis of cyclically repetitive structures. Acta Mechanica 217(1-2), 51–62 (2011)Kourtis, K., Goumas, G., Koziris, N.: Optimizing sparse matrix-vector multiplication using index and value compression. In: Proceedings of the 5th Conference on Computing Frontiers, CF 2008, pp. 87–96. ACM, New York (2008)Krotkiewski, M., Dabrowski, M.: Parallel symmetric sparse matrix–vector product on scalar multi-core CPUs. Parallel Computing 36(4), 181–198 (2010)Lee, B., Vuduc, R., Demmel, J., Yelick, K.: Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In: International Conference on Parallel Processing, ICPP 2004, vol. 1, pp. 169–176 (2004)Leroux, J.D., Selivanov, V., Fontaine, R., Lecomte, R.: Accelerated iterative image reconstruction methods based on block-circulant system matrix derived from a cylindrical image representation. In: Nuclear Science Symposium Conference Record, NSS 2007, vol. 4, pp. 2764–2771. IEEE (2007)NVIDIA: CUSPARSE library (2014), https://developer.nvidia.com/cusparsePan, X., Sidky, E.Y., Vannier, M.: Why do commercial CT scanners still employ traditional, filtered back-projection for image reconstruction? Inverse Problems 25, 123009 (2008)Rodríguez-Alvarez, M.J., Soriano, A., Iborra, A., Sánchez, F., González, A.J., Conde, P., Hernández, L., Moliner, L., Orero, A., Vidal, L.F., Benlloch, J.M.: Expectation maximization (EM) algorithms using polar symmetries for computed tomography CT image reconstruction. Computers in Biology and Medicine 43(8), 1053–1061 (2013)Sheep, L., Vardi, Y.: Maximum likelihood reconstruction for emmision tomography. IEEE Transactions on Medical Imaging 1, 113–122 (1982)Sidky, E.Y., Pan, X.: Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization. Physics in Medicine and Biology 53, 4777–4807 (2008)Soriano, A., Rodríguez-Alvarez, M.J., Iborra, A., Sánchez, F., Carles, M., Conde, P., González, A.J., Hernández, L., Moliner, L., Orero, A., Vidal, L.F., Benlloch, J.M.: EM tomographic image reconstruction using polar voxels. Journal of Instrumentation 8, C01004 (2013)Thibaudeau, C., Leroux, J.D., Pratte, J.F., Fontaine, R., Lecomte, R.: Cylindrical and spherical ray-tracing for ct iterative reconstruction. In: 2011 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), pp. 4378–4381 (2011)Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series 16(1), 521 (2005)Vuduc, R.W., Moon, H.-J.: Fast sparse matrix-vector multiplication by exploiting variable block structure. In: Yang, L.T., Rana, O.F., Di Martino, B., Dongarra, J. (eds.) HPCC 2005. LNCS, vol. 3726, pp. 807–816. Springer, Heidelberg (2005)Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing 35(3), 178–194 (2009

    Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply

    No full text
    We present optimizations for sparse matrix-vector multiply SpMV and its generalization to multiple vectors, SpMM, when the matrix is symmetric: (1) symmetric storage, (2) register blocking, and (3) vector blocking. Combined with register blocking, symmetry saves more than 50 % in matrix storage. We also show performance speedups of 2.1Ă— for SpMV and 2.6 Ă— for SpMM, when compared to the best non-symmetric register blocked implementation. We present an approach for the selection of tuning parameters, based on empirical modeling and search that consists of three steps: (1) Off-line benchmark, (2) Runtime search, and (3) Heuristic performance model. This approach generally selects parameters to achieve performance with 85 % of that achieved with exhaustive search. We evaluate our implementations with respect to upper bounds on performance. Our model bounds performance by considering only the cost of memory operations and using lower bounds on the number of cache misses. Our optimized codes are within 68 % of the upper bounds.

    Doctor of Philosophy

    Get PDF
    dissertationThe increase in computational power of supercomputers is enabling complex scientific phenomena to be simulated at ever-increasing resolution and fidelity. With these simulations routinely producing large volumes of data, performing efficient I/O at this scale has become a very difficult task. Large-scale parallel writes are challenging due to the complex interdependencies between I/O middleware and hardware. Analytic-appropriate reads are traditionally hindered by bottlenecks in I/O access. Moreover, the two components of I/O, data generation from simulations (writes) and data exploration for analysis and visualization (reads), have substantially different data access requirements. Parallel writes, performed on supercomputers, often deploy aggregation strategies to permit large-sized contiguous access. Analysis and visualization tasks, usually performed on computationally modest resources, require fast access to localized subsets or multiresolution representations of the data. This dissertation tackles the problem of parallel I/O while bridging the gap between large-scale writes and analytics-appropriate reads. The focus of this work is to develop an end-to-end adaptive-resolution data movement framework that provides efficient I/O, while supporting the full spectrum of modern HPC hardware. This is achieved by developing technology for highly scalable and tunable parallel I/O, applicable to both traditional parallel data formats and multiresolution data formats, which are directly appropriate for analysis and visualization. To demonstrate the efficacy of the approach, a novel library (PIDX) is developed that is highly tunable and capable of adaptive-resolution parallel I/O to a multiresolution data format. Adaptive resolution storage and I/O, which allows subsets of a simulation to be accessed at varying spatial resolutions, can yield significant improvements to both the storage performance and I/O time. The library provides a set of parameters that controls the storage format and the nature of data aggregation across he network; further, a machine learning-based model is constructed that tunes these parameters for the maximum throughput. This work is empirically demonstrated by showing parallel I/O scaling up to 768K cores within a framework flexible enough to handle adaptive resolution I/O

    Analysis and acceleration of data mining algorithms on high performance reconfigurable computing platforms

    Get PDF
    With the continued development of computation and communication technologies, we are overwhelmed with electronic data. Ubiquitous data in governments, commercial enterprises, universities and various organizations records our decisions, transactions and thoughts. The data collection rate is undergoing tremendous increase. And there is no end in sight. On one hand, as the volume of data explodes, the gap between the human being\u27s understanding of the data and the knowledge hidden in the data will be enlarged. The algorithms and techniques, collectively known as data mining, are emerged to bridge the gap. The data mining algorithms are usually data-compute intensive. On the other hand, the overall computing system performance is not increasing at an equal rate. Consequently, there is strong requirement to design special computing systems to accelerate data mining applications. FPGAs based High Performance Reconfigurable Computing(HPRC) system is to design optimized hardware architecture for a given problem. The increased gate count, arithmetic capability, and other features of modern FPGAs now allow researcher to implement highly complicated reconfigurable computational architecture. In contrast with ASICs, FPGAs have the advantages of low power, low nonrecurring engineering costs, high design flexibility and the ability to update functionality after shipping. In this thesis, we first design the architectures for data intensive and data-compute intensive applications respectively. Then we present a general HPRC framework for data mining applications: Frequent Pattern Mining(FPM) is a data-compute intensive application which is to find commonly occurring itemsets in databases. We use systolic tree architecture in FPGA hardware to mimic the internal memory layout of FP-growth algorithm while achieving higher throughput. The experimental results demonstrate that the proposed hardware architecture is faster than the software approach. Sparse Matrix-Vector Multiplication(SMVM) is a data-intensive application which is an important computing core in many applications. We present a scalable and efficient FPGA-based SMVM architecture which can handle arbitrary matrix sizes without preprocessing or zero padding and can be dynamically expanded based on the available I/O bandwidth. The experimental results using a commercial FPGA-based acceleration system demonstrate that our reconfigurable SMVM engine is more efficient than existing state-of-the-art, with speedups over a highly optimized software implementation of 2.5X to 6.5X, depending on the sparsity of the input benchmark. Accelerating Text Classification Using SMVM is performed in Convey HC-1 HPRC platform. The SMVM engines are deployed into multiple FPGA chips. Text documents are represented as large sparse matrices using Vector Space Model(VSM). The k-nearest neighbor algorithm uses SMVM to perform classification simultaneously on multiple FPGAs. Our experiment shows that the classification in Convey HC-1 is several times faster compared with the traditional computing architecture. MapReduce Reconfigurable Framework for Data Mining Applications is a pipelined and high performance framework for FPGA design based on the MapReduce model. Our goal is to lessen the FPGA programmer burden while minimizing performance degradation. The designer only need focus on the mapper and reducer modules design. We redesigned the SMVM architecture using the MapReduce Framework. The manual VHDL code is only 15 percent of that used in the customized architecture

    Performance Optimization With An Integrated View Of Compiler And Application Knowledge

    Get PDF
    Compiler optimization is a long-standing research field that enhances program performance with a set of rigorous code analyses and transformations. Traditional compiler optimization focuses on general programs or program structures without considering too much high-level application operations or data structure knowledge. In this thesis, we claim that an integrated view of the application and compiler is helpful to further improve program performance. Particularly, we study integrated optimization opportunities for three kinds of applications: irregular tree-based query processing systems such as B+ tree, security enhancement such as buffer overflow protection, and tensor/matrix-based linear algebra computation. The performance of B+ tree query processing is important for many applications, such as file systems and databases. Latch-free B+ tree query processing is efficient since the queries are processed in batches without locks. To avoid long latency, the batch size can not be very large. However, modern processors provide opportunities to process larger batches parallel with acceptable latency. From studying real-world data, we find that there are many redundant and unnecessary queries especially when the real-world data is highly skewed. We develop a query sequence transformation framework Qtrans to reduce the redundancies in queries by applying classic dataflow analysis to queries. To further confirm the effectiveness, we integrate Qtrans into an existing BSP-based B+ tree query processing system, PALM tree. The evaluations show that the throughput can be improved up to 16X. Heap overflows are still the most common vulnerabilities in C/C++ programs. Common approaches incur high overhead since it checks every memory access. By analyzing dozens of bugs, we find that all heap overflows are related to arrays. We only need to check array-related memory accesses. We propose Prober to efficiently detect and prevent heap overflows. It contains Prober-Static to identify the array-related allocations and Prober-Dynamic to protect objects at runtime. In this thesis, our contributions lie on the Prober-Static side. The key challenge is to correctly identify the array-related allocations. We propose a hybrid method. Some objects can be identified as array-related (or not) by static analysis. For the remaining ones, we instrument the basic allocation type size statically and then determine the real allocation size at runtime. The evaluations show Prober-Static is effective. Tensor algebra is widely used in many applications, such as machine learning and data analytics. Tensors representing real-world data are usually large and sparse. There are many sparse tensor storage formats, and the kernels are different with varied formats. These different kernels make performance optimization for sparse tensor algebra challenging. We propose a tensor algebra domain-specific language and a compiler to automatically generate kernels for sparse tensor algebra computations, called SPACe. This compiler supports a wide range of sparse tensor formats. To further improve the performance, we integrate the data reordering into SPACe to improve data locality. The evaluations show that the code generated by SPACe outperforms state-of-the-art sparse tensor algebra compilers
    corecore