1,200 research outputs found

    QUERIES SERVICE TIME RESEARCH AND ESTIMATION DURING INFORMATION EXCHANGE IN MULTIPROCESSOR SYSTEMS WITH “UNI BUS” INTERFACE AND SHARED MEMORY

    Get PDF
    Abstract. The issues connected with estimating service time of queries (transactions) during the information exchange in multiprocessor systems with a unibus interface and shared memory are analyzed and studied in the article. The article aims at developing and making research of models based on systems and queueing networks, the "processor-memory" subsystem, as well as estimating the queries service time during the information exchange in multiprocessor systems with shared memory. The subject matter of the study is the analysis of time delays associated with conflict situations occured during the realization of interprocessor exchange when many processors turn to the exchange unibusand memory. The object of the article research is the "processor-memory" subsystem of existing multiprocessor systems and well-known versions of the architecture of this subsystem . The main task defined by the authors of the scientific article is to develop and make research of mathematical models of the "processor-memory" subsystem of the mentioned systems and to estimate the processing time of inputting queries during the information exchange in systems with shared memory. Mathematical models for carrying out queries service time research have been proposed. Equations fordetermining the main probabilistic-temporal characteristics of the "processor-memory" subsystem have been presented. The mentioned probabilistic-temporal models have been developed using the theory of queueing networks and probability theory. In conclusion the authors make the main judgements about the work done. The mathematical models studied in the article make it possible to estimate the main probabilistic-temporal characteristics of multiprocessor systems without developing real models or prototypes. As a result some effect is achieved, because it is possible to estimate thecharacteristics of new multiprocessor computer systems and choose the most optimal ones without creating a real expensive systemKeywords: simulation, analytical model, imitation model, queueing network system, transaction, read-operation, record-operation, multiprocessor system, “processor-memory” subsystem, memory architecture, memory bandwidth, memory controller, memory latency, buffer element

    Algorithms for Extracting Frequent Episodes in the Process of Temporal Data Mining

    Get PDF
    An important aspect in the data mining process is the discovery of patterns having a great influence on the studied problem. The purpose of this paper is to study the frequent episodes data mining through the use of parallel pattern discovery algorithms. Parallel pattern discovery algorithms offer better performance and scalability, so they are of a great interest for the data mining research community. In the following, there will be highlighted some parallel and distributed frequent pattern mining algorithms on various platforms and it will also be presented a comparative study of their main features. The study takes into account the new possibilities that arise along with the emerging novel Compute Unified Device Architecture from the latest generation of graphics processing units. Based on their high performance, low cost and the increasing number of features offered, GPU processors are viable solutions for an optimal implementation of frequent pattern mining algorithmsFrequent Pattern Mining, Parallel Computing, Dynamic Load Balancing, Temporal Data Mining, CUDA, GPU, Fermi, Thread

    Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

    Full text link
    Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.Comment: 11 pages, 7 figures, to appear in the Proceedings of Supercomputing 2010 (submitted April 12, 2010

    Graph Pattern Matching on Symmetric Multiprocessor Systems

    Get PDF
    Graph-structured data can be found in nearly every aspect of today's world, be it road networks, social networks or the internet itself. From a processing perspective, finding comprehensive patterns in graph-structured data is a core processing primitive in a variety of applications, such as fraud detection, biological engineering or social graph analytics. On the hardware side, multiprocessor systems, that consist of multiple processors in a single scale-up server, are the next important wave on top of multi-core systems. In particular, symmetric multiprocessor systems (SMP) are characterized by the fact, that each processor has the same architecture, e.g. every processor is a multi-core and all multiprocessors share a common and huge main memory space. Moreover, large SMPs will feature a non-uniform memory access (NUMA), whose impact on the design of efficient data processing concepts should not be neglected. The efficient usage of SMP systems, that still increase in size, is an interesting and ongoing research topic. Current state-of-the-art architectural design principles provide different and in parts disjunct suggestions on which data should be partitioned and or how intra-process communication should be realized. In this thesis, we propose a new synthesis of four of the most well-known principles Shared Everything, Partition Serial Execution, Data Oriented Architecture and Delegation, to create the NORAD architecture, which stands for NUMA-aware DORA with Delegation. We built our research prototype called NeMeSys on top of the NORAD architecture to fully exploit the provided hardware capacities of SMPs for graph pattern matching. Being an in-memory engine, NeMeSys allows for online data ingestion as well as online query generation and processing through a terminal based user interface. Storing a graph on a NUMA system inherently requires data partitioning to cope with the mentioned NUMA effect. Hence, we need to dissect the graph into a disjunct set of partitions, which can then be stored on the individual memory domains. This thesis analyzes the capabilites of the NORAD architecture, to perform scalable graph pattern matching on SMP systems. To increase the systems performance, we further develop, integrate and evaluate suitable optimization techniques. That is, we investigate the influence of the inherent data partitioning, the interplay of messaging with and without sufficient locality information and the actual partition placement on any NUMA socket in the system. To underline the applicability of our approach, we evaluate NeMeSys against synthetic datasets and perform an end-to-end evaluation of the whole system stack on the real world knowledge graph of Wikidata

    Distributed real-time operating system (DRTOS) modeling in SpecC

    Get PDF
    System level design of an embedded computing system involves a multi-step process to refine the system from an abstract specification to an actual implementation by defining and modeling the system at various levels of abstraction. System level design supports evaluating and optimizing the system early in design exploration.;Embedded computing systems may consist of multiple processing elements, memories, I/O devices, sensors, and actors. The selection of processing elements includes instruction-set processors and custom hardware units, such as application specific integrated circuit (ASIC) and field programmable gate array (FPGA). Real-time operating systems (RTOS) have been used in embedded systems as an industry standard for years and can offer embedded systems the characteristics such as concurrency and time constraints. Some of the existing system level design languages, such as SpecC, provide the capability to model an embedded system including an RTOS for a single processor. However, there is a need to develop a distributed RTOS modeling mechanism as part of the system level design methodology due to the increasing number of processing elements in systems and to embedded platforms having multiple processors. A distributed RTOS (DRTOS) provides services such as multiprocessor tasks scheduling, interprocess communication, synchronization, and distributed mutual exclusion, etc.;In this thesis, we develop a DRTOS model as the extension of the existing SpecC single RTOS model to provide basic functionalities of a DRTOS implementation, and present the refinement methodology for using our DRTOS model during system level synthesis. The DRTOS model and refinement process are demonstrated in the SpecC SCE environment. The capabilities and limitations of the DRTOS modeling approach are presented

    Exploiting cache locality at run-time

    Get PDF
    With the increasing gap between the speeds of the processor and memory system, memory access has become a major performance bottleneck in modern computer systems. Recently, Symmetric Multi-Processor (SMP) systems have emerged as a major class of high-performance platforms. Improving the memory performance of Parallel applications with dynamic memory-access patterns on Symmetric Multi-Processors (SMP) is a hard problem. The solution to this problem is critical to the successful use of the SMP systems because dynamic memory-access patterns occur in many real-world applications. This dissertation is aimed at solving this problem.;Based on a rigorous analysis of cache-locality optimization, we propose a memory-layout oriented run-time technique to exploit the cache locality of parallel loops. Our technique have been implemented in a run-time system. Using simulation and measurement, we have shown our run-time approach can achieve comparable performance with compiler optimizations for those regular applications, whose load balance and cache locality can be well optimized by tiling and other program transformations. However, our approach was shown to improve significantly the memory performance for applications with dynamic memory-access patterns. Such applications are usually hard to optimize with static compiler optimizations.;Several contributions are made in this dissertation. We present models to characterize the complexity and present a solution framework for optimizing cache locality. We present an effective estimation technique for memory-access patterns to support efficient locality optimizations and information integration. We present a memory-layout oriented run-time technique for locality optimization. We present efficient scheduling algorithms to trade off locality and load imbalance. We provide a detailed performance evaluation of the run-time technique

    Simulation models of shared-memory multiprocessor systems

    Get PDF
    corecore