1,402 research outputs found

    Генерация потоковых сетей акторов поиска кратчайших путей для параллельной многоядерной реализации

    Get PDF
    Objectives. The problem of parallelizing computations on multicore systems is considered. On the Floyd – Warshall blocked algorithm of shortest paths search in dense graphs of large size, two types of parallelism are compared: fork-join and network dataflow. Using the CAL programming language, a method of developing actors and an algorithm of generating parallel dataflow networks are proposed. The objective is to improve performance of parallel implementations of algorithms which have the property of partial order of computations on multicore processors.Methods. Methods of graph theory, algorithm theory, parallelization theory and formal language theory are used.Results. Claims about the possibility of reordering calculations in the blocked Floyd – Warshall algorithm are proved, which make it possible to achieve a greater load of cores during algorithm execution. Based on the claims, a method of constructing actors in the CAL language is developed and an algorithm for automatic generation of dataflow CAL networks for various configurations of block matrices describing the lengths of the shortest paths is proposed. It is proved that the networks have the properties of rate consistency, boundedness, and liveness. In actors running in parallel, the order of execution of actions with asynchronous behavior can change dynamically, resulting in efficient use of caches and increased core load. To implement the new features of actors, networks and the method of their generation, a tunable multi-threaded CAL engine has been developed that implements a static dataflow model of computation with bounded sizes of buffers. From the experimental results obtained on four types of multi-core processors it follows that there is an optimal size of the network matrix of actors for which the performance is maximum, and the size depends on the number of cores and the size of graph.Conclusion. It has been shown that dataflow networks of actors are an effective means to parallelize computationally intensive algorithms that describe a partial order of computations over decomposed data. The results obtained on the blocked algorithm of shortest paths search prove that the parallelism of dataflow networks gives higher performance of software implementations on multicore processors in comparison with the fork-join parallelism of OpenMP.Цели. Рассматривается задача распараллеливания вычислений на многоядерных системах. Посредством блочного алгоритма Флойда – Уоршалла поиска кратчайших путей на плотных графах большого размера сравниваются два вида параллелизма: разветвление/слияние и сетевой потоковый. С использованием языка программирования CAL разрабатываются метод построения акторов потока данных и алгоритм генерации параллельных сетей акторов. Целью работы является повышение производительности параллельных сетевых реализаций алгоритмов, обладающих свойством частичного порядка вычислений, на многоядерных процессорах.Методы. Используются методы теории графов, теории алгоритмов, теории распараллеливания, теории формальных языков.Результаты. Доказаны утверждения о возможности переупорядочивания вычислений в блочном алгоритме Флойда – Уоршалла, способствующие повышению загрузки ядер при реализации алгоритма. На основе утверждений разработан метод построения акторов на языке CAL и предложен алгоритм автоматической генерации CAL-сетей потока данных для различных конфигураций матриц блоков, описывающих длины кратчайших путей. Доказано, что сети обладают свойствами согласованности, ограниченности и живучести. В акторах, работающих параллельно, порядок выполнения действий с асинхронным поведением может динамически меняться, что приводит к эффективному использованию кэшей и увеличению загрузки ядер. Для реализации новых возможностей акторов, сетей и метода их генерации разработан настраиваемый многопоточный CAL-движок, реализующий статическую модель потоковых вычислений с ограниченными размерами буферов. Из экспериментальных результатов, полученных на четырех типах многоядерных процессоров, следует, что существует оптимальный размер сетевой матрицы акторов, для которого производительность максимальна, и этот размер зависит от размера графа и количества ядер.Заключение. Показано, что сети акторов потока данных являются эффективным средством распарал-леливания алгоритмов с высокой вычислительной нагрузкой, описывающих частичный порядок вычислений над данными, декомпозированными на части. Результаты, полученные на блочном алгоритме поиска кратчайших путей, показали, что параллелизм сетей потока данных дает более высокую производительность программных реализаций на многоядерных процессорах по сравнению с параллелизмом разветвления/слияния стандарта OpenMP

    Practical Fine-grained Privilege Separation in Multithreaded Applications

    Full text link
    An inherent security limitation with the classic multithreaded programming model is that all the threads share the same address space and, therefore, are implicitly assumed to be mutually trusted. This assumption, however, does not take into consideration of many modern multithreaded applications that involve multiple principals which do not fully trust each other. It remains challenging to retrofit the classic multithreaded programming model so that the security and privilege separation in multi-principal applications can be resolved. This paper proposes ARBITER, a run-time system and a set of security primitives, aimed at fine-grained and data-centric privilege separation in multithreaded applications. While enforcing effective isolation among principals, ARBITER still allows flexible sharing and communication between threads so that the multithreaded programming paradigm can be preserved. To realize controlled sharing in a fine-grained manner, we created a novel abstraction named ARBITER Secure Memory Segment (ASMS) and corresponding OS support. Programmers express security policies by labeling data and principals via ARBITER's API following a unified model. We ported a widely-used, in-memory database application (memcached) to ARBITER system, changing only around 100 LOC. Experiments indicate that only an average runtime overhead of 5.6% is induced to this security enhanced version of application

    Leader Election in Anonymous Rings: Franklin Goes Probabilistic

    Get PDF
    We present a probabilistic leader election algorithm for anonymous, bidirectional, asynchronous rings. It is based on an algorithm from Franklin, augmented with random identity selection, hop counters to detect identity clashes, and round numbers modulo 2. As a result, the algorithm is finite-state, so that various model checking techniques can be employed to verify its correctness, that is, eventually a unique leader is elected with probability one. We also sketch a formal correctness proof of the algorithm for rings with arbitrary size

    A Domain Specific Language Based Approach for Generating Deadlock-Free Parallel Load Scheduling Protocols for Distributed Systems

    Get PDF
    In this dissertation, the concept of using domain specific language to develop errorree parallel asynchronous load scheduling protocols for distributed systems is studied. The motivation of this study is rooted in addressing the high cost of verifying parallel asynchronous load scheduling protocols. Asynchronous parallel applications are prone to subtle bugs such as deadlocks and race conditions due to the possibility of non-determinism. Due to this non-deterministic behavior, traditional testing methods are less effective at finding software faults. One approach that can eliminate these software bugs is to employ model checking techniques that can verify that non-determinism will not cause software faults in parallel programs. Unfortunately, model checking requires the development of a verification model of a program in a separate verification language which can be an error-prone procedure and may not properly represent the semantics of the original system. The model checking approach can provide true positive result if the semantics of an implementation code and a verification model is represented under a single framework such that the verification model closely represents the implementation and the automation of a verification process is natural. In this dissertation, a domain specific language based verification framework is developed to design parallel load scheduling protocols and automatically verify their behavioral properties through model checking. A specification language, LBDSL, is introduced that facilitates the development of parallel load scheduling protocols. The LBDSL verification framework uses model checking techniques to verify the asynchronous behavior of the protocol. It allows the same protocol specification to be used for verification and the code generation. The support to automatic verification during protocol development reduces the verification cost post development. The applicability of LBDSL verification framework is illustrated by performing case study on three different types of load scheduling protocols. The study shows that the LBDSL based verification approach removes the need of debugging for deadlocks and race bugs which has potential to significantly lower software development costs

    DAMNED: A Distributed and Multithreaded Neural Event-Driven simulation framework

    Full text link
    In a Spiking Neural Networks (SNN), spike emissions are sparsely and irregularly distributed both in time and in the network architecture. Since a current feature of SNNs is a low average activity, efficient implementations of SNNs are usually based on an Event-Driven Simulation (EDS). On the other hand, simulations of large scale neural networks can take advantage of distributing the neurons on a set of processors (either workstation cluster or parallel computer). This article presents DAMNED, a large scale SNN simulation framework able to gather the benefits of EDS and parallel computing. Two levels of parallelism are combined: Distributed mapping of the neural topology, at the network level, and local multithreaded allocation of resources for simultaneous processing of events, at the neuron level. Based on the causality of events, a distributed solution is proposed for solving the complex problem of scheduling without synchronization barrier.Comment: 6 page

    A Householder-based algorithm for Hessenberg-triangular reduction

    Full text link
    The QZ algorithm for computing eigenvalues and eigenvectors of a matrix pencil AλBA - \lambda B requires that the matrices first be reduced to Hessenberg-triangular (HT) form. The current method of choice for HT reduction relies entirely on Givens rotations regrouped and accumulated into small dense matrices which are subsequently applied using matrix multiplication routines. A non-vanishing fraction of the total flop count must nevertheless still be performed as sequences of overlapping Givens rotations alternately applied from the left and from the right. The many data dependencies associated with this computational pattern leads to inefficient use of the processor and poor scalability. In this paper, we therefore introduce a fundamentally different approach that relies entirely on (large) Householder reflectors partially accumulated into block reflectors, by using (compact) WY representations. Even though the new algorithm requires more floating point operations than the state of the art algorithm, extensive experiments on both real and synthetic data indicate that it is still competitive, even in a sequential setting. The new algorithm is conjectured to have better parallel scalability, an idea which is partially supported by early small-scale experiments using multi-threaded BLAS. The design and evaluation of a parallel formulation is future work

    CheckFence: Checking Consistency of Concurrent Data Types on Relaxed Memory Models

    Get PDF
    Concurrency libraries can facilitate the development of multithreaded programs by providing concurrent implementations of familiar data types such as queues or sets. There exist many optimized algorithms that can achieve superior performance on multiprocessors by allowing concurrent data accesses without using locks. Unfortunately, such algorithms can harbor subtle concurrency bugs. Moreover, they require memory ordering fences to function correctly on relaxed memory models. To address these difficulties, we propose a verification approach that can exhaustively check all concurrent executions of a given test program on a relaxed memory model and can verify that they are observationally equivalent to a sequential execution. Our Check- Fence prototype automatically translates the C implementation code and the test program into a SAT formula, hands the latter to a standard SAT solver, and constructs counterexample traces if there exist incorrect executions. Applying CheckFence to five previously published algorithms, we were able to (1) find several bugs (some not previously known), and (2) determine how to place memory ordering fences for relaxed memory models
    corecore