3,546 research outputs found

    Admit your weakness: Verifying correctness on TSO architectures

    Get PDF
    “The final publication is available at http://link.springer.com/chapter/10.1007%2F978-3-319-15317-9_22 ”.Linearizability has become the standard correctness criterion for fine-grained non-atomic concurrent algorithms, however, most approaches assume a sequentially consistent memory model, which is not always realised in practice. In this paper we study the correctness of concurrent algorithms on a weak memory model: the TSO (Total Store Order) memory model, which is commonly implemented by multicore architectures. Here, linearizability is often too strict, and hence, we prove a weaker criterion, quiescent consistency instead. Like linearizability, quiescent consistency is compositional making it an ideal correctness criterion in a component-based context. We demonstrate how to model a typical concurrent algorithm, seqlock, and prove it quiescent consistent using a simulation-based approach. Previous approaches to proving correctness on TSO architectures have been based on linearizabilty which makes it necessary to modify the algorithm’s high-level requirements. Our approach is the first, to our knowledge, for proving correctness without the need for such a modification

    Property-Driven Fence Insertion using Reorder Bounded Model Checking

    Full text link
    Modern architectures provide weaker memory consistency guarantees than sequential consistency. These weaker guarantees allow programs to exhibit behaviours where the program statements appear to have executed out of program order. Fortunately, modern architectures provide memory barriers (fences) to enforce the program order between a pair of statements if needed. Due to the intricate semantics of weak memory models, the placement of fences is challenging even for experienced programmers. Too few fences lead to bugs whereas overuse of fences results in performance degradation. This motivates automated placement of fences. Tools that restore sequential consistency in the program may insert more fences than necessary for the program to be correct. Therefore, we propose a property-driven technique that introduces "reorder-bounded exploration" to identify the smallest number of program locations for fence placement. We implemented our technique on top of CBMC; however, in principle, our technique is generic enough to be used with any model checker. Our experimental results show that our technique is faster and solves more instances of relevant benchmarks as compared to earlier approaches.Comment: 18 pages, 3 figures, 4 algorithms. Version change reason : new set of results and publication ready version of FM 201

    Datacenter Traffic Control: Understanding Techniques and Trade-offs

    Get PDF
    Datacenters provide cost-effective and flexible access to scalable compute and storage resources necessary for today's cloud computing needs. A typical datacenter is made up of thousands of servers connected with a large network and usually managed by one operator. To provide quality access to the variety of applications and services hosted on datacenters and maximize performance, it deems necessary to use datacenter networks effectively and efficiently. Datacenter traffic is often a mix of several classes with different priorities and requirements. This includes user-generated interactive traffic, traffic with deadlines, and long-running traffic. To this end, custom transport protocols and traffic management techniques have been developed to improve datacenter network performance. In this tutorial paper, we review the general architecture of datacenter networks, various topologies proposed for them, their traffic properties, general traffic control challenges in datacenters and general traffic control objectives. The purpose of this paper is to bring out the important characteristics of traffic control in datacenters and not to survey all existing solutions (as it is virtually impossible due to massive body of existing research). We hope to provide readers with a wide range of options and factors while considering a variety of traffic control mechanisms. We discuss various characteristics of datacenter traffic control including management schemes, transmission control, traffic shaping, prioritization, load balancing, multipathing, and traffic scheduling. Next, we point to several open challenges as well as new and interesting networking paradigms. At the end of this paper, we briefly review inter-datacenter networks that connect geographically dispersed datacenters which have been receiving increasing attention recently and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial

    PABO: Mitigating Congestion via Packet Bounce in Data Center Networks

    Full text link
    In today's data center, a diverse mix of throughput-sensitive long flows and delay-sensitive short flows are commonly presented in shallow-buffered switches. Long flows could potentially block the transmission of delay-sensitive short flows, leading to degraded performance. Congestion can also be caused by the synchronization of multiple TCP connections for short flows, as typically seen in the partition/aggregate traffic pattern. While multiple end-to-end transport-layer solutions have been proposed, none of them have tackled the real challenge: reliable transmission in the network. In this paper, we fill this gap by presenting PABO -- a novel link-layer design that can mitigate congestion by temporarily bouncing packets to upstream switches. PABO's design fulfills the following goals: i) providing per-flow based flow control on the link layer, ii) handling transient congestion without the intervention of end devices, and iii) gradually back propagating the congestion signal to the source when the network is not capable to handle the congestion.Experiment results show that PABO can provide prominent advantage of mitigating transient congestions and can achieve significant gain on end-to-end delay

    Efficient Micro-Mobility using Intra-domain Multicast-based Mechanisms (M&M)

    Full text link
    One of the most important metrics in the design of IP mobility protocols is the handover performance. The current Mobile IP (MIP) standard has been shown to exhibit poor handover performance. Most other work attempts to modify MIP to slightly improve its efficiency, while others propose complex techniques to replace MIP. Rather than taking these approaches, we instead propose a new architecture for providing efficient and smooth handover, while being able to co-exist and inter-operate with other technologies. Specifically, we propose an intra-domain multicast-based mobility architecture, where a visiting mobile is assigned a multicast address to use while moving within a domain. Efficient handover is achieved using standard multicast join/prune mechanisms. Two approaches are proposed and contrasted. The first introduces the concept proxy-based mobility, while the other uses algorithmic mapping to obtain the multicast address of visiting mobiles. We show that the algorithmic mapping approach has several advantages over the proxy approach, and provide mechanisms to support it. Network simulation (using NS-2) is used to evaluate our scheme and compare it to other routing-based micro-mobility schemes - CIP and HAWAII. The proactive handover results show that both M&M and CIP shows low handoff delay and packet reordering depth as compared to HAWAII. The reason for M&M's comparable performance with CIP is that both use bi-cast in proactive handover. The M&M, however, handles multiple border routers in a domain, where CIP fails. We also provide a handover algorithm leveraging the proactive path setup capability of M&M, which is expected to outperform CIP in case of reactive handover.Comment: 12 pages, 11 figure

    Parallel structurally-symmetric sparse matrix-vector products on multi-core processors

    Full text link
    We consider the problem of developing an efficient multi-threaded implementation of the matrix-vector multiplication algorithm for sparse matrices with structural symmetry. Matrices are stored using the compressed sparse row-column format (CSRC), designed for profiting from the symmetric non-zero pattern observed in global finite element matrices. Unlike classical compressed storage formats, performing the sparse matrix-vector product using the CSRC requires thread-safe access to the destination vector. To avoid race conditions, we have implemented two partitioning strategies. In the first one, each thread allocates an array for storing its contributions, which are later combined in an accumulation step. We analyze how to perform this accumulation in four different ways. The second strategy employs a coloring algorithm for grouping rows that can be concurrently processed by threads. Our results indicate that, although incurring an increase in the working set size, the former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    A multi-paradigm language for reactive synthesis

    Get PDF
    This paper proposes a language for describing reactive synthesis problems that integrates imperative and declarative elements. The semantics is defined in terms of two-player turn-based infinite games with full information. Currently, synthesis tools accept linear temporal logic (LTL) as input, but this description is less structured and does not facilitate the expression of sequential constraints. This motivates the use of a structured programming language to specify synthesis problems. Transition systems and guarded commands serve as imperative constructs, expressed in a syntax based on that of the modeling language Promela. The syntax allows defining which player controls data and control flow, and separating a program into assumptions and guarantees. These notions are necessary for input to game solvers. The integration of imperative and declarative paradigms allows using the paradigm that is most appropriate for expressing each requirement. The declarative part is expressed in the LTL fragment of generalized reactivity(1), which admits efficient synthesis algorithms, extended with past LTL. The implementation translates Promela to input for the Slugs synthesizer and is written in Python. The AMBA AHB bus case study is revisited and synthesized efficiently, identifying the need to reorder binary decision diagrams during strategy construction, in order to prevent the exponential blowup observed in previous work.Comment: In Proceedings SYNT 2015, arXiv:1602.0078
    corecore