3,546 research outputs found
Admit your weakness: Verifying correctness on TSO architectures
“The final publication is available at http://link.springer.com/chapter/10.1007%2F978-3-319-15317-9_22 ”.Linearizability has become the standard correctness criterion for fine-grained non-atomic concurrent algorithms, however, most approaches assume a sequentially consistent memory model, which is not always realised in practice. In this paper we study the correctness of concurrent algorithms on a weak memory model: the TSO (Total Store Order) memory model, which is commonly implemented by multicore architectures. Here, linearizability is often too strict, and hence, we prove a weaker criterion, quiescent consistency instead. Like linearizability, quiescent consistency is compositional making it an ideal correctness criterion in a component-based context. We demonstrate how to model a typical concurrent algorithm, seqlock, and prove it quiescent consistent using a simulation-based approach. Previous approaches to proving correctness on TSO architectures have been based on linearizabilty which makes it necessary to modify the algorithm’s high-level requirements. Our approach is the first, to our knowledge, for proving correctness without the need for such a modification
Property-Driven Fence Insertion using Reorder Bounded Model Checking
Modern architectures provide weaker memory consistency guarantees than
sequential consistency. These weaker guarantees allow programs to exhibit
behaviours where the program statements appear to have executed out of program
order. Fortunately, modern architectures provide memory barriers (fences) to
enforce the program order between a pair of statements if needed. Due to the
intricate semantics of weak memory models, the placement of fences is
challenging even for experienced programmers. Too few fences lead to bugs
whereas overuse of fences results in performance degradation. This motivates
automated placement of fences. Tools that restore sequential consistency in the
program may insert more fences than necessary for the program to be correct.
Therefore, we propose a property-driven technique that introduces
"reorder-bounded exploration" to identify the smallest number of program
locations for fence placement. We implemented our technique on top of CBMC;
however, in principle, our technique is generic enough to be used with any
model checker. Our experimental results show that our technique is faster and
solves more instances of relevant benchmarks as compared to earlier approaches.Comment: 18 pages, 3 figures, 4 algorithms. Version change reason : new set of
results and publication ready version of FM 201
Datacenter Traffic Control: Understanding Techniques and Trade-offs
Datacenters provide cost-effective and flexible access to scalable compute
and storage resources necessary for today's cloud computing needs. A typical
datacenter is made up of thousands of servers connected with a large network
and usually managed by one operator. To provide quality access to the variety
of applications and services hosted on datacenters and maximize performance, it
deems necessary to use datacenter networks effectively and efficiently.
Datacenter traffic is often a mix of several classes with different priorities
and requirements. This includes user-generated interactive traffic, traffic
with deadlines, and long-running traffic. To this end, custom transport
protocols and traffic management techniques have been developed to improve
datacenter network performance.
In this tutorial paper, we review the general architecture of datacenter
networks, various topologies proposed for them, their traffic properties,
general traffic control challenges in datacenters and general traffic control
objectives. The purpose of this paper is to bring out the important
characteristics of traffic control in datacenters and not to survey all
existing solutions (as it is virtually impossible due to massive body of
existing research). We hope to provide readers with a wide range of options and
factors while considering a variety of traffic control mechanisms. We discuss
various characteristics of datacenter traffic control including management
schemes, transmission control, traffic shaping, prioritization, load balancing,
multipathing, and traffic scheduling. Next, we point to several open challenges
as well as new and interesting networking paradigms. At the end of this paper,
we briefly review inter-datacenter networks that connect geographically
dispersed datacenters which have been receiving increasing attention recently
and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial
PABO: Mitigating Congestion via Packet Bounce in Data Center Networks
In today's data center, a diverse mix of throughput-sensitive long flows and
delay-sensitive short flows are commonly presented in shallow-buffered
switches. Long flows could potentially block the transmission of
delay-sensitive short flows, leading to degraded performance. Congestion can
also be caused by the synchronization of multiple TCP connections for short
flows, as typically seen in the partition/aggregate traffic pattern. While
multiple end-to-end transport-layer solutions have been proposed, none of them
have tackled the real challenge: reliable transmission in the network. In this
paper, we fill this gap by presenting PABO -- a novel link-layer design that
can mitigate congestion by temporarily bouncing packets to upstream switches.
PABO's design fulfills the following goals: i) providing per-flow based flow
control on the link layer, ii) handling transient congestion without the
intervention of end devices, and iii) gradually back propagating the congestion
signal to the source when the network is not capable to handle the
congestion.Experiment results show that PABO can provide prominent advantage of
mitigating transient congestions and can achieve significant gain on end-to-end
delay
Efficient Micro-Mobility using Intra-domain Multicast-based Mechanisms (M&M)
One of the most important metrics in the design of IP mobility protocols is
the handover performance. The current Mobile IP (MIP) standard has been shown
to exhibit poor handover performance. Most other work attempts to modify MIP to
slightly improve its efficiency, while others propose complex techniques to
replace MIP. Rather than taking these approaches, we instead propose a new
architecture for providing efficient and smooth handover, while being able to
co-exist and inter-operate with other technologies. Specifically, we propose an
intra-domain multicast-based mobility architecture, where a visiting mobile is
assigned a multicast address to use while moving within a domain. Efficient
handover is achieved using standard multicast join/prune mechanisms. Two
approaches are proposed and contrasted. The first introduces the concept
proxy-based mobility, while the other uses algorithmic mapping to obtain the
multicast address of visiting mobiles. We show that the algorithmic mapping
approach has several advantages over the proxy approach, and provide mechanisms
to support it. Network simulation (using NS-2) is used to evaluate our scheme
and compare it to other routing-based micro-mobility schemes - CIP and HAWAII.
The proactive handover results show that both M&M and CIP shows low handoff
delay and packet reordering depth as compared to HAWAII. The reason for M&M's
comparable performance with CIP is that both use bi-cast in proactive handover.
The M&M, however, handles multiple border routers in a domain, where CIP fails.
We also provide a handover algorithm leveraging the proactive path setup
capability of M&M, which is expected to outperform CIP in case of reactive
handover.Comment: 12 pages, 11 figure
Parallel structurally-symmetric sparse matrix-vector products on multi-core processors
We consider the problem of developing an efficient multi-threaded
implementation of the matrix-vector multiplication algorithm for sparse
matrices with structural symmetry. Matrices are stored using the compressed
sparse row-column format (CSRC), designed for profiting from the symmetric
non-zero pattern observed in global finite element matrices. Unlike classical
compressed storage formats, performing the sparse matrix-vector product using
the CSRC requires thread-safe access to the destination vector. To avoid race
conditions, we have implemented two partitioning strategies. In the first one,
each thread allocates an array for storing its contributions, which are later
combined in an accumulation step. We analyze how to perform this accumulation
in four different ways. The second strategy employs a coloring algorithm for
grouping rows that can be concurrently processed by threads. Our results
indicate that, although incurring an increase in the working set size, the
former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
A multi-paradigm language for reactive synthesis
This paper proposes a language for describing reactive synthesis problems
that integrates imperative and declarative elements. The semantics is defined
in terms of two-player turn-based infinite games with full information.
Currently, synthesis tools accept linear temporal logic (LTL) as input, but
this description is less structured and does not facilitate the expression of
sequential constraints. This motivates the use of a structured programming
language to specify synthesis problems. Transition systems and guarded commands
serve as imperative constructs, expressed in a syntax based on that of the
modeling language Promela. The syntax allows defining which player controls
data and control flow, and separating a program into assumptions and
guarantees. These notions are necessary for input to game solvers. The
integration of imperative and declarative paradigms allows using the paradigm
that is most appropriate for expressing each requirement. The declarative part
is expressed in the LTL fragment of generalized reactivity(1), which admits
efficient synthesis algorithms, extended with past LTL. The implementation
translates Promela to input for the Slugs synthesizer and is written in Python.
The AMBA AHB bus case study is revisited and synthesized efficiently,
identifying the need to reorder binary decision diagrams during strategy
construction, in order to prevent the exponential blowup observed in previous
work.Comment: In Proceedings SYNT 2015, arXiv:1602.0078
- …