101 research outputs found
Recommended from our members
Software lock elision for x86 machine code
More than a decade after becoming a topic of intense research there is no
transactional memory hardware nor any examples of software transactional memory
use outside the research community. Using software transactional memory in large
pieces of software needs copious source code annotations and often means
that standard compilers and debuggers can no longer be used. At the same time,
overheads associated with software transactional memory fail to motivate
programmers to expend the needed effort to use software transactional
memory. The only way around the overheads in the case of general unmanaged code
is the anticipated availability of hardware support. On the other hand, architects
are unwilling to devote power and area budgets in mainstream microprocessors to
hardware transactional memory, pointing to transactional memory being a
"niche" programming construct. A deadlock has thus ensued that is blocking
transactional memory use and experimentation in the mainstream.
This dissertation covers the design and construction of a software transactional
memory runtime system called SLE_x86 that can potentially break this
deadlock by decoupling transactional memory from programs using it. Unlike most
other STM designs, the core design principle is transparency rather than
performance. SLE_x86 operates at the level of x86 machine code, thereby
becoming immediately applicable to binaries for the popular x86
architecture. The only requirement is that the binary synchronise using known
locking constructs or calls such as those in Pthreads or OpenMP
libraries. SLE_x86 provides speculative lock elision (SLE) entirely in
software, executing critical sections in the binary using transactional
memory. Optionally, the critical sections can also be executed without using
transactions by acquiring the protecting lock.
The dissertation makes a careful analysis of the impact on performance due to
the demands of the x86 memory consistency model and the need to transparently
instrument x86 machine code. It shows that both of these problems can be
overcome to reach a reasonable level of performance, where transparent
software transactional memory can perform better than a lock. SLE_x86 can
ensure that programs are ready for transactional memory in any form, without
being explicitly written for it
Uniqueness of Optimal Mod 3 Circuits for Parity
We prove that the quadratic polynomials modulo
with the largest correlation with parity are unique up to
permutation of variables and constant factors. As a consequence of
our result, we completely characterize the smallest
MAJ~ circuits that compute parity, where a
MAJ~ circuit is one that has a
majority gate as output, a middle layer of MOD gates and a
bottom layer of AND gates of fan-in . We
also prove that the sub-optimal circuits exhibit a stepped behavior:
any sub-optimal circuits of this class that compute parity
must have size at least a factor of times the
optimal size. This verifies, for the special case of ,
two conjectures made
by Due~{n}ez, Miller, Roy and Straubing (Journal of Number Theory, 2006) for general MAJ~ circuits for any odd . The correlation
and circuit bounds are obtained by studying the associated
exponential sums, based on some of the techniques developed
by Green (JCSS, 2004). We regard this as a step towards
obtaining tighter bounds both for the quadratic
case as well as for
higher degrees
ALLARM: Optimizing Sparse Directories for Thread-Local Data
Large-scale cache-coherent systems often impose unnecessary overhead on data that is thread-private for the whole of its lifetime. These include resources devoted to tracking the coherence state of the data, as well as unnecessary coherence messages sent out over the interconnect. In this paper we show how the memory allocation strategy for non-uniform memory access (NUMA) systems can be exploited to remove any coherence-related traffic for thread-local data, as well removing the need to track those cache lines in sparse directories. Our strategy is to allocate directory state only on a miss from a node in a different affinity domain from the directory. We call this ALLocAte on Remote Miss, or ALLARM. Our solution is entirely backward compatible with existing operating systems and software, and provides a means to scale cache coherence into the many-core era. On a mix of SPLASH2 and Parsec workloads, ALLARM is able to improve performance by 13\% on average while reducing dynamic energy consumption by 9\% in the on-chip network and 15\% in the directory controller. This is achieved through a 46\% reduction in the number of sparse directory entries evicted
Incomplete Quadratic Exponential Sums in Several Variables
We consider incomplete exponential sums in several variables of the form
S(f,n,m) = \frac{1}{2^n} \sum_{x_1 \in \{-1,1\}} ... \sum_{x_n \in \{-1,1\}}
x_1 ... x_n e^{2\pi i f(x)/p}, where m>1 is odd and f is a polynomial of degree
d with coefficients in Z/mZ. We investigate the conjecture, originating in a
problem in computational complexity, that for each fixed d and m the maximum
norm of S(f,n,m) converges exponentially fast to 0 as n grows to infinity. The
conjecture is known to hold in the case when m=3 and d=2, but existing methods
for studying incomplete exponential sums appear to be insufficient to resolve
the question for an arbitrary odd modulus m, even when d=2. In the present
paper we develop three separate techniques for studying the problem in the case
of quadratic f, each of which establishes a different special case of the
conjecture. We show that a bound of the required sort holds for almost all
quadratic polynomials, a stronger form of the conjecture holds for all
quadratic polynomials with no more than 10 variables, and for arbitrarily many
variables the conjecture is true for a class of quadratic polynomials having a
special form.Comment: 31 pages (minor corrections from original draft, references to new
results in the subject, publication information
Bounds on an exponential sum arising in Boolean circuit complexity
We study exponential sums of the form S = 2-n ∑x∈{0,1}n em (h(x))eq (p(x)), where m, q ∈ Z+ are relatively prime, p is a polynomial with coefficients in Zq, and h(x) = a(x1 +⋯+ xn) for some 1 ≤ a \u3c m. We prove an upper bound of the form 2-Ω(n) on S . This generalizes a result of J. Bourgain, who establishes this bound in the case where q is odd. This bound has consequences in Boolean circuit complexity. © Académie des sciences. Published by Elsevier SAS. All rights reserved
- …