208,684 research outputs found
Reversible GANs for Memory-efficient Image-to-Image Translation
The Pix2pix and CycleGAN losses have vastly improved the qualitative and
quantitative visual quality of results in image-to-image translation tasks. We
extend this framework by exploring approximately invertible architectures which
are well suited to these losses. These architectures are approximately
invertible by design and thus partially satisfy cycle-consistency before
training even begins. Furthermore, since invertible architectures have constant
memory complexity in depth, these models can be built arbitrarily deep. We are
able to demonstrate superior quantitative output on the Cityscapes and Maps
datasets at near constant memory budget
Moving the shared memory closer to the processors: DDM
Multiprocessors with shared memory are considered more general and easier
to program than message-passing machines. The scalability is, however, in favor
of the latter. There are a number of proposals showing how the poor scalability
of shared memory multiprocessors can be improved by the introduction of private
caches attached to the processors. These caches are kept consistent with each
other by cache-coherence protocols.
In this paper we introduce a new class of architectures called Cache Only
Memory Architectures (COMA). These architectures provide the programming
paradigm of the shared-memory architectures, but are believed to be more scal-
able. COMAs have no physically shared memory; instead, the caches attached to
the processors contain all the memory in the system, and their size is therefore
large. A datum is allowed to be in any or many of the caches, and will automatically be moved to where it is needed by a cache-coherence protocol, which also
ensures that the last copy of a datum is never lost. The location of a datum in
the machine is completely decoupled from its address.
We also introduce one example of COMA: the Data Diffusion Machine (DDM).
The DDM is based on a hierarchical network structure, with processor/memory
pairs at its tips. Remote accesses generally cause only a limited amount of traffic
over a limited part of the machine.
The architecture is scalable in that there can be any number of levels in the
hierarchy, and that the root bus of the hierarchy can be implemented by several
buses, increasing the bandwidth
Parallelizing RRT on distributed-memory architectures
This paper addresses the problem of improving the performance of the Rapidly-exploring Random Tree (RRT) algorithm by parallelizing it. For scalability reasons we do so on a distributed-memory architecture, using the message-passing paradigm. We present three parallel versions of RRT along with the technicalities involved in their implementation. We also evaluate the algorithms and study how they behave on different motion planning problems
Memory performance of and-parallel prolog on shared-memory architectures
The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly
beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology
An Efficient Multiway Mergesort for GPU Architectures
Sorting is a primitive operation that is a building block for countless
algorithms. As such, it is important to design sorting algorithms that approach
peak performance on a range of hardware architectures. Graphics Processing
Units (GPUs) are particularly attractive architectures as they provides massive
parallelism and computing power. However, the intricacies of their compute and
memory hierarchies make designing GPU-efficient algorithms challenging. In this
work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway
mergesort algorithm. MMS employs a new partitioning technique that exposes the
parallelism needed by modern GPU architectures. To the best of our knowledge,
MMS is the first sorting algorithm for the GPU that is asymptotically optimal
in terms of global memory accesses and that is completely free of shared memory
bank conflicts.
We realize an initial implementation of MMS, evaluate its performance on
three modern GPU architectures, and compare it to competitive implementations
available in state-of-the-art GPU libraries. Despite these implementations
being highly optimized, MMS compares favorably, achieving performance
improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art
algorithms are susceptible to bank conflicts. We find that for certain inputs
that cause these algorithms to incur large numbers of bank conflicts, MMS can
achieve up to a 37.6% speedup over its fastest competitor. Overall, even though
its current implementation is not fully optimized, due to its efficient use of
the memory hierarchy, MMS outperforms the fastest comparison-based sorting
implementations available to date
BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures
We introduce BriskStream, an in-memory data stream processing system (DSPSs)
specifically designed for modern shared-memory multicore architectures.
BriskStream's key contribution is an execution plan optimization paradigm,
namely RLAS, which takes relative-location (i.e., NUMA distance) of each pair
of producer-consumer operators into consideration. We propose a branch and
bound based approach with three heuristics to resolve the resulting nontrivial
optimization problem. The experimental evaluations demonstrate that BriskStream
yields much higher throughput and better scalability than existing DSPSs on
multi-core architectures when processing different types of workloads.Comment: To appear in SIGMOD'1
- …