5,695 research outputs found
swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture
The flourish of deep learning frameworks and hardware platforms has been
demanding an efficient compiler that can shield the diversity in both software
and hardware in order to provide application portability. Among the exiting
deep learning compilers, TVM is well known for its efficiency in code
generation and optimization across diverse hardware devices. In the meanwhile,
the Sunway many-core processor renders itself as a competitive candidate for
its attractive computational power in both scientific and deep learning
applications. This paper combines the trends in these two directions.
Specifically, we propose swTVM that extends the original TVM to support
ahead-of-time compilation for architecture requiring cross-compilation such as
Sunway. In addition, we leverage the architecture features during the
compilation such as core group for massive parallelism, DMA for high bandwidth
memory transfer and local device memory for data locality, in order to generate
efficient code for deep learning application on Sunway. The experimental
results show the ability of swTVM to automatically generate code for various
deep neural network models on Sunway. The performance of automatically
generated code for AlexNet and VGG-19 by swTVM achieves 6.71x and 2.45x speedup
on average than hand-optimized OpenACC implementations on convolution and fully
connected layers respectively. This work is the first attempt from the compiler
perspective to bridge the gap of deep learning and high performance
architecture particularly with productivity and efficiency in mind. We would
like to open source the implementation so that more people can embrace the
power of deep learning compiler and Sunway many-core processor
Even faster sorting of (not only) integers
In this paper we introduce RADULS2, the fastest parallel sorter based on
radix algorithm. It is optimized to process huge amounts of data making use of
modern multicore CPUs. The main novelties include: extremely optimized
algorithm for handling tiny arrays (up to about a hundred of records) that
could appear even billions times as subproblems to handle and improved
processing of larger subarrays with better use of non-temporal memory stores
Generalization Strategies for the Verification of Infinite State Systems
We present a method for the automated verification of temporal properties of
infinite state systems. Our verification method is based on the specialization
of constraint logic programs (CLP) and works in two phases: (1) in the first
phase, a CLP specification of an infinite state system is specialized with
respect to the initial state of the system and the temporal property to be
verified, and (2) in the second phase, the specialized program is evaluated by
using a bottom-up strategy. The effectiveness of the method strongly depends on
the generalization strategy which is applied during the program specialization
phase. We consider several generalization strategies obtained by combining
techniques already known in the field of program analysis and program
transformation, and we also introduce some new strategies. Then, through many
verification experiments, we evaluate the effectiveness of the generalization
strategies we have considered. Finally, we compare the implementation of our
specialization-based verification method to other constraint-based model
checking tools. The experimental results show that our method is competitive
with the methods used by those other tools. To appear in Theory and Practice of
Logic Programming (TPLP).Comment: 24 pages, 2 figures, 5 table
POPE: Partial Order Preserving Encoding
Recently there has been much interest in performing search queries over
encrypted data to enable functionality while protecting sensitive data. One
particularly efficient mechanism for executing such queries is order-preserving
encryption/encoding (OPE) which results in ciphertexts that preserve the
relative order of the underlying plaintexts thus allowing range and comparison
queries to be performed directly on ciphertexts. In this paper, we propose an
alternative approach to range queries over encrypted data that is optimized to
support insert-heavy workloads as are common in "big data" applications while
still maintaining search functionality and achieving stronger security.
Specifically, we propose a new primitive called partial order preserving
encoding (POPE) that achieves ideal OPE security with frequency hiding and also
leaves a sizable fraction of the data pairwise incomparable. Using only O(1)
persistent and non-persistent client storage for
, our POPE scheme provides extremely fast batch insertion
consisting of a single round, and efficient search with O(1) amortized cost for
up to search queries. This improved security and
performance makes our scheme better suited for today's insert-heavy databases.Comment: Appears in ACM CCS 2016 Proceeding
System Support for Bandwidth Management and Content Adaptation in Internet Applications
This paper describes the implementation and evaluation of an operating system
module, the Congestion Manager (CM), which provides integrated network flow
management and exports a convenient programming interface that allows
applications to be notified of, and adapt to, changing network conditions. We
describe the API by which applications interface with the CM, and the
architectural considerations that factored into the design. To evaluate the
architecture and API, we describe our implementations of TCP; a streaming
layered audio/video application; and an interactive audio application using the
CM, and show that they achieve adaptive behavior without incurring much
end-system overhead. All flows including TCP benefit from the sharing of
congestion information, and applications are able to incorporate new
functionality such as congestion control and adaptive behavior.Comment: 14 pages, appeared in OSDI 200
Recommended from our members
Interconnect optimizations for nanometer VLSI design
textAs the semiconductor technology scales into deeper sub-micron domain, billions of transistors can be used on a single system-on-chip (SOC) makes interconnection optimization more important roughly for two reasons. First, congestion, power, timing in routing and buffering requirements make inter- connection optimization more and more challenging. Second, gate delay get- ting shorter while the RC delay gets longer due to scaling. Study of interconnection construction and optimization algorithms in real industry flows and designs ends up with interesting findings. One used to be overlooked but very important and practical problem is how to utilize over- the-block routing resources intelligently. Routing over large IP blocks needs special attention as there is almost no way to insert buffers inside hard IP blocks, which can lead to unsolvable slew/timing violations. In current design flows we have seen, the routing resources over the IP blocks were either dealt as routing blockages leading to a significant waste, or simply treated in the same way as outside-the-block routing resources, which would violate the slew constraints and thus fail buffering. To handle that, this work proposes a novel buffering-aware over-the- block rectilinear Steiner minimum tree (BOB-RSMT) algorithm which helps reclaim the “wasted” over-the-block routing resources while meeting user-specified slew constraints. Proposed algorithm incrementally and efficiently migrates initial tree structures with buffering-awareness to meet slew constraints while minimizing wire-length. Moreover, due to the fact that timing optimization is important for the VLSI design, in this work, timing-driven over-the-block rectilinear Steiner tree (TOB-RST) is also studied to optimize critical paths. This proposed TOB-RST algorithm can be used in routing or post-routing stage to provide high-quality topologies to help close timing. Then a follow-up problem emerges: how to accomplish the whole routing with over-the-block routing resources used properly. Utilizing over-the- block routing resources could dramatically improve the routing solution, yet require special attention, since the slew, affected by different RC on different metal layers, must be constrained by buffering and is easily violated. Moreover, even of all nets are slew-legalized, the routing solution could still suffer from heavy congestion problem. A new global router, BOB-Router, is to solve the over-the-block global routing problem through minimizing overflows, wire-length and via count simultaneously without violating slew constraints. Based on my completed works, BOB-RSMT and BOB-Router tremendously improve the overall routing and buffering quality. Experimental results show that proposed over-the-block rectilinear Steiner tree construction and routing completely satisfies the slew constraints and significantly outperforms the obstacle-avoiding rectilinear Steiner tree construction and routing in terms of wire-length, via count and overflows.Electrical and Computer Engineerin
- …