21,165 research outputs found
swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture
The flourish of deep learning frameworks and hardware platforms has been
demanding an efficient compiler that can shield the diversity in both software
and hardware in order to provide application portability. Among the exiting
deep learning compilers, TVM is well known for its efficiency in code
generation and optimization across diverse hardware devices. In the meanwhile,
the Sunway many-core processor renders itself as a competitive candidate for
its attractive computational power in both scientific and deep learning
applications. This paper combines the trends in these two directions.
Specifically, we propose swTVM that extends the original TVM to support
ahead-of-time compilation for architecture requiring cross-compilation such as
Sunway. In addition, we leverage the architecture features during the
compilation such as core group for massive parallelism, DMA for high bandwidth
memory transfer and local device memory for data locality, in order to generate
efficient code for deep learning application on Sunway. The experimental
results show the ability of swTVM to automatically generate code for various
deep neural network models on Sunway. The performance of automatically
generated code for AlexNet and VGG-19 by swTVM achieves 6.71x and 2.45x speedup
on average than hand-optimized OpenACC implementations on convolution and fully
connected layers respectively. This work is the first attempt from the compiler
perspective to bridge the gap of deep learning and high performance
architecture particularly with productivity and efficiency in mind. We would
like to open source the implementation so that more people can embrace the
power of deep learning compiler and Sunway many-core processor
A study of topologies and protocols for fiber optic local area network
The emergence of new applications requiring high data traffic necessitates the development of high speed local area networks. Optical fiber is selected as the transmission medium due to its inherent advantages over other possible media and the dual optical bus architecture is shown to be the most suitable topology. Asynchronous access protocols, including token, random, hybrid random/token, and virtual token schemes, are developed and analyzed. Exact expressions for insertion delay and utilization at light and heavy load are derived, and intermediate load behavior is investigated by simulation. A new tokenless adaptive scheme whose control depends only on the detection of activity on the channel is shown to outperform round-robin schemes under uneven loads and multipacket traffic and to perform optimally at light load. An approximate solution to the queueing delay for an oscillating polling scheme under chaining is obtained and results are compared with simulation. Solutions to the problem of building systems with a large number of stations are presented, including maximization of the number of optical couplers, and the use of passive star/bus topologies, bridges and gateways
On-Chip Transparent Wire Pipelining (invited paper)
Wire pipelining has been proposed as a viable mean to break the discrepancy between decreasing gate delays and increasing wire delays in deep-submicron technologies. Far from being a straightforwardly applicable technique, this methodology requires a number of design modifications in order to insert it seamlessly in the current design flow. In this paper we briefly survey the methods presented by other researchers in the field and then we thoroughly analyze the solutions we recently proposed, ranging from system-level wire pipelining to physical design aspects
Real-time detection of grid bulk transfer traffic
The current practice of physical science research has yielded a continuously growing demand for interconnection network bandwidth to support the sharing of large datasets. Academic research networks and internet service providers have provisioned their networks to handle this type of load, which generates prolonged, high-volume traffic between nodes on the network. Maintenance of QoS for all network users demands that the onset of these (Grid bulk) transfers be detected to enable them to be reengineered through resources specifically provisioned to handle this type of traffic. This paper describes a real-time detector that operates at full-line-rate on Gb/s links, operates at high connection rates, and can track the use of ephemeral or non-standard ports
Spread spectrum communication link using surface wave devices
A fast lock-up, 8-MHz bandwidth 8,000 bit per second data rate spread spectrum communication link breadboard is described that is implemented using surface wave devices as the primary signal generators and signal processing elements. It uses surface wave tapped delay lines in the transmitter to generate the signals and in the receiver to detect them. The breadboard provides a measured processing gain for Gaussian noise of 31.5 dB which is within one dB of the theoretical optimum. This development demonstrates that spread spectrum receivers implemented with surface wave devices have sensitivities and complexities comparable to those of serial correlation receivers, but synchronization search times which are two to three orders of magnitude smaller
Preliminary basic performance analysis of the Cedar multiprocessor memory system
Some preliminary basic results on the performance of the Cedar multiprocessor memory system are presented. Empirical results are presented and used to calibrate a memory system simulator which is then used to discuss the scalability of the system
Evaluation of the Cedar memory system: Configuration of 16 by 16
Some basic results on the performance of the Cedar multiprocessor system are presented. Empirical results on the 16 processor 16 memory bank system configuration, which show the behavior of the Cedar system under different modes of operation are presented
- …