21,165 research outputs found

    swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture

    Full text link
    The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the exiting deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific and deep learning applications. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient code for deep learning application on Sunway. The experimental results show the ability of swTVM to automatically generate code for various deep neural network models on Sunway. The performance of automatically generated code for AlexNet and VGG-19 by swTVM achieves 6.71x and 2.45x speedup on average than hand-optimized OpenACC implementations on convolution and fully connected layers respectively. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and high performance architecture particularly with productivity and efficiency in mind. We would like to open source the implementation so that more people can embrace the power of deep learning compiler and Sunway many-core processor

    A study of topologies and protocols for fiber optic local area network

    Get PDF
    The emergence of new applications requiring high data traffic necessitates the development of high speed local area networks. Optical fiber is selected as the transmission medium due to its inherent advantages over other possible media and the dual optical bus architecture is shown to be the most suitable topology. Asynchronous access protocols, including token, random, hybrid random/token, and virtual token schemes, are developed and analyzed. Exact expressions for insertion delay and utilization at light and heavy load are derived, and intermediate load behavior is investigated by simulation. A new tokenless adaptive scheme whose control depends only on the detection of activity on the channel is shown to outperform round-robin schemes under uneven loads and multipacket traffic and to perform optimally at light load. An approximate solution to the queueing delay for an oscillating polling scheme under chaining is obtained and results are compared with simulation. Solutions to the problem of building systems with a large number of stations are presented, including maximization of the number of optical couplers, and the use of passive star/bus topologies, bridges and gateways

    On-Chip Transparent Wire Pipelining (invited paper)

    Get PDF
    Wire pipelining has been proposed as a viable mean to break the discrepancy between decreasing gate delays and increasing wire delays in deep-submicron technologies. Far from being a straightforwardly applicable technique, this methodology requires a number of design modifications in order to insert it seamlessly in the current design flow. In this paper we briefly survey the methods presented by other researchers in the field and then we thoroughly analyze the solutions we recently proposed, ranging from system-level wire pipelining to physical design aspects

    Real-time detection of grid bulk transfer traffic

    Get PDF
    The current practice of physical science research has yielded a continuously growing demand for interconnection network bandwidth to support the sharing of large datasets. Academic research networks and internet service providers have provisioned their networks to handle this type of load, which generates prolonged, high-volume traffic between nodes on the network. Maintenance of QoS for all network users demands that the onset of these (Grid bulk) transfers be detected to enable them to be reengineered through resources specifically provisioned to handle this type of traffic. This paper describes a real-time detector that operates at full-line-rate on Gb/s links, operates at high connection rates, and can track the use of ephemeral or non-standard ports

    Spread spectrum communication link using surface wave devices

    Get PDF
    A fast lock-up, 8-MHz bandwidth 8,000 bit per second data rate spread spectrum communication link breadboard is described that is implemented using surface wave devices as the primary signal generators and signal processing elements. It uses surface wave tapped delay lines in the transmitter to generate the signals and in the receiver to detect them. The breadboard provides a measured processing gain for Gaussian noise of 31.5 dB which is within one dB of the theoretical optimum. This development demonstrates that spread spectrum receivers implemented with surface wave devices have sensitivities and complexities comparable to those of serial correlation receivers, but synchronization search times which are two to three orders of magnitude smaller

    Preliminary basic performance analysis of the Cedar multiprocessor memory system

    Get PDF
    Some preliminary basic results on the performance of the Cedar multiprocessor memory system are presented. Empirical results are presented and used to calibrate a memory system simulator which is then used to discuss the scalability of the system

    Evaluation of the Cedar memory system: Configuration of 16 by 16

    Get PDF
    Some basic results on the performance of the Cedar multiprocessor system are presented. Empirical results on the 16 processor 16 memory bank system configuration, which show the behavior of the Cedar system under different modes of operation are presented
    corecore