2,433 research outputs found
Highly parallel sparse Cholesky factorization
Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms
Gaussian Belief Propagation Based Multiuser Detection
In this work, we present a novel construction for solving the linear
multiuser detection problem using the Gaussian Belief Propagation algorithm.
Our algorithm yields an efficient, iterative and distributed implementation of
the MMSE detector. We compare our algorithm's performance to a recent result
and show an improved memory consumption, reduced computation steps and a
reduction in the number of sent messages. We prove that recent work by
Montanari et al. is an instance of our general algorithm, providing new
convergence results for both algorithms.Comment: 6 pages, 1 figures, appeared in the 2008 IEEE International Symposium
on Information Theory, Toronto, July 200
Large-Scale Optical Neural Networks based on Photoelectric Multiplication
Recent success in deep neural networks has generated strong interest in
hardware accelerators to improve speed and energy consumption. This paper
presents a new type of photonic accelerator based on coherent detection that is
scalable to large () networks and can be operated at high (GHz)
speeds and very low (sub-aJ) energies per multiply-and-accumulate (MAC), using
the massive spatial multiplexing enabled by standard free-space optical
components. In contrast to previous approaches, both weights and inputs are
optically encoded so that the network can be reprogrammed and trained on the
fly. Simulations of the network using models for digit- and
image-classification reveal a "standard quantum limit" for optical neural
networks, set by photodetector shot noise. This bound, which can be as low as
50 zJ/MAC, suggests performance below the thermodynamic (Landauer) limit for
digital irreversible computation is theoretically possible in this device. The
proposed accelerator can implement both fully-connected and convolutional
networks. We also present a scheme for back-propagation and training that can
be performed in the same hardware. This architecture will enable a new class of
ultra-low-energy processors for deep learning.Comment: Text: 10 pages, 5 figures, 1 table. Supplementary: 8 pages, 5,
figures, 2 table
Scalability of broadcast performance in wireless network-on-chip
Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version
AirSync: Enabling Distributed Multiuser MIMO with Full Spatial Multiplexing
The enormous success of advanced wireless devices is pushing the demand for
higher wireless data rates. Denser spectrum reuse through the deployment of
more access points per square mile has the potential to successfully meet the
increasing demand for more bandwidth. In theory, the best approach to density
increase is via distributed multiuser MIMO, where several access points are
connected to a central server and operate as a large distributed multi-antenna
access point, ensuring that all transmitted signal power serves the purpose of
data transmission, rather than creating "interference." In practice, while
enterprise networks offer a natural setup in which distributed MIMO might be
possible, there are serious implementation difficulties, the primary one being
the need to eliminate phase and timing offsets between the jointly coordinated
access points.
In this paper we propose AirSync, a novel scheme which provides not only time
but also phase synchronization, thus enabling distributed MIMO with full
spatial multiplexing gains. AirSync locks the phase of all access points using
a common reference broadcasted over the air in conjunction with a Kalman filter
which closely tracks the phase drift. We have implemented AirSync as a digital
circuit in the FPGA of the WARP radio platform. Our experimental testbed,
comprised of two access points and two clients, shows that AirSync is able to
achieve phase synchronization within a few degrees, and allows the system to
nearly achieve the theoretical optimal multiplexing gain. We also discuss MAC
and higher layer aspects of a practical deployment. To the best of our
knowledge, AirSync offers the first ever realization of the full multiuser MIMO
gain, namely the ability to increase the number of wireless clients linearly
with the number of jointly coordinated access points, without reducing the per
client rate.Comment: Submitted to Transactions on Networkin
- …