Grant Number: N000140610342 http://www.coas.oregonstate.edu
APPROACH
We established a partnership with IBM to have early access to Cell BE technology, including future releases along its evolutionary path. We partnered with OASIS, Inc. (Lexington, MA) to develop signal processing code for the Cell BE. We analyzed the algorithm and software, subdividing tasks, and allocating these tasks to the programming elements. Managing the communications and scheduling of these tasks is another challenge with the Cell BE. Our primary contacts at OASIS, Inc. are Phil Abbott and Vince Premus. Will Dillon (OSU) is a PhD graduate student in computer science working on this project, under the supervision of Mike Bailey (OSU Engineering).
WORK COMPLETED
We acquired and installed IBM Cell BE rack consisting of two blades, with 2 Cells per blade system. Each Cell consists of one PowerPC chip and eight Synergistic Processing Elements (SPE). We installed the Rapid Mind development environment. After initial testing and system integration, we ported the acoustic signal processing application from OASIS, Inc. to the Cell BE. We have tested this application and begun to optimize it for the Cell BE.
Target detection, tracking and identification is performed in three discrete steps, each with unique design specifications; the first of which is "Data-Independent Beamforming" (Van Veen and Buckley, 1988) . This step is a frequency-domain beamforming operation across n (64, currently) beams. The first and second derivatives of the result are found and identify unique audio sources, which are added to a "target queue." The second stage is "Statistically-Optimum Beamforming," and works by collecting the maximum auditory information from the beam in the direction of the desired target and a "Null" is placed in the strongest source of interference. Finally, this signal is used to perform a machine learning operation with the intent of identifying the type of target: man-made, natural, etc.
Report Documentation Page
The progress made toward a completed system is twofold. First, the architecture and system design is nearly complete, and design documents for each subsystem are in the process of being drafted. Second, the initial programming for the first stage is complete. Each subsystem runs as a standard Unix task and subsystems communicate through network sockets. The data-independent beamforming task runs as a network server, and all audio data is sent along a network link. Note that the "network link" may be virtual and constrained to the on-board computer.
RESULTS
At present, the data-independent beamformer is heavily instrumented to collect run times for various operations. Optimization of this code base for the Cell BE has not yet begun; however, baseline timing runs have been compiled on a 2.4 GHz. Core 2 Duo running Mac OS X. The most meaningful run time is the time taken to perform the FFT on input data, in this case 4.8 GFLOPS which works out to be about 600 μs. The initial FFT benchmarks on the QS20 (IBM Blade center containing Cell BE processors) are in the 2.3 GFLOPS range and through extrapolation: about 1.2 ms. On first look, the performance of the Cell BE seems sub-optimal. However, it bears noting that we are performing 16 one-dimensional FFTs, one for each hydrophone channel, and the FFT benchmark is performing a single FFT. It is reasonable to assume that the performance of the FFT step may increase by up to 8 times when all 8 SPEs are used on their own data, which could yield 18.4 GFLOPS. This would result in a theoretical FFT computation time of 200 μs. This assumption seems reasonable when compared to the best observed FFT benchmark on the QS20, which achieved 39 GFLOPS on a three-dimensional FFT. It is important to note that these values are significant because the ratio of time spent to complete computation of a data set must be less the time to collect it. In this case, the FFTs are performed upon 2048 samples, which represents 0.6 seconds. Thus computation is complete in 3/10,000 of the time taken to collect data, a demonstration of how much computation capability remains after the first step of processing.
IMPACT/APPLICATIONS
The evolution of computing power has followed twin paths of higher frequency processors and symmetric multiprocessing. Although Moore's Law has held for the past several decades, we have reached the limits of heat dissipation that are required for higher speed chips that are built at higher densities. Design tradeoff decisions must be made. For example, some chip designs are moving to multiple core architectures within in a single chip, with each core running multiple processing threads. As with all multi-processor architectures, communications and latency are key determinants of overall processing throughput. These cores behave as programmable I/O devices or attached co-processors. Each of these co-processors can be optimized for complex functions such as signal processing, video and audio, etc. The next generation of computer processing architectures are now appearing in both technical and home computing systems. The challenge is that programming models must undergo fundamental changes in order to exploit these new multi-core architectures.
The new generation of chips, such as the IBM/Toshiba/Sony Cell BE, relies on task-specific processing elements to achieve the necessary throughput. Moreover, future versions will incorporate an increasing number task-specific cores, capable of running tens of processing threads concurrently. These tasks today are traditionally found in Field Programmable Gate Arrays (FPGA) and Graphics Processing Units (GPU) that can be customized by the user to perform specific tasks, but are hard to program.
In sense, this architecture does not look much different than architectures from 20 years ago where a single computer had a separate, dedicated array processing board. In fact, many of the issues are the same, including understanding I/O from the specialized hardware, integrating optimized software libraries for the special purpose hardware within the processing flow, etc. The challenge now is that there is greater complexity and flexibility available for user applications. The advantage is that these capabilities are incorporated into a single "system on a chip," expanding the deployment and packaging opportunities far beyond what can be done with typical systems today. With the eventual appearance of low-power versions of these architectures, the ability to deploy enormous amounts of computational power into a wide variety of platforms could transform how we sense and respond to the environment.
RELATED PROJECTS
None.
