Multi-channel input signals are often used in Multiple-input Multiple-out systems. This paper first introduces HPC (High-powered Computation) optimized principle, then discusses some key techniques of generating multiple input signals. The implementation based on multicore DSP is given then. In the end the performance of this method is analyzed and evaluated. The efficiency is testified on a test system.
INTRODUCTION
Multi-channel input signals are often used in Multiple-input Multiple-out radar or communication system and distributed aperture coherence synthetic radar or communication system. Building this kind of real-time simulating system not only can save a amount of manpower and material resources, but also can solve many problems much better than other methods.
For example, there are several ways to simulate echo of multiple input radar system [1] . The first way is using DDS (Direct Digital Synthesizer) chip. But it only can generate several simple echo, and can not meet the need of complicated radar echo. The second way is using FPGA(Field Programmable Gate Array) chip. But it lacks flexibility [2] . The third way is using CPU/DSP(Digital Signal Processing) chip [3] . This way has the characters of flexibility and nicety, can save more money and time, but it depends on high-powered chip, programming model and efficient algorithms.
PRINCIPLE ON HPC OPTIMIZATOION

Introduction of symmetric multi-processor
Parallel system structure, parallel programming model, parallel optimization are three key factors of implementing parallel computation. Fig. 1 shows the relation of the three factors [4] . Parallel system structure are the foundation of parallel computation. SMP (Symmetric Multi-Processor) is a typical processor with this structure. Its structure is symmetrical, and used parallel identification technique and circulation technique. In accessing memory, there are the characteristic of operating system image, low accessing delay, low communication delay. SMP is a include multiple processors which can run a independent instruction stream respectively. The representative parallel processors are IBM R50,YH-2 based on CPU and,TMS320C6678、TMS320C6609 based on DSP. Figure 2 shows elementary structure of SMP [4] . Parallel Programming Model provided normative programming regulation, self-contained programming interface and systemic implementation frame .These are in favor of hold the characteristic of system structure and application program synchronously. It can be realized in ways. The first is adopting bran-new programming language. The second is adopting mature parallel programming language and extending the language by paralism. The latter is better programmable and transplantable. OpenMP is a kind of Shared Memory Parallel Programming Model. It is a parallel frame based on serial language frame. In general, OpenMP API can combine with C／C++ and Fortran easily. Fig.3 is the sketch map of OpenMP model flow. In main thread, the parts that can be executed with parallel child threads fork and join alternately [5] . 
Analyzation of multi-channel intercurrency
Computation parallelism usually includes three parts. They are task parallel, data parallel, pipeline parallel. Figure 4 shows all tasks of multi-channel signal simulation. Each kind of signal has identical type. There is not dependency among them. So task parallel can be realized. In the end all kinds signals adds together.
Simulation of Signals
Clutter Signal
Object Echo
Jamming Signal
Noise Signal
Weather Clutter
Geography Clutter
Passive Jamming
Inpassive Jamming The arithmetic can be divided into memory accessing, communication and computation. These tasks form a pipeline. The output of the first step is the input of the next step. They are the relationship of producer and consumer. So pipeline parallel can be adopted to conceal data I/O delay and communication transmission delay.
Read Control Info. Accessing
Generating Signal Computation
Output Signals to File Accessing Read Control Info.
Transmit Control
Info. Multi-channel signals are needed to output in the end. The signals of all channels have the same generating method. So data parallel can be realized.
Generating
Applicability of three elements
TMS320C6678 DSP is selected because of its SMP structure. It is a highest-performance fixed/floatingpoint DSP that is based on TI's KeyStone multicore architecture. Incorporating the new and innovative C66x DSP core, this device can run at a core speed of up to 1.25 GHz. For developers of a broad range of applications, such as mission critical, medical imaging, test and automation, and other applications requiring high performance, TI's TMS320C6678 DSP offers 10 GHz cumulative DSP and enables a platform that is power-efficient and easy to use. TI's KeyStone architecture provides a programmable platform integrating various subsystems and uses several innovative components and techniques to maximize intra-device and inter-device communication that allows the various DSP resources to operate efficiently and seamlessly. The multicore shared memory controller allows access to shared and external memory directly without drawing from switch fabric capacity. The C6678 DSP integrates a large amount of on-chip memory. In addition to 32KB of L1 program and data cache, there is 512KB of dedicated memory per core that can be configured as mapped RAM or cache. The device also integrates 4096KB of Multicore Shared Memory that can be used as a shared L2 SRAM and/or shared L3 SRAM. All L2 memories incorporate error detection and error correction. For fast access to external memory, this device includes a 64-bit DDR-3 external memory interface (EMIF) running at 1600 MHz and has ECC DRAM support. The high-powered program optimization and OpenMP API provide efficient support for real-time application based on the SYS/BIOS on it [6] .
Multi-channel radar echo can be decomposed according to Fig.6 .TaskA is pattern control and synchronization; Task B is simulation of all kinds of signals; Task C is signal collection and synchronization; Task D is collecting of multiple channel signal and synchronization. Task E is signal pre-processing which is out of range of this paper. Fig.6 
A is Amplitude modulation factor,  is echo delay time, T is pulse width,  is for FM slope,
, c is velocity of light, 0 R is range of the object, v is the speed of the object. s T is sample period. Type (4) shows the LFM signal after discretization. 
In general, the method is using function cos and sin without consideration of efficiency. But it can not meet the need of real-time simulation.
The phase of the signal can be arranged to polynomial,and FM slope and scope of frequency variation are considered. 
Then type (6) is obtained. 
The polynomial is adjusted to monopulse signal, phase coded pulse and pulse train signal. But
a a a are different . 
The original value is in type (9).For object echo as monopulse signal, phase coded pulse and pulse train signal or some impassive jamming, the complex computation can be transformed into overlapping forms which can use circulation optimization technique.
Memory accessing spending optimization
DMA is used in this design for data moving between memory, which is not needed CPU participation. Three high-speed EDMA controllers are integrated in C6678 processor. EDMA transmitting speed between L2 or MCSM and DDR3 is showed in table 1.The result is the average value of 100 times by inquiring. 
Delay concealing optimization
When there are memory accessing and communication with long-time delay, some measures must be adopted .The design uses two methods: overlapping of memory accessing and communication , overlapping of memory accessing and computation.
 Overlapping of memory accessing and communication
The best effective way is adopting double buffers if accessing and communication are considered. Two segments data store into two different memory. When one buffer fetches data, the other buffer starts up communication .This mechanism increase efficiency effectively.
 Overlapping of memory accessing and computation
The first is overlapping of EDMA(Enhanced DMA) transmition and computation. Because EDMA does not use DSP core time. After EDMA starting up transmitting, pending is not needed. The computation can execute simultaneously. Irrelevance and balance must be considered here. The second is circulation block. The efficiency of Cache lies on reuse of space and time. Spending of loading data to cache should avoid.
TEST AND ANALYZITION
Accelerating ratio testing is done in case of optimization adopted or not. With the increase of data, the accelerating ratio increase distinctly. This indicates that the parallel optimization technique are fit for the occasion of massive computation. 
CONCLUSION
An attempt has been made to optimize the parallelization of programs by considering the multicore architecture, parallel programming model and parallel programming skills. These skills are composed of circulation conversion, accessing optimization, communication optimization and delay hiding, which resulted in not only considerable speedup but also much smaller scale hardware system.
