Radio transient signals are non-periodic and discrete obtained from high energy physical processes in space. Most challenging issues in transient signal detection are the speed and accuracy with which a signal can be detected. Cumulative Sum (CUSUM) algorithm has been employed in this paper for transient signal detection and proved to be capable of meeting the necessary requirements. However, as ordinary softwarebased programs are unable to handle large scale sample of signals, the current research focuses on implementing the CUSUM algorithm on Field Programmable Gate Array (FPGA) which is a specific integrated circuit within the field of semi-customized circuits that can greatly enhance the speed of detection and analysis. Therefore a FPGA-based system was devoted to implement the algorithms for an efficient transient signal detection. A detection speed of 64 ns per sample set was achieved via implementation the algorithm on an Altera Cyclone IV device with a clock speed of 50 MHZ. The analyzed result shows the power consumption of the FPGA based system can be reduced to 136.75 mW.
INTRODUCTION
Radio transient signals generate from high energy physical processes in space, such as solar flares, supernovae, pulsars, quasars and active galaxies. Other speculations include evaporating black holes, colliding neutron stars and a number of unknown events. The detection of radio transient is a challenge due to their short and non-periodic nature, as well as the high risk of misdetection. It requires the backend of radio telescopes to be equipped with the appropriate hardware and software. Generally, a de-dispersion procedure is used to improve detectability and test the property of the signal. However, due to the large scale of the signal from outer space, the computational demands of this method appear insufficiently robust. Therefore, a new improved algorithm and method for detecting transient signals is developed.
Traditionallymost of the developed solutions have been based on software programs targeted for general purpose processors. The shortfalls of these platforms are obvious. For example, these platforms are usually constrained by a fixed number of processors, a limited operating speed and a fixed bandwidth, and are characterized by high power consumption. Most importantly, not all the resources on such platforms are used for transient signal operation, with some parts of the resources consumed by the operating system and software setup.The platform would be used to optimize only transient signal operation, thereby saving on the cost of other unnecessary components. In addition, the power consumption of the customized platform would be much lower than a general purpose computer.The most common technologies available to achieve this result are theApplication Specific Integrated Circuit (ASIC), and the Field Program Gate Array (FPGA).ASIC is an integrated circuit customized to perform a certain task.The advantages of ASIC include low power consumption, high operating frequency and high logical density. The disadvantages are that ASIC systems come with a high design cost and require specialized designer knowledge.
FPGA is a specific integrated circuit within the field of semicustomized circuits that solves the lack of customized circuits and overcomes the existing limitation of gate numbers. The circuit is designed in hardware description language (Verilog or VDHL) allowing easy layout and burn to the FPGA chip. Another advantage of using FPGA is the capability of working as a co-processor for High Performance Computers (HPC). It seems easier and more reasonable to implement algorithms via FPGAs, with the design on FPGAs be able to migrate to HPCs or ASICs in most cases.
DETECTION ALGORITHM
Transient signal detection can be considered as a complex stochastic model. Any abnormal signal can affect changes of the model. The aim of detection is to monitor the difference between input signals with the threshold. Currently, there are two ways to monitor those changes; one is from the perspective of signal processing, the other is from a statistical point of view.
Signal processing methods usually transform the sampling signal to a time domain or frequency domain and observe the changes. In [1] , Cornel Loana provides an adaptive timefrequency method based on the over-complete wavelet transform concepts, which lead to signal processing on interest frequency bands. This method is based on the fourth order moment, and is applied for each sub-band, in order to establish the optimal weight for each sample. The result obtained proves the capability of the proposed approach to accurately detect a transient signal, when compared with other methods (e.g. Spectrogram or Standard Wavelet Transform) [2] .The author [1] , discovered that the commonly used method, discrete wavelet transform (DWT) was not well suited to this kind of signal processing problem. From a mathematical point of view, the DWT is generated by the sampling in time-scale plant of a corresponding continuous wavelet transform (CWT). Despite the fact that there is an infinite possible discretization of CWT, the terms of DWT is commonly used to refer to that associated with the dyadic sampling lattice [3] . In certain analyses it will cause wavelet orthogonal basis and theuse of orthogonal representation will lose the signal characteristics [4] . In order to eliminate this drawback, the key factor is the use of a non-dyadic sampling structure, which in this case is the Over Complete Wavelet Transform (OCWT). However, this approach requires a more complicated data acquisition process and still needs to be improved [5] .
The aim of many statistical methods is to discover the characteristics of the sampled signal. Soudlenkov [6] and Fridman [7] proposed the use of Cumulative Sum (CUSUM) algorithm for transient signal detection capable of meeting the necessary requirements. The Mann-Whitney U [8] and the Wilcoxon signed-rank [9] are both non-parametric statistical methods. The Mann.Whitney U is used for assessing which of two independent observations have larger values than the other and is one of the most well-known non-parametric significance tests. The Wilcoxon signed-rank test is used when comparing two related samples, or repeated measurements on a single sample, to assess whether their population means differ (i.e. a paired difference test). This method is based on the assumption that there is no significant difference between the two samples' overall distribution. The main limitation of these methods is that they were originally designed for detecting single point changes. By contrast, the Mann-Kendall [10] and the CUSUM methods are particularly suitable for sequential analysis.Specifically, the MannKendall method is used to measure the association between two measured quantities. It is easy to implement and widely used in the analysis of climate change. CUSUM is a sequential analysis technique typically used for monitoring change detection [10] . It has several advantages including its relative simplicity, a graphical interpretation of results, and the ability to detect unusual patterns. It has been successfully used in fault detection, onset detection, and defect detection in mechanical systems. Both the Mann-Kendall and the CUSUM tests have particular parameters that need to be fixed at design-time [11] in order to allow the test to detect changes. Specifically, the Mann-Kendall test requires setting a level of significance for the test, while the CUSUM test needs to fix the thresholds in order to detect the possible changes in statistical behavior. One of the significant benefits of CUSUM for signal detection isits stability in the presence of regression behavior for signal sampling [10] . The CUSUM test was chosen as the method of implementation for this research due to its high detection accuracy and real time computation features.
CUSUM is a detection procedure proposed by Page (1954) and Lorden (1971) . It is a sequential analysis technique in statistical quality control, typically used for monitoring change detection. As its name implies, CUSUM involves the calculation of a cumulative sum (making it "sequential"). In this algorithm, a constant reference value is subtracted from the data collected. This difference is added to the previous difference (the cumulated sum). Usually this average value is referred to as "M." The equation is summed as below:
The CUSUM test requires a reference threshold value h.When the value of Sn exceeds a certain threshold, an abrupt change can be detected. Threshold selection is the key point for CUSUM detection. However, there is no fixed method to detect transient signals. The author chose Standard Deviation(SD) method. Standard Deviation is a method widely used in statistics to measure variability or diversity. It shows the extent of variation offset from the mean or expected value. As it is an important indicator of precision in statistics it is widely used in quality control.The SD of a data set is the square root of its Variance. The equation is:
Here the X refers to the mean value of the samplingunit. In transient signal detection, to find an overall SD is unrealistic, however, one can choose a certain amount of samples and use the SD of the samples as the threshold. In this design, two times the SD of each sample set is selected the sample set thresholdIn Normal Distribution, one SD stands for 68% of overall values and two SD are representative of 95% of overall values. In transient signal detection, if the SD of two neighboring samples extends over two times the SD this indicates that one of the samples is abnormal.
The choice of the size of sample set is important, because it's related to detection accuracy. The standard error equation below expresses the relation between sampling size and standard error.
This equation reflects the degree of dispersion of samples. The smaller of standard error means the samples close to overall average value, otherwise the samples appears more discrete. Obviously, more samples are chosen in each sample set, more closer to overall SD. In practical, sample set normally choose around 50. Here, the author chooses 32 samples as a sample set in design. The main reason is to reduce the incubation period for sampling data processing.
DESIGN STRUCTURES
The CUSUM core comprised of several modules including FIFO, Serial to Parallel, Standard deviation and CUSUM algorithm modules. Here, asynchronous FIFO was adopted because CUSUM module may work under different sampling frequency and processing frequency. Asynchronous FIFO can adapt read and write operation under different clock frequency or in same frequency but different clock phase. For highspeed asynchronous FIFO, address control should be handled carefully. From the perspective of logic gate design, the simple "add" operation is complicated as it involves the "carry" and "flip" operations on the counter and these operations easily generate glitches in a high-speed circuit [12] . Therefore, this module employs Gray code in design.
With gray code, only one bit changes state from one position to another. To determine the FIFO status, the read address (read_addr) should be first converted to gray code (read_gray). Then the write clock domain is used to synchronize this read address (rag_wt_syn) and convert this address to binary code. Finally, the current write address (write_addr_pl) is allowed to delay for one clock cycle, with the difference between read address (rag_wt_syn) and write address (write_addr_pl) the status of this FIFO. The aim of delaying the write address one clock cycle is to ensure it synchronizes with the read address, because the read address takes one clock cycle to convert to gray code. The main function of Serial to Parallel module was to parallelize the serial input data from the FIFO module and calculate the average of these samples. A 5-bit counter is used to control the amount of input data. Once completed, the entire data of this interval, plus its mean value, is output in Figure 1 shows one group of parameterized ALTMUTI_ADD functions.
Fig 1: MUTI_ADD_BASE Module
Each "MUTI_ADD_BASE" module function can complete four groups of input data. Therefore, parallelized eight MUTI_ADD_BASE modules can process one sampling interval data. This approach reduces the calculation time by consuming more resources and is a kind of implementation of the space for speed method. The output from each MUTI_ADD_BASE module is added up and divided by the total numbers. It is then sent to the parameterized SQRT megafunction to calculate get the SD. Pipeline design is implemented here to minimize the clock consumption by "add" operation. It only takes three clock cycles to add eight outputs together. Division calculation is achieved via a shift operation.
In this case, a division of thirty-two can be achieved through shifting five bits left. CUSUM algorithm module is to detect abnormal signals in each sample set. This module is used to cumulate the difference between sampled signals with their corresponding mean value. If the cumulative is more than twice the standard deviation that means the sampled signal is an abnormal signal. Then output this signal for further processing. Otherwise, the output port will be pull up to high resistance status. As the sampling signal is a sign-based signal, the cumulative value may appear as a negative number. Therefore this module should have the ability to detect either positive or negative abnormal signals. Therefore, "2's complement" method was implemented here, to setup negative threshold.
MODULE ASSEMBLE
Modules assembly is the key point in digital design. Unlike C or other high level languages which all functions are executed sequentially, Verilog modules all run in parallel. This means, if stitching these modules simply, they cannot achieve the expected operating results. However, most of the IP cores supported by Quartus are multi-functional and do not provide communication signals such as "Start," and "Done," as these cores do not contain sub module, functional module and control module. They are therefore only employed in many "always @" statements inside modules to achieve desired results, which increases the difficulty for sequential operations. There are two ways to solve this problem. The first way is to rewrite two of the IP cores and add communication signals. Another way of approaching the issue is to consider a parallel operation instead of a sequential operation. From the perspective of saving development time, the author decided to adopt the second method. This method necessitates the precise knowledge of the execution time for each module in order to build the pipeline structure via timing sequential. The results obtained by verification each module shows there are 33 clock cycles consumed by Serial to Parallel module, 9 clock cycles used by Standard Deviation module and another 33 clock cycles spent by the CUSUM algorithm module. Obviously, the running times of these three modules are not equal, meaning that they cannot build using the pipeline structure directly, as it will cause a data hazard when processing old data as new data comes in. Therefore, total 24 clock cycle latency was added on Standard Deviation module. That means although the computing time only consumes 9 clock cycles, the results are being latched until the 33rd clock cycle. Also a Delay module was added in this design. The role of this module is to obtain the output from Serial to Parallel module, delay 33 clock cycles to synchronous with Standard Deviation module and then passes to CUSUM algorithm module. By using this way, this design can achieve pipeline structure and seamless connection for input/output data. Figure 2 shows the pipeline structure of these modules. 
VERIFICATION AND EVALUATION
The verification has been done by ModelSim. ModelSim is a powerful HDL simulation tool that allows simulating the inputs of the modules and viewing both output and internal signals. From internal registers Serial to Parallel module output parallelized data can be seen, withthe mean value at the 33rd clock cycle. Here, the outputs are separated into two groups, one group to pass to Standard Deviation module and the other group to send to theDelay module to meet the timing requirement. The standard deviation calculation is done at the 65th clock cycle, and then sends theresult tothe CUSUM algorithm module at the same clock cycle. The internal result register has been integrated to "wire" status; it does not need any extra clock cycles to drive the result to the next module. Delay module passes the data that has been delayed by 33 clock cycles to the CUSUM algorithm module at same time. 
from the above that the entire latency is 65clock cycles. This means it will consume 65 clockcycles toprocess one group of input signals. In this way the pipelinestructure can achieve a continuous processing flow, avoidswasting clock cycles, and maximizes processing speed. Afterobserving several groups of sampling data, it was found thatthe data could achieve seamless transfer without any missing.This module can therefore be considered to meet the designrequirements.
The power consumption of the IP core was measured by the PowerPlay early power estimator tool which support by Altera.The power consumption is composed of static and dynamic power. Static power is the power consumed by leakage current. Dynamic power is the amount of power the device consumes when it is actively operating. Power analyze was done by implementing 1,2,4,8 cores respectively. Table 1 below gives the results.
Table 1. Power consumption results
One can see that the power consumption tracks in an approximately linear fashion with the increase in the number of cores. This is because the resource utilization is directly proportional to the number of cores. Compared with computer based transient signal detection, FPGA based detection consumes less power as it doesn't need to waste power on software implementation.
DISCUSSION AND CONCLUSION
There are still rooms for improvements to this system in order to achieve processing several TByte data per day. These improvements can be achieved via multiprocessor operation and multi-core operation. Multiprocessor architecture can enhance processing speed. The SOPC Builder was observed to allow users to add custom instructions to the Avalon Bus and build their own system via the NIOS II processor. The number of processors in a multiprocessor system was shown to be scalable, with processors able to be easily added or removed from the multiprocessor system. Avalon Bus can easily control the on-chip memories which store the shared data in the detection process. Further, the workload of detection can seamlessly allocated to any number of processors. The resource limitation of extending such a multiprocessor system is mainly associated with the on-chip shared memories and logic resources. Therefore, more processors are able to be easily added to improve the performance if more on chip memory is available. One needs to balance the process efficiency however, with the number of processors. This efficiency ratio is defined as [13] :
If the computational complexity is not sufficient to have all processors running at the same time, adding more processors will decrease the performance and also waste resources. Another issue that needs to be considered is the use of one NIOS II processor to control a multi CUSUM core, or the use of multiprocessors to process one CUSUM core. However selecting the wrong architecture will introduce side effects to the processing speed. Here, using one NIOS II processor to control multi CUSUM core can be considered as intra-node architecture, while using a multiprocessor to process one CUSUM core can be considered as inter-node architecture. Generally, for small or medium complexity computation, intra-node architecture achieves higher performance than inter-node architecture. This is due to the communication time required within inter-node architecture, which slows down the processing speed. If the computational complexity is increased multiprocessor architecture will deliver improved performance, so there needs to balance these two forms of computing architecture in future work.
