Abstract-Much research has been performed that concentrates on providing processing throughput enhancements to existing algorithms. Many systems have performance requirements that constrain their volume and/or power consumption. For volume and power consumption constrained systems, throughput cannot be the only decision factor when selecting a computational engine. Typical studies can aid in the selection of computational engines that meet the throughput requirements of a system, but may be of little help with respect to the volume, power and thermal constraints. This paper takes a different approach to help provide a different perspective on the constrained design problem. The research performed in this paper emphasizes the cost due to the power, size and NonRecurring Engineering (NRE) costs of various computational engines. The computational engines researched in this paper are: Central Processing Unit (CPU), mobile CPU, Digital Signal Processor (DSP), and mobile Graphics Processing Unit (GPU). The various architectures are compared against each other with respect to throughput, power, size and NRE costs. The authors hope that the process outlined in this paper may serve as a possible guideline for other Systems Engineers to perform similar Analysis of Alternatives of computational engines. Furthermore, the authors hope that the methods used for the relative performance evaluations will serve as a starting point to help shape policy in the selection of computational engines for future designs.
INTRODUCTION
When performing an Analysis of Alternatives (AoA) for the selection of the computational engine of a system, attention needs to be paid to the system constraints. Much research has been performed that concentrates on providing processing throughput enhancements to existing algorithms [1, 2, 3, 4] , but many systems have performance requirements that constrain their volume and/or power consumption. Studies such as [1, 2, 3, 4] can aid in the selection of computational engines that meet the throughput requirements of a system, but may be of little help with respect to the volume, power and thermal constraints. If the limitations of the chosen architecture are not well understood beforehand, the results can be expensive and time consuming. Furthermore, if the benefits of each computational engine are not well understood beforehand, an inferior or inappropriate architecture may be chosen. This results in reduced system capability, thereby limiting the current and future software algorithms that can be implemented. Therefore, it is important to understand the limitations and benefits of existing hardware architectures and provide the best system design alternatives based on each system's specific performance requirements and constraints. This research is a direct result of this initiative and provides a methodology for performing AoAs of existing computer architectures for use in future Naval Systems. The intent is that this research may serve as guidelines and enable system engineers to choose the most appropriate architecture for use in their particular system. The primary focus will be providing guidelines for systems that are constrained, such as volume constrained, power constrained, or both power and volume constrained. The guidelines will be useful for system engineers whose applications are unconstrained, but the primary focus of this paper will be the constrained design analysis. The viable architectures analyzed in this study are: Central Processing Unit (CPU), mobile CPU, Digital Signal Processor (DSP), and mobile Graphics Processing Unit (GPU).
To help systems engineers and designers choose the appropriate architectures, this study provides the following contributions:
• Data on the software development Non-Recurring Engineering cost (NRE) for the DSP and GPU architectures for porting from a C-based application to aid in producing accurate NRE estimates and schedules; • Architecture based performance assessments related to power utilization, space utilization and SWaP (space, wattage and performance) [5] to aid in meeting system performance requirements and constraints; • Architecture specific overhead, such as GPU Kernel function overhead, to better understand the complexity and limitations of the architectures.
A candidate algorithm has been chosen that performs signal processing on multiple raw data streams and utilizes Support Vector Machine (SVM) based classification [6, 7] . The candidate algorithm was chosen for its similarity with processing requirements for many naval systems as well as the research's applicability to the Wounded Warrior Program. This particular algorithm is a Neural Machine Interface (NMI) for volitional control of powered lower limb prostheses. A NMI application is both volume and power constrained, but also requires a significant amount of processing throughput, which poses many challenges [8] . To develop the candidate algorithm, the MATLAB model utilized in [6] and [7] was ported to an ANSI C baseline and its accuracy verified against the MATLAB model. The first candidate architecture to undergo a performance evaluation was the mobile CPU, because of its direct applicability to the NMI's constraints (i.e.
-high performance utilizing a small and low power device). The performance results for the mobile CPU based NMI were published in [8] . This paper provides the additional performance results for a CPU, DSP, and mobile GPU. Furthermore, it provides an architecture performance comparison of all four architectures, thereby providing the basis of an AoA for the selection of hardware architectures.
The paper is organized as follows. The next section presents the Neural Machine Interface Algorithm. Sections III, IV and V present our implementation and performance for the various architectures (i.e. -computational engines). Sections VI, VII and VIII provide our constrained performance evaluations. We conclude our paper in Section IX.
II. NEURAL-MACHINE INTERFACE
This NMI utilizes a pattern recognition (PR) algorithm that identifies user locomotion intent based on seven (7) electromyographic (EMG) signals acquired from leg muscles and six (6) mechanical forces/moments data acquired from a 6 degrees-of-freedom (DOF) load cell mounted on the prosthetic device. Time domain based features are extracted from this data and provided to SVM-based gait phase classifiers for determination of user intent. A brief description of the NMI PR algorithm is provided below, a detailed description is available in [6] and [7] .
A. Support Vector Machine Classification
SVM is a supervised learning classification technique whereby the selection of the features utilized for training and detection directly relate to classification accuracy and burden placed on the computational engine [9] . SVM supports the use of non-linear kernel functions [9] , such as the Radial Basis Function (RBF), which provides the capability to better match the distribution of the feature sets. The chosen algorithm utilizes SVM with an RBF kernel function to provide its user intent classification. The features were chosen to provide high accuracy and minimize the burden on the computational engine [10] .
B. Feature Extraction
In this study, four time-domain (TD) features (the mean absolute value, the number of zero crossings, the waveform length, and the number of slope sign changes) were extracted from EMG signals in each analysis window [10] . For mechanical data the mean, minimum, and maximum values in each analysis window were extracted as the features.
C. Phase Dependent Pattern Recognition
The user's human locomotion is separated into four gait phases: initial double limb stance, single limb stance, terminal double limb stance, and swing [11] . Four separate detectors are trained, each with the data from a single corresponding gait phase. Data features are extracted from the raw EMG and mechanical signals during a sliding analysis window and fused into a single feature vector. A gait phase detector identifies the current gait phase in real-time, selects the corresponding gait sub-classifier, and forwards the feature vector to the classifier for final determination of user intent. In this study, a sliding analysis window of 150ms with a window increment of 50ms was utilized.
D. Performance Evaluation of the NMI
The performance evaluation of the NMI on the various architectures will be directly related to the average prediction achieved by the architectures. For the purposes of the various evaluations, the prediction time will be defined as the total time to execute: feature extraction, normalization, gait phase detection and classification for a single analysis window.
III. CPU AND MOBILE CPU IMPLEMENTATION AND

PERFORMANCE
The CPU and mobile CPU implementations were directly based on the C language implementation of the research performed in [8] . In [8] , the goal was to create a NMI capable of meeting real-time constraints, while executing on low power architectures. To help the lower power architectures meet realtime constraints, various common performance enhancements techniques were implemented. These enhancements included reduced dynamic memory management [12] , loop unwinding [13] , and inline function expansion [14] among others. The NMI's average prediction time, during execution on an Intel Atom Z530, was 0.846ms. The Intel Atom Z530 CPU has a form factor of 13mm x 14mm and has a maximum power utilization of 2.2 watts [15] .
The current NMI CPU implementation was written to take advantage of single core hyper threaded [16] CPUs and is, therefore not capable of taking full advantage of multi-core CPU architectures such as Intel's i5 and i7 CPUs. The closest CPU comparison to the execution on the Atom Z530 we had available was the Intel E7500 Core 2 Duo. Similarly to the previous study, the Intel E7500 allowed the Operating System (OS) to execute on one core, while the NMI executes on the second core. This helps minimize the impacts of the OS on the NMI. The NMI's average prediction time during execution on an Intel E7500, was 0.605ms. The Intel E7500 CPU has a form factor of 37.5mm x 37.5mm and has a maximum power utilization of 65 watts [17] .
IV. DSP IMPLEMENTATION AND PERFORMANCE
The DSP implementation began with the mobile CPU C software baseline. The C baseline was modified and optimized to work with the Spectrum Digital TMS3206713 board that utilizes a Texas Instruments TMS3206713 DSP [18] at a clock speed of 225MHz. The development board was programmed in the C programming language using the provided Code Composer Studio integrated development environment.
For a professional with prior C programming experience, but no prior experience using Code Composer Studio, it took about 1 week to get a non-optimized program to match the mobile CPU version's execution time and accuracy. An additional 2 weeks of time was required to optimize the application to reach its maximum potential.
One optimization performed was to reduce the number of branches required by the program. The TMS320C6713 does not have any form of branch prediction. Instead, each branch function results in 5 stall operations being inserted into the pipeline [19] . When possible, the instances of nested if statements were merged into a single if statement, thus reducing the number of branches required for the same operation. The number of conditional loops was reduced by combining multiple operations into a single loop whenever possible. This also reduced the number of branches that occur within the program.
The most effective optimization was the activation of L2 cache. The TMS3206713 development board does not have L2 cache activated by default [20] . Instead only a small L1 cache is used. Since the external memory accesses are slow, it is beneficial to activate the L2 cache as long as the application execution does not result in a large number of cache misses. The inclusion of the L2 also requires the remapping of some internal memory to be configured to serve as the cache. In this case neither of these two issues were a factor and the inclusion of L2 cache provided a major performance boost. This change required an additional two lines of code to be added to the program. The first instruction configures the board to use L2 cache, and the second instruction can be used to control the size of the L2 cache. In this case it was found that the largest performance was achieved when with the largest possible L2 cache. For the TMS320C6713 development board the largest possible L2 cache size is 64KB [20] .
The optimized version of the DSP implementation resulted in an average time of 11.35ms per prediction with a standard deviation of 2.186ms. The feature extraction required an average of 6.887ms with a standard deviation of 838µs. The classification required an average of 4.472ms with a standard deviation of 1.778ms. The TMS3206713 DSP has a form factor of 27mm x 27mm and has a power utilization of approximately 1 watt [18] .
V. MOBILE GPU IMPLEMENTATION AND PERFORMANCE
The mobile GPU implementation began with the mobile CPU software baseline. The C baseline was modified and optimized to take advantage of the Nvidia GeForce GT 540m architecture. The GPU utilized in this study is located within a Dell XPS laptop with an Intel i7-2720QM CPU. The GeForce GT 540m has 96 Compute Unified Device Architecture (CUDA) cores divided up into 2 separate streaming multiprocessors and runs at a clock speed of 1.3GHz [21] . The development of the application was performed in a Microsoft Visual Studio integrated development environment which provides CUDA programming capability.
One immediate difference in the GPU architecture versus the DSP and CPU architectures is that the GPU is more of a highly optimized and parallelized co-processor to the CPU than it is a standalone architecture. Therefore, the GPU incurs the additional power and space overhead of the CPU or device it communicates with. Because the CPU power and form factor can vary, our analyses will not take into consideration this additional overhead. It is recommended that this implementation specific overhead be accounted for by the systems engineer, while performing the analyses within this paper. Another disadvantage of the GPU is the time of the overhead required to launch a GPU kernel function. The CPU needs to communicate with the GPU in order to setup and run a CUDA kernel function. There is a certain amount of overhead time required to perform this process. If the kernel launch overhead begins to approach or exceeds the actual execution time of the kernel function then it can become a detriment to the total execution time of the program. Therefore, if one has a kernel function that performs little to no calculations, attention needs to be paid to how time is spent in actual kernel function execution versus the kernel launch overhead. In some cases it may be more advantageous to execute the less calculation intensive functions on the CPU, thereby eliminating the need for kernel function overhead. In the case of our implementation of the gait phase detection and tallying of SVM vote, it was beneficial to execute these on the CPU versus the GPU. For this system, it was found that an average of 3µs of overhead time is required per GPU kernel launch. Our GPU implementation utilized nine (9) GPU kernel functions per prediction; therefore a total of 27µs per prediction is attributed to kernel function overhead.
One important concept in GPU programming is the concept of organizing execution paths into grids, blocks, threads, and warps. When starting a kernel function the CPU specifies several parameters. The main parameters used are the number of threads per block, the number of blocks, and the number of grids of blocks. In this case there was only one grid, since we were using a single GPU board. Threads are grouped into blocks. Threads in the same block can share data and be synchronized whereas threads of different blocks cannot. Another important concept is warps. Threads are grouped into sets of 32 threads known as warps. Threads in the same warp are intrinsically synchronized and are scheduled together. When writing GPU code it is important to keep threads of the same warp following the same execution path to prevent divergent warps. When threads of the same warp execute different code the warp is said to be divergent and the operations are executed in a serialized fashion, thus missing the potential parallelism offered by the GPU. More detailed information on CUDA programming, grids, blocks, threads and warps can be found in [22] .
Our CUDA program begins its execution on the CPU and then the CPU initiates kernel functions that execute on the GPU. In this case the program begins by copying the raw EMG and load cell data to the GPU to be used during its execution. For each analysis window, the phase detection is performed on the CPU. Nine GPU kernel functions are used to perform the needed steps for the feature extraction, normalization, and to determine the one versus one SVM classifier votes. The votes are then copied from the GPU to the CPU where the actual one versus one SVM votes are tallied to determine the user intent for the given window. This NMI algorithm allows for a large amount of parallelization. The GPU's massively parallel architecture provides the capability to take advantage of this opportunity. As shown by Amdahl's Law [23] , the more parallelization that can be found in an application, the greater the increase in the performance of the application on a parallel architecture such as a GPU. To take advantage of the principles defined by Amdahl's Law we examined the NMI algorithm for every possible opportunity to exploit parallelism. The DC offset for each of the channels can be calculated and removed in parallel. Each of the 46 features that need to be extracted from the channel data can be calculated in parallel. Similarly, the approximately 200 to 400 SVM support vector dot products can be performed in parallel, and the 21 one versus one SVM classifiers that use the SVM dot product values can be performed in parallel.
The parallelization was further increased by utilizing the parallel reduction method [24, 25] to parallelize the necessary work to calculate the values. The parallel reduction method uses many threads to process a data set. For example, when finding the sum of a data set each thread will be used to find the sum of two values in the data set. After the first stage, half of the threads will have partial sums. Then half of the threads with the partial sums add their resultant partial sums to that of one of the other threads. This process continues until one thread holds the sum of the entire data set. In this way the sum is found in the most efficient way to maximize the parallelism provided by the GPU [24, 25] . Fig. 1 shows the NMI algorithm's GPU implementation, data flow, and the workload separation between the CPU and GPU architectures. In Fig. 1 , the portions of the algorithm allocated to the GPU are initiated by kernel functions launched by the CPU to perform the calculations. These kernel functions are launched with a set amount of blocks and threads per block in order to best take advantage of the architecture of this particular GPU. For our program we utilized a block size of 96 threads. This provided enough threads to accomplish each given task.
For the six EMG channels, the DC offset first has to be removed. This is done by calculating the mean of each channel and subtracting the mean from each of the values. The mean of each channel is calculated in the first GPU kernel function. This function finds the means for all channels including the mechanical channels as this is a feature required from the mechanical channels. A second kernel function is then used to subtract the mean from the EMG channels.
For the feature extraction, 96 threads per block are used to calculate the features needed for a channel. Each block is responsible for extracting the features for 1 channel. Therefore 13 blocks are required, seven for the EMG channels and six for the mechanical features. All the features are calculated using the reduction method. The third kernel function is used for the extraction of the EMG features. In this function each thread reads in a single data value from a single EMG channel and then extracts all four EMG features for this data point (calculates the absolute value of the current data point, calculates the waveform length of the current data point relative to the prior data point, determines if the current data point is representative of a zero crossing, and determines if the current data point is representative of a slope sign change). Once these factors are known, the values can then be combined together using parallel reduction as previously described until one thread holds the feature values for the current window increment. The mechanical channels do not have to wait for the DC offset to be subtracted from the data so the features from these channels can begin to be extracted immediately while the EMG channels are still waiting. There is no need for a separate kernel function to calculate the mean of the mechanical channels as that is handled by the same kernel function that calculates the mean of each channel in order to remove the DC offset from each of the EMG channels. The fourth kernel function is used to extract the mechanical features and also utilizes the reduction method to increase parallelism. Each thread reads in one value from the current window. It then compares this value to the value in one other thread to determine which is the minimum and which is the maximum. This process continues until there are only two threads remaining at which point the min and max for the entire data set can be found. The fifth kernel function saves the features by loading them into an array to be processed by the later steps in the GPU implementation of the algorithm.
A sixth kernel function is utilized to calculate the SVM dot products and also utilizes the reduction method. Each block is allocated 96 threads. Each block of 96 threads is segmented into three warps of 32 threads. Each warp calculated the value for one SVM dot product; hence the three warps can calculate 3 SVM dot products in parallel. There are a total of 46 features in each SVM support vector and test vector, therefore 46 products and 45 sums are required for each dot product. Each warp performs the following steps:
Step 1. The 32 threads in the warp calculate the first 32 products.
Step 2. The first 14 threads in the warp calculate the final 14 products and sums them with their prior 14 products, resulting in the first 14 partial sums. Step 5. The remaining 8 partial sums are reduced into 4 partial sums.
Step 6. The remaining 4 partial sums are reduced into 2 partial sums.
Step 7. The remaining 2 partial sums are summed and become the final sum.
We allocated a single SVM dot product to a warp of 32 threads to minimize warp divergence and synchronization issues. If threads from the same warp follow different execution paths then the threads are said to diverge. In the case that the threads diverge, they are executed in a serial fashion and thus do not take best advantage of the parallel processing provided by the GPU. The other advantage of using a warp to calculate a single SVM dot product is that there is no need to call any thread synchronization functions because the threads of a warp are naturally synchronized.
An eighth kernel function is executed at the same time as the SVM dot products are being calculated. This function does some necessary setup prior to the SVM classification. In this function some required variables are initialized to be used in the classification. The ninth kernel function performs the one versus one SVM classification. The 21 classifiers are executed using 21 blocks of 96 threads each. Each of the classifications is again done using parallel reduction. This produces the 21 votes that are copied back to the CPU in order to tally the final vote and determine the user intent for the current window.
This implementation takes advantage of the parallel nature of the GPU while at the same time avoiding one of its biggest disadvantages, the need to copy data back and forth between the GPU and the CPU [22] . With the method outlined above the program only requires one memory copy between the GPU and the CPU per prediction. This is done by keeping as much of the data as possible on the GPU and only copying data to the CPU at the very end of a prediction. Fig. 2 shows the GPU program flow for this implementation.
About a month and a half of work performed by a professional with prior C experience and no prior CUDA experience was required to produce a GPU implementation that matched the accuracy of the CPU implementation. An additional month and a half was required to produce code that could match and exceed the prediction speed accomplished by the CPU implementation. In the final optimized version of the GPU code, the program required an average 0.193ms per prediction with a standard deviation of 21µs. An average of 51µs with a standard deviation of 0.2µs was required for the feature extraction and 74µs with a standard deviation of 18µs for the classification. The GeForce GT 540m GPU has a form factor of 29mm x 29mm and has a maximum power utilization of 35 watts [26, 27] .
VI. POWER CONSTRAINED ANALYSIS
This analysis will compare the computational performance of the architectures relative to their respective power utilization. We recommend that this analysis be performed for systems that are constrained to operate within a limited power or thermal range. This performance requirement is usually imposed when the lower power utilization will allow the device to operate for a longer period, there is a limited method to dissipate thermal energy, or the system has a limited power source. Some examples might be satellites, electric passenger vehicles, and electric autonomous vehicles.
For a computational performance measure we will utilize the number of floating point operations per second (FLOPS). The power utilization will be measured in watts. Of interest is the performance of each architecture per its respective power utilization, therefore (1) can be used to provide the relative performance of each architecture for our NMI algorithm. We intentionally utilized the same NMI algorithm for all the architectures to ensure that the number of floating point operations per prediction, the numerator in (1) below, is the same for all architectures implementations. Therefore to maximize the performance of any given architecture we must minimize the product of the prediction time and power utilization. Conversely, the architecture whose product of the prediction time and power utilization is the largest will be the worst performing architecture for this analysis. For this analysis, the Intel E7500 CPU architecture exhibited the worst performance with a prediction time of 0.605ms and a maximum power utilization of 65 watts. To provide a comparison between all the architectures we will take a ratio of each architecture's performance achieved by (1) relative to the worst performer (CPU). By comparing the architectures to the worst performer we can then provide a performance ratio utilizing (2).
(1) (2) Figure 2 . GPU implementation program flow Table I provides the results of the power constrained analysis. As can be seen, the Atom mobile CPU provided the highest performance, which was 21 times that of the CPU. Although the mobile CPU provided the highest performance, it does not automatically make it the best architecture choice; the system performance requirements need to be examined prior to making a final selection. This applies for all of the analyses performed in this paper. For example, for an NMI, the less power utilized the longer a patient is able to use the prosthesis without the need for changing and/or replacing the power cell. All of the architectures tested met the requirement for a 50ms prediction time. Since the DSP's power consumption is lower than that of the mobile CPU, it may be the better choice. Alternatively, since the prosthetic device we are targeting requires updates every 10ms, the goal is for the NMI to perform a prediction every 10ms. Based on a 10ms prediction time, the mobile CPU becomes the better choice. Furthermore, the mobile CPU provides additional expansion capability to augment the existing NMI algorithm to provide actual leg control and EMG signal anomaly detection in future design iterations.
It is important to note that these results are for the phase dependent NMI algorithm and that a different algorithm will probably result in different performance and rankings for the architectures. To ensure accurate results, it is recommended that the actual target algorithm, actual architecture power utilizations during algorithm execution, and actual architecture sizes be utilized to perform this and all of the other analyses in this paper. To provide an example of how to perform these analyses, we have only taken into account the computational engine and utilized the manufacturers' maximum advertised power consumption.
VII. VOLUME CONSTRAINED ANALYSIS
This analysis will compare the computational performance of the architectures relative to the surface area that each would utilize on a circuit board assembly. We recommend that this analysis be performed for systems designs that are constrained to operate within a small volume. Some applications that can be volume constrained are networked surveillance cameras, digital photo frames and home automation devices.
For a computational performance measure, we will utilize the number of floating point operations per second. Of interest is the computational performance of each of the architectures relative to its surface area consumption, therefore (3) can be used to provide a measure of the relative performance of each of the architectures for our NMI algorithm. Similarly to the power constrained analysis, the NMI algorithm utilized the same number of floating point operations per prediction; therefore, to maximize the performance of any given architecture, we must minimize the product of the prediction time and architecture surface area. Conversely, the architecture whose product of the prediction time and surface area is the largest will be the worst performing architecture for this analysis. For this analysis, the Texas Instruments TMS320C6713 DSP architecture exhibited the worst performance with a prediction time of 11.35ms and a package dimension of 27mm by 27mm. To provide a comparison between all the architectures we will take a ratio of each architecture's performance as determined by (3) relative to the worst performer (DSP). By comparing the architectures to the worst performer we can then provide a performance ratio utilizing (4).
Table II provides the results of the power constrained analysis. As can be seen, the mobile CPU provided the highest performance, which was 53.7 times that of the DSP. Again, the mobile CPU architecture appears to be the best alternative. Furthermore, being the smallest architecture chosen for this AoA, the mobile CPU provides a viable solution for mounting the final design into the prosthesis.
VIII. VOLUME AND POWER CONSTRAINED ANALYSIS
This analysis will compare the computational performance of the architectures using SWaP. We will examine the architectures' computational performance relative to their respective surface areas and power consumptions. We recommend that this analysis be performed for systems designs that are both volume and power constrained. Some applications that are both power and volume constrained are cell phones, tablets and neural-machine interfaces.
For a computational performance measure, we will utilize the number of floating point operations per second. Of interest is the computational performance of each architecture relative to its surface area and power consumption, therefore (5) can be used to provide the relative performance of each architecture for our NMI algorithm. Similarly to the prior constrained analyses, the NMI algorithm utilized the same number of floating point operations per prediction; therefore, to maximize the performance of any given architecture, we must minimize the product of the prediction time with the architecture surface area and power consumption. Conversely, the architecture whose product of the prediction time, surface area and power consumption is the largest will be the worst performing architecture for this analysis. For this analysis, the Intel E7500 CPU architecture exhibited the worst performance with a prediction time of 0.605ms, a package dimension of 37.5mm by 37.5mm and a power consumption of 65watts. To provide a comparison between all the architectures we will take a ratio of each architecture's performance as measured by (5) relative to the worst performer (CPU). By comparing the architectures to the worst performer we can then provide a performance ratio utilizing (6).
Table III provides the results of the SWaP analysis. As can be seen, the mobile CPU provided the highest performance, which was 163 times that of the CPU. Similarly to the prior constrained performance analyses, it is important that the results from the SWaP performance evaluations are used in conjunction with the system performance requirements prior to making a final architecture selection. Based on the three constrained analyses and the future performance requirements of the NMI, the mobile CPU architecture appears to be the best selection.
IX. CONCLUSIONS
This paper presented a methodology of performing constrained AoAs for the selection of computational engines for future system designs. Various analyses were utilized to evaluate power, volume and both power/volume performance constraints. Guidance was provided on when to use each analysis and how to combine the results of the analyses with performance requirements to provide the appropriate computer architecture selection for future system designs. NRE was provided for the DSP and mobile GPU architectures to aid in properly planning such an analysis. As can be seen by the three man-month effort to port and optimize the NMI for use in a mobile GPU architecture, such analysis can be time consuming and expensive. We hope that the processes and analyses presented will help other systems engineers perform their own AoAs for their system. Furthermore, we hope that the methods used for the relative performance evaluations will serve as a starting point to help shape policy in the selection of computational engines for future designs.
Our future research includes the development of multi-core variants of the phase-dependent NMI algorithm, using various programming techniques. We plan to compare the performance of multi-core processors, such as the Intel i5 and i7 architectures, to that of the mobile CPU, CPU, DSP and mobile GPU architectures. Although the size and power consumption of these architectures may exclude them from candidacy for an NMI, the additional results will provide a more complete AoA. Furthermore, the parallel capability of the multi-core processors should provide a better comparison relative to the parallel GPU architecture. 
