1: Introduction 6. assessing the compatibility of the architecture with the real-time embedded applications, and "Once the architecture begins to take shape, the sooner 7. selecting the appropriate design approach based contextual constraints and sanity checks are made on on a trade-off analysis. assumptions and requirements, the better."
The tradeoff analysis incorporates an assortment of Eberhardt Rechtin, Systems Architecting: design tools to expedite and facilitate the decision making Creating & Building Complex Systems [1]
process. Examples of tools used to date include: VHSIC Hardware Description Language (VHDL), RDD-100 and Commercial massively parallel processing (MPP) OMTool. The approach will integrate the systems architectures offer a solution to TERAFLOP (one trillion engineering tools developed under the direction of the operations per second) computing applications in the Naval Surface Warfare Center (NSWC) within the Office Navy. Computing density (TERAFLOP/cubic foot) and of Naval Research's (ONR) Engineering of Complex cost (dollars/TERAFLOP) have decreased in recent years; Systems (ECS) Block Program as they become available. however, the challenge of real-time embedded processing
The generic approach suits a myriad of applications requirements poses a high risk for complex systems. The ranging from radar to sonar systems. Accordingly, air, high risk relates to: inefficient match of applications to surface and subsurface platforms can benefit from the architectures, low availability of high throughput approach outlined in this paper. This paper uses the case architectures, the accuracy of forecasted downward study method to showcase the approach. spiraling price projections, and immature software
The case study method applies the generic approach to development tools (e.g., parallelizing compilers). a practical application. This paper discusses one case This paper proposes a generic approach to mitigate the study to demonstrate the utility of the approach. The risk when investing in a specific MPP architecture. The research will explore additional case studies as interest approach proposes a series of intermediate steps to assess arises. the compatibility of the architecture and the requirements, This paper describes the case study, outlines the Each step refines the assessment and leads to the final generic systems engineering design approach, presents tradeoff study. The approach embodies reengineering high level architectural sizing techniques, and discusses considerations when pursuing new implementations for detailed modeling and requirements allocation issues. Wide Aperture Array (WAA) beamforming problem Array system can provide an attractive capability. In serves as the initial case study for the application of the addition, current fiscal requirements within the Navy (i.e., generic approach. The approach will evolve based on the the New Attack Submarine Program) have created the need comments, suggestions and progress of research performed to significantly reduce the cost of the existing WAA in the ECS Program under the direction of the Naval system. Surface Warfare Center (NSWC).
The WAA full detection concept conceived by DePrimo and Choinski, and published in "An Efficient 2: Overview of WAA case study Approach to Systems Evolution (EASE)" [4] , offers a cost effective way to implement the inboard electronics NUWC chose the WAA in-board electronics with additional full detection capability. This application for two reasons. First, reengineering the WAA implementation serves as the case study for this paper. system contributes to the incremental insertion of
The implementation uses commercial massively commercial off-the-shelf equipment (COTS) into parallel processing technology developed within the High submarine warfare systems [2] .
Second Development's (DDR&E) thrust areas. Two of DDRE's Figure 1 illustrates the WAA signal processing seven thrust areas emphasize Affordability, and Sea specified for the reengineering process. The signal Control and Undersea Superiority [3] . Therefore, the processing includes additions to the existing WAA reengineering of the existing WAA system responds to system. changing cost and commercial technology requirements. This paper proposes a generic approach to mitigate the This paper provides a brief description of the Wide risk when investing in a massively parallel architecture. Aperture Array full detection system. As interest arises,
The approach concentrates on the importance of constraints the approach will incorporate diversified case studies and sanity checks throughout the design process. based on other existing Naval systems.
The cost effective implementation of the in-board 3: System engineering approach electronics for a Wide Aperture Array full detection system serves as the primary objective for this case study. Figure 2 embodies the intent of Rechtin's heuristic, as The WAA system can perform the detection function for a quoted at the beginning of the paper, by setting up a submarine. Sea test data indicates the Wide Aperture process to assess the compatibility of the architecture and In this manner, dynamic benchmarking reduces risk. The The process starts with the definition of functional move from using a single processor to multiple processors requirements and the selection of a candidate architecture.
differentiates dynamic benchmarking from static For reengineering problems like the Wide Aperture Array, benchmarking. Dynamic benchmarking also introduces the process includes a step to characterize the existing partitioning, input/output (I/0) issues, and event driven system. The existing system characterization provides the processing attributes. baseline for the tradeoff analysis.
In addition, the dynamic benchmarks validate the An object oriented software design follows the detailed architecture models simulated in this step. The functional specification. The inclusion of the object concept of using modeling, simulation and benchmarking oriented design step translates functional requirements to for architecture validation was first introduced by Mufioz objects suitable for software design. This step will of the Naval Undersea Warfare Center [5] . Figure 3 determine if object oriented design facilitates software elaborates on the allocation process identified in figure 2. portability and reuse. In practice, a systems designer could After allocating the functions, the final tradeoff bypass this step in favor of functionally based software analysis uses a set of previously defined metrics to design.
compare the performance of the proposed implementation The sustained throughput and data rate estimates to the existing baseline system. The results of the tradeoff follow the object oriented design. The throughput and analysis determine whether to accept, modify or eliminate data rate estimates enable a preliminary architectural sizing the candidate architecture. using the performance data from existing libraries or static benchmarks. Static benchmarks provide single processor 3.1: Metrics performance data for metrics like efficiency. Therefore, the architecture sizing obtained at this point allows for an
The basis of the tradeoff analysis rests with the initial assessment of the instantaneous throughput levels extraction and comparison of metrics. Modeling and quoted by manutactureis.
simulation permit measurement of the metrics for the Given the preliminary architectural sizing, the systems proposed system. The measured data can be compared to designer can perform a detailed analysis of the architectural the existing system baseline data. requirements for the given application. . generating an object oriented design, Ideally, the systems design engineer should baseline 4. establishing a preliminary architectural sizing the existing system using metrics necessary to complete from static benchmarks. the tradeoff analysis. Under these conditions, the designer Once completed, these four steps lead to a detailed completes the tradeoff analysis by comparing the new and architectural design and development. Unlike the detailed existing systems on equal footing. design, the high level architectural sizing does not address Unfortunately, even the best documentation from a software issues or partitioning of the functions. military system will fall short of supplying all the previously defined metrics for the tradeoff analysis.
4.1: Functional requirements definition
Design engineers document their work for development and not reengineering purposes. Therefore, the tradeoff The functional requirements definition phase of the analysis will embody comparisons between similar but not generic system engineering design approach results in equivalent metrics. system level specifications for the application. For the The reengineering process was initiated by using the case study highlighted in the paper, a systems engineer design capture views established by the ECS Block to capture the implementation of the existing WAA 
4_

Communication Bandwidth
A description of the 1/0 rate measured in MBytes/second.
Memory Bandwidth
A measure of the memory access requirements per unit time represented by Bytes/second.
FLOPS-I/O Ratio
A ratio which compares the computation load(MFLOPS) to the 1/0 (Bytes/sec) load.
Latency-FLOPS Product
A characterization of the ability to support communications requirements versus the computational bandwidth requirements of a module orý architectural element.
Power/WeightlVolume
Values used to characterize the physical attributes of a system. Power is characterized by Watts, weight by pounds (Ib) and volume by cubic feet (ft 3 ).
dB/Watts
A measure which combines process gain (dB), algorithm efficiency, dB/gate-Hz, technology cost, gate -Hz/watts, architecture efficiency, and percent duty cycle. An alternative is to use noise recognition differential (NRD) instead of process gain for a measure of sonar system performance.
8.
Architecture Diameter An integer which represents the maximum number of communication paths that a message or data may be required to travel from processor to processor.
9.
Architecture Latency
The maximum time, in seconds, a message takes to propagate across the path that determines architecture diameter.
Processor Memory Ratio
A ratio that captures the memory available to an individual processor. For local memory systems the ratio would be the local memory per processor. For shared memory systems the ratio would be computed by dividing the total system memory by the number of processors and adding the amount of local cache memory per processor.
Average Message Size
A value computed by dividing the total number of message per Processor bytes sent during the time it takes to execute an algorithm, divided by the number of processors.
Response Time
The time in seconds that is required to execute an algorithm. The time begins when the first processor starts executing and ends when the last processor stops executing.
13. Processor Utilization A percentage computed by dividing the sum of the individual times that the processors are executing by the total time it takes to executc the algorithm, times the number of processors in the system. =tl +t2+t3+ ..... tn NT
Program Size
The size in bytes of the program.
15. Speed Up A value computed by dividing the response time for an algorithm executing on a single processor by the response time for an algorithm executing on several nodes in a system. OMTool will perform the functional to object oriented for the baseline characterization of the Wide Aperture translation automatically; however, OMTool cannot Array System case study.
9%
perform the translation at this time. Although the existing baseline uses a distributed OMTool provides functional, object and data flow processing architecture, some of the experiences can be views for a given application. In addition, the tool carried over to the massively parallel processor produces C++ code. This paper neither endorses nor architecture. For example, since trackers do not require denounces the use of OMTool. Engineers working on the large amounts of throughput, the MPP implementation for WAA case, study use OMTool because of the features trackers probably would not change significantly. Figures available for the given price range. 4-7 present four different views of the Wide Aperture Figure 8 illustrates the object oriented array system Array System. Partitioning functions to resources has design. Figure 9 expands the object oriented beamformer become the focal point for the case study because of its design. These diagrams represent a synopsis of the object significance in massively parallel array architectures.
oriented design which will be used to reengineer the WAA system. Note that although the object oriented design
4.3:
Object oriented design started with the WAA application in mind, the high level software suits any array processing problem using a 2 The object oriented software design follows the stage time delay beamformer. functional specification, and the existing system baseline.
In the future a systems engineer specifying the The inclusion of the object oriented step translates functional level requirements could expedite the object functional requirements to objects suitable for software oriented design if a link was developed between tools like design. Object oriented design should facilitate the reuse OMTool and RDD-100. The link could further automate and portability of software.
the design process. In addition, the link would also Once the functional requirements have been designed, ensure the consistency of requirements between object a software engineer determines the set of software objects oriented and system design tools. necessary to achieve the desired functionality. An analysis Technology independence means that it is possible to floating point multiply and addition operations were retarget the softwaie. Partitioning involves dividing the calculated for the functions identified in figure I . The processing into pieces which can run on individual sustained throughput estimates in Table II reflect these processing elements. multiply and addition estimates coupled with input data Massively parallel architectures can have an assorted rates, collection of heterogeneous analog or digital processors. Intel Corporation provided the efficiency and peak
The program that runs the real-time embedded system numbers in Table II Scalability and partitionirig are correlated. Good With appropriate research and development in the software partitioning methods generally lead to good engineering of complex systems, the software for scaling. Generally, the efficiency and scalabiiity increase commercial MPP architectures can achieve lower cost with effective software development techniques. Note through partial portabilit,.
One objective of the however, that this relationship is not linear and is Massively Parallel System Design task is to address algorithm dependent. detailed level MPP software mapping and portability.
Despite continuing research efforts in parallel 5.1: Technology independence processing, two challenges exist for MPP architectures:
Technology independence presents a significant I. The MPP scalability problem presents a major hurdle to real-time embedded MPP architectures. Figure obstacle. Efficiencies from benchmarks with large 10 shows one approach for attaining technology (thousands of processors) MPP architectures independence. In general, the objective and procedures are measure less than 10%. For vector processors similar to other previous works. The uniqueness lies in like the Cray supercomputer, the efficiencies the details of the methodology. The method concentrates measure higher than 10%. These inefficiencies on using commercially available tools whenever possible. create a high incentive to increase the speed-up of Many of these tools have graphical user interfaces. MPP systems.
Graphical interfaces facilitate the use of signal flow graphs for representing real-time embedded applications. The CMPP paradigm is discussed in this section of the awareness of the topology of the MPP architecture, but it paper can achieve the highest scalability and efficiency. Table
Because of a large and multidimensional solution IV describes the software techniques.
space, heuristic methods provide the first pass solution.
Therefore, automation and design aids would expedite a Unfortunately, automatic mapping technology for passing operations. The process is slow when the user has partitioning and allocation does not exist. Good to do a manual mapping for all the pieces (thousands). as performance in programming MPP architectures relies on well as run the MPP execution to decide whether the tedious manual mapping methods, mapping works. on full scale MPP to collect dynamic performance metric The four salient features for the portable massively data, the benchmark is collected from model simulation. parallel systems design (MPSD) method discussed in this The full scale model then provides estimates of the paper include:
architectural performance in terms of the previously defined metrics. element is less than that of the applica.Xn module, it is metric as the initial focus, since execution time is directly possible to fit the application module into the element. If related to the speed up in MPP architectures. EXEC load, the peak FLOPS-I/O ratio of an element is greater than the EXEC bandwidth, COMM load, and COMM bandwidth application module, the partition will encounter problems. characterize these modules. The host program estimates Essentially, the FLOPS-I/O ratio characterizes the EXEC load and COMM loads for all the partitioned computational activities relative to communication pieces of a specific mapping. The collected data become activities. With this metric, it will be easier to analyze the model parameters to annotate the performance model before results of different mapping processes by examining simulation. Each new partition requires repetition of the granularity. One definition of fine grain tasks refers to load estimation and extraction process. Any automation small FLOPS-I/O ratios. Fine grain application modules that can be added would be desirable.
Host
can only be efficiently accommodated in fine grain EXEC modules and COMM modules are used to architecture elements. build the performance model with token networks. The
The FLOPS-I/O ratio metric makes it possible to find token network handles multiple transmitters like real a common partition of an application for a set of MPPs. network situations. Presently the model can only handle
The common partition usually can not achieve the best Ethernet simulation. Construction of the performance speedup and efficiency in a specific MPP, but the partition model is done in the graphics mode. The VHDL feature can be accommodated in a number of MPPs. Further simplifies the replication of thousands of identical development of the CMPP paradigm will demonstrate this modules. The Calibrated Mapping Performance Prediction situation in the future. (CMPP) paradigm hides many of the details of message A collection of Sparc workstations on an Ethernet passing so that the designer can concentrate on the was used to demonstrate the CMPP approach during 1993, partition and allocation problems. The right environment since the researchers did not have access to a commercial enables replication the modules many times. This MPP architecture. The researchers also used a message environment reduces the problem of scaling to thousands passing development environment called EXPRESS. of processors.
EXPRESS addresses the portability challenge for the The CMPP paradigm discussed in this paper used the CMPP paradigm. VHDL environment. Note that VHDL is not used here for
The development consists of three parts. First, the hardware design; instead VHDL allows the designer to EXEC module characterizes a piece of the execution that construct the structure, simulate the performance, and occurred in the architecture element of the MPP. EXEC collect metric data. Both PC's and workstations support modules represent a source that generates a load token, a VHDL environments at low cost. VHDL will be available feed-through that accepts input tokens and produces for hardware and system design for a long time. In output tokens, or a sink that consumes a load token. The addition, constructs of the VHDL language can replicate following VHDL parameters characterize EXEC modules: modules as shown in Figure 12 in a straight forward manner. VHDL generic constructs also help annotate INST => unique module name model parameters before simulation.
The manual Unit => I Kbytes EXEC/COMM load estimation and extraction is a Sizeinfo => statistic size in units disadvantage of the CMPP paradigm. An automatic Throughput-info => (#/sec) statistic throughput rate procedure would strengthen the CMPP paradigm.
Latency-info => statistic delay (usec.) The CMPP paradigm allows the partition and * Duty-cycle-info=>(#/sec) statistic duty allocation results to be portable to different types of cycles MPPs. Remapping is necessary due to different network * Only relevant for source EXEC modules. bandwidths, topologies and throughput rates in different MPPs. However the CMPP minimizes the portability Figure 13 shows a VHDL structure for the modules. effort as much as possible. The CMPP paradigm reduces
The left most block depicts a source EXEC module, and the effort needed to run a real-time embedded application the right most one a sink EXEC module. The INST on different MPP architectures.
generic describes a unique name for the module in the model. The size *unit characterizes the load of execution.
Systems Reengineering Technology Workshop, Februan 8-/0, 1994
The term "unit" represents the basic data size such as in The ebiu is in turn built from two sub modules: Local bytes or Kbytes. The throughput characterizes the speed and Globalnet. The sub module structure is shown in of this EXEC module. The latency feature permits a more Figure 14 . The VHDL environment can build these accurate delay account. Duty cycle is relevant if the entities, module structures, and sub module structures EXEC module is a source that generates periodic loads, hierarchically. Graphics windows permit editing, checking, and simulation. The bottom level behavior of the EXEC or COMM modules are written in VHDL. The data from the dynamic benchmarks help refine the siraulation models to reduce the risk associated with the scaling process. Due to the lack of availability of Figure 13 . A Structure of EXEC Modules, COMM a commercial MPP architecture during 1993, the Naval Modules, and Ethernet Postgraduate School researchers explored the CMPP paradigm using Sun Workstations connected by Ethernet. Figure 13 shows two COMM modules called ebiu.
One important feature in the CMPP paradigm The COMM module can receive or transmit to or from a involves the calibration process for EXEC/COMM model local port. The data transfer on the glob,0' -:-1 is also biparameters. figure 17 was accumulated bustimeoutinfo => 10 sec using the Sun operating tcov command. COMM loads were estimated using EXPRESS profilers. In addition to the Ethernet modeling, two
The parallel EXPRESS environment can also provide beamformers were coded and tested. One beamformer used an event profile which shows the communication a frequency domain algorithm, and the other a time domain activities, and the execution activities of the processors. algorithm. The time domain algorithm reflects the type After the analysis, the next step is to construct a used in the Wide Aperture array system. partition structure in the VHDL environment that The frequency domain bearnformer demonstrated the simulates the performance. A structure for the 8-node advantage of using MPP systems. The hypothetical partition was developed. The objective is to be able to beamformer assumed 96 sensors in the system. Beam predict performance such as execution time shown in response covered 0 to 180 degrees with I degree figure 16 . Progress is ongoing and encouraging. Figure 17 exhibits the computational and EXPRESS will be used to map the application to the communication loads for the frequency domain Sun Workstation environment. Table V reveals beamformer. The loads were estimated and extracted as preliminary execution time data for the WAA beamformer described in the CMPP paradigm in Figure 12 . These program on three high speed computers: the Sparc 630MP estimates represent the loads for each processor.
(2 processors), the Navy TAC-3 (HP 900/730), and the The two main execution modules are: the FFT module Cray YMP/EL. The Cray yielded the best execution time, and the Vector-Matrix product module. The other modules but the TAC-3 yielded the smallest execution code size.
i,~
Systems Reengineering Technology Workshop, Februar" 8-10. 1994 The TAC-3 is about 10 times slower than the Cray 4. scalability of massively parallel architectures, YMPIEL. but requires 25 times less code.
5. availability of commercial massively parallel 
