Specialising Systems-on-Chip (SOCs) for a particular application is an effective way of increasing the performance achievable for a given level of energy consumption. In fact, silicon manufacture costs are low enough that small, custom, entirely digital designs, up to and including multi-core microprocessor designs, can be manufactured cheaply in short manufacturing runs. Non-recurring engineering (NRE) costs are still prohibitive due to the high level of experience required from the design engineer and the vast size of the design space. This is even true when only pre-verified Commercial Off-the-Shelf (COTS) Intellectual Property (IP) blocks are used in the SOC design. In this paper we present a novel machinelearning based method of generating an application-specific SOC design and configuration. This approach is fully automated and can generate near-optimal application-specific SOC designs within hours rather than weeks and, hence, reduce both NRE costs and time-to-market significantly. Our methodology profiles key application characteristics using simulation of a small number of test systems and machine-learning based prediction to find likely optimal system designs for a given target application. We demonstrate the effectiveness of our automated design methodology using 82 workload applications, generate SOC designs with up to 10 cores and 8 memory banks, and show that our classifier averages up to 92% of the optimal design performance across our applications.
INTRODUCTION
It is becoming economical to manufacture multi-core designs as application-specific SOC implementations, despite the initial cost of setting up manufacturing of a new design. This is due to the falling prices of manufacture in mature technology nodes, notably 90 and 65nm, and soon 45nm. These technology nodes allow for Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. multi-core systems in just a few mm 2 , opening the door for cheap manufacture of reasonably complex computer systems. A problem that remains is how to efficiently optimise these systems for area, power and financial return, given the complexity of the design as a whole. A major part of these designs is the Network-on-Chip (NOC) architecture used in connecting the functional blocks.
One way to circumvent this issue is to manufacture a generic, high-powered design and use it for many applications. This is suitable for systems that have many purposes (consumer-controlled devices), but less so for embedded systems as each such system will draw more power and be larger than strictly necessary for the task at hand, implying financial costs to the buyer of such devices.
To reduce these costs we wish to manufacture application-specific silicon designs that, while manufactured in relatively small batches, result in a lower total cost to the customer. These specialised designs must have low design (NRE) costs to go with the now existing low manufacturing startup costs, so that generating a new design is as cheap as possible. Our work targets this design cost, investigating methods to automate the NOC design process to reduce the costs of application-specific SOC designs.
The number of possible combinations of IP, parameters, partitioning and layout for even a single-purpose design can be very large and selecting an optimal design can be a challenging task. Additionally, while COTS building blocks are generally of high quality, there is also the possibility of latent problems in the design which may only manifest in certain system-level configurations. Such errors can easily affect the bottom line of a project through a variety of mechanisms and should be avoided.
Thus, to ultimately save money we would like to be able to quickly, and automatically, determine if a given system configuration will work as intended and if there is a particularly efficient system for a given application. We are attempting to solve this problem using machine-learning techniques over the design space of possible system configurations.
The rest of this paper is organised as follows. We elucidate our goals in section 2. We discuss our approach to address these problems in section 3. We present details of our evaluation and the results of our experiments in section 4. Other, related research is presented in section 5, and we conclude the paper with section 6.
GOALS
First we aim to find which system configurations produce working systems, so as to avoid spending resources (silicon, synthesis time, simulation time) on designs that will predictably fail. Even if all individual pieces of IP in a SoC are verified separately, there are usually still integration issues due to vague specifications or other problems. Usually these issues are systematic, or can easily be made systematic, such that some combinations of blocks and connections work while others do not. We want to know without lengthy system simulation whether a given combination of blocks will work so as to save simulation and design time. Secondly, we aim to predict (pre-simulation) whether a given design will fit within a given silicon, power, or monetary budget, so we can avoid spending resources on designs that do not meet these goals. It is a given that for this problem we only concern ourselves with SoC designs that pass the first predictor, i.e. designs that have been predicted to work. Meeting the budgetary criteria is a necessity for designing comercially successful SoC systems.
Thirdly, having efficiently solved the aforementioned problems, we aim to use the predictive power this gives us to search for nearoptimal SoC designs. We use machine learning methods to predict particularly appealing points in the SoC design space for particular applications, and use the previous two points to qualify them. This would enable fast, automated, design space evaluation to find custom, efficient designs for a new application.
Motivating Example
We evaluated our design space by extracting the energy and runtime figures of a single workload over 71 hardware designs. These are plotted in Figure 1 . It can clearly be seen that the range of runtime and energy consumption, even for the same application, varies by an order of magnitude. Selecting a random design for this application is highly unlikely to result in optimal performance, and finding the optimal design even from the small set of working designs would be expensive.
It is clear that this spread of performance results in a hard to find optimal design point. This is especially true when the design includes a few tens of separate logic blocks, making the design space very large. Using our approach instead drastically cuts down on the amount of designs that need to be implemented, greatly reducing the time needed to find a new optimal design for a new application.
METHODOLOGY
To meet our aims, we need to investigate and take measurements on a real, parameterisable, embedded system. Such systems are rare in the consumer market as most devices have already been finalised and there is no configurability left. We therefore implemented our own embedded system to achieve the configurability we needed. Our design consists of up to 16 embedded processor cores, linked to up to 8 memory banks and IO devices. We kept the interconnect topology options, core counts and memory bank counts configurable so that we could vary these options on a perinstance basis.
This design used off-the-shelf processor elements, available to us as pre-parameterised Verilog source. To this we added a customwritten, parameterisable AXI-based interconnect [2] , based on butterfly network topology. We further used abstract Verilog RAM models, which in an FPGA fabric translate to on-die memory blocks. We also implemented some in-FPGA IO devices as specialised Verilog, to connect these devices into the AXI interconnect fabric.
We synthesise the instantiated system for an FPGA, thus enabling both functional correctness checks and fast simulation in one step. Straight software simulation of a complete system is infeasible due to the demand for complete program executions, which would take an unreasonably long time. A downside of in-FPGA simulation is we cannot gain arbitrary switching statistics without prior counter insertion. These counters are non-invasive to the logic function of the system, and have a very small area overhead. What we cannot monitor with a simulation save low-level layout-aware device simulation, is accurate power figures.
After workload completion the inserted design counters are read out using a separate non-invasive scanchain. We thus obtain information on the performance of a particular hardware design for a particular program. By using these counters we derive the machine learning features we need, such as average cycles per IO request, average transactions per interconnect switch, and others. Importantly, we derive the two features we wish to optimise for: total runtime and total dynamic energy.
Total runtime is the time, in seconds, it takes for an entire workload to complete. Total dynamic energy is taken as the number of possible wire switchings in the AXI interconnect. The energy cost in the processing cores, IO devices and RAM blocks is static for a given workload, as they do does the same work, and IO delays have a minimal impact due to clock gating. Thus the only difference in the dynamic energy is the actual switching in the interconnect.
From a machine-learning perspective our first question is a classification problem, with two classes (working and not working). Hence we tried the standard classification algorithms k-Nearest Neighbours and Support Vector Machines (with linear and polynomial kernels).
In k-NN classification, we pick a small integer k, and for each design to be classified we identify the k closest designs to it in feature-space. The classification of the new point is predicted to be the most common classification among these k points. In SVM classification we use the training data to construct hyperplanes that partition the configuration space (possibly after transforming said space by applying a "kernel" function to make the regions more regular and reduce dimension), then classify a new point by finding out which side of the hyperplanes it's on. Both algorithms may also choose to not return a classification for a given feature vector, which indicates that the confidence in the prediction is too low to render a classification. These algorithms are described in standard machine-learning textbooks such as [12] .
Our second question is an optimisation problem, but standard optimisation approaches like genetic algorithms and simulated annealing would be prohibitively expensive, since they would require repeated synthesis and evaluation of newly-generated NoC designs. Hence we approximate it as another classification problem, where the classes to be selected from are known-good designs. We used the same feature vector as before, evaluated on a single reference NoC design. We considered the same range of classification algorithms for this task. The full workflow of our approach is shown in Figure 2 . We train the machine learning model with the data generated in the thick arrow flow, and use the thin arrow flow for a previously unknown application. The steps in the thin arrow flow are not time consuming (some minutes to test a new application on a known working design, seconds to run the machine learning model) compared to the task of generating known working designs, and so overall the process takes far less time. The generated outputs are prediction accuracy data, which we present in Section 4, and prediction values which is simple design identifiers.
EMPIRICAL EVALUATION
This section discusses first the technical details of the experiments carried out to address the problem, and then discusses the results.
Evaluation Methodology
Using the full range of parameters of our embedded system we arrive at 510,000 possible different designs. Some of these designs are far larger and more complex than necessary, or otherwise violate device constraints. We used synthesis for our chosen FPGA device as the design constraint; if a design worked under the FPGA synthesis constraints, it was regarded as a valid design. It would be conceptually simple to substitute a set silicon synthesis flow instead of the FPGA synthesis path, which would then implicitly use silicon manufacture constraints instead.
From our experimental design space we randomly chose 512 designs, covering 0.1% of the design space. By selecting randomly we ensured that this sample is indicative of the design space as a whole; we are treating the best and worst designs of these 512 as indicative of the worst and best in the entire space. We synthesised these 512 designs for the X6VLX240T device, and recorded the results of the synthesis processes. Time taken for a completed synthesis of a single design is on the order of a few hours but is completely parallelizable. Designs that were too large or did not meet timing constraints in this device were regarded as faulty. Similarly, designs that completed synthesis within these parameters but which did not complete software tests were also regarded as faulty, and the reason noted. In total, out of 512 selected designs, 71 (13.8%) passed tests and were regarded as fully working.
We used statically compiled workloads composed of up to 9
benchmarks. The benchmarks we used to build our workloads were VITERBI, FFT, FBITAL, CONVEN and AUTCOR benchmarks from EEMBC-1, the COREMARK benchmark [4] and a specially composed IO-heavy application. These combined workloads were compiled together with a bare-metal proto-OS that handles the required system calls and functions to enable correct functionality of our benchmarks. The proto-OS code also handled startup and processor identification, enabling the same binary image to run on systems with different numbers of processors. Scheduling was handled on a task by task basis, so that each processor in a system got a static number of tasks to complete. Each workload binary was thus a self-hosting statically-scheduled OS image that was ready to run on a given design; the scheduling was considered as part of the input to our process. We generated two sets of workloads, one which was randomly scheduled and one that was statically scheduled with the longer-running tasks first. As there are many possible workloads that can be constructed from combinations of the chosen benchmarks, we randomly selected 35 unscheduled and 47 scheduled workloads from the total set to use for learning. It was expected that the scheduled workloads would provide a somewhat more even design space, so that machine learning would perform better, but little evidence for this was found in our results.
Counters Description 5× number of cores Counters for IO cycles, IO operations, committed instructions, etc. 10× number of Interconnect Switches Traffic counters, counting cycles in which AXI channels are active.
Table 1: Summary of inserted counters
A single workload takes up to three minutes to complete. After workload completion the inserted design counters are read out using a separate non-invasive scanchain. We thus obtain information on the performance of a particular hardware design for a particular program. By using these counters we derive the machine learning features we need, such as average cycles per IO request, average transactions per interconnect switch, and others. Importantly, we derive the two features we wish to optimise for: total runtime and total dynamic energy. Table 1 summarize the counters inserted into our designs. From these counters we derive the values we require that we cannot measure directly, such as global number of AXI switchings for a given channel. Overall, gathering the complete set of perfornmance data for 71 working designs takes up 6 days of CPU time, but this can be parallelized trivially.
Total runtime is set as the time, in seconds, it takes for an entire workload to complete. Total dynamic energy is taken as the number of possible wire switchings in the AXI interconnect. We reason that the energy cost in the processing cores is static for a given workload, as it does the same work, and IO delays have a minimal impact, due to clock gating. The same argument holds for IO devices and RAM blocks; thus the only difference in the dynamic energy is the actual switching in the interconnect. Thus we are measuring direct change in dynamic energy with changing hardware design, rather than measuring total dynamic energy.
Results
In this section we discuss our empirical results and findings in detail.
Predicting optimal designs
We predicted which designs were optimal for a given, new, application, and evaluated our results against the known best design for that application, comparing application performance against optimal application performance. Figure 3 shows the resultant performance on the predicted hardware design for 82 different composite applications, when predicting for energy consumption. In some cases a relative performance of 0% compared to the optimal application was recorded; these were the cases where our classifier could not predict a design with high enough confidence. Across our 82 applications, our best 5-NN predictor achieved on average within 92% of the optimal energy. Figure 4 and 5 similarly show the performance of the predicted designs in terms of the perfect design when trying to predict for runtime and ED respectively. We achieve average performance across our applications of 88% and 80% for these applications. Again, a score of 0% indicates that our classifier did not render a classification with a significant enough confidence to use.
By looking at which benchmarks were present in which workloads, and how these workloads fared in terms of predictions, we were able to determine that some applications are more difficult to predict. In particular, workloads including the COREMARK and VITERBI benchmarks fared badly in these predictions, as did workloads with a small number of benchmarks. The dips towards ends Figure 7 : Distribution of best designs from workloads, for Runtime, Energy and Energy-Delay product. Ordered by design ID and thus in approximate order of overall complexity.
of Figures 4 and 5 are all such small workloads including CORE-MARK and VITERBI. The reason these may be hard to predict for and yield bad results may be that these applications do not have significant interconnect traffic to base a determination on. We used the gathered workload indicators to find which design was the best reference design for our classifier. The classifier was trained to predict which design would be the best for a new, previously unseen, set of workload indicators. For each candidate reference design we used an N -fold training scheme; for each workload w the classifier was trained with the performance of all workloads except w, and then used to predict the best design for w. Our classifiers were evaluated using their aggregate accuracy over the workloads and the average performance of the designs they predicted for each workload.
Classifier performance varied greatly according to which reference design was chosen, as shown in Figure 6 . By trying every known working design as a candidate reference design, we were able to achieve performance on average 92% that of the best known design while optimising for energy consumption, 88% that of the best known design while optimising for runtime, and 80% that of the best known design while optimising for energy-delay product.
We aim to predict which design is best for a new workload; to do so we first determined which of our 71 working designs was best, for runtime or interconnect switching activity, for each of our existing workloads. Figure 7 shows the distributions of the best designs (design IDs) against the number of workloads which are optimal on each design, using either switching energy, runtime, or energydelay product as optimisation metrics. From this figure it is clear that the optimal design for a given application depends on wether we aim for energy efficiency, optimal runtime, or a combination of both, and that a design is unlikely to be optimal for all three metrics for any given application. This is further evidence that the interconnect design space is highly complex.
Predicting working designs
For predicting which designs work, we took the set of synthesised designs and randomly selected 12 of them for validation, and use the other 500 for training. We ran 1000 such training / validation cycles to average out any problems with the randomisation.
As a baseline, a guessing predictor, that predicts randomly based solely on the relative partitioning of the training data set, was used.
This predictor managed an accuracy of 83% over 1000 trials, with 3.2% false positives and 13.7% false negatives. This is indicative of the extreme slant of the search space: we are trying to find the small portion of working designs among many non-working designs.
Using 7-NN to predict if a given design will synthesise correctly achieves an accuracy of 85.6%, with 0.6% false positives and 13.7% false negatives. This is an improvement on the baseline, particularly in terms of false positives. Using this predictor has the advantage that it will very rarely clear a design that will not work.
A linear SVM algorithm however manages an overall accuracy of 87.8%, with 5.3% false positives and 7% false negatives. This is 27% fewer mispredictions compared to our baseline, but at the cost of a higher false positive rate.
RELATED WORK
Using machine-learning techniques to optimise MPSoC systems is a new approach and little literature exists. Work on using these techniques for single-core systems exists [3] , but these methods are not directly applicable to the more complex interconnect space.
Indirect tree networks form a specific subtype of NoC, which maps well to the embedded system-on-chip design. This and other general NoC concepts are discussed in [10] .
Interconnect partitioning and optimisation has been attempted before by analytical methods coupled with limited design-space exploration [11, 8] . We feel that these methods are not sufficient when it comes to power-aware design of application specific MPSoC systems due to their high abstraction level.
On-chip interconnects and network-on-chip architectures are largely treated as synonymous, and much effort has gone into the latter [7, 6, 1] . These approaches are however largely program-agnostic, being architecture driven, and furthermore concentrate on processorto-processor communication rather than system throughput and performance.
Further regarding network-on-chip approaches, [5] presents and discusses regular interconnect networks using synthesisable constructs, but the authors limit themselves to mesh networks and generally investigate only the effect of the network topology. No mention is made of experiments with real programs running on the system.
[9] discusses a network topology very similar to the one implemented in this paper; the authors conclude from their limited experiments that it is a feasible topology, which has trouble competing with mesh networks on smaller devices but scales better. However, the evaluation here is under the conditions of an MPSoC and not an embedded device, and the conclusions may also be different when application-specific information is taken into account.
CONCLUSIONS AND FURTHER WORK
We have shown that using SVM and k-NN classifiers, we can tackle the problem of automated application-specific SOC design in a time-efficient manner. SVM, in particular, is suited for predicting working designs from gross system configuration information, including determining suitability and conformity to design constraints.
k-NN on the other hand is better at classifying the optimal design for a previously unseen workload. Using a k-NN classifier we acheive on average up to 92% of the optimal application performance across our 82 applications. Our classifiers, once trained, take less than a fraction of a second to classify a new set of features, enabling large savings of time in optimising designs for new applications.
Using this method of finding optimal application-specific SOC and NOC designs is very efficient for a new application on a design space that has been previously explored even lightly. In future we plan to improve our ability to find optimal design points, using a combination of classification, interpolation and standard optimisation techniques such as genetic algorithms. Due to the prohibitively high time requirements of synthesising and testing new designs this has not been possible so far.
