Bytecode folding is an effective technique for speeding up execution in Java virtual machines. This paper investigates a hardware implementation of the aforementioned technique on BlueJEP, a Java embedded processor. Since BlueJEP is a micro-programmed stack machine, we adopt a microinstruction oriented approach, folding up to four microinstructions (corresponding to up to four bytecodes, on occasion). A variety of processor versions for different subsets of folding patterns are implemented, simulated and synthesized on a Xilinx FPGA. The measurements and results show that, although the number of execution cycles is reduced, the critical path increase leads to a lower performance. Taking into account the device area, we conclude that for our case, adding a second processor may be preferred over hardware folding. In general, we observe that folding efficiency may only be evaluated properly on a real implementation, rather than using theoretical estimates, due to the increased complexity of the hardware.
INTRODUCTION
With the increasing popularity of Java as a programming environment, a larger range of hardware solutions have appeared, dedicated to raise the performance or lower the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. cost of Java powered systems. Among these, Java processors [7-9, 14, 15, 19] implement a considerable part of the Java Virtual Machine (JVM) in hardware. We are focusing in this paper on BlueJEP [4, 5] , a micro-programmed Java embedded processor entirely specified in Bluespec System Verilog.
In Java processors, bytecodes are decoded and translated into native instructions or control signals on the fly, at runtime. By contrast, compilation based approaches, be it Just in Time (JiT) or Ahead of Time (AoT), carry out translation and optimization of a sequence of bytecodes before the system starts executing it. These have their own problems, as JiT requires resource hungry and unpredictable runtime compilation of Java, while AoT requires that all the code is available beforehand. For embedded systems, where hardware resources are limited and real-time predictability is often a requirement, native Java processors are a desirable choice. Nonetheless, compilation based approaches have the advantage of producing more efficient code, thus offering higher performance, since compiler optimizations are an integral part of the process.
Bytecode folding is a technique that has been suggested to improve the performance of Java processors [17] . In general, stack machines (such as the JVM) frequently move of data in and out of the stack, as operations can be carried out only on the top locations of the stack. Multi-address machines are able to directly access a range of registers for any operations, avoiding push/pop like operations. In fact many of the Java processors are micro-programmed, building on multiaddress cores (usually RISC) [8, 15] . Sequences of bytecodes may thus be translated (folded) into more efficient multiaddress instructions, that bypass the stack. Identifying the optimal length and type of sequences that can be folded (referred to as folding schemes) is a complex problem [16, 18] . The majority of Java processors do not implement bytecode folding, often a choice for avoiding predictability or resources issues. For PicoJava-II [15] , instruction folding turns out to be the critical path of the processor pipeline [11] . This indicates that a theoretical estimate of the percentage of folded instructions is not enough to make a case for implementing folding. As we point out in this paper, a full simulation and synthesis process is required to get an accurate figure of the performance gains introduced by a folding scheme or folding in general.
The remainder of the paper is organized as follows. Section 2 briefly describes the original BlueJEP architecture. Section 3 focuses on folding and examines the theoretical gain of different solutions for our case. Section 4 is dedicated to the actual hardware folding architecture and implementation, while section 5 presents the experimental setup and the results. At last, conclusions are drawn in section 6.
ORIGINAL ARCHITECTURE
The Bluespec Java Embedded Processor (BlueJEP) is based on the Java Optimized Processor (JOP, [14] ), but differs from it in several ways. BlueJEP is described in a relatively new hardware description language, the Bluespec extension of System Verilog (BSV, [1] ). Specifications in BSV are at a higher level of abstraction than typical RTL Verilog/VHDL designs, employing rules to describe functionality in a declarative way, are strongly typed and can be highly generic [6] . BSV has been successful in describing and exploring the architecture of various complex processors [2, 3] . Additionally, BlueJEP has a more complex pipeline than JOP (one additional stage, operand forwarding and bypass capabilities), a slightly different micro-instruction set and implements a different bus interface. Nevertheless, the execution model, memory organization and software tools are identical between the two processors. We briefly present next the BlueJEP architecture. An overview is shown in Figure 1 , and a more detailed description can be found in [5] .
BlueJEP is a micro-programmed pipelined processor, able to execute bytecodes either as single micro-instructions, as micro-programs or even as Java code, depending on their complexity. Its pipeline consists of six stages as follows: 5. The Execute stage carries out the actual operation, and sends the result to the last stage.
6. In the Write-back stage the result of a data moving instruction or operation will be stored in either the register file or on the stack.
Branches are speculatively executed as not taken in the BlueJEP pipeline. Faulty predictions cause a flush and a rollback to the state available just before the Write-back stage. Different specialized bytecode caches (as in JOP) can be used with the pipeline, the simplest being a single method cache that is filled every method invocation and return [12] . More advanced method and data caches are likely to increase the performance of the processor, also influencing the folding efficiency. For an idea of the possible performance gains through caching please consult [10] .
FOLDING IN THEORY
One of the main drawbacks of a stack based architecture is that data is frequently shuffled in and out of the stack. For example, when adding two variables, the processor needs to fetch the values from their respective memory locations and push these onto the stack (ld, ld), perform the addition on the top two stack values and save the result back in the top of the stack (add). Finally, the top of the stack must be saved into the destination memory location (st). However these four instructions could be folded into one single instruction on a three-address machine, thus saving valuable clock cycles and memory accesses.
Bytecode vs. Micro-Instruction Folding
Typically folding is carried out at bytecode level in the majority of the published approaches or processors [15] . Since BlueJEP implements Java bytecodes as micro-code of a smaller set of stack-based operations (see Listing 1 for some examples), we have the choice of folding on either bytecode level or micro-instruction level. At bytecode level, this would mean having specialized micro-programs for the folded instructions and no folding would occur within a micro-program. At micro-instruction level, folding can be carried out across micro-programs, reaching even at bytecode-level (simple bytecodes such load, store and arithmetic operations are implemented as single microinstructions). We chose therefore to implement folding at the micro-instruction level. 
Folding Model
In Sun's PicoJava [15] processor, up to four bytecodes can be folded according to pre-defined folding patterns which the executing bytecode stream is matched against. This is usually referred to as fixed-folding-pattern matching. A dynamic-folding-pattern matching was introduced in [17] , namely the Producer, Operator, Consumer (POC) folding model, which has been later refined in [16, 18] . In the dynamic model, the result of a folding can be again folded with other neighboring instructions. A comparison between the different types of folding models can be found in [16] .
In this paper we adopt a fixed-folding-pattern approach, similar to the one implemented in PicoJava. A description of the instruction classes follows (note that although the names and classes are partially similar to the POC model, we do not implement the POC approach):
• Producer produces one word of data and pushes this onto the stack. For example: iconst0 which pushes the constant zero onto the stack.
• Operation pops the two top entries of the stack, performs an operations on it and pushes the result back onto the stack. For example: iadd which pops the two top entries of the stack and pushes the addition of these two integers onto the stack.
• Consumer pops the top of the stack and can optionally store the data into the memory. For example: istore_0 which pops the top entry of the stack and stores this in local variable 0.
• Special gathers all other instructions, that in principle will not be folded, such as branches. Note that in the dynamic POC model, branches are seen as operations and may be folded, while also breaking the folding sequence.
In theory it is possible to construct folding patterns that are practically unlimited in length, but since the resources available on the BlueJEP core are limited, we must restrict ourselves to specific patterns. In particular, we can only carry out two memory reads, one write and one operation in the same clock cycle. This limits our longest folding pattern to four micro-instructions, a sequence of producer, producer, operation, consumer (ppoc). Table 1 gathers the different folding patterns used on BlueJEP. 
Theoretical Gain
The amount of foldable micro-instructions depends mainly on the executed Java program, which dictates the sequence of executed micro-instructions.
While benchmarks for small embedded Java processors are available (i.e. [13] ), to estimate the folding gain we used an application that we considered to be more common for a Java environment. In particular, we use a garbage collector (stop-the-world, markcompact as described in [4] ) alongside a simple mutator that allocates/discards objects in linked lists.
To get some estimates on the possible gain from folding, we used a micro-instruction level trace of the execution of the test application, running on an unmodified version of BlueJEP. Obtaining the trace was relatively easy, since the BSV compiler can generate an executable from the BSV specification, and the BlueJEP code was instrumented to output the sequence of executed micro-instructions along with other relevant information.
There are two important figures that should give us an idea about the achievable gain through folding. The first one relates to number of micro-instructions that are eliminated, and is the ratio between the number of initial vs. folded micro-instructions. This estimate, which we refer to as theoretical gain, assumes that there are no data dependencies and micro-instructions and data are always available without delay, and that we can issue one microinstruction every clock cycle. Nevertheless, due to bus accesses that stall the pipeline and cache re-fills due to method calls and returns, BlueJEP can never average one micro-instruction every clock cycle. This means that during certain clock cycles the pipeline is stalled, due to data dependencies, bus accesses, etc. Keeping the number of idle clock cycles invariant between the initial and folded execution, we can factor this in to obtain a more realistic estimate of the speed up, the adjusted gain.
We estimated these two different gains for different combinations of folding patterns, and give the most promising ones in Table 2 . The most promising combination results when all folding patterns are used. However implementing all the combinations in hardware can cost valuable device area and may results in a lower clock frequency. Using fewer folding patterns seems to lead to a gain similar to other larger/different combinations. Therefore, from the device area and clock frequency point of view, it makes sense to consider other combinations of folding patterns, besides the solution that implements all of them. 
HARDWARE IMPLEMENTATION
While the actual folding would only require changes to the decode stage of the BlueJEP pipeline (stage 3 in Figure  1 ), more changes are needed in order to feed the folding mechanism. For instance, in the original architecture the fetch micro-instruction stage (stage 2) delivers at most one microinstructions to the decode stage, yet folding can consume four micro-instructions at once. Therefore a buffer that can provide four (or less) micro-instructions at once is required at the input of the decode stage. The changes made to the architecture are detailed next, in the order they were introduced during the implementation.
Multiple Decode FIFOs
The standard FIFOs in the BlueSpec library only support at most one dequeue and enqueue per clock cycle. Therefore we use multiple parallel FIFOs such that the decode stage is able to dequeue more than one micro-instruction per clock cycle. The FIFOs are fed as a circular buffer, each queueing every forth micro-instruction in the sequence. In the best case, stage 2 can fetch and enqueue four micro-instructions at the same time as stage 3 dequeues and folds four microinstructions.
The standard FIFOs are fairly easy to use in Bluespec System Verilog, since operations on them automatically generate the guards needed for the correct behavior. For example, rules containing an enqueue operation will only fire if the FIFO is not full, regardless of the location of the enqueue operation. However, when using multiple FIFOs of this kind in parallel, a single rule for handling them would block if any of the FIFOs are full, prohibiting any operations on the rest of the FIFOs. In particular, the fetch microinstruction stage (stage 2) would stall if any of the FIFOs going to the decode stage (stage 3) were full. Writing our own FIFOs, with slightly different guards and precedence of operations solved this problem.
Wider Fetch-Instruction Stage
With the decode stage (stage 3) being able to receive now as many as four micro-instructions at once, more changes were required in the fetch micro-instruction stage (stage 2), to feed the four parallel FIFOs. Nevertheless, fetching multiple micro-instructions in the same clock cycle is not trivial, since it is uncertain where they should come from. The micro-instructions could all come from consecutive micro-addresses, consecutive bytecodes or a mix of the former two options. Additionally, the micro-ROM must also allow four simultaneous reads, but no special provisions in the BSV code were needed for this. The modified stage is able to enqueue four instructions in one clock cycle, if the fetch bytecode stage (stage 1) provides bytecodes sufficiently fast.
Multiple Bytecode FIFOs
The fetch instruction (stage 2) can only enqueue microinstructions from several bytecodes when they are available. A circular buffer of multiple FIFOs was implemented in between stages 1 and 2, similar to the one between stages 2 and 3.
Wider Fetch-Bytecode Stage
Minimal modifications to the bytecode cache and stage 1 were necessary in order to fill the bytecode buffer between stage 1 and 2 in one clock cycle.
The changes described above were made in such a way that a number of design parameters can be configured. This allowed us to evaluate different solutions of the folded processor relatively easy. As an example, Figure 2 depicts the changed stages for a specific configuration. The configurable options are as follows:
• Fetch instruction width: valid values are 1, 2 and 4.
This options determines how many consecutive microinstructions the fetch instruction stage (stage 2) can simultaneously look-up and read from the micro-ROM.
It also matches the number of FIFOs between the fetch bytecode (stage 1) and fetch instruction stages.
• Decode width: valid values are 1, 2 and 4. The option regulates the number of decode FIFOs, which offer patterns to stage 3. Note that this option must be at least the length of the longest pattern.
• Folding patterns This options can be any combination of the different folding patterns in Table 1 .
The decode stage hosts the actual folding mechanism. In this stage the micro-instructions in the decode FIFOs are matched against the folding combinations. Using BSV, the pattern matching was achieved rather easily. See Listing 2 for a section of the BSV code folding the ppoc pattern. 
Fetch

RESULTS
With the folding modifications to the processor architecture in place, we were able to inspect various configurations of the design. It was interesting to look both at the reduction in executed clock cycles and synthesis results, such as maximum clock frequency and design area.
Setup
For synthesis we employed the following flow. The BSV specification was compiled to Verilog using the Bluespec bsc version 2006.11. These Verilog files and required BSV provided libraries were directly and without modifications synthesized through the Xilix ISE 9.1 tool chain, with optimizations for speed. As a target for synthesis we used the Xilinx Virtex 5 family (xc5vlx30-3). This flow was used to obtain actual figures for maximum clock speed and device area utilization (equivalent gate count).
To obtain accurate traces and take advantage of instrumented code, we also used the other flow available for the BSV files, namely compiling the BSV code directly to an executable on our Linux host platform. The executable is a high level simulation of the BSV design, able to run in a cycle accurate manner. Since the BlueJEP processor is only part of a system which needs to be complete for a correct functionality, simplified models of peripherals (memory, serial, etc.) were also described in BSV. These components were connected together using the bus interface of BlueJEP. Finally, to make sure the timing for bus and memory accesses models the real system, we matched them against ChipScope traces from implementations running the original BlueJEP processor. Thus we kept the timing of the processor I/O signals as accurate as possible in the final executable simulation. Table 3 gathers a selection of the results for simulations and synthesis. The first eight columns describe the configuration of the design, in terms of the number of the fetch micro-instruction FIFOs, number of decode FIFOs and implemented folding patterns. The next two columns show the absolute results of simulation and synthesis, in terms of number of clock cycles (for the same stop-the-world garbage collection application from section 3.3) and maximum clock frequency reported by synthesis. The next two columns present these figures compared to the unmodified BlueJEP processor. The Size column gives the relative device area (in equivalent gate count) compared to the original BlueJEP (≈500k equivalent gate count). For more details on the performance and size of the initial BlueJEP please refer to [5] . Finally, the last two columns compare the designs using their absolute performance (Clock Cycles / Clock Frequency) and their performance per device area unit.
Discussion
As shown in Table 3 , none of the configurations performs better than the unmodified version of BlueJEP. The main reason is the enormous drop in clock frequency due to the added hardware required for the folding mechanism. For example, take the configuration with fetch instruction width Another important observation is that the device area increases as well. In fact, for the widest configuration (that would allow the most through-put and fold the most instructions) captured in the last row of the table, the device area more than doubles compared to the unmodified Blue-JEP. Thus, looking per device area unit, the performance decreases even more.
To summarize, adding a folding mechanism to our original processor does not seem to pay off from either performance or area point of view. One can always add a second processor and have a better performance than using hardware folding.
We make that statement with some reservations, however. First, a multi-method bytecode cache would increase the performance in general. Second, BSV generates rather large designs in the first place (for comparison, the original BlueJEP takes twice the device area as used by JOP [4] ). Rewriting the first three stages in VHDL, for better control over the device area and critical path may yield better designs. Furthermore, splitting some pipeline stages would also reduce the critical path and may improve performance. Nevertheless, the theoretical figures from Table 2 limit the gains to a maximum of 64% at best. The area overhead required by the folding mechanism must be small enough to warrant such a limited performance improvement compared to a two processor system. Modifications to the execution stage, to allow several operations in parallel, adding more read/write ports to the stack, could open the possibility for using longer folding patterns. Nevertheless, these will require even wider buffers, and overall even more device area. For small embedded processors, this may not be a feasible option. Moreover, folding makes the execution time analysis much more complex than for processors without folding. This goes against the real-time predictability principle, a requirement for many embedded systems. Overall, for our case making folding more efficient would require extensive redesign of the architecture with marginal benefits. Instead of increasing the complexity of the processor, we see the real advantages coming from adding more processors to the systems, which has become a natural trend today.
CONCLUSION
In this paper, we have investigated the impact of implementing a hardware folding mechanism on an already existing Java processor, BlueJEP. The architectural modifications implement fixed-pattern folding at micro-instruction level. Several configurations were evaluated, thanks to the capabilities offered by Bluespec System Verilog, both through cycle accurate simulations and synthesis for a Xilinx FPGA. Although in theory folding shows a significant reduction in the number of executed instructions, our results show that the critical path is increased to an extent where the absolute performance decreases. Furthermore, the device area for the configuration with most folding is more than double the size of the unmodified processor. We conclude that using several simple processors instead of a folding hardware would be in fact more efficient from the performance point of view.
