ing desktop calculator. The first transistor desktop calculator had been developed in Japan a few years earlier, in 1964. It was quickly put into commercial production, leading to a rapid growth in the calculator market and making Japan a major manufacturer of calculators. At that time, desktop calculators still used wired logic. For calculator technology to proceed, however, a new approach to logic design was needed-one that would allow the design, modification, and addition of new system functions in a short period of time.
Moving to ROM-Based Programmed-Logic
In 1968 at Busicom, I developed a printing desktop calculator with a ROM-based, macroinstruction programmable decimal computer architecture. Like a computer, the calculator consisted of:
input-output (I/O) devices com-■ ■ prising a keyboard, a display, and a printer a processor for data processing Figure 1 shows a block diagram of the CPU in this calculator. The CPU was multichip; its data path consisted of a 4-b serial ALU operating on decimal and binary data, an accumulator, a keyboard input register, a multiplication and division register; control registers (comprising counter, decimal point register, and flag status register), a timing module; and data memory.
The control unit consisted of an 8-b program counter with an address incrementer, a ROM for storing the program, an 8-b instruction register, an instruction decoder, and an instruction execution control module. The instruction set included 23 instructions for operations such as data transfer, data exchange, load immediate, clear, addition and subtraction of decimal data, shift, addition and subtraction of binary data, increment and decrement, flag control, unconditional jump, conditional jump, and print. The program size was less than 256 B. The I/O device control was implemented in wired logic. We designed the logic at the transistor level instead of the gate level so that it could be used for both the circuit and the layout designs.
LSI Enters the Scene
of control registers was increased to eight; these registers were collectively called the "index register." The instruction set was improved, and subroutine jump and processor stop instructions were added. Busicom planned to develop seven different types of LSI chips, one each for the program control, the arithmetic unit, the ROM for program, the shift register for data, the timing circuit, the printer control, and the output buffer. The printing calculator was to be constructed with nine LSI chips. The chips were to be used in a variety of applications-business calculators, scientific calculators, billing machines, teller machines, and cash registers.
Busicom 
Intel's Proposal
It was in late August 1969, when the project was at a standstill, that In responding to Intel's proposal, I emphasized the need to meet the demands of the application in the system architecture and instruction set.
With this emphasis in mind, as well as our goal of using only LSI chips to construct a system, Busicom and Intel agreed to develop a program ROM and a data RAM in addition to the processor. Then, we jointly defined a 4-b time-multiplexed bidirectional system bus for connecting the processor directly with both ROM and RAM. We also agreed to add a decimal adjust addition (DAA) instruction to be used after the binary addition of binary-coded decimal (BCD) data in multidigit decimal numbers. With DAA, the time required for onedigit of addition was reduced to 107 microseconds with a 750-kHz clock.
I conducted both a static evaluation and a dynamic evaluation of the proposed instruction set by utilizing a macroinstruction set and an application program such as would be used in the printing calculator. I also asked Busicom to measure the contact time of the keyboard's key. On the basis of these evaluations and measurement, I concluded that it would be critical to the success of the project to add a collection of new useful instructions, redefine some instructions, and delete unnecessary instructions; with these changes it would be possible to reduce the ROM size, increase performance, implement an interpreter, and achieve realtime control of I/O devices. Figure 2 shows the sources of the instructions in the final 4004 instruction set.
Developing the Instruction Set
The instruction development proceeded in the following sequence. For real-time control of I/O devices, I added the status of an external input pin (Test) as one of conditions of the conditional jump instruction (JCN). Next I added a 10-b static shift register LSI (SR) to serve not only for keyboard scanning but also as an output buffer for a printer.
Further, I added the keyboard process (KBP) instruction for converting a 4-b row data of the keyboard into 0, 1, 2, 3, or 4, or, if two or more keys were pressed simultaneously, into 15.
So that we could implement an interpreter program, we added a fetch indirect instruction (FIN) to load the 8-b data from ROM addressed by the pair register PR0 into the pair register PRn. Next I re defined the BBL instruction (branch back and load data to the accumulator) so that it loads a constant return value n into the accumulator on execution. Figure 3 shows the interpreter program. It uses PR0 as the program counter of the macro program while the PR1 register holds the start address in ROM for the macroinstruction to be executed. The interpreter first executes JCN, which checks the Test input to determine if there is a request from an external device. If there is, it jumps to the I/O control program; if not, the interpreter begins normal program execution.
First, the interpreter executes FIN to fetch the start address for the current macroinstruction from the ROM addressed by PR0 and stores this start address into PR1. Next, using the jump to subroutine instruction (JMS), it transfers execution to the top of the macroinstruction program; then it jumps to the current macroinstruction's program by using the register-indirect jump instruction (JIN) with PR1. In the last step of execution of the macroinstruction program, a nonzero return value is stored into the accumulator with BBL if the macroinstruction is a branch instruction, and the carry is set if the branch condition is not met.
On returning from the macroinstruction program, the interpreter updates PR0, processes the branch macroinstruction if any, and then returns to its start. The program size of interpretive routine in the interpreter was 16 B, the longest macroinstruction took 2 ms to execute, and real-time control of I/O devices was made possible by polling the synchronization signal of the printer via the Test input pin at 2-ms intervals in the interpreter. This became one of the keys to the success of the 4004 system.
Optimizing Instructions
The original proposal from Intel included the store instruction (STO) to transfer data from the accumulator to the index register, but it did not include the load instruction (LD) to transfer the data in the opposite direction. In optimizing the instruction set, the STO instruction was replaced with LD, and the exchange instruction (XCH), which swaps the contents of accumulator and index register, was added. The XCH instruction does not destroy the contents of either register. The addition of XCH and eight more RAM status registers per chip made up for the lack in number of registers in the index register.
The proposed shift instruction (SH) was replaced with the more useful rotate instruction (RA). To reduce program size further, the following instructions were also added: the clear-both instruction (CLB) to clear both the accumulator and the carry; the complement-carry instruction (CMC); the transfer-carry-subtract instruction (TCS) to be used for decimal subtraction; and the nooperation instruction (NOP) to be used for the software timer and the debugging. Finally, the designate command line instruction (DCL) and the chip select output pins were added for selecting RAM without an external decoder circuit. At the same time, keyboard input pins and the halt instruction (HLT) were deleted.
It took me three months to optimize the system configuration and the instruction set. However, with these major changes to the instruction set, I reduced program size by around 30%.
A Brief Return to Japan
I returned temporarily to Japan in December 1969 to finalize the details of the printing calculator's program and confirm the instruction set. Busicom and Intel formally signed a contract for the development of the LSI chips on February 6, 1970, with the development fee set at US$60,000. In mid-March of that year, Busicom sent Intel the formal functional specification and the instruction set, attaching detailed diagrams to avoid misunderstanding.
I visited Intel again to verify the logic in April 1970, only to find that the project had made little progress since I had returned to Japan-all Intel had done in the intervening months was to hire one development engineer and two layout designers for the project. There was no logic design engineer. Consequently, I joined the design group and took charge of the logic design of the 4004 CPU, logic simulation, layout checking, and test program generation. Busicom in Tokyo took charge of building a CPU emulator for logic verification.
I designed the logic at the transistor level instead of the gate level so that it could be used for both the circuit and the layout designs. First, I clearly defined the interface signals between the functional modules and then made a detailed functional block diagram of the 4004 CPU before proceeding with the detailed logic design. Figure 4 shows the block diagram of the 4004 CPU. This was an important step; back in the 1970s, the success of a microprocessor development depended largely on the quality of its detailed block diagram.
Reducing the Transistor Count
Because we chose a three-transistor DRAM cell for the address stack and the index register, each cell contained one read bit line and one write bit line. First of all, an address incrementer was built in the refresh circuit of the address stack and was executable in one state by using two interleaved clocks. Next, in order to calculate early the total number of transistors needed, the logic design started from those modules that used relatively large number of transistors. Those modules were the index register (402 transistors), the address stack with an address incrementer (374), the instruction register (72), the timing circuit (99), the command control (100), and the system bus interface control (92), which brought the total number of transistors used to 1139. This meant that it would be impossible to design the 4004 CPU within the 2,000-transistor limit that was first estimated by Intel in the proposal. The logic design was now reaching its completion. In order to simplify the logic-and thereby reduce the transistor count-we decided that any machine cycle would be composed of eight states (A1, A2, A3, M1, M2, X1, X2, and X3). A combinational logic circuit was used for instruction execution control, and constructed with signals of the instruction decoder (48 PLA terms), the instruction encoder (6 terms), state timing, and machine cycle timing. A major challenge at this point-and critical for logic simplification-was optimizing by balancing the selection of the instruction decoders and encoders with the number of combinational logic circuits.
Next, taking the overall layout into consideration, I carefully designed the ALU module (329 transistors), including the accumulator and decimal adjust circuit, the instruction decoder and encoder (269), and the instruction execution control module (500). Then, to reduce testing time, I decided to send the accumulator and the carry flag out to the system bus when it was idle. In the end, the number of transistors used for the 4004 totaled 2,237.
Only One Logic Error
In the logic simulation carried out in August 1970, only one logic error was found. Consequently, the layout of the 4004 was almost the same as the transistor-based logic circuit schematics that I drew in the last phase of the logic design. I have been feeling that after Busicom transferred the rights to the 4004 to Intel in 1974, Busicom's contribution to the 4004 development faded out.
Back to the Desktop Calculator
After completing the test program generation and the layout checking, I flew back to Japan in November 1970 and returned to my original work of developing the printing desktop calculator. At Busicom, I built an engineering prototype with the 4004 CPU, 4001 ROM emulator, 4002 RAM, 4003 SR, a card reader, and a control panel with display. The program for the calculator was completed in March 1971. The total program size was kept to within 1,000 B, with approximately 250 B allocated for the application program, 400 B for the macroinstruction program, and 350 B for the interpretive routine, keyboard control, and printer control. Figure 5 shows the motherboard of the 4004-based printing desktop calculator. Figure 6 shows the world's first microprocessor-based calculator.
The Moment of Truth
The long-awaited moment finally came in April 1971 when the 4004 CPU arrived at Haneda Airport. After some simple testing of the 4004 CPU, the 1-Kbyte program was loaded into the ROM emulator via a card reader. Now all I had to do was push and release the reset button of the engineering prototype. While not being afraid of failure, I was also fully aware that the outcome of the two-year project would be determined in that one moment. I pushed the reset button, but hesitated for a while before releasing it; once released, the final result would be inescapable.
After a deep breath, I took the plunge and released the button. The calculator activated, and the program started to run. The address display of the program counter indicated that the keyboard scan program was executing. Unable to wait any longer, I pressed the number keys and then the addition key. It felt like I waited forever for the printer to start. Finally, the printer roared into life, and the numbers and addition symbol I had input were printed out. I pressed some more number keys, the addition key, and the equal key at the end. "It's working!" I felt my heart pounding and my entire body flash hot with excitement, while my head alone remained sober. This was the very moment of the birth of the world's first microprocessor, the 4004, and the microprocessor-based system. I still remember vividly the feel of the reset button on the prototype model, the hopes and fears I felt, and my racing heartbeat when I released that button.
More Innovative Microprocessors
After the 4004-based calculator had been successfully developed, I moved to Ricoh Corp. There, using NEC's NEAC-M4 8-b minicomputer, I designed a controller for a high-speed printer and a production tester for a drum memory. Then I joined Intel in 1972 and developed the 8-b 8080 microprocessor, as well as the 8080 peripheral LSI chips as supervising engineer (Figure 7 ). I moved on to Zilog and developed the Z80, which later came to be called the ultimate 8-b microprocessor. I returned to Japan in 1980 after completing the development of the 16-b Z8000 microprocessor.
My life as a development engineer was enriched by my decade-long active involvement in developing innovative microprocessors, and I look back with nostalgia on those exciting pioneering days.
