Inside the PowerPC AS

In early 1991, Apple Computer was looking for a new microprocessor for its future computers. Apple believed that the future for small-computer processors was a Reduced Instruction Set Computer (RISC) design rather than the Complex Instruction Set Computer (CISC) designs then in use. Compared to CISC, RISC had fundamental benefits: It offered a substantial performance advantage, yet the chips were smaller and used less power. But major vendors still produced primarily CISC processors. Motorola, Apple's traditional supplier, had produced a RISC processor, but it was not a commercial success.

At the time, IBM had one of the industry's best RISC processors in its RISC System/6000 (RS/6000). These processors were all multichip, 32-bit designs, but IBM planned to announce a single-chip, or microprocessor, version (called RSC for RISC Single Chip) in early 1992. Hearing of Apple's search for a new RISC microprocessor, IBM decided to make a sales call at Apple. Would Apple like to use RISC microprocessors from IBM?

After the shock wore off of having IBM, a major competitor, try to sell it processors, Apple began to think that using IBM chips might not be such a bad idea. Motorola also was looking at a new RISC design, and Apple suggested that the three companies get together. In September 1991, IBM, Apple, and Motorola announced an alliance to pursue the development of several exciting new technologies, including a broad family of microprocessors based on the RSC and named the PowerPC.

From this small beginning eventually sprang the PowerPC AS that is the engine for the new AS/400s announced in June. This is the story of how the AS/400 got RISC.

The Evolution of PowerPC

Today's RISC processors owe much to the pioneering efforts of Seymour Cray as he designed the world's first supercomputer at Control Data Corporation in the early 1960s. In his design, Cray incorporated pipelining as a way to let the processor work on more than one instruction at a time (see "Pipelining," page 52, for an overview of pipelining technology). Cray's design also used a simple instruction set, which improved pipeline performance. RISC processors today use this same basic design — a pipeline and fairly simple instructions. Cray also introduced hardware to allow instruction resequencing and conditional branch prediction so that the pipeline was kept as full as possible.

John Cocke, at the IBM Research Laboratory in San Jose, California, developed the concepts behind RISC in the late 1960s. Cocke was concerned with the complexity and expense of the hardware needed to keep pipelines full and believed that much of the responsibility for instruction sequencing could be given to optimizing compilers, thus simplifying the hardware and leading to lower costs. If the compilers could absorb this complexity, high-performance processing would no longer be the sole domain of supercomputers. This marriage of a simple instruction set and a carefully crafted compiler was the beginning of RISC as we know it today.

Unfortunately, Cocke's research project was canceled before he could prove his ideas. He got another chance in 1976 at IBM's Yorktown Research Laboratory in New York, where he was designing and building a high-speed telecommunications controller. Code-named the 801 after the building in which Cocke worked, it has generally been regarded as the first RISC computer. The 801 showed that compilers could take over the scheduling of a pipelined processor.

The first IBM product to use a descendant of the 801 was the PC RT. The 801 was also the base for RISC processors developed by other vendors. In the early 1980s, research projects under David Patterson at the University of California at Berkeley and under John Hennessy at Stanford University both had graduate students who had worked at IBM Research and knew about the 801. Patterson coined the term "RISC," and his project led to Sun's SPARC microprocessor. Hennessy's work led to the MIPS Computer Systems microprocessor. Meanwhile, at nearby Hewlett-Packard (HP), Joel Birnbaum, who had earlier managed the 801 group at IBM Research, led the effort to design the PA-RISC architecture.

Like the original 801, these early RISC processors used a single pipeline. Meanwhile, Cocke and others at IBM were working on dispatching multiple instructions each cycle to multiple pipelines from a conventional linear instruction stream, thereby further increasing performance. They called this a superscalar machine. The first superscalar RISC processor appeared in the RS/6000 in 1990. This architecture, named POWER for "Performance Optimization With Enhanced RISC," was the starting point for the joint effort between Apple, IBM, and Motorola.

IBM and Motorola, with help from Apple engineers, created the Somerset Design Center in Austin, Texas, to develop future PowerPC microprocessors. (Readers familiar with the story of King Arthur may recognize Somerset as the place warring factions went to make peace.) The first single-chip PowerPC processors were

the PowerPC 601, a medium-performance microprocessor for desktop computer systems and a transition from the POWER architecture to the PowerPC that implements a superset of features from both architectures for use in the RS/6000
the 602, an entry model intended for use in consumer products
the 603, a low-power-consumption microprocessor for desktop and portable computers
the 604, a high-performance microprocessor for uniprocessor and multiprocessor desktop computers and technical workstations
the 620, a 64-bit high-performance microprocessor for technical workstations, servers, and multiprocessor systems

Let's look at how these processors relate to those used in the new AS/400s.

The AS/400 Commercial RISC Processor

The AS/400's original processor architecture, which had the unwieldy name Internal Microprogrammed Interface or IMPI, was primarily a memory-to-memory architecture that was designed in the mid-1970s to support interactive, transaction-processing commercial applications. Data could be fetched from memory, modified by the processor, and returned to memory all in one instruction. (For more information about the IMPI, see "The AS/400 Architecture," August 1988.) Interactive, transaction-processing applications typically move lots of data but modify very little of it. Response time is vitally important, so a processor that can move lots of data and complete operations quickly is critical to the system's success. The AS/400 excels at this type of application.

In contrast, typical engineering/scientific applications are compute-intensive — they tend to do lots of computations on relatively small amounts of data. Their small instruction working sets have many tight loops executing floating-point arithmetic. Even their I/O is often sequential rather than random. A processor such as RISC that works on data only in registers is the best choice for these applications.

Although commercial applications are still the AS/400's mainstay, more and more compute-intensive applications are finding their way onto the system. Client/server computing, in particular, tends to require more compute-intensive operations than do interactive applications and has increased the need for the AS/400 to improve its compute- intensive performance.

It was obvious to us in Rochester that the AS/400 needed RISC processor characteristics to satisfy future computational performance needs. In 1990, we began an effort to add RISC features to the IMPI.

Many RISC processors at the time had 64-bit floating-point registers, but their integer registers, which are used for most commercial application operations, were only 32 bits. Because a RISC processor can move data only to and from its registers, the 32- bit register quickly becomes a bottleneck for the massive data transfers characteristic of commercial processing. CISC architectures such as IMPI that can move and process data without going through the registers essentially bypass the bottleneck. We decided that our future design needed a full 64-bit processor.

Another factor that influenced this decision was the AS/400's 48-bit address size: There was no way to squeeze a 48-bit address into a 32-bit register. A 64-bit processor would let us expand the address used in the IMPI to 64 bits, which our projections showed large future AS/400s would need.

The decision was made. We would start with the IMPI to minimize low-level operating- system code changes, expand it to 64 bits, and add RISC computational operations to create the first RISC processor designed exclusively for commercial computing. We called the processor C-RISC for "Commercial-RISC."

PowerPC Technology for the AS/400

Jack Kuehler, president of IBM in 1991, believed that by the end of the decade, all computers would use RISC microprocessors and that only a handful of companies would be building the microprocessors. He was betting that the PowerPC alliance would be one of the few survivors. But Kuehler couldn't understand why two of his laboratories were developing new RISC processors. The lab in Austin was working on the definition of the PowerPC, while the lab in Rochester was working on C-RISC. Convinced that the PowerPC was the right answer for both of us, he wanted to know why Rochester couldn't also use it.

We dutifully traveled to Armonk to explain the differences between processors designed for the AS/400 and other RISC designs. Although Kuehler didn't dispute our success in building commercial processors, neither did he buy our reasons, repeatedly sending us back for more data to prove our position.

After about our third visit to Armonk, Kuehler was finally convinced that we knew what we were talking about. Still, he sent us back one more time, giving us 90 days to bring him the answers to two questions: What changes to the PowerPC architecture were needed to make it suitable for the AS/400? And what would it cost to move the AS/400 to this new architecture?

Starting in early April 1991, I led a team of 10 of the very best people from Rochester to answer these questions. At the outset, it appeared that the requirements for a Rochester system were in contrast to the goals for the PowerPC, so we decided to start with the PowerPC architecture and add the extensions we would need for the AS/400. The PowerPC architecture was defined to run in both a 32-bit and a 64-bit mode. We considered very few changes to the 32-bit subset, concentrating instead on the 64-bit mode.

Many of the changes we considered looked difficult to incorporate into the early versions of the PowerPC. We didn't even consider trying to extend the PowerPC 601, 603, or 604 designs, which supported only the 32-bit mode. The high-end AS/400s needed processors that could handle massive amounts of data, which required very wide data buses. It seemed doubtful that we could extend the 64-bit 620, because it was not sufficiently dense to package the size processor we needed at the high end of our line. We would have to build a multichip version of the architecture. Even for the low-end AS/400s, it wasn't clear what to use. We briefly considered using a variant of the 620, but Somerset couldn't deliver it in time, as their first priority was the 32-bit design.

The software picture for a PowerPC-based AS/400 looked much worse. Application software and system software above the Machine Interface (MI) was protected by the technology independence of the AS/400 architecture. But because the PowerPC instruction set was so different from the IMPI, the operating-system code below the MI would have to be converted. Some of it could be ported to the new hardware with the help of automated tools, but much of it would have to be rewritten — a major undertaking. We estimated that we'd need several hundred new system programmers to do the work on the RISC-based systems.

Finally, there was the problem of schedule. We had planned to ship the new C-RISC processors in mid-1994 and had already started to work on them for the Advanced Series. If we used PowerPC instead, the schedule for the RISC processors would have to slip into 1995 — it would take us that long to build the processors with the extensions we needed and to rewrite the internal code. We would have to ship the Advanced Series with the IMPI processors and ensure that they could be upgraded when the PowerPC processors were available.

In early July of 1991, we went back to Jack Kuehler with our results, expecting him to thank us for a good job, conclude that moving to the PowerPC was just too expensive, and send us back to build C-RISC. Instead, he gave us the green light and provided the development resources to make it happen. Suddenly, we had to totally change both the hardware and software directions for our Advanced Series.

By the end of July, we had reorganized the lab to concentrate on the new system. The design of our high-end processor, called Muskie, was Rochester's responsibility. (Large game fish of the pike family, muskies are found in the lakes of northern Minnesota and Wisconsin and survive by eating other fish.) It was to be a multichip implementation that could satisfy the demands of very large commercial computers. The design for the low end of the AS/400 line went to the IBM laboratory in Endicott, New York. Called Cobra, it was to be a single-chip, 64-bit design. (Most people think the design team at Endicott likes to name processors after snakes. Actually, they use the names of high-performance American sports cars.)

The effort to port the AS/400 microcode to the new processors also started in earnest. Here was the opportunity to redo the internals of the AS/400's operating system (the code below MI), something that hadn't been done since the original S/38 design. We decided to go all the way and use the latest object-oriented programming methodologies to achieve the most modern operating system in the industry. We needed to hire hundreds of new people, train them in object-oriented technologies, and rewrite the most critical part of the AS/400 operating system in a short time. Many industry observers wondered what we were up to with our massive hiring campaign for system programmers with C++ experience. Now they know.

The PowerPC Architecture

In most respects, the PowerPC architecture is fairly conventional, containing all the expected characteristics of a RISC architecture, including fixed-length instructions, register-to-register architecture, simple addressing modes, and a large set of registers. But other features set it apart.

As I've mentioned, the PowerPC has a full 64-bit architecture with a 32-bit subset. Required and optional instructions are defined for both the 32-bit and the 64-bit instruction sets. A mode switch lets 64-bit processors run 32-bit programs.

A superscalar implementation dispatches multiple instructions to multiple pipelines in a single clock cycle. The instructions can be executed concurrently and can even finish out of order, adding parallelism that can greatly increase overall performance. Instructions are dispatched simultaneously to three independent execution units: a branch unit, a fixed-point unit, and a floating-point unit. Figure 1 illustrates these units as well as the instruction cache, data cache, memory, and I/O space.

The PowerPC architecture defines an independent set of registers for each execution unit. Each instruction can be executed in only one type of unit. Thus, each unit has its own set of registers plus its own instruction set. An execution unit can also have more than one pipeline, allowing more than one of the unit's instructions to be executed simultaneously. The duplication of resources such as registers also means that minimal communication and synchronization is required between units. Execution units can adjust to the changing dynamics of an instruction stream and enable one instruction to slip past another and be completed out of order.

Yet another distinguishing characteristic of the PowerPC architecture is the use of several compound instructions. The biggest drawback of RISC relative to CISC is code expansion: More RISC instructions, and thus more processor cycles, are required to perform the same task. Compound instructions can reduce that code expansion. Although RISC purists accused the PowerPC architects of "selling out" to the forces of CISC, the architects were simply recognizing the fact that certain operations, such as moving unaligned byte strings, occur frequently enough in commercial applications to benefit from optimization. If a compound instruction does the job but violates some unwritten rule of RISC purity, so be it. Rather than signaling a return to CISC architectures, compound instructions just show that nothing is ever black and white.

Some in the industry argue that the added complexity makes it difficult to achieve high clock rates. They believe better performance can be achieved by increasing clock rates rather than aggressively increasing instruction-level parallelism. To see the difference, look at HP's PA-RISC, which follows the high-clock-rate philosophy. A high- end PA-RISC processor can typically dispatch two instructions per cycle. The smaller PowerPC processors dispatch three instructions per cycle, and the high-end models can dispatch four or more. This added parallelism gives the PowerPC a throughput advantage at the cost of added complexity, which can slow the clock rate.

The debate over which is the best design philosophy continues. The two camps have been called "Speed Demons" (high clock rates) and "Brainiacs" (complexity). The clock rate, usually specified by the number of processor cycles that can be executed per second, in megahertz (MHz), doesn't always indicate comparative performance. A 150 MHz Brainiac may easily outperform a 300 MHz Speed Demon — it all depends on the program being executed and the amount of instruction parallelism the compiler can exploit.

From recent industry announcements, it appears that HP has succumbed to the siren song of complexity and that the scales may be tipping in favor of a Brainiac design such as the PowerPC. HP's new PA-8000 will join the Brainiac camp in 1996.

Extensions to the PowerPC

The most significant extension to the PowerPC architecture for the AS/400 is the support for memory tags. When we introduced the concept of single-level store with the S/38, we needed a mechanism to protect areas in memory used to store pointers that user applications should not directly modify. A special memory protection bit, called a tag bit, was associated with every word (32 bits) in memory. If a user directly modified any part of a pointer in memory, the hardware would turn off (set to 0) the associated tag bit, rendering that pointer invalid.

The tag bit had to be hidden and kept where user applications could not get at it directly. It could not be one of the data bits in the word, because a user application could change data bits. It had to be a separate bit, but where to keep it?

The S/38 used separate error-correction code bits for every word in memory and kept these bits in a part of memory that was not visible to programs above the MI. We added the tag bit to the error-correction code bits. When a user program modified a word in memory, the processor would automatically turn off the tag bit. If that word contained any part of a pointer, the pointer would become invalid. Only the operating system code below the MI had instructions to turn the tag bits on.

The AS/400 also uses the tag bits in memory. Because the original PowerPC architecture didn't recognize tag bits, we added a tags-active mode to the architecture. In this mode, the processor recognizes that tag bits exist, and it turns off the tag bit whenever a user modifies a word in memory. All AS/400s run PowerPC AS processors in tags- active mode.

The PowerPC architecture defines privileged operations and instructions used only by the operating system — application programs don't use these privileged operations and instructions. The tags-active mode enables the extensions that were added for the AS/400 and determines how the privileged operations and instructions are defined.

For example, the virtual memory address translation hardware needs to support both a single-level store with a single address space and a conventional store with a separate address space for each process. The tags-active mode tells the processor to use single- level store, while the tags-inactive mode tells it to use the conventional PowerPC address translation.

Other extensions for the AS/400 include decimal support instructions, some new load and store instructions, and some enhancements to an internal processor status register to improve branching.

AS/400 Processor Implementations

The two AS/400 processor implementations, Muskie and Cobra, support only the tags- active mode and only the AS/400 I/O structure. Thus they can run applications (which use only standard instructions), but not operating systems (which use unsupported optional insructions), written for a standard PowerPC processor. Any other operating system running on one of the AS/400 processors would have to use AS/400-specific facilities for such functions as I/O. Future AS/400 processors will implement both the tags-active and the tags-inactive modes and will support other I/O structures.

Figure 2a shows the processor used in each of the new models. Figure 2b shows the upgrade paths that will be available. An overview of the current processor implementations will show how they are designed to support an AS/400 environment.

The Muskie (A30) Processors

The Muskie or A30 processor is a single-module, multichip, pipelined, superscalar design intended for high-end AS/400s. It features a fixed-point unit that can dispatch and execute up to four instructions per cycle. The A30 comes in two versions, one with a cycle time of 8 nanoseconds (which equates to a clock rate of 125 MHz) and the other with a cycle time of only 6.5 nanoseconds (for a clock rate of 154 MHz). From a performance perspective, the faster fixed-point unit can execute more than 600 million instructions per second (MIPS). The implementation includes a floating-point unit with a peak rate of 500 million floating-point operations per second (MFLOPS), an 8 K (on-chip) instruction cache, a 256 K (on-module) cache, and support for up to 64 GB of main memory. The A30 also supports multiprocessor configurations.

With all of its support circuitry, the A30 uses seven chips, packaged on a single multichip module with more than 25 million transistors. One chip is an I/O control unit that technically is not a part of the processor. Figure 3 shows the six chips that make up the processor complex and the interconnections between them.

Muskie uses BiCMOS (bipolar/CMOS) technology, which is a good compromise between the performance of bipolar technology and the lower temperature CMOS technology. Besides being the fastest AS/400 processor to date, the A30 is optimized for commercial processing. A few of these characteristics will illustrate the point:

Commercial systems and servers must handle lots of data. The A30's 16-byte (128-bit) and 32-byte (256-bit) buses let it handle massive amounts of data and numbers of instructions. Compare this to the typical 8-byte (64-bit) buses used in most high- performance RISC processors, which are designed for use in a workstation whose data requirements are far less demanding.
The cache memory is a bottleneck in most RISC designs, even if the system can move large amounts of data. The A30's 256 K, single-cycle data cache overcomes this bottleneck. The cache bandwidth of 5.34 GB per second and system bus bandwidth of 2.67 GB per second are double the bandwidth of other very-high-performance RISC processors designed for technical computing.
Since a branch instruction can stall a pipelined processor, RISC processors today implement a form of branch prediction, just as the supercomputers do. Branch prediction in most RISC processors is typically 80 percent to 90 percent accurate on technical workloads. For commercial workloads, which use far fewer loops, accuracy may be as low as 50 percent — a random guess would be as accurate. Rather than trying to guess which branch target will be needed, the A30 prefetches the instructions at all branch targets and begins to execute them — an approach called speculative execution. Speculative execution requires a very-high-bandwidth cache but achieves essentially 100 percent accuracy for all workloads.
Another important aspect of commercial processing is the need for high data integrity and high availability. The A30 implements full error-correction codes or parity on all off-chip signals. Parity schemes are also integrated into most of the control and data-flow logic on each chip. Typical workstation RISC processors rarely have anywhere near this level of error detection and correction.

The Cobra (A10) Processors

Like the A30 processors, the A10s implement the 64-bit extended PowerPC architecture. The A10 is also a superscalar design, to take advantage of instruction-level parallelism. Functionally, the two processor families execute the same application-level instruction set, but there are slight differences in the optional instructions that are implemented. For example, the A10 is intended for the middle and lower models of the AS/400, so instructions to support characteristics such as multiprocessing are not included.

Four versions of the Cobra processors have been built. The Cobra-0 was designed at Endicott and used only for testing. Another version was designed in Rochester for the Advanced 36 announced in 1994. This version wasn't announced as a PowerPC processor because some of the required instructions were left out (we called this processor Cobra- Lite).

The A10 processors announced in the new RISC AS/400 systems are the Cobra-CR (for "cost reduced") and the Cobra-4. The Cobra-CR runs only at 50 MHz and is used in the smallest AS/400s. The Cobra-4 runs at both 50 and 77 MHz and is used in the middle of the AS/400 model range.

The design objective for all the A10 processors was to integrate the processor and the memory interface on a single chip (the I/O bus interface is on a separate chip). To accomplish the integration, the Cobra family uses a CMOS technology called CMOS-4S. The result is an implementation that dissipates less heat than Muskie and so can be used in a smaller package. For this reason, the Cobra processors are the only ones packaged in the original boxes announced for the Advanced Series in 1994.

The initial A10 processors have clock rates of 50 and 77 MHz, but the design is capable of higher speeds. At 77 MHz, an A10 runs at 231 MIPS. To sustain this rate, it has a 4 K (on-chip) instruction cache and an 8 K (on-chip) data cache, both backed up by a 1 MB off-chip cache.

Technology for the Future

The IBM/Apple/Motorola alliance intends to have the best-selling RISC processor in the industry. So far, it is succeeding. PowerPC processors are selling at more than 10 times the rate of their nearest RISC competitor. As 64-bit "industrial-strength" RISC processors capable of delivering the function and performance demanded by commercial systems and servers, the A10 and A30 are the first members of a family of RISC processors that will carry the AS/400 into the next century.

In a clear attempt to battle the PowerPC threat, Intel and HP formed their own alliance in 1994 to converge their lines in a new RISC processor. Their product will be available near the end of the decade, but some industry analysts have predicted that that will be too late. By then, the PowerPC will be the dominant RISC processor in the industry, which is just fine with IBM, Apple, and Motorola.

This article was excerpted from Inside the AS/400, by Frank G. Soltis, which is due out in September 1995 from Duke Press.

Frank Soltis of IBM Rochester conceived the technology for the AS/400's independent architecture and has been a central figure in the development of the S/38 and the AS/400 for more than 25 years. He holds a PhD in electrical engineering and is also a professor of computer engineering at the University of Minnesota.

Sidebar: Pipelining

The single most important invention that enabled greatly improved performance in processors was pipelining, an implementation technique that breaks the execution of a single instruction into smaller steps, each taking a fraction of the time needed to complete the entire instruction. Each step, called a pipe stage, performs a specific function in an instruction's execution. The stages are connected one to the next to form a pipeline, and each instruction must traverse the entire pipeline to be completely executed. Figure A is a diagram of a five-stage instruction pipeline.

Instructions enter one end of the pipeline, are processed through the stages, and exit the other end. A pipeline improves performance because multiple instructions can be in the pipeline at different stages simultaneously. In a single processor cycle, a new instruction enters the pipeline and another exits the pipeline.

Pipelined machines achieve their maximum performance when the pipeline is full, because an instruction is completed with every cycle. But if an instruction uses data stored by another instruction just ahead of it in the pipeline, the data may not be available in time for the following instruction, causing a stall in the pipeline and reducing performance.

In the CDC 6600, Seymour Cray introduced hardware that enabled the processor to look at instructions later in the instruction stream and determine whether they could be started before an instruction that had to await a result. Allowing the hardware to rearrange instructions in the pipeline greatly improved the performance.

Another idea used by the supercomputers of the 1960s was branch prediction. A branch instruction can stall the pipeline until the system can determine the next instruction to use. The idea of branch prediction was to guess, based on experience, where the next instruction would come from. Sophisticated hardware could do branch prediction for scientific applications with remarkably good results.

All the specialized hardware to optimize pipeline performance added to the complexity and hardware costs of these systems. Although not a problem for a cost-is-no-object supercomputer, it prevented use of these techniques in ordinary systems.

— F.G.S.

Sidebar: The IMP in the AS/400

In the mid-1970s, I gave the name Internal Microprogrammed Interface, or IMPI, to the S/38's internal interface, assuming it would be changed before the system was announced. It wasn't, and has caused problems ever since.

People refer to the "IMPI interface," but of course the word "interface" is redundant. Someone once decided the last "I" in IMPI should stand for "instruction," but that didn't work either, because we talked about "IMPI instructions." Finally, someone solved the redundancy problem by dropping the last "I." IMP suddenly had no meaning and conjured up visions of a small mischievous demon in the system. That name didn't last long. The move to PowerPC is finally solving our internal-interface naming problem.

— F.G.S.

Sidebar: The Name Game

Our new AS/400 processor needed a new name. Because it was to be an extension of PowerPC, some proposed "PowerPC Plus," which we didn't like because some or all of the extensions could someday become part of the general PowerPC architecture. Besides, no two of the PowerPC processors have exactly the same instruction set — each implements a different set of the optional instructions. We decided to name them PowerPC AS and identify them with new numbers. The other PowerPC processors were all members of the "6xx" series, so we called ours the "Axx" series.

— F.G.S.