Following our look at AMD’s 2010 platform plans, this second installment of a four-part series which digests AMD’s 2009 Financial Analyst Day takes a look the company’s new architectures on track for 2011.
For the last three years AMD has moved along with the basic K10 architecture to varying degrees of success. Where the architecture’s initial debut—the Phenom—can largely be considered a critical failure, the Phenom II has remained competitive in the face of two new Intel architectures, and two die shrinks. That’s not bad for a company with a fraction of the investment capital. But the time is quickly coming when AMD will need to step up its game and fire a new round in the ongoing architecture war. That process begins at the end of 2010 when the firm starts sampling three new chip designs known as Bulldozer, Bobcat and Llano.
Bulldozer
For enthusiasts and servers, AMD plans to lead the way with the “Bulldozer” architecture. According to the company’s roadmaps, the first desktop CPU scheduled to launch with the new architecture is codenamed Zambezi, and it will offer four, six, or eight cores built on the company’s new 32nm process. The Bulldozer family should also see the introduction of the six or eight core Valencia, and 12-16 core Interlagos server part.
Chips based on Bulldozer will be scalable across a number of what AMD calls “modules,” better known as cores. The slide above offers a high-level illustration of what a single Bulldozer module looks like, and it’s a pretty fascinating piece of engineering.
On the basic level, a Bulldozer module looks very similar to a single core processor with simultaneous multi-threading (SMT), a technology which Intel famously implemented with the Pentium 4 in a technology still known today in the Core i7 as HyperThreading.
Intel’s implementation of SMT duplicated architectural states—the part of a CPU which holds the condition of a process—but not the execution engine. This allows their processors to maximize execution resources by busying silicon that would otherwise lay idle, or by injecting threads into the pipeline in the event of a stall. In effect, Intel uses SMT to ensure that their processor is always busy crunching data, and today’s operating systems are increasingly intelligent at dispatching threads for this setup.
The “problem” with this implementation of SMT (one execution resource, duplicate registers) is that one instruction window tracks the dispatch, execution and retirement of both threads. It’s the processor equivalent of juggling—one of the balls is eventually going to drop.
Bulldozer puts SMT on steroids and offers a dedicated instruction window to both threads issued to the processor by the OS. The above diagram illustrates this perfectly: A shared frontend (fetch/decode) can receive and dispatch two threads to a pair of independent integer schedulers.
The integer schedulers are associated with their own set of execution pipelines; the general consensus is that they are evenly split between store and ALU operations, but this is not concrete, and would actually reduce single-thread performance with respect to the Phenom II. These execution resources write to their own dedicated chunk of L1 data cache, whereas today’s SMT-enabled cores write the results of both threads to the same L1 data cache.
AMD is calling each execution resource a “core,” making a single Bulldozer module a “dual core” chip; that convention is not explicitly true, but it’s very close. Concurrent thread throughput on Bulldozer should be within arm’s reach of a dual core CPU that does not use SMT.
Both execution blocks share an FPU/SIMD (“FP Scheduler”) with 128-bit FMAC support. Fused multiply-accumulate (FMAC) gives the chip improved floating point precision, and it should give it a leg up on the Phenom II which cannot perform fused multiply-add operations.
A 128-bit FPU is also a natural choice as AMD has announced SSE5 for the Bulldozer, an instruction extension which has several 128-bit multimedia and 3-operand instructions. Bulldozer’s ability to crunch these instructions in a single cycle continues to be a source of some debate; one camp says the pipes are limited to 2×64 and 4×32, while others claim 1×128 is on tap as well. Fusing the 128-bit FPUs should allow the chip to crunch 256-bit Intel AVX instructions in a minimum of one cycle (if 1×128 is possible), or two cycles if the FPU’s pipes top out at 2×64.
Rounding out the picture, the Bulldozer brings forward the Phenom II’s cache hierarchy by dumping all the pipelines into shared pools of L2 and L3 cache.
Now, we’ll bring it all back to a gentle reminder: Zambezi will contain four of these “dual core” modules for a total of eight cores. Server parts will combine six or eight of them for a total of 12 or 16 cores. That’s up to eight dispatch units putting up to 16 fully independent threads in flight. This kind of thread concurrency simply does not exist on the desktop at this time.
As a bit of a reality check, however, there are some things we just don’t know:
- The instructions per cycle of the dispatch units;
- The depth of the processing pipelines;
- The configuration of the ambiguously-illustrated FPU;
- The pipeline configuration of the integer units;
- and how exactly AMD shares the FPU/SIMD scheduler with two instruction windows.
Until this information tips up, we just can’t know how Bulldozer will stack up against the Phenom II, or Intel’s Sandy Bridge architecture which also arrives in the same time frame. In the interim, we can only admire the genuinely different architecture and speculate over the diagram’s many ambiguities.