If geeks love it, we’re on it

What we know about AMD’s next-generation processors

What we know about AMD’s next-generation processors

You might be surprised to learn that AMD is just seven months away from releasing new CPUs based on not one, but three, new designs. The Phenom II that we have known for the past 17 months will soon be put to pasture, never to be seen again. Its replacements are built for the server, the desktop, the notebook and the netbook.

Dubbed Bulldozer, Bobcat and Llano, the new processor designs are the final piece of AMD’s grand strategy to emerge from years of debt and struggle as a leaner, meaner company. For enthusiasts, they are something altogether more important: a clear sign that the fascinating war between AMD and Intel is about to go nuclear once again.

Bulldozer: the chip for enthusiasts

A block diagram of a single Bulldozer module, or core.

Chips based on Bulldozer will be scalable across any number of what AMD calls “modules” (shown above), each of which contains two CPU cores. It is postulated that each module is equipped with a technology called Cluster-Based Multi-Threading, or CMT.

To understand CMT, we must first have an understanding of its lesser sibling, Symmetric Multi-Threading (SMT), which you are likely to know by Intel’s name: Hyper-Threading. Though Intel did not create the technology, their implementation is by far the most famous.

Intel’s implementation of SMT duplicates architectural states—the part of a CPU which holds the condition of a process—but not the execution engine. This allows their processors to maximize execution resources by busying silicon that would otherwise lay idle, or by injecting threads into the pipeline in the event of a stall.

To give a real-world analogy, Intel’s implementation of SMT is similar to an automobile assembly plant with only one assembly line capable of taking a car from parts to completion. At every stage of the assembly, however, workers are standing by with completed parts to keep the line moving if there’s a problem. The workers can’t build a car (they don’t have a line), but they can make sure that line is always moving the car on to the next step without issue.

Intel uses SMT in the same way: to ensure that the processor’s line is always busy moving to the next step, and today’s operating systems are increasingly intelligent at dispatching threads for this setup.

The “problem” with this implementation of SMT is that one instruction window tracks the dispatch, execution and retirement of both threads. Going back to the assembly line, it would be like putting one supervisor in charge of watching the line and the workers—that supervisor can’t watch for problems with the line and the workers at the same time. Something is bound to fail. On a CPU, as in an assembly line, failures lead to a reduction in apparent performance.

Each Bulldozer module, meanwhile, puts the plant on steroids not only by adding a second fully-functional assembly line, but by giving each line the ability to break one big stage down into several, parallel stages—little assembly lines that can be created, run, merged and closed on demand without sacrificing the efficiency of the main assembly line. This is CMT, and the Bulldozer can do it.

CMT is more efficient and performs more consistently than Hyper-Threading.

When a processor is done sending calculations through the pipeline, it stores that data in cache for programs to access (L1 DCache in the diagram below). In essence, these are the completed cars sitting in the parking lot waiting for transport. Intel processors have one parking lot that may contain a mix of cars and trucks, which reduces efficiency when a shipping company arrives to grab a shipment made exclusively of trucks. The Bulldozer plant has two parking lots, which gives that plant more flexibility to be efficient with storing and shipping.

From end to end, the entire Bulldozer plant can do more, and do it more intelligently than the plants AMD and Intel run today.

AMD Bulldozer

Going back to raw architecture, both of Bulldozer’s lines share a single floating point scheduler (cordoned in red), with two 128-bit FMAC pipelines. Fused multiply-accumulate (FMAC) gives the chip improved floating point precision, which grants Bulldozer a leg up on the Phenom II when it comes to calculating big equations more accurately and efficiently. And, when you realize that everything you do on a computer is a mathematical equation, you can see why this is important.

A 128-bit floating point pipe is also a natural choice as AMD has announced SSE5 for the Bulldozer, an instruction extension that has several 128-bit multimedia instructions. Fusing the 128-bit FPUs will also allow the chip to crunch 256-bit Intel AVX instructions in just one cycle. SSE5 and AVX alone will take these processors to a whole new level of performance when it comes to multimedia, encryption and scientific research.

Finally, the Bulldozer brings forward the Phenom II’s cache hierarchy by dumping all the pipelines into shared pools of L2 and L3 cache. These shared L2 and L3 caches give either core on a Bulldozer module access to completed calculations that can be pulled back in to speed up a new task. This is standard for today’s processors.

Your future Bulldozer CPU

The first enthusiast CPU to employ the Bulldozer design is currently codenamed Zambezi, and it will contain four of these dual core modules for a total of eight cores. We also know for a fact that Zambezi will use socket AM3, meaning anyone with a DDR3 Phenom II motherboard will be ready to rock with a BIOS upgrade.

What about performance?

Unfortunately, there are some elements of the Bulldozer design that we just don’t understand yet, including:

  • How many cars the supervisor can send down the line at a time;
  • How many stages it takes to complete a car;
  • How AMD has configured the floating point unit (FPU) to run the numbers;
  • and how exactly AMD shares the single FPU amongst two independent assembly lines.

Until this information tips up, we just can’t know how Bulldozer will compare to today’s processors. In the interim, we can only admire the genuinely different architecture and speculate over the diagram’s many ambiguities.

Bobcat: the chip for netbooks

Next on the launch deck is AMD’s “Bobcat” architecture, a chip explicitly designed to cater to products containing CPUs like the Athlon Neo or the Intel Atom.

According to the company’s roadmaps, the first chip to launch with Bobcat architecture will be the 32nm Ontario APU, which combines two Bobcat modules and a rudimentary DirectX 11 chip on the same processor.

AMD Bobcat architecture

Each Bobcat module is a single core design, with one supervisor (int scheduler) and one assembly line, which consists of the I-Pipes, Ld-Pipe and St-pipe in the diagram above. These can be considered specialized workers—electricians versus mechanics, for example—that perform unique tasks on the car while it is rolling down the line. You’ll note that Bulldozer, too, had four pipelines per int scheduler, but we just don’t know what kind of workers they are yet.

The Bobcat’s integer pipe is paired with a dual-pipe FPU, ambiguously titled “A-Pipe” and “M-Pipe” in this diagram. We postulate that the “A” and “M” refer to the addition and multiplication/division floating point operations, respectively. The size of these pipelines—the number of bits they can calculate at a time—will not only determine what this processor is strongest at, but its complexity, and how it consumes power.

On the topic of power, AMD claims that Bobcat is capable of radiating less than 1 watt of heat, which could mean something around 0.5W. A chip at that wattage isn’t doing much more than sitting around on standby, but it’s a healthy number for users looking for laptop designs with a long standby life. In practice, Bobcat’s actual TDP should be around 5-10W, which is perfect for netbook-sized laptops.

On the point of performance, AMD says it’ll weigh in at “90% of today’s mainstream performance” at less than half of the die size. If AMD’s definition of mainstream is the Athlon II—an assumption that bears out in their platform roadmaps—then Bobcat is essentially an Athlon II in a (much) smaller, cooler and quieter package. Not bad.

Bobcat’s most remarkable feature is not its architecture, however, but its design process. AMD has designed the Bobcat via high-level synthesis, or HLS. HLS is a process by which a chip’s design begins its life as a set of behaviors coded by a programmer in C++. The code is then interpreted and synthesized by a machine that manufactures a processor that exhibits the behavior written by the programmer.

HLS is a fascinating way to rapidly design and produce a chip that can easily be modified or ported to other processes for outstanding flexibility in the market. The trade off for this agility is frequency—Bobcat’s maximum clockspeed with an HLS-driven design is about 20% lower than it could have been were it designed “by hand.”

All things considered, Bobcat will assuredly be faster than any ultra low-voltage chip in the market today; it will handily eclipse the Nano, the Atom and the Athlon Neo, by orders of magnitude on some metrics. Additionally, AMD’s decision to roll with HLS gives the firm the ability to respond to market conditions in ways its competitors simply cannot with current processes.

Fusion: the chip for notebooks and budget desktops

AMD’s acquisition of ATI Technologies was completed on October 26, 2006 and was accompanied by an official, and very important statement:

AMD plans to create a new class of x86 processor that integrates the central processing unit (CPU) and graphics processing unit (GPU) at the silicon level with a broad set of design initiatives collectively codenamed “Fusion.”

In other words, AMD announced that it would soon put GPUs and CPUs on a processor. AMD calls these chips an accelerated processor unit, or APU. If you’re familiar with the CPU market, the APU might not be new to you: some of Intel’s Core i5 processors have a GPU onboard. Yes, Intel beat AMD to the punch, and it was almost a direct result of AMD’s financial hardship.

Despite yielding the first design wins to its chief rival, there is a silver lining for AMD’s APU initiative: even AMD’s slowest modern GPU bloody annihilates anything Intel has to offer. This includes the GPUs AMD plans to stick inside its processors, starting next year with Llano.

Llano

The Llano CPU is AMD’s first processor scheduled to adopt the Fusion APU design. Based on the die shots provided earlier this year, the chip strongly resembles an Athlon II X4 that has been shrunk from 45nm to 32nm to accommodate an onboard GPU.

This would make perfect sense given that Llano and Propus are both oriented for the mainstream. Marrying existing technologies manufactured at a smaller size is much easier than starting over with a brand new architecture when none is needed.

An uncanny resemblance: Propus (Left) and Llano (Right)

It is certainly worth noting that the above x-ray of the Llano is not complete; the bottom section of the chip has been cut off in press materials, meaning there’s even more silicon at play than we can see at this time.

However, judging from what we can see, the Llano APU will feature 512k-1MB L2 cache per core, no L3 cache and six Radeon HD 5000-series units for a total of 480 stream processors.

In short, Llano is shaping up to be an Athlon II X4 with 66% of a Radeon HD 5750 on board. If that bears out, then it is more than capable slugging Intel’s Clarkdale and Arrandale (Core i5) designs into the pavement without lifting much more than a few fingers.

Recap

Before we head into our final thoughts, let’s take a moment to quickly summarize all the architectures that have been tossed around in this article.

Zambezi

Family: Bulldozer
Cores: 4 to 8
Process: 32nm
Socket: AM3
Onboard GPU?: No
Platform: Scorpius
Role: Performance Desktop
Launch date: Late 2010

Ontario

Family: Bobcat
Cores: 2-4
Process: 32nm
Socket: N/A
Onboard GPU?: Yes
Platform: Brazos
Role: Ultra Thins, Netbooks
Launch date: 2011

Llano

Family: Stars (Athlon II)
Cores: 4
Process: 32nm
Socket: N/A (AM3 rumored)
Onboard GPU?: Yes
Platform: Brazos
Role: Mainstream notebook, mainstream desktop
Launch date: 2011

Final thoughts

AMD has been saying that “the future is Fusion” for years, and the company is just now in a place with its capital and processes to realize that future. By 2011, AMD will completely revamp their desktop, laptop and netbook offerings with three innovative and purpose-built CPU designs, all of which can be paired with on-die GPUs if the market demands it.

You read that right: Llano isn’t the only design that can support an onboard GPU. AMD can pair Bulldozer and Bobcat modules with a GPU, too.

Now, AMD’s first generation Fusion won’t have the performance to take on the discrete GPU market, but the groundwork is being laid. It will start with mainstream and low-voltage in laptops and netbooks, respectively. Economical desktop designs aren’t out of the question either, but there are signs that something much bigger is in the works.

For example, Bulldozer may not be an APU now, but its relatively small floating point unit speaks to a future architecture that cedes floating point operations entirely to the GPU, a component that crushes the CPU in floating point performance.

And indeed, in conversations with AMD, this is the paradigm they have been working to kickstart: a computing ecosystem that recognizes CPUs and GPUs alike as valid processors for a program. They envision a day when processing tasks are easily and automatically sent to the best processor for the job.

We are just beginning on that road, the one that blurs the line between the CPU and the video card, but AMD appears poised to make a confident first step. They have the resources, they have the engineers, and they have the drive. AMD is extremely passionate about where they’re going with their market strategy; talking to engineers and representatives at all levels of the company reveals an infectious enthusiasm that can’t be manufactured or faked.

Do not believe for a moment that competition between AMD and Intel has waned: 2011 will be more exciting than ever.

Correction (5/19/2010): Astute readers have noted that we erroneously attributed socket C32 to the Bobcat, whose true socket remains unknown at this time. The story has been updated to reflect more current information.

Comments

  1. ardichoke
    ardichoke I think I just got a bit of a nerd boner.
  2. BuddyJ
    BuddyJ Yeah. Bulldozer holds the OMGWTF crown previously occupied by Nehalem for me. Can't wait to see how it pans out.
  3. Cliff_Forster
    Cliff_Forster All of this and AMD has really improved its financial positioning this year. A really tough run seems to be in the rear-view mirror. They have fought back against insurmountable odds, and we will all benefit from it, weather you choose AMD or not, having them in the market is paramount to continued progress in what has become a really competitive chip market again.

    Not just looking forward, but just think about where we are at, six cores for about $200 and, they are making money on them.... Unbelievable. Moore's law is not dead yet.
  4. Komete
    Komete I'm drooling over the bulldozer writeup. It's sounds downright beastly. It makes me really glad I went with a Phenom II and a ddr3 board. I almost didn't. I hope we see some leaked bench marks in the coming months.
  5. Bandrik
    Bandrik I think Thrax did the unthinkable and actually made my heart race over a processor preview. All three of these architectures are exciting. It means a wonderful performance processor for me in the coming months, a nice netbook option next year, and the mainstream will be getting some awesome options in the Best Buy-level markets so my friends on a budget can still get on a game like the rest of us with integrated graphics that are worth their salt.

    Thrax, dude, awesome writeup, and I LOVED the lay-man's analogy of a vehicle assembly line.
  6. this is dizzy stuff folks Very nice write up guys, well done. Im running a 955 on AM3 and was considering a 1090T upgrade at the end of the year but with zambezi launching the 1090T options gone out the window. So the obvious question is, how did you get the bulldozer release date? Is this going to be announced officialy? Everything else i've heard about it has zambezi launching well into next year. Cheers
  7. PastramiOnRy Your article states that bulldozer (zambezi) will release in late 2010 and Llano will release in 2011. All other reports say the opposite; zambezi will release in 2011 and Llano will release in late 2010. Are you sure what you wrote is accurate?
  8. BernardP This is the best article I have read so far about the upcoming AMD processors. The comparison with car assembly helps to understand the concepts underlying the architecture. However, AFAIK, Bulldozer is set for release in "2011", with no indication of a Quarter or even Half.
  9. aussiebear There is a reason why AMD's products hasn't been exciting in 2010; they've dedicated a bulk of their resources for 2011 products! :)

    "We also know for a fact that Zambezi will use socket AM3, meaning anyone with a DDR3 Phenom II motherboard will be ready to rock with a BIOS upgrade."

    => It is said to be using Socket AM3 Revision 2 specification. (Socket AM3+ ???)...It depends on how the electricals are implemented by the motherboard manufacturer; a user of an existing Socket AM3 motherboard *is very likely* to be able to do a drop-in install with the new "Zambezi" processor and a BIOS upgrade. Regardless, it explains why AMD is re-using the same 2010 chipsets in 2011 roadmaps. ie: Will keep using the 8xx series chipsets for Bulldozer architecture.

    "It is certainly worth noting that the above x-ray of the Llano is not complete; the bottom section of the chip has been cut off in press materials, meaning there’s even more silicon at play than we can see at this time."

    => Correct. What AMD doesn't show is that in Llano; the whole northbridge is now part of the processor. The motherboard will only have the SouthBridge. In Llano's case, the SouthBridge is codenamed "Hudson-D"...It also means you are very likely needing a new mobo! :(

    "However, judging from what we can see, the Llano APU will feature 512k-1MB L2 cache per core, no L3 cache and six Radeon HD 5000-series units for a total of 480 stream processors."

    => Its definitely 1MB L2 cache per core. But I'm not sure about the number of stream processors. Some say 400; others say 480.

    "In short, Llano is shaping up to be an Athlon II X4 with 66% of a Radeon HD 5750 on board. If that bears out, then it is more than capable slugging Intel’s Clarkdale and Arrandale (Core i5) designs into the pavement without lifting much more than a few fingers."

    => Its going to be a slightly tweaked up version of the Athlon II X2/X3/X4 line; and the graphics part is going to be more like somewhere between the Radeon HD 5500 and 5600 series. Regardless, I agree its going to raise the bar IGP performance quite dramatically.

    Other things you've missed about Llano:

    * It will feature power saving technologies as similarly found on Intel's Core i3/5/i7 lines. ie: Power gating, dynamic speed scaling, etc. (Its likely to have a "Turbo mode".)

    * Is said to use 0.8V – 1.3V voltage.

    * Has target clock-speeds over 3.0GHz.

    * TDP numbers: 20W to 59W depending on the version.
    => Lower power dual-core version: 20W
    => Mainstream dual-core version: 30W
    => High-end dual-core, triple-core and quad-core chips: 35W and 59W.

    "For example, Bulldozer may not be an APU now, but its relatively small floating point unit speaks to a future architecture that cedes floating point operations entirely to the GPU, a component that crushes the CPU in floating point performance."

    => Its going to go beyond that. In the 2nd generation Fusion processor (somewhere in 2015); AMD has plans to incorporate GPU elements into the CPU core itself! Meaning there won't be distinct GPU and CPU sections on the processor like we see with the x-ray of Llano.
    => See here: http://www.xbitlabs.com/news/cpu/display/20100512150105_Second_Iteration_of_AMD_Fusion_Chips_Due_in_2015_AMD.html

    Anyway, I've been saving up for 2011 releases since the beginning of this year!
  10. Thrax
    Thrax Hi, Aussiebear! Thanks for taking the time to stop by and leave your fantastic comment.

    Your points about Llano are all true, and we've reported on many of them previously, but this article was designed to be a simple primer on the processors so consumers know what to expect.

    As their release dates draw closer, we will undoubtedly plunge deeper into their technical merits as you have done here.

    Isn't it all exciting?
  11. Eddy Dear Aussiebar,

    Thankyou for the excellent read! I've been following the developments on these for a while, and I'm not quite sure whether I agree with your car manufacturing analogy.

    If I'm not mistaken, Intel's hyperthreading basically uses a second thread to feed the execution units while the other thread has stalled or doesn't use the full width. So, to keep in line with your assembly line: There are several robotic arms that assemble a car, working together whenever possible. However, sometimes the parts supply has stalled - that's when Intel's second parts supply brings in different parts for a different car for the arms to work on. This way the arms are busy more of the time, resulting in a more efficient execution.

    What AMD does is simply arranging for a second set of robotic arms right next to the original set, so creating two execution sites with two parts supplies. Some of the arms can temporarily help out on the other line (the FPUs doing AVX, as you explained), and there are many other ways in which these two assembly lines work together.

    So in essence, one module is a highly optimised dual core, that can really boost multithreaded performance at not that much cost. I believe that's CMT. I've never seen anything about the breaking down of the "main" thread in little parallel ones, and I don't think that's what meant with CMT. Perhaps it's speculative multithreading what you're on about? That one isn't expected for the first iteration of Bulldozer just yet, but could well emerge in future versions. I could be mistaken in all of this of course. :)

    Some minor corrections I would like to make are the following:

    - The L2 will be devoted to a single module, while the L3 is still for all modules and cores. So this is actually different from the current way, be it slightly.
    - Orochi will be made on 40 nm bulk (so, not SOI like Zambezi and Llano), not 32. 32 nm Bulk actually got canceled by both TSMC and Global Foundries.
    - Zambezi is planned for somewhere in 2011, not 2010.

    It's absolutely great to read the guys at AMD are excited though, that's a very good sign! I certainly grant them a victory, especially after all the nasty bullying by Intel. Intel will of course be quite the competitor still I'm sure - I'm pretty sure they know about AMD's plans better than any of us, and must be cooking something up to counter them right now.
  12. mirage
    mirage
    aussiebear wrote:
    "For example, Bulldozer may not be an APU now, but its relatively small floating point unit speaks to a future architecture that cedes floating point operations entirely to the GPU, a component that crushes the CPU in floating point performance."

    => Its going to go beyond that. In the 2nd generation Fusion processor (somewhere in 2015); AMD has plans to incorporate GPU elements into the CPU core itself! Meaning there won't be distinct GPU and CPU sections on the processor like we see with the x-ray of Llano.
    => See here: http://www.xbitlabs.com/news/cpu/display/20100512150105_Second_Iteration_of_AMD_Fusion_Chips_Due_in_2015_AMD.html

    I can't wait to see this happening. :thumbup
  13. drasnor
    drasnor
    Thrax wrote:
    HLS is a fascinating way to rapidly design and produce a chip that can easily be modified or ported to other processes for outstanding flexibility in the market. The trade off for this agility is frequency—Bobcat’s maximum clockspeed with an HLS-driven design is about 20% lower than it could have been were it designed “by hand.”
    That's a pretty major achievement. The HLS systems I've seen at school deliver at best 20% of the performance possible with a manually-routed design.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!