If geeks love it, we’re on it

Dissecting Fermi, NVIDIA’s next generation GPU

Dissecting Fermi, NVIDIA’s next generation GPU

NVIDIA has just revealed the first details of their next-generation video card during the keynote address delivered by NVIDIA CEO Jen-Hsun Huang.

They call it “Fermi,” and it’s NVIDIA’s newest and most radical GPU architecture. Huang implied that the card is the symbol of the company’s efforts to embrace the GPGPU, or a video card which can handle CPU-like tasks.

An xray of the Fermi GPU die.

An x-ray of the Fermi GPU die.

The top down

We’ll begin by looking at NVIDIA’s simplified block diagram of their new architecture. These types of pictures don’t say much on the surface, but a little interpretation goes a long way.

Fermi in two dimensions.

Fermi in two dimensions.

As indicated by the firm’s whitepapers, the tall green structures sandwiched in blue are the GPU’s streaming multiprocessors. Each one of these SMs contains 32 CUDA cores which are reflected in the green rectangles. Each CUDA core is itself a mini CPU which can crunch floating point and integer numbers just like a desktop CPU can. We’ll talk more about these in a moment, as the SMs are Fermi’s true heart.

Flanking the SMs, you’ll notice six squares allotted for DRAM. These are the interconnects for the memory bus which allow the GPU to talk to the graphics card’s GDDR5 memory. Each block is a 64-bit memory interface, making for a 384-bit total memory interface. That’s less than the 512-bit bus offered on GPUs like the GTX 275, but it’s made up for by the fact that the bandwidth of a 384-bit GDDR5 bus is roughly equivalent to a 768-bit bus with lesser memory. All told, Fermi’s memory bandwidth is significantly higher.

Sandwiched between a memory interface and a host interface is NVIDIA’s cutely-named “GigaThread” scheduler. On Fermi, 32 processing threads are bundled into “warps” in NVIDIA parlance. The GigaThread scheduler hands warps off to the streaming multiprocessors which do the work of sorting them out amongst themselves. In essence, GigaThread is a traffic cop at the intersection, while cores in the SMs do the actual lane shuffling.

The GigaThread engine can also direct “microkernels,” or small applications, to work in parallel on each SM. Rather than wasting space by executing them serially, GigaThread reorders them to pack efficiently into Fermi’s pipeline. The result of this effort is that processor cycles aren’t wasted on tasks that don’t fill an SM’s full processing capacity.

GigaThread reorganizes program kernels for more efficient computing.

GigaThread reorganizes program kernels for more efficient computing.

The last significant architectural element is a shared 768k L2 cache crossbar. This technique is similar to what is used for multi-core CPUs with an L3 cache. This gives each SM the ability to draw on data stored for other SMs, possibly to take data that would otherwise sit idle if each core had its own exclusive L2 cache. This model not only keeps information flowing, but amplifies bandwidth if multiple SMs are calling on the same body of data.

Fermi’s streaming multiprocessors

Just like a real CPU, each Fermi SM has an L1 cache, cores, a register file, and some silicon to oversee computation. Now, calling them “cores” is probably overly generous, but we can get a pretty decent idea of their power by looking at the total number of them. Fermi has 512 CUDA cores (32 in each of the 16 SMs), which compares favorably to the 240 cores featured in GPUs like the GTX 275. At the very least, that makes it a fair sight more powerful than the GPU it’s succeeding.

A Fermi streaming multiprocessor up close.

A Fermi streaming multiprocessor up close.

The L1 cache on Fermi can be configured to operate as 16KB of L1 cache and 48KB of shared memory, or 48KB of L1 cache and 16KB of shared memory. This architectural flexibility will give developers the opportunity to leverage more throughput when a task needs more or less L1 cache space.

You’ll notice also a region of the SM called the “warp scheduler.” Each warp scheduler works in concert with an instruction dispatch unit to distribute warps (groups of 32 threads) to sets of 16 CUDA cores in each SM. That makes for 1024 active concurrent threads on the GPU, with a total of 24576 if you count standby warps which pitch in to fill idle time. Today’s desktop CPUs top out at 12 or 16 with unreleased architecture. Now, that is not an entirely fair comparison, as CPU threads are much more complex than GPU threads, but it should give you a sense of the scale we’re talking about today.

Each core has also adopted fused “multiply add” operations. Long story short on this one; it allows the GPU to perform adding and multiplication functions in a single equation. Older GPUs would multiply, round the number, add, then round the number again. If that sounds like a silly way to do it, that’s because it is. G200 and Fermi simply write it out like x+(a*b) and round that number off. It saves a lot of time, and Fermi can do it with double precision, which means the rounded number is more accurate. When you absolutely have to get rounded numbers right, double precision is the way to go.

Speaking of double precision, the Fermi has implemented IEEE 754-2008-compliant double-precision floating point operations. As we discussed in our Radeon HD 5870 exposé, gigaFLOPS stands for one billion FLoating point Operations Per Second. A floating point operation is a basic calculation used by the CPU to process code, especially “scientific” ones like computer AI, video encoding and physics. Double-precision FLOPs ensure a high degree of accuracy in these calculations, which translates to more accurate rendering or encoding. We guess Fermi will be north of 700 billion precision FLOPS, while the HD 5870 weighs in at 544 billion. On the other hand, the HD 5870 will deliver an assbeating in the altogether less useful single-precision category with nearly twice the performance.

Memory

The biggest change to memory architecture on Fermi is the addition of Error Checking and Correction, or ECC. ECC is an important component of enterprise memory because it can detect and eliminate errors in what is being held in memory. There are many ways to come across these errors, but most of them have to do with sensitivity of these electronics at such small sizes. ECC will make Fermi significantly more useful to scientists who depend on accurate results.

In terms of the memory we can expect on a Fermi-based board, it could be as high as 12GB with 4Gb memory chips. Perhaps the enthusiast levels of these GPUs will come with 2, or even 4GB of GDDR5 depending on its contract price at the time of production.

Lastly, that host interface block we ignored in the block diagram overview comes into play as well. NVIDIA has revamped the host interface to allow for concurrent bi-directional system/GPU transfers, which are fully overlapped with CPU and GPU processing time. This means that there’s an uninterrupted flow between the CPU and GPU when both are hard at work.

Running some numbers

The big question is, “Okay, but how does it perform?” Well, we don’t know that yet, but we can hazard a few guesses that may or may not be right.

  • Die size: The Radeon HD 5870 has a surface area of 334mm² with 2.15 billion transistors at 40nm. NVIDIA estimates 3 billion 40nm transistors, or a 29% increase. With extremely sloppy math in tow, that pegs a Fermi die at around 430mm². That is by no means a small core.
  • Dies per wafer: We know that NVIDIA’s manufacturing partner (TSMC) uses 300mm (12″) wafers. Ignoring probable defects, that means each wafer will cough up approximately 129 dies.
  • Memory bandwidth: If Fermi uses the same 4.8Gbps GDDR5 used on the HD 5870 (153.6GB/s bandwidth), then the 34% wider Fermi memory bus should offer around 210GB/s.
  • Shader clockspeed: We’re just going to guess and say it’s similar to the GeForce GTX 285 at 1500MHz.
  • FLOP performance: NVIDIA claims 8x the double precision peak FLOP performance over the G200b’s estimate 89 GFLOPS. If that’s true, Fermi offers 712 double-precision GFLOP performance. Alternatively, we estimate around 1500 single-precision FLOP performance, which falls well shy of the 5870’s 2720.

At the end of the day, we can only guess based on what NVIDIA has told us. We don’t have the shader clockspeed, the memory throughput, or the GPU’s core clockspeed. What NVIDIA delivers can be much higher or much lower than what we’ve scribbled down here.

Wrapping it all up

NVIDIA has seriously bent its will towards turning the next GeForce into something that looks more like a CPU than a GPU. With full C++, OpenCL, C, Fortran and DirectCompute language support, don’t be surprised to see all sorts of programs running on your GPU in the near future.

That said, the decision to announce the company’s newest architecture in the context of stream computing concerns us. This is a company which has positioned all prior GPUs as gaming cards, even though more recent ones have had impressive GPGPU functionality as well.

But nothing ever stays the same, and NVIDIA has clearly changed gears to focus on its Tesla division. The very same division which has produced miserable revenues for stream computing parts and appliances. Why is NVIDIA banking on it? The fact that big green is looking to a nascent and, in most respects, very small market to grow its business is dangerously telling of its opinion on gaming.

With so many of today’s PC titles mere ports of popular console games, it’s no wonder that NVIDIA appears to be fearing market stagnation. But it’s not just ports that are harming the industry. Display resolution stagnation, the rise of casual gaming and the looming shadow of the console’s popularity are having an impact as well.

Ultimately, it will take time for us to see and understand where NVIDIA is taking its business. Of course, that also requires an architecture which the company can sell to get it there, and for that we must wait until at least Q1 2010.

Comments

  1. UPSLynx
    UPSLynx This thing is impressive, yet horrifying. This is totally consistent with the trends we saw NVIDIA pushing for at SIGGRAPH with GPU computing.

    We're seeing a time where the line between the CPU and GPU is being blurred significantly. In theory, great news, with all the crazy calculations needed for AI and physics and such.

    But at the same time, this could be new approach that doesn't end well for gamers.

    GPU computing is awesome, especially in the professional fields. But I'm going to buy my GPU for gaming. That's all I want this thing to do. If the computing helps games, awesome. If the thing games exceptionally well AS WELL as assisting the CPU during non-gaming moments, I'm OK with that. If the card has mediocre yields in gaming while helping run MS Word better, then we're in trouble.
  2. QCH
    QCH Did someone say Fermi? ;D
  3. Cliff_Forster
    Cliff_Forster I've been saying this for a while now, the GPU is the enthusiast part that is going to continue to evolve and impress us, the CPU is starting to top out, all you can do is add more cores. For a home user this is becoming fairly pointless, now GPU's are showing their muscle, they can do more than just render images, and they are just now figuring out how they fit into the much larger picture. Think Intel dumped millions in research to develop their own GPU's just to compete in the PC gaming market? You know they didn't, they are doing it because they have to, the writing is on the wall, GPU's are where the biggest performance gains are possible while the traditional X86 CPU while still improving, is starting to top out. I think its only a matter of time before GPU's are the guts of the system.
  4. Butters
  5. Leonardo
    Leonardo
    the writing is on the wall, GPU's are where the biggest performance gains are possible while the traditional X86 CPU while still improving, is starting to top out.
    What you have expressed became palpable for me when I added GPU clients to my [EMAIL="Folding@Home"]Folding@Home[/EMAIL] efforts. Seeing the amazing production the GPUs accomplished versus the CPU processing, I started paying attention to GPU-as-processor developments.
  6. Leonardo
    Leonardo I hope Nvidia is successful with this technology. Five to ten years down the road it would benefit us all if there were a three-way serious competition - Nvidia, AMD/ATI, and Intel all producing top quality, high performance CPU-GPU processing units, or whatever they might be called at that time. The two-player general CPU market must evolve.

    The writing is definitely on the wall: 1) AMD purchased ATI, 2) Intel starts getting serious with research and development for graphics processing, and 3) Nvidia develops Fermi.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!