Howdy, stranger! Ready to join the community? [log in]

Pipelining Explained


The word “pipelining” with
respect to a processor may be a familiar word to enthusiasts but many may not
know much beyond the term or even what it means. Learn now what pipelining
is and if it is something that will benefit your computing experience or not.

In the most basic terms pipelining is the ability of the processor to “cue”
data it may require while the processor is completing an instruction. Think
of it like an assembly line inside the processor except this assembly line
sees several parts of the finished product happen simultaneously. Pipelining
works as long as the right tools are used and the right parts are available
when needed. Conversely, the assembly line can experience slowdowns when the
wrong tool or wrong part is thought to be needed…when it isn’t.

The parts equal the sum of the Pipeline.

Instructions

All programs can be broken down into a fixed number of instructions, known
as the Instruction Set. These instructions include very basic arithmetic,
memory access, jumping within the program (branching) and other instructions.
Each of these instructions has a given path that are taken through a processor.

A very simple way of completing a program would be to figure out what instruction
the processor must complete, complete that instruction and then move on to
the next. The following is a simple example of a few instructions being completed
linearly.

Linear

Note the total number of cycles to complete the instruction is: Average Cycles
Per Instruction (x) * Number of Instructions.

This manner of completing instructions, while very simple, is rather inefficient
because several parts of the processor remain inactive while one instruction
is being completed. Wouldn’t it be great if we could use most of the
chip at the same time eliminating dormant parts of the hardware? This is
the purpose of pipelining.

Here is a brief overview of what makes pipelining
possible:

Every instruction has five basic divisions: Fetch, Decode, Execute, Memory
Access, and Store. “Fetching” involves accessing
memory that contains the program. The next part, Decode,
involves looking at the data obtained in the Fetch to determine what kind
of instruction is used, and what data will be necessary to complete the
instruction. Every instruction has the same Fetch and Decode because until
Decode has completed, there is no way of telling what the instruction is. Execution
is where any ALU (Arithmetic and Logic Unit)
operations (like Add, Multiply, AND, OR, etc) are performed. Memory
Access
is only used by instructions that store or
load data to/from memory. The final part of an instruction is Writing the
results (if applicable) to the registers within the processor. Each of these
divisions uses separate parts of the processor, and this is the basis for
pipelining. Here is an instruction broken into these five divisions with
some of what happens in each part.

Division

Pipelining

Pipelining is accomplished by adding registers in between the stages to hold
the results of each individual stage. After the first instruction completes
the Fetch stage it moves to the Decode stage and then the second instruction
starts the Fetch stage. The following is a picture to help illustrate the
process:

Pipelined

Note that now all the instructions have a fixed number of cycles to complete
which is equal to the number of stages. As the number of stages increases,
the amount of work that the processor does in a stage decreases along with
the time required for a stage. This allows the frequency of the processor
to increase. The time to execute N instructions with a pipeline of x stages
is t = x + (N
– 1)

Advantages of Pipelining:

  • More efficient use of processor
  • Higher frequencies possible.
  • Quicker time of execution of large number of
    instructions

Disadvantages:

  • Pipelining involves adding hardware to the chip
  • Pipelining requires a large
    instruction cache to keep the pipeline “full” and
    minimize stalling

Branching

Since branches are conditional there are two possible instructions that could
follow a branch instruction. If the branch is not taken the next instruction
is immediately after the branch. However, if the branch is taken, a different
instruction would follow the branch. This means that every branch that is
predicted incorrectly requires a flush of all the instructions that were fetched
after it. This is not a huge performance hit on a short pipeline or one that
has the branch determined early in the pipeline. If this is not the case
then a large number of instructions will be flushed, causing a large performance
hit. These bad predictions can be minimized using one of several branch prediction
techniques which can reach around 98% accuracy.

Dependencies

Consider executing the following two equations:

x = 5 + 6

y = x + 7

You cannot start to compute ‘y’ until ‘x’ has been
stored. Storing ‘x’ occurs in the final stage of the pipeline but
calculation of ‘y’ has already started. Therefore, the pipeline
must be stalled to allow the value of ‘x’ to be written, and then
calculation of ‘y’ can finish. These data dependencies occur frequently
in instructions, so special forwarding hardware is added to processors. This
minimizes the impact of these dependencies and often removing the need to stall
at all.

Pipelining: Good or Bad?

Pipelining, like most things in life, is good in moderation. Making a processor’s
pipeline too short causes a longer minimum clock period which hinders the
manufacturer’s
ability to ramp up the clock speed. Making the pipeline very long allows faster
clock speeds however it also increases the cost of stalls and flushes which
negatively affects performance and also increases the amount of resources
required to pipeline the processor. A pipeline that allows for a balance between
these two; a decent clock speed without severe performance hits from stalls
or flushes is the optimal design.

Works Cited

Patterson & Hennessy. Computer Organization & Design: The Hardware/Software
Interface, 2nd Edition.
McFarling, Scott. WRL Technical Note TN-36: Combining Branch Predictors, June
1993.

Share |

5 Comments:

  1. Unregistered
    Guest

    Greetings!!

    Nice, compact and informative article!

    Thank You!

  2. Straight_Man
    Playing with Virtual Painter

    One thing, differences between "pipe" and "pipelining" might be qualified in this way:

    Pipelining speaks to the strategies and tactics in design used in processing data into and through each pipe. HT-Capable CPUs that are also true multi-pipe (or multiple virtual CPUs on one die as we see it) will be seen as multiple CPUs to any O\S that is not HT-tuned and not 64 bit core.

    So if you see a CPU referred to as having two pipes or four or eight, it is multiple pipes. Pipelining is the throughput design expressed in L1 and\or L2 cache sizing(I include cahce as cache can be used for three things by a CPU, input, output, and very small amounts of pended work-- caching strategy is tightly tied to how a pipelining strategy works in reality), the instruction length of the pipe, and the bredth of the pipe (IE how many bits\word it uses natively), and how many steps each bit of a htread must go through to hit output side of pipe at CPU output.

    A well-designed multipipe CPU CAN process more than one O\S thread at the same time. Make it too long, what mediaman speaks of does happen-- threads get interrupted and even reset in part in mid thread sometimes if the pipe is too long as the pipe has to stop work while data and instructions trickle in too slowly and then resume. Too short, with high speed CPUs, RAM or cache gets overloaded (from output side of CPU pipe) and Windows then has to clear some more RAM and then the data that the CPU processes has to be written to HD (Virtual RAM or swap space on HD is used)while not actively in use.

    John D.

  3. Unregistered
    Guest

    hi my name is miled from lebanon i wild ask about pipeline :how many stage there is like(fetch....)because my friend tel me about intel (20 stage)
    please tel me
    my email is bodium2002 at hotmail dot com

  4. shwaip
    elaborate bot

    Depends on the processor, but most architectures are comprised of the 5 basic stages that I wrote about in the article. Each stage can be broken into smaller substages to do "less" in a clock cycle. I can't tell you specifically what the stage names or what the individual stages accomplish in a given processor. My advice would be to use google to find the information. You could also check the white papers at intel's site (http://www.intel.com).

    It's unlikely that you'll find out exactly what is done in each stage, as this sort of thing is often considered a "trade secret". They don't want to release the information to the public, because that means that they release it to competitors (AMD, Via, whomever).

  5. shwaip
    elaborate bot

    actually...check out this xbit labs article...

    http://www.xbitlabs.com/articles/cpu...urst-1_19.html

Hey, be nice. Icrontic is full of good people, we promise.

New Features on Icrontic: