Forcing Independent GPU for Increased Z Buffer?

Friend came to me with a rather ugly problem. Now, if we still had the brilliance that was 3DLabs, I'd have already solved it twice over. But we don't, so I'm faced with a hell of a problem.

They need >6GB of Z Buffer space. No, I am not at liberty to explain why. Normally in a 3DLabs scenario, I'd use multiple REALiZM 800's since each VPU was an independent unit. Since this is offload work, there's no GenLock requirement.
Since it's >6GB combined, not per unit, the thought I had was to use 2x Radeon W7000's in independent (non-CrossFire) mode. Is that even possible? The S10k's not an option due to cost and size.

Comments

  • I don't know what most any of this means, but is there any way to optimize the software to need less Z Buffer space? Maybe use some sort of compression or queue?
  • I don't know what most any of this means, but is there any way to optimize the software to need less Z Buffer space? Maybe use some sort of compression or queue?

    Yes and no, no and yes. (Bear in mind, I'm an old guy so terminology is 3DLabs era.)
    Yes because I already crunched the numbers, and given single ZBuf load is <3GB, it could. However, with the workload, a second GPU with a memory mirror wouldn't offer significant benefit. Especially with the W7000's double vertex.
    No because the software's more or less optimized correctly as is. AMD and NV can say CrossFire/SLI is great but the fact is that it's detrimental to some workloads. This just happens to be one of them. Having two GPUs working the same set ultimately offers less performance than having two GPUs working independently, since the workload is highly parallelized.

    So I should probably explain that a bit better. Generally a straight up ~60-80% performance increase is good, yes. The problem is when you have an offload system which is properly parallelized a la 3DLabs VPU design. What that means is that when given a second VPU (something AMD/NV are only starting to catch up to with Stream/CUDA) or GPU is provided, that works on a second data set. Third VPU/GPU, third data set, and so on. Obvious result is that instead of doing a single set in ~70% of the time, it has N+ sets in 100% of the time. (Which is fine because it's a standalone offload box.)
    These workload types are extremely rare, which is why I can't say as much as I'd like to on it. This particular workload was originally written to leverage independent Render/Geometry Processors, then VPUs, which means that it's written to leverage geometry units. Thus making the W7000 the hands down best choice. (Amusingly, this came up around the same time as someone asking me about a Blender offload box too.) The problem is that if the W7000 can only operate multiple cards in CrossFire, it's detrimental to the workflow and would require multiple 1U systems with 1 W7000 each instead of 5U with 4+ W7000's.

    Heh.. AMD should sponsor me to build a show-off render box for them. Because who the hell wouldn't want 8 x W7000's in a single mixed air/water chassis?
  • mertesnmertesn I am Bobby Miller Yukon, OK Icrontian
    I have no idea whether it's possible or not, but I'll check with my PR contact and see if they can put me in touch with someone who can answer this.

    Out of curiosity, is there a reason Crossfire isn't a solution? The W7000 should support it.
  • mertesn said:

    I have no idea whether it's possible or not, but I'll check with my PR contact and see if they can put me in touch with someone who can answer this.

    Out of curiosity, is there a reason Crossfire isn't a solution? The W7000 should support it.

    CrossFire offers no functional benefit for this workload. Even if it was 100%+100% (which it isn't) the reduction in functional memory - remember that 4+4=4 in CF - means it reduces and eliminates parallel loads. The workflow is based around high parallelization - inputs X,Y,Z producing outputs X,Y,Z at around the same time. Input X, output X, input Y, output Y, and so on would actually slow things down.

    I haven't had time to talk to them about code refactor yet, but given the input data, there's still a lot more advantage to large memory independent GPU operation than raw processing power. Might be more helpful to explain how it worked with the VPU/VSU setup.
    In the VPU/VSU configuration, the system and software relied on tight coupling (VPU/VSU tightly linked to specific CPU) and large multi-channel system memory to feed calculation load into the 128MB DirectBurst segment while streaming from dedicated memory channels using EVM to the 512MB GDDR segment to get around the memory limitation. To my knowledge, FireGL/FirePro does NOT have an equivalent to EVM which gave each REALiZM card up to 256GB of effective buffer. (AFAIK effective buffer is 2 * GDDR, so for 4GB would be 8GB.)
    Basically, each VPU was fed by 1-2 dedicated DDR2 channels and did not use non-local memory, something not entirely possible with modern architecture sadly. Bear in mind that the R800 VPU unit (made up of two parallel operating VPUs operating on the same 512MB GDDR segment) was rated at over 700 GFLOPS per card, and 'multi-VPU' did not parallelize like CF/SLI - memory was not mirrored. So the whole setup was designed around grinding large datasets with each processing unit working through a separate data set.
  • mertesnmertesn I am Bobby Miller Yukon, OK Icrontian
    edited January 2013
    RootWyrm said:

    Friend came to me with a rather ugly problem. Now, if we still had the brilliance that was 3DLabs, I'd have already solved it twice over. But we don't, so I'm faced with a hell of a problem.

    They need >6GB of Z Buffer space. No, I am not at liberty to explain why. Normally in a 3DLabs scenario, I'd use multiple REALiZM 800's since each VPU was an independent unit. Since this is offload work, there's no GenLock requirement.
    Since it's >6GB combined, not per unit, the thought I had was to use 2x Radeon W7000's in independent (non-CrossFire) mode. Is that even possible? The S10k's not an option due to cost and size.

    I just heard back from someone who works at AMD's FirePro division. He says yes, it is possible to set up two W7000s that way.
  • RootWyrmRootWyrm Icrontian
    edited January 2013
    mertesn said:

    I just heard back from someone who works at AMD's FirePro division. He says yes, it is possible to set up two W7000s that way.

    System now looks like:
    imageimageimageimage
Sign In or Register to comment.