Forcing Independent GPU for Increased Z Buffer?
Friend came to me with a rather ugly problem. Now, if we still had the brilliance that was 3DLabs, I'd have already solved it twice over. But we don't, so I'm faced with a hell of a problem.
They need >6GB of Z Buffer space. No, I am not at liberty to explain why. Normally in a 3DLabs scenario, I'd use multiple REALiZM 800's since each VPU was an independent unit. Since this is offload work, there's no GenLock requirement.
Since it's >6GB combined, not per unit, the thought I had was to use 2x Radeon W7000's in independent (non-CrossFire) mode. Is that even possible? The S10k's not an option due to cost and size.
They need >6GB of Z Buffer space. No, I am not at liberty to explain why. Normally in a 3DLabs scenario, I'd use multiple REALiZM 800's since each VPU was an independent unit. Since this is offload work, there's no GenLock requirement.
Since it's >6GB combined, not per unit, the thought I had was to use 2x Radeon W7000's in independent (non-CrossFire) mode. Is that even possible? The S10k's not an option due to cost and size.
0
Comments
Yes because I already crunched the numbers, and given single ZBuf load is <3GB, it could. However, with the workload, a second GPU with a memory mirror wouldn't offer significant benefit. Especially with the W7000's double vertex.
No because the software's more or less optimized correctly as is. AMD and NV can say CrossFire/SLI is great but the fact is that it's detrimental to some workloads. This just happens to be one of them. Having two GPUs working the same set ultimately offers less performance than having two GPUs working independently, since the workload is highly parallelized.
So I should probably explain that a bit better. Generally a straight up ~60-80% performance increase is good, yes. The problem is when you have an offload system which is properly parallelized a la 3DLabs VPU design. What that means is that when given a second VPU (something AMD/NV are only starting to catch up to with Stream/CUDA) or GPU is provided, that works on a second data set. Third VPU/GPU, third data set, and so on. Obvious result is that instead of doing a single set in ~70% of the time, it has N+ sets in 100% of the time. (Which is fine because it's a standalone offload box.)
These workload types are extremely rare, which is why I can't say as much as I'd like to on it. This particular workload was originally written to leverage independent Render/Geometry Processors, then VPUs, which means that it's written to leverage geometry units. Thus making the W7000 the hands down best choice. (Amusingly, this came up around the same time as someone asking me about a Blender offload box too.) The problem is that if the W7000 can only operate multiple cards in CrossFire, it's detrimental to the workflow and would require multiple 1U systems with 1 W7000 each instead of 5U with 4+ W7000's.
Heh.. AMD should sponsor me to build a show-off render box for them. Because who the hell wouldn't want 8 x W7000's in a single mixed air/water chassis?
Out of curiosity, is there a reason Crossfire isn't a solution? The W7000 should support it.
I haven't had time to talk to them about code refactor yet, but given the input data, there's still a lot more advantage to large memory independent GPU operation than raw processing power. Might be more helpful to explain how it worked with the VPU/VSU setup.
In the VPU/VSU configuration, the system and software relied on tight coupling (VPU/VSU tightly linked to specific CPU) and large multi-channel system memory to feed calculation load into the 128MB DirectBurst segment while streaming from dedicated memory channels using EVM to the 512MB GDDR segment to get around the memory limitation. To my knowledge, FireGL/FirePro does NOT have an equivalent to EVM which gave each REALiZM card up to 256GB of effective buffer. (AFAIK effective buffer is 2 * GDDR, so for 4GB would be 8GB.)
Basically, each VPU was fed by 1-2 dedicated DDR2 channels and did not use non-local memory, something not entirely possible with modern architecture sadly. Bear in mind that the R800 VPU unit (made up of two parallel operating VPUs operating on the same 512MB GDDR segment) was rated at over 700 GFLOPS per card, and 'multi-VPU' did not parallelize like CF/SLI - memory was not mirrored. So the whole setup was designed around grinding large datasets with each processing unit working through a separate data set.