Overclocked GPU failing work units?
Tim
Southwest PA Icrontian
I noticed something today while looking at my HFM monitor on my 4 GPU folding PC.
GPU1 has had 5 complete and 10 failed work units so far. The other 3 GPUs have never had a failed work unit, and 62 successful ones total.
I had the shaders overclocked in sync in EVGA Precision, all at 1650 Mhz, up from the stock 1500 Mhz.
Could there be something in GPU1 that makes it unstable enough at this overclock to fail work packets? I un-synced the 4 GPUs and put GPU1 back to 1500. So far it's up to 8 pass and the same 10 fails.
We'll see what happens.
GPU1 has had 5 complete and 10 failed work units so far. The other 3 GPUs have never had a failed work unit, and 62 successful ones total.
I had the shaders overclocked in sync in EVGA Precision, all at 1650 Mhz, up from the stock 1500 Mhz.
Could there be something in GPU1 that makes it unstable enough at this overclock to fail work packets? I un-synced the 4 GPUs and put GPU1 back to 1500. So far it's up to 8 pass and the same 10 fails.
We'll see what happens.
0
Comments
EDIT : I checked the log file. I saw the failures. Of those who failed, about half failed in the first 10%, the others were between 20%-55% when they failed. Then towards the end of the log is where I took out the overclock because the failures had stopped. Some of the error messages at the failure point were:
mdrun_gpu returned, NANs detected on GPU, Folding@Home core shutdown: UNSTABLE_MACHINE, Corestatus = 7A (122)
What's a NAN? And on EUE pauses, WHY is it a 24 hour shutdown? Give it 10 minutes to cool down the GPU and resume! I used to see EUE messages on HFM, but never knew what EUE was. If the GPU with EUE hadn't restarted after a little while, I'd shut down that GPU and restart it.
They say that if you do unethical things to gain points without completing work they will remove points or suspend the user account. I have thrown about 40 EUEs in one day trying to figure out if I could folding in SLi to work correctly but never lost any points or had a hold placed on my user name. Going through those events I saw how you could cheat because I picked up an extra 3k that day from clearing logs and throwing more EUEs but picking up partial points. It is an FYI for everyone about the possible issues with having uncontrolled EUEs and the possible backlash if you exploit it.
Since I took the overclock off GPU1, I haven't seen it.
I may have to RMA this one GPU, we can't have all these failures wasting time.
The other 3 GPUs put together can go 50-60 work units with maybe only 2-3 failures at the most, sometimes not even that many.
I could try changing this GPUs position in the current computer and see if that helps any.
When you create Folding folders, like GPU1, GPU2, GPU3, etc, the computer assigns those numbers to the GPU slots, doesn't it? Not the video card itself?
In the case of single GPU video cards, it would start with the first GPU slot closest to the CPU, and count higher as it got further away?
And with dual GPU video cards, it would put 0 and 1 on the first PCI-E slot, 2 and 3 on the second, etc?
Where can I go to look at the voltages? I don't think there are any GPU voltage settings in the BIOS.
This power supply should be more than enough to handle this PC. 950 watt Corsair TX950W.
I moved the GPU in slot 1 to slot 4, and moved the one on 4 up to 1. This morning GPU4 had a pass / fail record of 5-3, and GPU1 was at 4-1. Then I remembered I had not changed the EVGA settings when I switched the cards, so slot 4 was overclocked again. I put 1 and 4 back to stock speeds, 2 and 3 are around 1613 and 1650 on the shaders and have not had problems.
If all else fails pull the OC off of every card, move up slowly by dividers with all cards together, as soon as you fail a WU back one step down on the divider. There is also the fact being you might have to accept some WUs will always fail in this setup.
Great job so far Tim.
I used the EVGA overclocking utility to UNDERclock the card a bit, but it still didn't help.
I looked at the log file, and most of the failed ones were "exception thrown during guarded run", with less than 5% of the work unit finished. And most of those didn't even make it to 2%. There were a few "NAN detected" errors also.
Looks like something is wrong with this GPU, if even underclocking can't fix it I don't know what else I can do for it.
Is it possible that this GPU will work fine as a graphics card for general or gaming use, it just fails at folding? Or will it screw up playing WoW also?
Here's a possibility - when I got these GPUs, I took the heak sink assembly off of one of them, looked at everything, and put it back on. Maybe it was this one, I didn't keep track, and I messed up the heat sink contacting surfaces and it's overheating now and failing? The GPU had those spongy tabs for each of the memory chips.
Maybe clean the heat sink surfaces, try some Arctic Silver 5, and see what happens?
I wonder what exactly is wrong with the GPU.
I'll have to contact the eBay seller company and see if it's not too late for an RMA. If it is, the 8800GTs are only $57.00 at their place so it's not very expensive for 4500-5000 PPD!
Total folding machines - 1 4870, 1 E7300 dual core, one TL-60 Athlon 64X2 in my DV6000 laptop, a 9650 quad core, and 2 8800GTs.
I may assemble the other K9A2 motherboard with the bad 8800GT and let it fold on the 8750 tri-core if I get bored enough.
The company I got the 8800GTs from said they couldn't duplicate the problem, because I'm sure they probably don't even know what folding is, much less install it, but getting the refund was no problem.