Overclocked GPU failing work units?

TimTim Southwest PA
edited June 2010 in Folding@Home
I noticed something today while looking at my HFM monitor on my 4 GPU folding PC.

GPU1 has had 5 complete and 10 failed work units so far. The other 3 GPUs have never had a failed work unit, and 62 successful ones total.

I had the shaders overclocked in sync in EVGA Precision, all at 1650 Mhz, up from the stock 1500 Mhz.

Could there be something in GPU1 that makes it unstable enough at this overclock to fail work packets? I un-synced the 4 GPUs and put GPU1 back to 1500. So far it's up to 8 pass and the same 10 fails.

We'll see what happens.

Comments

  • SnarkasmSnarkasm Madison, WI
    edited April 2010
    Yes, overclocking can ruin work units, and yes, different cards have different levels of overclockability. Just because one card is good at 1650 doesn't mean all will be. Even if one card is good at 1650, that very same card may not be good at 1600. OCing needs patience.
  • _k_k P-Town, Texas
    edited April 2010
    The other thing you need to look at is where it is failing on the WUs. If it is throwing an error at the end or front of a WU it usually indicates a system problem but if it errors from 5%-95% complete is usually indicates an OC issue. A completely unstable OC will get one set of projects and just fail as soon as it loads it, so it never gets past 0%, and it will tend to do max EUE error 24 hour shut down. These are not exact guidelines but most of what I am seen conforms to it.
  • TimTim Southwest PA
    edited April 2010
    Since I took out its overclock, it has now improved from a record of 5-10 up to 12-10. That seems to have fixed it, maybe I'll try and squeeze it up to 1550 or 1600 and watch the record. I can also check the text log and see where they were failing at.

    EDIT : I checked the log file. I saw the failures. Of those who failed, about half failed in the first 10%, the others were between 20%-55% when they failed. Then towards the end of the log is where I took out the overclock because the failures had stopped. Some of the error messages at the failure point were:

    mdrun_gpu returned, NANs detected on GPU, Folding@Home core shutdown: UNSTABLE_MACHINE, Corestatus = 7A (122)

    What's a NAN? And on EUE pauses, WHY is it a 24 hour shutdown? Give it 10 minutes to cool down the GPU and resume! I used to see EUE messages on HFM, but never knew what EUE was. If the GPU with EUE hadn't restarted after a little while, I'd shut down that GPU and restart it.
  • clifford_cooleyclifford_cooley Arkansas, USA
    edited April 2010
    I've seen NAN in programming before and it usually meant "Not A Number". I'm not sure if it would mean the same here or not. The "Not A Number" error would indicate that a math function tried to calculate two values and one was not a number so the calculation could not be performed and submitted the NAN error.
  • SnarkasmSnarkasm Madison, WI
    edited April 2010
    NaN is not a number, yes. That's exactly what happened. EUEs are early unit ends; it stops to give you time to evaluate what's wrong with your system before trying again. It doesn't do Stanford any good for you to turn in 50 EUEs. They'd much rather you just turn in 1 and notice your points have dropped and debug why.
  • _k_k P-Town, Texas
    edited April 2010
    Just to add onto what Snark said, even though he nailed the EUE thing. The other reason they have a 24 hour hold on clients that throw a large number of consecutive EUEs is to cut down on cheating.

    They say that if you do unethical things to gain points without completing work they will remove points or suspend the user account. I have thrown about 40 EUEs in one day trying to figure out if I could folding in SLi to work correctly but never lost any points or had a hold placed on my user name. Going through those events I saw how you could cheat because I picked up an extra 3k that day from clearing logs and throwing more EUEs but picking up partial points. It is an FYI for everyone about the possible issues with having uncontrolled EUEs and the possible backlash if you exploit it.
  • TimTim Southwest PA
    edited April 2010
    I've seen HFM reporting work units as EUE, but never knew what it meant until now.

    Since I took the overclock off GPU1, I haven't seen it.
  • TimTim Southwest PA
    edited April 2010
    GPU1 is starting to show EUEs again. I looked at the log file and it failed 4 in a row, all before 30% of the work unit was done. And this is at the stock clock speeds.

    I may have to RMA this one GPU, we can't have all these failures wasting time.

    The other 3 GPUs put together can go 50-60 work units with maybe only 2-3 failures at the most, sometimes not even that many.
  • shwaipshwaip bluffin' with my muffin
    edited April 2010
    It could be overheating too. Also you may want to make sure all your voltages are in spec.
  • _k_k P-Town, Texas
    edited April 2010
    You might want to throw it in your other board by itself to make sure it is really the card throwing the errors and not the board GPU combo.
  • TimTim Southwest PA
    edited April 2010
    The other motherboard is just bare yet, in the box. I need to get all the other parts sometime, I just PM'd you on that.

    I could try changing this GPUs position in the current computer and see if that helps any.

    When you create Folding folders, like GPU1, GPU2, GPU3, etc, the computer assigns those numbers to the GPU slots, doesn't it? Not the video card itself?

    In the case of single GPU video cards, it would start with the first GPU slot closest to the CPU, and count higher as it got further away?

    And with dual GPU video cards, it would put 0 and 1 on the first PCI-E slot, 2 and 3 on the second, etc?

    Where can I go to look at the voltages? I don't think there are any GPU voltage settings in the BIOS.

    This power supply should be more than enough to handle this PC. 950 watt Corsair TX950W.
  • TimTim Southwest PA
    edited April 2010
    What about the EVGA Precision overclocking utility? Does it list slots 1-4 in the order they are on the board, like 1 is closest to the CPU, and 4 is the farthest away? I'm sure it does not assign itself to a specific GPU and follow it if it is switched to a different slot.

    I moved the GPU in slot 1 to slot 4, and moved the one on 4 up to 1. This morning GPU4 had a pass / fail record of 5-3, and GPU1 was at 4-1. Then I remembered I had not changed the EVGA settings when I switched the cards, so slot 4 was overclocked again. I put 1 and 4 back to stock speeds, 2 and 3 are around 1613 and 1650 on the shaders and have not had problems.
  • _k_k P-Town, Texas
    edited April 2010
    Honestly I don't remember for certain on these. I believe when you have single GPU cards it displays the GPU number in order with the slot number. The easy way to make sure you know which card is where is to run the fan speed of 1 or 4 to 100% with the other fans left to idle, pause folding for a few minutes so the cards cool and the duty cycle on everything drops. This will give you an indicator of which card is where in relation to the way the software is labeling them.

    If all else fails pull the OC off of every card, move up slowly by dividers with all cards together, as soon as you fail a WU back one step down on the divider. There is also the fact being you might have to accept some WUs will always fail in this setup.

    Great job so far Tim.
  • TimTim Southwest PA
    edited May 2010
    That same GPU (now in slot 4) is blowing work units again. It was doing good for a few weeks, with over 90% success. Now it's failing almost every one, I had to take it out of EUE pauses 3 times since yesterday.

    I used the EVGA overclocking utility to UNDERclock the card a bit, but it still didn't help.

    I looked at the log file, and most of the failed ones were "exception thrown during guarded run", with less than 5% of the work unit finished. And most of those didn't even make it to 2%. There were a few "NAN detected" errors also.

    Looks like something is wrong with this GPU, if even underclocking can't fix it I don't know what else I can do for it.

    Is it possible that this GPU will work fine as a graphics card for general or gaming use, it just fails at folding? Or will it screw up playing WoW also?

    Here's a possibility - when I got these GPUs, I took the heak sink assembly off of one of them, looked at everything, and put it back on. Maybe it was this one, I didn't keep track, and I messed up the heat sink contacting surfaces and it's overheating now and failing? The GPU had those spongy tabs for each of the memory chips.

    Maybe clean the heat sink surfaces, try some Arctic Silver 5, and see what happens?
  • BuddyJBuddyJ Dept. of Propaganda OKC
    edited May 2010
    Try reapplying your TIM (AS5 or whatever) and seeing if that helps. Sounds like you might also be having memory errors. Memtest the PC just to be sure.
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska
    edited May 2010
    UNDERclock the card a bit, but it still didn't help
    Assuming that GPU is not trapped in an airflow dead zone and not cooling properly, I'd say you need to RMA that card if haven't already done so.
  • TimTim Southwest PA
    edited May 2010
    I changed the RAM sticks on the motherboard, it didn't make any difference. The one GPU needs replaced, other than that, all is fine.

    I wonder what exactly is wrong with the GPU.

    I'll have to contact the eBay seller company and see if it's not too late for an RMA. If it is, the 8800GTs are only $57.00 at their place so it's not very expensive for 4500-5000 PPD!
  • TimTim Southwest PA
    edited June 2010
    I sent the 8800GT in for RMA yesterday, once a new one arrives I should be back to full folding capacity.
  • _k_k P-Town, Texas
    edited June 2010
    Did you ever get all the parts back and machines reassembled?
  • TimTim Southwest PA
    edited June 2010
    I sent in one 8800GT for rma and they refunded it instead of shipping a new one. Another 8800GT has also failed, so I'm only folding on 2 8800GTs at the moment.:wtf:

    Total folding machines - 1 4870, 1 E7300 dual core, one TL-60 Athlon 64X2 in my DV6000 laptop, a 9650 quad core, and 2 8800GTs.

    I may assemble the other K9A2 motherboard with the bad 8800GT and let it fold on the 8750 tri-core if I get bored enough.

    The company I got the 8800GTs from said they couldn't duplicate the problem, because I'm sure they probably don't even know what folding is, much less install it, but getting the refund was no problem.
Sign In or Register to comment.