Sporadic NANs on 9600GT?

lordbeanlordbean Ontario, Canada
edited November 2009 in Folding@Home
Anyone fold with the same card and ever had trouble with this? Once in a while when my 9600GT downloads a new project, it goes into a NAN error loop and results in EUE, and goes to sleep for 24 hours. Flushing the work files seems to temporarily correct the problem.

Current steps taken - updated to latest driver, downloaded rivatuner and forced fan @ 80%.

Comments

  • SnarkasmSnarkasm Madison, WI
    edited September 2009
    NaNs are stability indicators. Check your GPU memory, heat levels, and processor. The card may just be dying.
  • lordbeanlordbean Ontario, Canada
    edited September 2009
    The problem seems to be occuring on both my Phenom II rig and my E6600 rig. The SMP clients are running stable on both, but the 9600 GT in the E6600 and the second HD4850 in the Phenom will occasionally finish a unit, download a new one, and then go into a NAN loop. I'm starting to suspect it's a bug in the client.
  • _k_k P-Town, Texas
    edited September 2009
    Reinstall or delete everything except the .dll files.
  • lordbeanlordbean Ontario, Canada
    edited September 2009
    if you mean in the working folder, I do that each time, that's how the problem clears (at least temporarily). Everything except client.cfg and the cal/cuda DLLs get deleted.
  • sgstairsgstair Reverse Engineer Redmond, WA
    edited September 2009
    I see this a lot on my primary machine, but almost never on the 3 other machines - From my research this seems to be a bug in the nVidia driver for Vista/win7, possibly related to multiple monitors.
    Still trying to track it down; I'll post if I do find anything concrete.
    I've cycled the GPU in this system several times and the old cards are now happily folding in other systems without hitting this issue, so it's almost certainly related to my current configuration.

    -Stephen
  • _k_k P-Town, Texas
    edited September 2009
    The only time I have ever gotten that error is when something is clocked way to high because I busy doing something else with it or a new set of WUs come out that stress the cores just enough that the current clock is not stable any longer for the client. WUs do change even though the point value might not, there are a huge variations of series with any WU set.
  • sgstairsgstair Reverse Engineer Redmond, WA
    edited September 2009
    I just want to say that I've had cards in this machine that somewhat randomly alternate between folding without issue and EUEing; I've then put them in other machines and they have been folding since without issue; The only appreciable difference is that this system has 2 monitors - I haven't played with my GPU clocks.
    I seriously think there's a multimonitor bug in the nVidia drivers causing all sorts of grief and am working now to isolate it (It wouldn't be the first absolutely horrible bug in this area)
  • lordbeanlordbean Ontario, Canada
    edited September 2009
    My own experience with this leads me to believe that a multimonitor bug, if there is one, is not involved. In fact, my only system that does not produce this error is my gaming rig w/ GTX 285, and it has two monitors connected. Also, I sporadically get the problem on my HD4850s as well, which suggests it is not a driver glitch but a glitch somewhere in the folding app itself. My current suspicion is the machine ID #... my GTX 285 has always been ID 5 since when I first installed folding, I used 4 single-core clients on IDs 1-4, and it has never produced a NAN error. Since this is the only thing I can see which is different about the graphics client on my gaming PC, I have reset the machine IDs on all my other graphics folding cores to ID 5 or 6, and I'm testing to see if it works properly or not.
  • lordbeanlordbean Ontario, Canada
    edited September 2009
    Well, that theory has gone out the window. Just checked the log on my 9600GT, which has been running for about a week... it's gone into a NAN / EUE loop twice as machine ID #5. This is driving me up the wall, I can't figure out why this happens.
  • sgstairsgstair Reverse Engineer Redmond, WA
    edited September 2009
    There are -some- bad WUs, which NaN at random times.
    However, I typically see a NaNs detected message the instant the core starts, which I attribute to this bug (whatever it is).
    It may not be related to multiple monitors but I'm not sure what then, because I have 3 systems that have never seen this bug, and this one sees it constantly, has under vista and win7, and is the only system with more than one monitor. And I've moved cards exhibiting this problem to other machines (twice now) where they fold without issue. So it's not likely hardware related. (unless maybe mainboard related... but not sure.)

    It could also be related to use - are your systems unattended most of the time? (Mine are set up as servers, don't see much user interaction)
  • lordbeanlordbean Ontario, Canada
    edited September 2009
    Use could be a factor. My gaming PC is the only one I really interact with on a daily basis. The other 2 systems pretty much sit there and fold.
  • sgstairsgstair Reverse Engineer Redmond, WA
    edited September 2009
    Except for in your experience the one that gets used doesn't see this issue?
    I see the opposite. Not sure where to go with this.
  • lordbeanlordbean Ontario, Canada
    edited September 2009
    I'm calling it weird, and quite possibly leaving it at that. I can't trace any consistent source of this issue, and it doesn't even seem to be the same between us.
  • sgstairsgstair Reverse Engineer Redmond, WA
    edited September 2009
    Fair enough :)
    I'll let you know if I find anything.
  • lordbeanlordbean Ontario, Canada
    edited October 2009
    I noticed something unusual when I went to flush my 9600GT work files today. I was about to delete the contents of the work folder, and then noticed that there were two cores present... FahCore_11.exe and FahCore_14.exe. I've never seen my GTX 285 do this before... it has only ever used one fahcore executable. Seems like something odd is going on when the program determines which core it should use.

    Edit - seems this may be normal. I'm not sure why there has to be more than one executable for nvidia cards, but google search on the two cores both bring up results suggesting they are nvidia execution cores.
  • SnarkasmSnarkasm Madison, WI
    edited October 2009
    New number crunching can require new cores. It knows which one to use - don't you worry your pretty little head.
  • lordbeanlordbean Ontario, Canada
    edited October 2009
    Snarkasm wrote:
    New number crunching can require new cores. It knows which one to use - don't you worry your pretty little head.

    You say that, and yet I still have a serious problem with NAN loops on the 9600GT. I've had furmark running in stability test mode on it for the last 40 minutes, and the card looks perfectly stable. It's warm, but nowhere near warm enough to be overheating.
  • _k_k P-Town, Texas
    edited October 2009
    The newer .exe is for the newer WUs. It uses better math, burst its information, and keeps the power and heat down. All of my work folders have dual .exe in them, this is where you get the 1888pt. WUs
  • TrumandrummerTrumandrummer Taylor Michigan
    edited October 2009
    Hmmm. I seem to be having the same problem sorta.
    My GPU folding has been acting up lately too. (on my gtx 260)
    It was working fine. But now it is returning "NANs detected on GPU" and it will stop folding for long periods of time.

    If I delete the files from the work folder. and the "FahCore_11.exe" file. It will start up again fine. With no NANs.

    Sometimes I can complete multiple WU's, and sometimes when it completes one it gets stuck and reports NANs.

    EDIT:
    Although, I have also been experiencing some crazy problems. When I am GPU folding, a high pitch squeaking sound comes out of my speakers. So I have to turn them down. I read somewhere that this could be my power supply though. Only a 500w powering a GTX 260, and i7 920.
  • lordbeanlordbean Ontario, Canada
    edited October 2009
    The noise from your speakers definitely sounds a bit off, but your description of the problem with your graphics folding is basically identical to what I'm experiencing with my 9600GT. It only goes into a NAN loop after having completed a WU and attempting to start a new one, and sometimes it'll work for days before it happens.
  • TrumandrummerTrumandrummer Taylor Michigan
    edited October 2009
    lordbean wrote:
    The noise from your speakers definitely sounds a bit off, but your description of the problem with your graphics folding is basically identical to what I'm experiencing with my 9600GT. It only goes into a NAN loop after having completed a WU and attempting to start a new one, and sometimes it'll work for days before it happens.

    Yep. Exactly the same problem that I have been having. I can't seem to find a reason, or a fix other than deleting the work files. Which is a pain, since I constantly have to monitor the folding now.

    Yea, The noise problem is probably something different. It just only happens when I am folding.
  • sgstairsgstair Reverse Engineer Redmond, WA
    edited October 2009
    Note that you don't actually have to delete the work files. Folding is just pretty silly when it comes to this problem. Restart the client once or twice and it should continue.
    Noise in your speakers is just as likely to be a side effect of a specific use pattern; it just happened to have a frequency component in the audible spectrum, and the audio system wasn't sufficiently shielded from RF / power supply noise. Noise is a guarantee in a computer, and as such this is more likely to indicate lax attention to or low quality parts on the audio circuits.
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska
    edited October 2009
    FahCore_11.exe and FahCore_14.exe
    It's completely normal to have both core executables in the client folder. The current crop of projects don't all use the same FahCore.
  • danball1976danball1976 Wichita Falls, TX
    edited October 2009
    The only time I tried running the GPU client on my MSI 9600GT OC, I wound up getting a BSOD, and haven't tried it since. Now that I got a BFG GTX260 OC MaxCore 55 896MB, I might try it again.
  • _k_k P-Town, Texas
    edited October 2009
    I keep getting this since i moved to win7 its really annoying.
  • lordbeanlordbean Ontario, Canada
    edited October 2009
    I still haven't figured out what the hell causes the problem. It'll run fine for days sometimes, and then for no reason it'll EUE and sleep for 24 hours after starting a new unit.
  • sgstairsgstair Reverse Engineer Redmond, WA
    edited November 2009
    I wonder a bit if it's related to the motherboard chipset;
    The one system I see this on is a Asus M2N-SLI Deluxe (nForce 570 SLI Rev A1 Chipset)
    I don't see this problem on an Asus M3A78-EM or Gigabyte GA-MA78G-DS3H (AMD 780G chipset)
    I also don't see this problem on an x58 chipset board.
Sign In or Register to comment.