EUE and NaN errors

Snarkasm · December 2007

Everything was humming along smoothly up until last night when, in consecutive runs, I got an EUE, a NaN (ener [25]) and a NaN (ener [20]).

I'm running two instances of SMP with Affinity Changer, and the other instance runs flawlessly through the whole thing. You guys know I overclock, but given how stable it was before this and how cleanly the second instance is still running, I don't think it's that. So, that said, I've burned through 3 WUs with nothing to show for it, and I'm not sure what I should do about it. Restart the client? Leave it and see what happens? Report the offending units somewhere?

Relevant log files quoted below. Thanks for any help.

[00:49:25] Project: 2653 (Run 20, Clone 87, Gen 25)
[00:49:25] 
[00:49:25] Entering M.D.
[00:49:32] Rejecting checkpoint
[00:49:33] OPC
[00:49:33] Writing local files
[00:49:33] 
[00:49:33] Writing local files
[00:49:34] Extra SSE boost OK.
[00:49:35] Writing local files
[00:49:35] Completed 0 out of 500000 steps  (0 percent)
[01:02:27] Writing local files
[01:02:27] Completed 5000 out of 500000 steps  (1 percent)
.........
[08:40:53] Writing local files
[08:40:53] Completed 195000 out of 500000 steps  (39 percent)
[08:54:07] Writing local files
[08:54:07] Completed 200000 out of 500000 steps  (40 percent)
[09:05:38] Gromacs cannot continue further.
[09:05:38] Going to send back what have done.
[09:05:38] logfile size: 8784
[09:05:38] - Writing 9320 bytes of core data to disk...
[09:05:38]   ... Done.
[09:05:38] - Failed to delete work/wudata_01.arc
[09:05:38] Warning:  check for stray files
[09:07:38] 
[09:07:38] Folding@home Core Shutdown: EARLY_UNIT_END
[09:07:38] 
[09:07:38] Folding@home Core Shutdown: EARLY_UNIT_END
[09:07:41] CoreStatus = 7B (123)
[09:07:41] Client-core communications error: ERROR 0x7b
[09:07:41] Deleting current work unit & continuing...
[09:09:45] - Preparing to get new work unit...
.........
[09:10:17] Project: 2653 (Run 20, Clone 87, Gen 25)
[09:10:17] 
[09:10:18] Entering M.D.
[09:10:18] one 87, Gen 25)
[09:10:18] 
[09:10:18] Entering M.D.
[09:10:25] Rejecting checkpoint
[09:10:26] OPC
[09:10:26] Writing local files
[09:10:26] 
[09:10:26] Writing local files
[09:10:27] Extra SSE boost OK.
[09:10:27] Writing local files
[09:10:27] Completed 0 out of 500000 steps  (0 percent)
[09:23:37] Writing local files
[09:23:37] Completed 5000 out of 500000 steps  (1 percent)
.........
[11:01:12] Completed 45000 out of 500000 steps  (9 percent)
[11:13:23] Writing local files
[11:13:23] Completed 50000 out of 500000 steps  (10 percent)
[11:14:21] Quit 101 - NaN detected: (ener[25])
[11:14:21] 
[11:14:21] Simulation instability has been encountered. The run has entered a
[11:14:21]   state from which no further progress can be made.
[11:14:21] This may be the correct result of the simulation, however if you
[11:14:21]   often see other project units terminating early like this
[11:14:21]   too, you may wish to check the stability of your computer (issues
[11:14:21]   such as high temperature, overclocking, etc.).
[11:14:21] Going to send back what have done.
[11:14:21] logfile size: 8784
[11:14:21] - Writing 9334 bytes of core data to disk...
[11:14:21]   ... Done.
[11:14:21] No C.P. to delete.
[11:14:21] 
[11:14:21] Folding@home Core Shutdown: EARLY_UNIT_END
[11:14:21] 
[11:14:21] Folding@home Core Shutdown: EARLY_UNIT_END
[11:14:25] CoreStatus = 7B (123)
[11:14:25] Client-core communications error: ERROR 0x7b
[11:14:25] Deleting current work unit & continuing...
..........
[11:17:02] Project: 2653 (Run 20, Clone 87, Gen 25)
[11:17:02] 
[11:17:02] Entering M.D.
[11:17:09] Rejecting checkpoint
[11:17:10] OPC
[11:17:10] Writing local files
[11:17:11] 
[11:17:11] Writing local files
[11:17:12] Extra SSE boost OK.
[11:17:12] Writing local files
[11:17:12] Completed 0 out of 500000 steps  (0 percent)
[11:30:12] Writing local files
[11:30:12] Completed 5000 out of 500000 steps  (1 percent)
........
[23:48:20] Completed 310000 out of 500000 steps  (62 percent)
[00:00:06] Writing local files
[00:00:06] Completed 315000 out of 500000 steps  (63 percent)
[00:07:12] Quit 101 - NaN detected: (ener[20])
[00:07:12] 
[00:07:12] Simulation instability has been encountered. The run has entered a
[00:07:12]   state from which no further progress can be made.
[00:07:12] This may be the correct result of the simulation, however if you
[00:07:12]   often see other project units terminating early like this
[00:07:12]   too, you may wish to check the stability of your computer (issues
[00:07:12]   such as high temperature, overclocking, etc.).
[00:07:12] Going to send back what have done.
[00:07:12] logfile size: 8784
[00:07:12] - Writing 9334 bytes of core data to disk...
[00:07:12]   ... Done.
[00:07:12] - Failed to delete work/wudata_03.arc
[00:07:12] - Failed to delete work/wudata_03.xtc
[00:07:12] Warning:  check for stray files
[00:09:12] 
[00:09:12] Folding@home Core Shutdown: EARLY_UNIT_END
[00:09:12] 
[00:09:12] Folding@home Core Shutdown: EARLY_UNIT_END
[00:09:16] CoreStatus = 7B (123)
[00:09:16] Client-core communications error: ERROR 0x7b
[00:09:16] - Attempting to download new core...

At that point, it does download a new core and successfully engages it, and so far I'm up to 16% on this new run. All the runs are 2653s, (20, 87, 25), so is it possible it's an innate WU issue, or are any of you guys getting through these cleanly? All of them are 0x7b errors, which point to unstable OCing, but like I said, it ran for days and days before this smoothly, and the other instance is still running clean. They all say they'll report it to the server and send them the finished work, but nothing gets sent, it just deletes and downloads a new one.

Should I be getting credit for any of these? Should I be notifying anybody that this is happening? Do you guys think this is a hardware fault somewhere?

Thanks for your time. I can't wait to put Ubuntu back on this box so I can have a stable smp.

SPIKE09 · December 2007

run qfix from a command line in the directory FAH SMP is located in it will allow the partial result to be sent and will possibly alert the project to a problem.

1- stop your client
2- put qfix your FAH folder ( download it from http://linuxminded.nl/?target=software- ... s.plc#qfix )
3- open a command line windows
4- Change Directory to your FAH folder
5- run qfix
5bis- if you're folding the same WU as the one which failed (same project/run/gen/clone), note the number of the slot you're currently running
6- run your client with -send all switch to send the recovered partial credit
6bis- if you're folding the same WU as the one wich failed, delete your current WU by running the client with flag -delete XX, where XX is the number noted on 5bis-
7- close your command line window
8- restart your client like you used to run it

thanks to totow at the FCO for the partial instructions

Leonardo · December 2007

I have no experience with Qfix, so please pay close attention to Spike's advice.

It probably is just a bad WU. It happens. If you hadn't already performed numerous SMP work unit completions on your overclocked system, I would say that the overclocking has created an unstable system. Are you sure that there aren't any hardware components that have changed specification? Is your system on a UPS? Are you monitoring voltages?

I'm sure you are doing everything properly, but I had to ask.

Snarkasm · December 2007

Spike - Any idea if that's going to work given it's run through at least 3 iterations? I don't know if they're all the exact same unit (is that what the run/clone/gen numbers are?) or if it's moved on already. Since the EUE was on the first run, I don't know if it matters.

Leo, I sadly don't yet have a UPS; no voltage settings or hardware options have changed; I'm not watching terribly closely, but my core voltage is stable at where it was when it last got boosted.

Thanks for the thoughts, guys. I'll check out qfix in a couple minutes.

Leonardo · December 2007

Sometimes, no matter what we do, a bad work unit will slip through the cracks. It's either that or you've had some power spikes or dips. And yes, it is possible for one SMP from a dual client setup to experience an EUE with the other unit continuing on normally.

Also, you might want to pay a visit to Foldingforum.org and see if the bad work unit has been reported by others. Actually, SMP units seem to be much more stable than they were a few months ago.

Snarkasm · December 2007

Yeah, that's what I'd heard. I'll give it a little more time and see if it continues. I also just restarted, we'll see if that has any effect.

Leonardo · December 2007

If this problem proves to be a pattern, then you can pretty much rule out bad work units as the cause. If the problem persists, it will be a good indicator that you do not have a stable overclock or you have hardware that's not functioning correctly.

Snarkasm · December 2007

That's the obvious conclusion. It was rock solid Prime95 stable, and running smoothly prior to this. Maybe my room temperature is fluctuating and causing it. I dunno.

Only time shall tell! Thanks for the help.

SPIKE09 · December 2007

yes it should work fine open a command prompt and of you go. the recommendation for folding stable has always been prime/orthos stable and back 5% for FAH stable.

EUE and NaN errors

Comments