EUE and NaN errors
Snarkasm
Madison, WI Icrontian
Everything was humming along smoothly up until last night when, in consecutive runs, I got an EUE, a NaN (ener [25]) and a NaN (ener [20]).
I'm running two instances of SMP with Affinity Changer, and the other instance runs flawlessly through the whole thing. You guys know I overclock, but given how stable it was before this and how cleanly the second instance is still running, I don't think it's that. So, that said, I've burned through 3 WUs with nothing to show for it, and I'm not sure what I should do about it. Restart the client? Leave it and see what happens? Report the offending units somewhere?
Relevant log files quoted below. Thanks for any help.
At that point, it does download a new core and successfully engages it, and so far I'm up to 16% on this new run. All the runs are 2653s, (20, 87, 25), so is it possible it's an innate WU issue, or are any of you guys getting through these cleanly? All of them are 0x7b errors, which point to unstable OCing, but like I said, it ran for days and days before this smoothly, and the other instance is still running clean. They all say they'll report it to the server and send them the finished work, but nothing gets sent, it just deletes and downloads a new one.
Should I be getting credit for any of these? Should I be notifying anybody that this is happening? Do you guys think this is a hardware fault somewhere?
Thanks for your time. I can't wait to put Ubuntu back on this box so I can have a stable smp.
I'm running two instances of SMP with Affinity Changer, and the other instance runs flawlessly through the whole thing. You guys know I overclock, but given how stable it was before this and how cleanly the second instance is still running, I don't think it's that. So, that said, I've burned through 3 WUs with nothing to show for it, and I'm not sure what I should do about it. Restart the client? Leave it and see what happens? Report the offending units somewhere?
Relevant log files quoted below. Thanks for any help.
[00:49:25] Project: 2653 (Run 20, Clone 87, Gen 25) [00:49:25] [00:49:25] Entering M.D. [00:49:32] Rejecting checkpoint [00:49:33] OPC [00:49:33] Writing local files [00:49:33] [00:49:33] Writing local files [00:49:34] Extra SSE boost OK. [00:49:35] Writing local files [00:49:35] Completed 0 out of 500000 steps (0 percent) [01:02:27] Writing local files [01:02:27] Completed 5000 out of 500000 steps (1 percent) ......... [08:40:53] Writing local files [08:40:53] Completed 195000 out of 500000 steps (39 percent) [08:54:07] Writing local files [08:54:07] Completed 200000 out of 500000 steps (40 percent) [09:05:38] Gromacs cannot continue further. [09:05:38] Going to send back what have done. [09:05:38] logfile size: 8784 [09:05:38] - Writing 9320 bytes of core data to disk... [09:05:38] ... Done. [09:05:38] - Failed to delete work/wudata_01.arc [09:05:38] Warning: check for stray files [09:07:38] [09:07:38] Folding@home Core Shutdown: EARLY_UNIT_END [09:07:38] [09:07:38] Folding@home Core Shutdown: EARLY_UNIT_END [09:07:41] CoreStatus = 7B (123) [09:07:41] Client-core communications error: ERROR 0x7b [09:07:41] Deleting current work unit & continuing... [09:09:45] - Preparing to get new work unit... ......... [09:10:17] Project: 2653 (Run 20, Clone 87, Gen 25) [09:10:17] [09:10:18] Entering M.D. [09:10:18] one 87, Gen 25) [09:10:18] [09:10:18] Entering M.D. [09:10:25] Rejecting checkpoint [09:10:26] OPC [09:10:26] Writing local files [09:10:26] [09:10:26] Writing local files [09:10:27] Extra SSE boost OK. [09:10:27] Writing local files [09:10:27] Completed 0 out of 500000 steps (0 percent) [09:23:37] Writing local files [09:23:37] Completed 5000 out of 500000 steps (1 percent) ......... [11:01:12] Completed 45000 out of 500000 steps (9 percent) [11:13:23] Writing local files [11:13:23] Completed 50000 out of 500000 steps (10 percent) [11:14:21] Quit 101 - NaN detected: (ener[25]) [11:14:21] [11:14:21] Simulation instability has been encountered. The run has entered a [11:14:21] state from which no further progress can be made. [11:14:21] This may be the correct result of the simulation, however if you [11:14:21] often see other project units terminating early like this [11:14:21] too, you may wish to check the stability of your computer (issues [11:14:21] such as high temperature, overclocking, etc.). [11:14:21] Going to send back what have done. [11:14:21] logfile size: 8784 [11:14:21] - Writing 9334 bytes of core data to disk... [11:14:21] ... Done. [11:14:21] No C.P. to delete. [11:14:21] [11:14:21] Folding@home Core Shutdown: EARLY_UNIT_END [11:14:21] [11:14:21] Folding@home Core Shutdown: EARLY_UNIT_END [11:14:25] CoreStatus = 7B (123) [11:14:25] Client-core communications error: ERROR 0x7b [11:14:25] Deleting current work unit & continuing... .......... [11:17:02] Project: 2653 (Run 20, Clone 87, Gen 25) [11:17:02] [11:17:02] Entering M.D. [11:17:09] Rejecting checkpoint [11:17:10] OPC [11:17:10] Writing local files [11:17:11] [11:17:11] Writing local files [11:17:12] Extra SSE boost OK. [11:17:12] Writing local files [11:17:12] Completed 0 out of 500000 steps (0 percent) [11:30:12] Writing local files [11:30:12] Completed 5000 out of 500000 steps (1 percent) ........ [23:48:20] Completed 310000 out of 500000 steps (62 percent) [00:00:06] Writing local files [00:00:06] Completed 315000 out of 500000 steps (63 percent) [00:07:12] Quit 101 - NaN detected: (ener[20]) [00:07:12] [00:07:12] Simulation instability has been encountered. The run has entered a [00:07:12] state from which no further progress can be made. [00:07:12] This may be the correct result of the simulation, however if you [00:07:12] often see other project units terminating early like this [00:07:12] too, you may wish to check the stability of your computer (issues [00:07:12] such as high temperature, overclocking, etc.). [00:07:12] Going to send back what have done. [00:07:12] logfile size: 8784 [00:07:12] - Writing 9334 bytes of core data to disk... [00:07:12] ... Done. [00:07:12] - Failed to delete work/wudata_03.arc [00:07:12] - Failed to delete work/wudata_03.xtc [00:07:12] Warning: check for stray files [00:09:12] [00:09:12] Folding@home Core Shutdown: EARLY_UNIT_END [00:09:12] [00:09:12] Folding@home Core Shutdown: EARLY_UNIT_END [00:09:16] CoreStatus = 7B (123) [00:09:16] Client-core communications error: ERROR 0x7b [00:09:16] - Attempting to download new core...
At that point, it does download a new core and successfully engages it, and so far I'm up to 16% on this new run. All the runs are 2653s, (20, 87, 25), so is it possible it's an innate WU issue, or are any of you guys getting through these cleanly? All of them are 0x7b errors, which point to unstable OCing, but like I said, it ran for days and days before this smoothly, and the other instance is still running clean. They all say they'll report it to the server and send them the finished work, but nothing gets sent, it just deletes and downloads a new one.
Should I be getting credit for any of these? Should I be notifying anybody that this is happening? Do you guys think this is a hardware fault somewhere?
Thanks for your time. I can't wait to put Ubuntu back on this box so I can have a stable smp.
0
Comments
1- stop your client
2- put qfix your FAH folder ( download it from <!-- m -->http://linuxminded.nl/?target=software- ... s.plc#qfix<!-- m --> )
3- open a command line windows
4- Change Directory to your FAH folder
5- run qfix
5bis- if you're folding the same WU as the one which failed (same project/run/gen/clone), note the number of the slot you're currently running
6- run your client with -send all switch to send the recovered partial credit
6bis- if you're folding the same WU as the one wich failed, delete your current WU by running the client with flag -delete XX, where XX is the number noted on 5bis-
7- close your command line window
8- restart your client like you used to run it
thanks to totow at the FCO for the partial instructions
It probably is just a bad WU. It happens. If you hadn't already performed numerous SMP work unit completions on your overclocked system, I would say that the overclocking has created an unstable system. Are you sure that there aren't any hardware components that have changed specification? Is your system on a UPS? Are you monitoring voltages?
I'm sure you are doing everything properly, but I had to ask.
Leo, I sadly don't yet have a UPS; no voltage settings or hardware options have changed; I'm not watching terribly closely, but my core voltage is stable at where it was when it last got boosted.
Thanks for the thoughts, guys. I'll check out qfix in a couple minutes.
Also, you might want to pay a visit to Foldingforum.org and see if the bad work unit has been reported by others. Actually, SMP units seem to be much more stable than they were a few months ago.
Only time shall tell! Thanks for the help.