SMP Risks
Qeldroma
Arid ZoneAh Member
Well, a worst-case scenario happened to me.
Left my machines running, went to work for the day and the AC decided to crap out early on a 107F+humid corker. One of my machines EUE'd going down, another looks like it's lost the WU (see attached log- never seen this one before), and another shut down and now has no prayer of making the deadline in time. Likely three 1760 point WUs- 5280 points- lost and I'm out more until the AC is fixed and one machine can be diagnosed.
Moral of story- SMP WUs are pretty touchy and have (AFAIC- too) short deadlines.
Weird FAHLog.txt:
[00:07:23] Completed 500000 out of 500000 steps (100 percent)
[00:07:23] Writing final coordinates.
[00:07:24] Past main M.D. loop
[00:07:24] Will end MPI now
[00:08:24]
[00:08:24] Finished Work Unit:
[00:08:24] - Reading up to 3724128 from "work/wudata_04.arc": Read 3724128
[00:08:24] - Reading up to 1938028 from "work/wudata_04.xtc": Read 1938028
[00:08:24] goefile size: 0
[00:08:24] logfile size: 60873
[00:08:24] Leaving Run
[00:08:27] - Writing 5727429 bytes of core data to disk...
[00:08:27] ... Done.
[00:08:27] - Failed to delete work/wudata_04.sas
[00:08:27] - Failed to delete work/wudata_04.goe
[00:08:27] Warning: check for stray files
[00:08:27] - Shutting down core
Folding@Home Client Shutdown at user request.
Folding@Home Client Shutdown.
--- Opening Log file [September 13 00:14:06]
# SMP Client ##################################################################
###############################################################################
Folding@Home Client Version 5.91beta4
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: E:\FoldingSMP
Executable: E:\FoldingSMP\fah.exe
[00:14:06] - Ask before connecting: No
[00:14:06] - User name: QelDroma (Team 93)
[00:14:06] - User ID: 29BBB61E2F9CA704
[00:14:06] - Machine ID: 1
[00:14:06]
[00:14:06] Loaded queue successfully.
[00:14:06]
[00:14:06] + Processing work unit
[00:14:06] Core required: FahCore_a1.exe
[00:14:06] Core found.
[00:14:07] Working on Unit 04 [September 13 00:14:07]
[00:14:07] + Working ...
[00:14:30]
[00:14:30] *
*
[00:14:30] Folding@Home Gromacs SMP Core
[00:14:30] Version 1.74 (March 10, 2007)
[00:14:30]
[00:14:30] Preparing to commence simulation
[00:14:30] - Ensuring status. Please wait.
[00:14:47] - Looking at optimizations...
[00:14:47] - Working with standard loops on this execution.
[00:14:47] - Created dyn
[00:14:47] - Files status OK
[00:14:47]
[00:14:47] Folding@home Core Shutdown: MISSING_WORK_FILES
[00:14:47] Finalizing output
[00:14:47] OK
[00:16:47]
[00:16:47] Folding@home Core Shutdown: MISSING_WORK_FILES
[00:16:47] Finalizing output
[00:16:50] CoreStatus = 1 (1)
[00:16:50] Client-core communications error: ERROR 0x1
[00:16:50] Deleting current work unit & continuing...
From FAH Wiki:
The 0x0 and 0x1 errors are unknown errors - all errors that are known will end with some other error code and message, but those errors that Pande Group hasn't seen before or did not know about, will end with error 0x0 or 0x1.
Note: The WU data of an unknown error can not be trusted and by definition you'll never get any credit for it. If the 0x0 and 0x1 error cause is identified and classified as some sort of EUE then you'll start getting credit for such WUs. One possible cause of errors 0x1 and 0x0 is a hardware failure (which is why the software is unable to classify them). If a RAM failure is detected by the OS or for some reason the program wishes to allocate more memory and the OS refuses, the OS will terminate FAHcore_* and the client will no longer be able to communicate with the FAHcore producing Client-core communications error: ERROR 0x1
I think Memtest for a while!
Sorry, gang- this sux- but should be back on line by the weekend.
Left my machines running, went to work for the day and the AC decided to crap out early on a 107F+humid corker. One of my machines EUE'd going down, another looks like it's lost the WU (see attached log- never seen this one before), and another shut down and now has no prayer of making the deadline in time. Likely three 1760 point WUs- 5280 points- lost and I'm out more until the AC is fixed and one machine can be diagnosed.
Moral of story- SMP WUs are pretty touchy and have (AFAIC- too) short deadlines.
Weird FAHLog.txt:
[00:07:23] Completed 500000 out of 500000 steps (100 percent)
[00:07:23] Writing final coordinates.
[00:07:24] Past main M.D. loop
[00:07:24] Will end MPI now
[00:08:24]
[00:08:24] Finished Work Unit:
[00:08:24] - Reading up to 3724128 from "work/wudata_04.arc": Read 3724128
[00:08:24] - Reading up to 1938028 from "work/wudata_04.xtc": Read 1938028
[00:08:24] goefile size: 0
[00:08:24] logfile size: 60873
[00:08:24] Leaving Run
[00:08:27] - Writing 5727429 bytes of core data to disk...
[00:08:27] ... Done.
[00:08:27] - Failed to delete work/wudata_04.sas
[00:08:27] - Failed to delete work/wudata_04.goe
[00:08:27] Warning: check for stray files
[00:08:27] - Shutting down core
Folding@Home Client Shutdown at user request.
Folding@Home Client Shutdown.
--- Opening Log file [September 13 00:14:06]
# SMP Client ##################################################################
###############################################################################
Folding@Home Client Version 5.91beta4
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: E:\FoldingSMP
Executable: E:\FoldingSMP\fah.exe
[00:14:06] - Ask before connecting: No
[00:14:06] - User name: QelDroma (Team 93)
[00:14:06] - User ID: 29BBB61E2F9CA704
[00:14:06] - Machine ID: 1
[00:14:06]
[00:14:06] Loaded queue successfully.
[00:14:06]
[00:14:06] + Processing work unit
[00:14:06] Core required: FahCore_a1.exe
[00:14:06] Core found.
[00:14:07] Working on Unit 04 [September 13 00:14:07]
[00:14:07] + Working ...
[00:14:30]
[00:14:30] *
*
[00:14:30] Folding@Home Gromacs SMP Core
[00:14:30] Version 1.74 (March 10, 2007)
[00:14:30]
[00:14:30] Preparing to commence simulation
[00:14:30] - Ensuring status. Please wait.
[00:14:47] - Looking at optimizations...
[00:14:47] - Working with standard loops on this execution.
[00:14:47] - Created dyn
[00:14:47] - Files status OK
[00:14:47]
[00:14:47] Folding@home Core Shutdown: MISSING_WORK_FILES
[00:14:47] Finalizing output
[00:14:47] OK
[00:16:47]
[00:16:47] Folding@home Core Shutdown: MISSING_WORK_FILES
[00:16:47] Finalizing output
[00:16:50] CoreStatus = 1 (1)
[00:16:50] Client-core communications error: ERROR 0x1
[00:16:50] Deleting current work unit & continuing...
From FAH Wiki:
The 0x0 and 0x1 errors are unknown errors - all errors that are known will end with some other error code and message, but those errors that Pande Group hasn't seen before or did not know about, will end with error 0x0 or 0x1.
Note: The WU data of an unknown error can not be trusted and by definition you'll never get any credit for it. If the 0x0 and 0x1 error cause is identified and classified as some sort of EUE then you'll start getting credit for such WUs. One possible cause of errors 0x1 and 0x0 is a hardware failure (which is why the software is unable to classify them). If a RAM failure is detected by the OS or for some reason the program wishes to allocate more memory and the OS refuses, the OS will terminate FAHcore_* and the client will no longer be able to communicate with the FAHcore producing Client-core communications error: ERROR 0x1
I think Memtest for a while!
Sorry, gang- this sux- but should be back on line by the weekend.
0
Comments
Losing WUs is a common thing now...
Wow, sir- that's a lot. These were the first EUEs I've had since May- but unlike you I apparently did not receive any points for these .
Any idea what's up with yours?
Been there done it in this situation and it dumped the wu and refused to start the 99% complete backup.
I think current units are less stable than WUs three and four months ago. I never much of any problems with WinSMP until about one month ago.
All the rigs passed diags- but the proof will be in the next day or two.
Spike- makes sense.
Now to try and lop off 30F of desert temps in the next couple of months ...
Same here Leo. And it always seems to be at like 70-80%+ of the way through the WU and it goes south. There is just no consistency to them. I can complete 5 in a row, and then 1 will fail. Sometimes it wont make it but a few % through and it will do that several times and download a new core.
You can find it at FAH Tools
I was fiddling around with some software at the time and it froze. When I restarted the PC the WU started from 0% again.
I love the points that SMP gives but I get frustrated that I can't use my computer to it's full potential for fear of borking the WU. Kinda takes the fun out it.
I did have it OC'd but reverted to stock speeds before installing the SMP client.
Thats because the WU´s data file are sensitive to a CRC check, therefore when a OS crashes or reboots improperly, the check disk procedure usually corrupts the WU CRC data. This does not happen if the WU is in another partition AND it was not saving data in the time of crash.
Try it out!
?