SMP Risks

QeldromaQeldroma Arid ZoneAh Member
edited September 2007 in Folding@Home
Well, a worst-case scenario happened to me.

Left my machines running, went to work for the day and the AC decided to crap out early on a 107F+humid corker. One of my machines EUE'd going down, another looks like it's lost the WU (see attached log- never seen this one before), and another shut down and now has no prayer of making the deadline in time. Likely three 1760 point WUs- 5280 points- lost and I'm out more until the AC is fixed and one machine can be diagnosed.

Moral of story- SMP WUs are pretty touchy and have (AFAIC- too) short deadlines.


Weird FAHLog.txt:

[00:07:23] Completed 500000 out of 500000 steps (100 percent)
[00:07:23] Writing final coordinates.
[00:07:24] Past main M.D. loop
[00:07:24] Will end MPI now
[00:08:24]
[00:08:24] Finished Work Unit:
[00:08:24] - Reading up to 3724128 from "work/wudata_04.arc": Read 3724128
[00:08:24] - Reading up to 1938028 from "work/wudata_04.xtc": Read 1938028
[00:08:24] goefile size: 0
[00:08:24] logfile size: 60873
[00:08:24] Leaving Run
[00:08:27] - Writing 5727429 bytes of core data to disk...
[00:08:27] ... Done.
[00:08:27] - Failed to delete work/wudata_04.sas
[00:08:27] - Failed to delete work/wudata_04.goe
[00:08:27] Warning: check for stray files
[00:08:27] - Shutting down core

Folding@Home Client Shutdown at user request.

Folding@Home Client Shutdown.

--- Opening Log file [September 13 00:14:06]

# SMP Client ##################################################################
###############################################################################

Folding@Home Client Version 5.91beta4

http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: E:\FoldingSMP
Executable: E:\FoldingSMP\fah.exe

[00:14:06] - Ask before connecting: No
[00:14:06] - User name: QelDroma (Team 93)
[00:14:06] - User ID: 29BBB61E2F9CA704
[00:14:06] - Machine ID: 1
[00:14:06]
[00:14:06] Loaded queue successfully.
[00:14:06]
[00:14:06] + Processing work unit
[00:14:06] Core required: FahCore_a1.exe
[00:14:06] Core found.
[00:14:07] Working on Unit 04 [September 13 00:14:07]
[00:14:07] + Working ...
[00:14:30]
[00:14:30] *
*
[00:14:30] Folding@Home Gromacs SMP Core
[00:14:30] Version 1.74 (March 10, 2007)
[00:14:30]
[00:14:30] Preparing to commence simulation
[00:14:30] - Ensuring status. Please wait.
[00:14:47] - Looking at optimizations...
[00:14:47] - Working with standard loops on this execution.
[00:14:47] - Created dyn
[00:14:47] - Files status OK
[00:14:47]
[00:14:47] Folding@home Core Shutdown: MISSING_WORK_FILES
[00:14:47] Finalizing output
[00:14:47] OK
[00:16:47]
[00:16:47] Folding@home Core Shutdown: MISSING_WORK_FILES
[00:16:47] Finalizing output
[00:16:50] CoreStatus = 1 (1)
[00:16:50] Client-core communications error: ERROR 0x1
[00:16:50] Deleting current work unit & continuing...


From FAH Wiki:

The 0x0 and 0x1 errors are unknown errors - all errors that are known will end with some other error code and message, but those errors that Pande Group hasn't seen before or did not know about, will end with error 0x0 or 0x1.

Note: The WU data of an unknown error can not be trusted and by definition you'll never get any credit for it. If the 0x0 and 0x1 error cause is identified and classified as some sort of EUE then you'll start getting credit for such WUs. One possible cause of errors 0x1 and 0x0 is a hardware failure (which is why the software is unable to classify them). If a RAM failure is detected by the OS or for some reason the program wishes to allocate more memory and the OS refuses, the OS will terminate FAHcore_* and the client will no longer be able to communicate with the FAHcore producing Client-core communications error: ERROR 0x1



I think Memtest for a while!

Sorry, gang- this sux- but should be back on line by the weekend.

Comments

  • mmonninmmonnin Centreville, VA
    edited September 2007
    I lose at least that much every week. I used to run at around 2700-2800 points per day but I average around 2300 now. Reduce clocks and all but EUEs are coming in more steadily the past couple of months.

    Losing WUs is a common thing now...
  • QeldromaQeldroma Arid ZoneAh Member
    edited September 2007
    mmonnin wrote:
    I lose at least that much every week. I used to run at around 2700-2800 points per day but I average around 2300 now. Reduce clocks and all but EUEs are coming in more steadily the past couple of months.

    Losing WUs is a common thing now...

    Wow, sir- that's a lot. These were the first EUEs I've had since May- but unlike you I apparently did not receive any points for these :( .

    Any idea what's up with yours?
  • mmonninmmonnin Centreville, VA
    edited September 2007
    No points for SMP EUEs....
  • edcentricedcentric near Milwaukee, Wisconsin Icrontian
    edited September 2007
    I have two boxes running in my office. Every time IT tweaks the network they EUE, but oh the points.
  • SPIKE09SPIKE09 Scatland
    edited September 2007
    @ Queldroma the 1st problem is probably due to the known problem that even though it says wu finished the fahcores will still work away for up to 10 minutes after this, never had a problem since I just left it until it has a failure to send then control c and connect then restart.
    Been there done it in this situation and it dumped the wu and refused to start the 99% complete backup.
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited September 2007
    For a while, I was losing about 25% of the work units. My goodness, they are so fragile. In my case, it was network adapters that weren't steady enough. For all other purposes, the network adapters (wireless G) worked just fine. Now that I've fixed the networking 'problem', unit completion for me is 100%, as long as I remember to shut the units down correctly if I need to reboot.

    I think current units are less stable than WUs three and four months ago. I never much of any problems with WinSMP until about one month ago.
  • QeldromaQeldroma Arid ZoneAh Member
    edited September 2007
    Well, replaced a radiator fan and got the AC running again and I'm all back up. Fortunately, my son's quad-core snuck out a WU before it went down and, get this, my 5 year-old AMD 3000 XP did not even shut down despite it folding 100% in about 120F ambient for a couple of hours (locally by the machine it was even hotter), and it turned in a WU (not SMP) this morning after getting everything back up. Tough critter- it will have a worthy replacement when the time comes :) .

    All the rigs passed diags- but the proof will be in the next day or two.

    Spike- makes sense.

    Now to try and lop off 30F of desert temps in the next couple of months ...
  • mmonninmmonnin Centreville, VA
    edited September 2007
    Leonardo wrote:
    I think current units are less stable than WUs three and four months ago. I never much of any problems with WinSMP until about one month ago.

    Same here Leo. And it always seems to be at like 70-80%+ of the way through the WU and it goes south. There is just no consistency to them. I can complete 5 in a row, and then 1 will fail. Sometimes it wont make it but a few % through and it will do that several times and download a new core.
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited September 2007
    Starting about two weeks ago, after I ironed out the last network instabilities (at least what is unstable for Win SMP), I have not lost a unit out of four Win SMP client computers on the network.
  • Ultra-NexusUltra-Nexus Buenos Aires, ARG
    edited September 2007
    Try a QFIX on that queue.dat and see if you can recover that WU. :)
  • QeldromaQeldroma Arid ZoneAh Member
    edited September 2007
    Try a QFIX on that queue.dat and see if you can recover that WU. :)

    You can find it at FAH Tools
  • WingaWinga Mr South Africa Icrontian
    edited September 2007
    Well I got through 43% of my 3rd SMP WU and it went south on me.
    I was fiddling around with some software at the time and it froze. When I restarted the PC the WU started from 0% again.

    I love the points that SMP gives but I get frustrated that I can't use my computer to it's full potential for fear of borking the WU. Kinda takes the fun out it.
  • Ultra-NexusUltra-Nexus Buenos Aires, ARG
    edited September 2007
    Are you overclocking?
  • WingaWinga Mr South Africa Icrontian
    edited September 2007
    Are you overclocking?

    I did have it OC'd but reverted to stock speeds before installing the SMP client.
  • Ultra-NexusUltra-Nexus Buenos Aires, ARG
    edited September 2007
    I stopped having these WU resets after a freeze or reboot by placing the F@H dir with all its contents in another partition different than C:

    Thats because the WU´s data file are sensitive to a CRC check, therefore when a OS crashes or reboots improperly, the check disk procedure usually corrupts the WU CRC data. This does not happen if the WU is in another partition AND it was not saving data in the time of crash.

    Try it out! :)
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited September 2007
    I love the points that SMP gives but I get frustrated that I can't use my computer to it's full potential for fear of borking the WU. Kinda takes the fun out it.
    After a while, you'll get used to taking a bit more care when messing with network settings and restarting/shutting down the computer. It'll be second nature before you know it. It doesn't bother me any more.
    I stopped having these WU resets after a freeze or reboot by placing the F@H dir with all its contents in another partition different than C: Thats because the WU´s data file are sensitive to a CRC check, therefore when a OS crashes or reboots improperly, the check disk procedure usually corrupts the WU CRC data. This does not happen if the WU is in another partition AND it was not saving data in the time of crash.
    Interesting, I hadn't even thought of that. All my main computers (sig) have almost all non-OS stuff on a separate partition. I've been doing that since the days of unstable Win98 first edition. I've continued the practice of keeping the C:\ OS partition as OS exclusive as possible. But with that said, machines on which I've run Folding on the OS partition - laptops, have experienced no higher frequency of burned units than the other computers. The laptops have run single and SMP Windows Folding clients.
  • WingaWinga Mr South Africa Icrontian
    edited September 2007
    I stopped having these WU resets after a freeze or reboot by placing the F@H dir with all its contents in another partition different than C:

    Thats because the WU´s data file are sensitive to a CRC check, therefore when a OS crashes or reboots improperly, the check disk procedure usually corrupts the WU CRC data. This does not happen if the WU is in another partition AND it was not saving data in the time of crash.

    Try it out! :)
    Sounds like a brilliant idea. I'm a third of the way through my current WU, so when and if it finishes the race without falling over I will move the folder as you suggested.
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited September 2007
    "CRC"

    ?
  • SPIKE09SPIKE09 Scatland
    edited September 2007
    Leonardo wrote:
    "CRC"

    ?
    "Cyclic Redundancy Check" check :D
Sign In or Register to comment.