SMP hanging at FINISHED_UNIT and WU is lost, solution
lsevald
Norway Icrontian
I don't know if this is an isolated problem with one of my computers. It has happened to me a few times, but only on my main workstation. But I thought I might as well post my findings in case someone else encounters this problem.
After completing a WU, the process just stops and sits there for hours:
I have tried making a backup of the FAH client folder (before shutting it down the first time) and restore it, but no matter what I do when I restart the client (in the hope it will pick up and send the completed WU), I'm getting:
And if I try using the -send all switch, it just exits like there's nothing to send:
Inspecting the the client work folder revealed that there was a results*.dat file, so why won't it send? My only idea was that something was wrong with the queue.dat file (it holds the work queue status). A google search led me to this site. First I tried the qfix.exe utility, but that didn't work (though it did indicate I had a finished WU in the queue). Afterwards I tried qgen.exe. I made a backup of the client folder, and saved qgen.exe there. Then I renamed client.cfg to client.old, and queue.dat to queue.old. Running qgen.exe from a command prompt returned this:
Qgen.exe generates a new client.cfg and queue.dat, but I only copied the new queue.dat file back to the main client folder, then I ran "fah -send all" again:
Success Afterwards I cleaned up the main work folder (deleted queue.dat and the work folder) to allow it to rebuild from scratch, just in case.
After completing a WU, the process just stops and sits there for hours:
[SIZE="2"][15:47:55] Project: 2653 (Run 18, Clone 153, Gen 0) [15:47:55] [15:47:56] Assembly optimizations on if available. [15:47:56] Entering M.D. [15:48:06] Rejecting checkpoint [15:48:07] ProtWriting local files [15:48:07] Extra SSE boost OK. [15:48:08] Writing local files [15:48:08] Completed 0 out of 500000 steps (0 percent) [SNIP] [05:58:45] Completed 495000 out of 500000 steps (99 percent) [06:06:56] Writing local files [06:06:56] Completed 500000 out of 500000 steps (100 percent) [06:06:56] Writing final coordinates. [06:06:57] Past main M.D. loop [06:06:57] Will end MPI now [06:07:57] [06:07:57] Finished Work Unit: [06:07:57] - Reading up to 3724560 from "work/wudata_06.arc": Read 3724560 [06:07:57] - Reading up to 1781612 from "work/wudata_06.xtc": Read 1781612 [06:07:57] goefile size: 0 [06:07:57] logfile size: 18086 [06:07:57] Leaving Run [06:07:58] - Writing 5528658 bytes of core data to disk... [06:07:58] ... Done. [06:07:58] - Failed to delete work/wudata_06.sas [06:07:58] - Failed to delete work/wudata_06.goe [06:07:58] Warning: check for stray files [06:07:58] - Shutting down core [06:07:58] [06:07:58] Folding@home Core Shutdown: FINISHED_UNIT [06:07:58] [06:07:58] Folding@home Core Shutdown: FINISHED_UNIT Folding@Home Client Shutdown at user request. (Shut down manually after 2hrs just hanging there) Folding@Home Client Shutdown. [/SIZE]
I have tried making a backup of the FAH client folder (before shutting it down the first time) and restore it, but no matter what I do when I restart the client (in the hope it will pick up and send the completed WU), I'm getting:
[SIZE="2"][09:05:34] - Ask before connecting: No [09:05:34] - User name: lsevald(icrontic) (Team 93) [09:05:34] - User ID: 2AAEB20F36956D74 [09:05:34] - Machine ID: 1 [09:05:34] [09:05:34] Loaded queue successfully. [09:05:34] [09:05:34] + Processing work unit [09:05:34] Core required: FahCore_a1.exe [09:05:34] Core found. [09:05:34] Working on Unit 06 [November 11 09:05:34] [09:05:34] + Working ... [09:05:35] [09:05:35] *------------------------------* [09:05:35] Folding@Home Gromacs SMP Core [09:05:35] Version 1.74 (March 10, 2007) [09:05:35] [09:05:35] Preparing to commence simulation [09:05:35] - Ensuring status. Please wait. [09:05:52] - Assembly optimizations manually forced on. [09:05:52] - Not checking prior termination. [09:05:52] [09:05:52] Folding@home Core Shutdown: MISSING_WORK_FILES [09:05:52] Finalizing output Folding@Home Client Shutdown at user request. Folding@Home Client Shutdown. [/SIZE]
And if I try using the -send all switch, it just exits like there's nothing to send:
[SIZE="2"]Launch directory: D:\Program Files (x86)\FAH Windows SMP Client V1.01 Executable: fah Arguments: -send all [09:07:23] - Ask before connecting: No [09:07:23] - User name: lsevald(icrontic) (Team 93) [09:07:23] - User ID: 2AAEB20F36956D74 [09:07:23] - Machine ID: 1 [09:07:23] [09:07:23] Loaded queue successfully. [09:07:23] Attempting to return result(s) to server... Folding@Home Client Shutdown. [/SIZE]
Inspecting the the client work folder revealed that there was a results*.dat file, so why won't it send? My only idea was that something was wrong with the queue.dat file (it holds the work queue status). A google search led me to this site. First I tried the qfix.exe utility, but that didn't work (though it did indicate I had a finished WU in the queue). Afterwards I tried qgen.exe. I made a backup of the client folder, and saved qgen.exe there. Then I renamed client.cfg to client.old, and queue.dat to queue.old. Running qgen.exe from a command prompt returned this:
[SIZE="2"]C:\Users\lsevald\Desktop\FAH Windows SMP Client V1.01>qgen qgen v1.1 Found the following units to requeue: index 6: + (finished) proj 2653, run 18, clone 153, gen 0 Designation: UserName: lsevald(icrontic) TeamNumber: 93 CPUID: 756D95360FB2AE2A Constructing files for the folding environment and new queue: index 6: + OK for upload; proj 2653, run 18, clone 153, gen 0 Units queued for processing: 0 Units queued for upload: 1 Errors: 0 done C:\Users\lsevald\Desktop\FAH Windows SMP Client V1.01> [/SIZE]
Qgen.exe generates a new client.cfg and queue.dat, but I only copied the new queue.dat file back to the main client folder, then I ran "fah -send all" again:
[SIZE="2"]Launch directory: D:\Program Files (x86)\FAH Windows SMP Client V1.01 Executable: fah Arguments: -send all [09:13:50] - Ask before connecting: No [09:13:50] - User name: lsevald(icrontic) (Team 93) [09:13:50] - User ID: 2AAEB20F36956D74 [09:13:50] - Machine ID: 1 [09:13:50] [09:13:50] Loaded queue successfully. [09:13:50] Attempting to return result(s) to server... [09:13:50] + Attempting to send results [09:16:28] + Results successfully sent [09:16:28] Thank you for your contribution to Folding@Home. [09:16:28] + Number of Units Completed: 364 Folding@Home Client Shutdown. [/SIZE]
Success Afterwards I cleaned up the main work folder (deleted queue.dat and the work folder) to allow it to rebuild from scratch, just in case.
0
Comments
This thread deserves to be a sticky or linked somewhere.
Why it's picking on just your main rig is more of a mystery- but could be an important clue.
EDIT: Previously I had problems with ~1/3 (at it's worst under XP, mostly FILE_IO_ERROR) of the WU's, now I guess it's more like 1/30. If it happened on a more regular basis I would have tried nailing it down (removing 2GB RAM, tried a single HDD, removing X-Fi and tried different AV software and so on)
But I'm thinking, this issue always seems to occur when I'm not using the computer. Maybe it could be something related to a power saving feature? I will look into that, and upgrade the NIC driver (onboard realtek) while I'm at it.