SMP hanging at FINISHED_UNIT and WU is lost, solution

lsevaldlsevald Norway Icrontian
edited November 2007 in Folding@Home
I don't know if this is an isolated problem with one of my computers. It has happened to me a few times, but only on my main workstation. But I thought I might as well post my findings in case someone else encounters this problem.

After completing a WU, the process just stops and sits there for hours:
[SIZE="2"][15:47:55] Project: 2653 (Run 18, Clone 153, Gen 0)
[15:47:55] 
[15:47:56] Assembly optimizations on if available.
[15:47:56] Entering M.D.
[15:48:06] Rejecting checkpoint
[15:48:07] ProtWriting local files
[15:48:07] Extra SSE boost OK.
[15:48:08] Writing local files
[15:48:08] Completed 0 out of 500000 steps  (0 percent)
[SNIP]
[05:58:45] Completed 495000 out of 500000 steps  (99 percent)
[06:06:56] Writing local files
[06:06:56] Completed 500000 out of 500000 steps  (100 percent)
[06:06:56] Writing final coordinates.
[06:06:57] Past main M.D. loop
[06:06:57] Will end MPI now
[06:07:57] 
[06:07:57] Finished Work Unit:
[06:07:57] - Reading up to 3724560 from "work/wudata_06.arc": Read 3724560
[06:07:57] - Reading up to 1781612 from "work/wudata_06.xtc": Read 1781612
[06:07:57] goefile size: 0
[06:07:57] logfile size: 18086
[06:07:57] Leaving Run
[06:07:58] - Writing 5528658 bytes of core data to disk...
[06:07:58]   ... Done.
[06:07:58] - Failed to delete work/wudata_06.sas
[06:07:58] - Failed to delete work/wudata_06.goe
[06:07:58] Warning:  check for stray files
[06:07:58] - Shutting down core
[06:07:58] 
[06:07:58] Folding@home Core Shutdown: FINISHED_UNIT
[06:07:58] 
[06:07:58] Folding@home Core Shutdown: FINISHED_UNIT

Folding@Home Client Shutdown at user request. (Shut down manually after 2hrs just hanging there)
Folding@Home Client Shutdown.


[/SIZE]

I have tried making a backup of the FAH client folder (before shutting it down the first time) and restore it, but no matter what I do when I restart the client (in the hope it will pick up and send the completed WU), I'm getting:
[SIZE="2"][09:05:34] - Ask before connecting: No
[09:05:34] - User name: lsevald(icrontic) (Team 93)
[09:05:34] - User ID: 2AAEB20F36956D74
[09:05:34] - Machine ID: 1
[09:05:34] 
[09:05:34] Loaded queue successfully.
[09:05:34] 
[09:05:34] + Processing work unit
[09:05:34] Core required: FahCore_a1.exe
[09:05:34] Core found.
[09:05:34] Working on Unit 06 [November 11 09:05:34]
[09:05:34] + Working ...
[09:05:35] 
[09:05:35] *------------------------------*
[09:05:35] Folding@Home Gromacs SMP Core
[09:05:35] Version 1.74 (March 10, 2007)
[09:05:35] 
[09:05:35] Preparing to commence simulation
[09:05:35] - Ensuring status. Please wait.
[09:05:52] - Assembly optimizations manually forced on.
[09:05:52] - Not checking prior termination.
[09:05:52] 
[09:05:52] Folding@home Core Shutdown: MISSING_WORK_FILES
[09:05:52] Finalizing output

Folding@Home Client Shutdown at user request.

Folding@Home Client Shutdown.


[/SIZE]

And if I try using the -send all switch, it just exits like there's nothing to send:
[SIZE="2"]Launch directory: D:\Program Files (x86)\FAH Windows SMP Client V1.01
Executable: fah
Arguments: -send all 

[09:07:23] - Ask before connecting: No
[09:07:23] - User name: lsevald(icrontic) (Team 93)
[09:07:23] - User ID: 2AAEB20F36956D74
[09:07:23] - Machine ID: 1
[09:07:23] 
[09:07:23] Loaded queue successfully.
[09:07:23] Attempting to return result(s) to server...

Folding@Home Client Shutdown.


[/SIZE]

Inspecting the the client work folder revealed that there was a results*.dat file, so why won't it send? My only idea was that something was wrong with the queue.dat file (it holds the work queue status). A google search led me to this site. First I tried the qfix.exe utility, but that didn't work (though it did indicate I had a finished WU in the queue). Afterwards I tried qgen.exe. I made a backup of the client folder, and saved qgen.exe there. Then I renamed client.cfg to client.old, and queue.dat to queue.old. Running qgen.exe from a command prompt returned this:
[SIZE="2"]C:\Users\lsevald\Desktop\FAH Windows SMP Client V1.01>qgen
qgen v1.1

Found the following units to requeue:
  index 6: + (finished) proj 2653, run 18, clone 153, gen 0

Designation:
  UserName:    lsevald(icrontic)
  TeamNumber:  93
  CPUID:       756D95360FB2AE2A

Constructing files for the folding environment and new queue:
  index 6: + OK for upload; proj 2653, run 18, clone 153, gen 0

Units queued for processing: 0
Units queued for upload: 1
Errors: 0

done

C:\Users\lsevald\Desktop\FAH Windows SMP Client V1.01>


[/SIZE]

Qgen.exe generates a new client.cfg and queue.dat, but I only copied the new queue.dat file back to the main client folder, then I ran "fah -send all" again:
[SIZE="2"]Launch directory: D:\Program Files (x86)\FAH Windows SMP Client V1.01
Executable: fah
Arguments: -send all 

[09:13:50] - Ask before connecting: No
[09:13:50] - User name: lsevald(icrontic) (Team 93)
[09:13:50] - User ID: 2AAEB20F36956D74
[09:13:50] - Machine ID: 1
[09:13:50] 
[09:13:50] Loaded queue successfully.
[09:13:50] Attempting to return result(s) to server...


[09:13:50] + Attempting to send results
[09:16:28] + Results successfully sent
[09:16:28] Thank you for your contribution to Folding@Home.
[09:16:28] + Number of Units Completed: 364


Folding@Home Client Shutdown.


[/SIZE]

Success :) Afterwards I cleaned up the main work folder (deleted queue.dat and the work folder) to allow it to rebuild from scratch, just in case.

Comments

  • mmonninmmonnin Centreville, VA
    edited November 2007
    Oooo nice, I've found one of my clients in a finished state after being there for hours as well.

    This thread deserves to be a sticky or linked somewhere.
  • QeldromaQeldroma Arid ZoneAh Member
    edited November 2007
    Thanks, Isevald. I'm not sure what the cause of this one is either (right now the F@H community forum is down- or I'd research it more), but it might have been useful in a couple of cases here.

    Why it's picking on just your main rig is more of a mystery- but could be an important clue.
  • lsevaldlsevald Norway Icrontian
    edited November 2007
    Yes, it's very strange. I have three very similarly configured rigs now (all running Vista64, intel chipsets and quads). But on my workstation (that has this problem occasionally) I'm running 4GB RAM, RAID, NOD32 and a Soundblaster X-Fi (the two other rigs have 2GB RAM, single SATA HDD, Avast antivirus and onboard Realtek sound). I have upgraded the motherboard, CPU, RAM, reinstalled Vista64, upgraded all drivers, but no change. It has gotten a lot better the last 2-3 months though. Earlier I also got a lot of FILE_IO_ERRORS when starting the client (problem gone now). It's like it's occasionally doing something out of order. Also, the problem was a lot worse under WinXP. But I haven't tried WinXP for several months, so that might have been due to early&buggy SMP code :confused2

    EDIT: Previously I had problems with ~1/3 (at it's worst under XP, mostly FILE_IO_ERROR) of the WU's, now I guess it's more like 1/30. If it happened on a more regular basis I would have tried nailing it down (removing 2GB RAM, tried a single HDD, removing X-Fi and tried different AV software and so on)
  • SPIKE09SPIKE09 Scatland
    edited November 2007
    NOD32 has a thread at the FCF and at others teams folding sections it has been known to cause hangs and dumped WUs, another one that really causes hangs is a change in network IP address
  • lsevaldlsevald Norway Icrontian
    edited November 2007
    Yes, I've seen reports of NOD32 causing issues too. I did try removing NOD when I had the FILE_IO_ERROR problem big time, but it didn't help (even got it on a clean system). The issue I have now might be different though. But I can live with this until the next big overhaul (especially since I'm able to salvage the WU now), and hopefully SMP is out of beta and Vista SP1 is released by then :)
  • SPIKE09SPIKE09 Scatland
    edited November 2007
    lsevald wrote:
    Yes, I've seen reports of NOD32 causing issues too. I did try removing NOD when I had the FILE_IO_ERROR problem big time, but it didn't help (even got it on a clean system). The issue I have now might be different though. But I can live with this until the next big overhaul (especially since I'm able to salvage the WU now), and hopefully SMP is out of beta and Vista SP1 is released by then :)
    nice to see your an optimist Lasse:bigggrin:
  • lsevaldlsevald Norway Icrontian
    edited November 2007
    Guess I'm in a good mood today :wtf:
  • lsevaldlsevald Norway Icrontian
    edited November 2007
    I will look into the change in network IP address problem Spike. The only network related log entry I can find in the event viewer, is this Tcpip warning (Event ID 4227):
    TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

    But I'm thinking, this issue always seems to occur when I'm not using the computer. Maybe it could be something related to a power saving feature? I will look into that, and upgrade the NIC driver (onboard realtek) while I'm at it.
  • broady81broady81 Member
    edited November 2007
    I had the same problem in that my router changed my IP address (refreshed it) - Might be simple but I solved it by reserving an IP for my PC and configured my network settings to always use it. :rolleyes2 As lsevald mentioned, I also prevented power-saving features in BIOS and through software. :bigggrin: Touch wood, seems to be working when I was running both Linux and now Windows, :smiles: because I used to get FILE_IO_ERROR and hanging at the end of work units (along with numerous other random crashes) :sad2:
  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited November 2007
    There's been some weird stuff lately. I've been having a problem with work units hanging after sending updates to the server. This problem is not when the work units are completed, but the periodic "-Autosend Completed." This has been happening on all of my home computers, three of which have had no software changes lately and no Folding configuration changes whatsoever. I am happy to say that I've not lost any work units. All I have to do is Ctrl-C to shut down the client, then restart. The processing continues and the work units finish without problem. If it weren't for FAHMon, all the computer might be sitting idle at any given time with work units not being processed. I should add, all these incidents have been with work units 2653. The last 18 hours have been good though - not one 'hung' unit.
Sign In or Register to comment.