Recouping lost WU progress

WeedoWeedo New
edited November 2004 in Folding@Home
One of my Win98 machines froze up sometime in the night. I couldn't do anything with it but hit the restart to get it going again. When it came back on the WU that was in progress started over at the beginning. It was a p1413_polyQ36x2 in water. It was at step 12 so you can image how much I have lost. This is the second time this has happened recently and on big WU's.

Is there anyway to recoup the lost work that was done? I can't see a way myself.

If you've been watching the folding guantlet thread then you know this comes at a bad time for me.

Check out this excerpt from my folding log:


[04:06:18] Extra SSE boost OK.
[04:21:27] Writing local files
[04:21:27] Completed 225000 out of 2500000 steps (9)
[05:14:29] Writing local files
[05:14:29] Completed 250000 out of 2500000 steps (10)
[06:07:31] Writing local files
[06:07:31] Completed 275000 out of 2500000 steps (11)
[07:01:42] Writing local files
[07:01:42] Completed 300000 out of 2500000 steps (12)

Folding@home Client Shutdown.


--- Opening Log file [November 25 15:46:29]


# Windows Graphical Edition ###################################################
###############################################################################

Folding@home Client Version 4.00

http://folding.stanford.edu

###############################################################################
###############################################################################



[15:46:29] - Ask before connecting: No
[15:46:29] - User name: Weedo (Team 93)
[15:46:29] - User ID = 67039BD37145EE45
[15:46:29] - Machine ID: 1
[15:46:29]
[15:46:30] Loaded queue successfully.
[15:46:30] Initialization complete
[15:46:30] + Benchmarking ...
[15:46:34]
[15:46:34] + Processing work unit
[15:46:34] Core required: FahCore_78.exe
[15:46:34] Core found.
[15:46:35] Working on Unit 06 [November 25 15:46:35]
[15:46:35] + Working ...
[15:46:37]
[15:46:37] *

*
[15:46:37] Folding@Home Gromacs Core
[15:46:37] Version 1.70 (October 24, 2004)
[15:46:37]
[15:46:37] Preparing to commence simulation
[15:46:37] - Ensuring status. Please wait.
[15:46:54] - Looking at optimizations...
[15:46:54] - Working with standard loops on this execution.
[15:46:54] - Previous termination of core was improper.
[15:46:54] - Files status OK
[15:46:55] - Expanded 314991 -> 1888769 (decompressed 599.6 percent)
[15:46:55] - Checksums don't match (work/wudata_06.xtc)
[15:46:55] - Starting from initial work packet
[15:46:55]
[15:46:55] Project: 1413 (Run 42, Clone 13, Gen 0)
[15:46:55]
[15:46:55] Entering M.D.
[15:47:01] Protein: p1413_Q36x2 in water
[15:47:01]
[15:47:02] Writing local files
[15:47:02] Writing local files
[15:47:02] Completed 0 out of 2500000 steps (0)

Checksums don't match?
Starting from initial work packet?

It's the same WU as before.
It ends at step 12 then restarts at 0. It should pick up where it left off.

Maybe it's the way I restarted it. :scratch: :shakehead

Comments

  • TimTim Southwest PA Icrontian
    edited November 2004
    Right click on the red gear symbol in the taskbar, then click on Configure << Advanced, and set the checkpoint frequency to 3 minutes (most frequent).

    That should minimize data loss.

    I've been doing some overclocking experimentation, and I always right click on the red gear icon and hit "Pause work" before restarting the computer. I don't know if that helps, but it can't hurt.
  • Straight_ManStraight_Man Geeky, in my own way Naples, FL Icrontian
    edited November 2004
    Weedo wrote:
    One of my Win98 machines froze up sometime in the night. I couldn't do anything with it but hit the restart to get it going again. When it came back on the WU that was in progress started over at the beginning. It was a p1413_polyQ36x2 in water. It was at step 12 so you can image how much I have lost. This is the second time this has happened recently and on big WU's.

    Is there anyway to recoup the lost work that was done? I can't see a way myself.

    [15:46:55] - Checksums don't match (work/wudata_06.xtc)
    [15:46:55] - Starting from initial work packet
    [15:46:55]
    [15:46:55] Project: 1413 (Run 42, Clone 13, Gen 0)
    [15:46:55]
    [15:46:55] Entering M.D.
    [15:47:01] Protein: p1413_Q36x2 in water
    [15:47:01]
    [15:47:02] Writing local files
    [15:47:02] Writing local files
    [15:47:02] Completed 0 out of 2500000 steps (0)[/I]

    Checksums don't match?
    Starting from initial work packet?

    It's the same WU as before.
    It ends at step 12 then restarts at 0. It should pick up where it left off.

    Maybe it's the way I restarted it. :scratch: :shakehead

    Um, see if a new client version is available for Windows 98, for starters. No, under client 4.0 there is not way, and Client 5 here is more stable, but I do not know if 5.03 is usable under 98 or 98 SE. IF what you have is 98 SE, and the client 5+ is Widnwos Me compatible, and you have Windowsupdated your 98 SE, then client 5.03 should be lots more stable than 4.0.

    What probably happened, for some odd reason, is that it was writting to the .xtc file and crashed between starting that write and before it wrote its progess crosscheck checksums. Checksums, in this case, are probably an md5sum of the .xtc file after it got updated. This could be a HD flaking, RAM flaking, both, or a computer running too hot due to dust in case, as well as simply the client or core malfing. So, in this case, I would do the following, and this last has been know to casue this also:

    In client 4.0+ ONLY, I have had issues when the FAHlog.txt file gets ebyond a certain size. In this case, I would look for a FAHlog.txt file of over 60K and rename it to FAHlog-Prev.txt with the CLIENT not working-- if the FAHlog.txt file is that big or bigger. Here, my FAHlog.txt crossing the 55-59K boundary used to hang client 4.0 sometimes. Client 5.0 was able to rename the log file and start a new one from the client itself. Client 4.0 used to LOCK, and often shutting down client, renaming file, and then letting it start up and make a new FAHlog.txt file would let it continue until it again reached 55-60 K and then I got to do it again. Once I had 30 plus iterations of that, I reported this, and the latest version 5.0 cleitn does NOT do that.

    BUT, Folding hanging should not be the cause of whole computer hanging, except for one thing-- Folding sometimes, with some WUs, causes a warmer box than with others. The smaller WUs tend to heat boxes up less. The setting for that is to choose only small WUs in the Advanced part of the config script. If you do not have a console Windows client running, might be a good idea to run one for this box, simply to restrict what WUs it accepts. Older boxes, and especially OC'd older boxes, tend to handle smaller sized WUs BETTER for some reason. Older boxes also have less RAM to give the WU processing software and workspace needed in RAM while it is running. BIG WUs can eat a lot of RAM, and this one is complex enough that it might use lots of RAM while being worked on.

    The other thing to think of, is to empty your recycle bin, if it is loaded up with over 1500 files the recycle bin processes CAN lock and then windows can lock. A very fragmented drive can also do this, or slow things down suddenly(causing processes to hang in wierd ways like this), and a drive with file system errors can also result in things flaking, so you might run scandisk and see if any errors show up, then defrag the drive if not, and also check for malware and virals. File system amintainance will be more intense when folding unless you tell Windows not to save in recycle bin the file types folding uses to store data, if you have tools for that. Essentially, for the file types that Folding uses, possibly sans .txt type, you want the files NOT to go into the recycle bin and instead be deleted totally. Otherwise, your recycle bin WILL fill up full quite rapidly, and over 1500 entries or overwritten entries in recycle bin, can hang Windows 98. IF you try to open your recycle bin and it locks windows, restart, browse to the Recycled folder (or Recycle Bin folderS) in My Computer or Windows Explorer and delete the files in small bunches and then your recycle bin should work again after a restart of Windows. I've run into this with many apps that replace files often, in 98 and Me.
  • WeedoWeedo New
    edited November 2004
    Exactly how do I set the console version to only accept small WU's?

    How do you get to advanced part of the config script?

    It started another 1413 and I don't want it.



    I want it to run invisibly in the background but it shows in the taskbar.
  • Access_DeniedAccess_Denied tennessee
    edited November 2004
    this has happened to me with a p2p software. my computer locked from being on for about a week an i had to just shut it off and when i came back my download (like 600MB of 700MB over dialup) had been erased. i looked in norton protected files and it was there so i restored it but the p2p wouldnt accept it anymore and would promptly delete it :shakehead .. i think its something to do with the file being open then when you shut down it gets deleted
  • WeedoWeedo New
    edited November 2004
    I've shut down the console, deleted and restarted several times. Can't find a way to set it to accept small WU's. It keeps restarting that 1413 which is destroying my progress. So I'm shutting it down until I figure it out.

    I put the console file in a folder on my C drive and executed from there. It's still appears on the taskbar. I thought the console was supposed to run invisibly.

    I need a console version for dummies tutorial.
  • botheredbothered Manchester UK
    edited November 2004
    I had a dodgy WU once. I was advised to delete everything in the work folder, then it got another, differant work unit and started again.
  • WeedoWeedo New
    edited November 2004
    Hmmm, that didn't work for me, or maybe it does and Stanford keeps sending me P1413's anyway. I've lost about 2 days production on this machine. :bawling:
  • WeedoWeedo New
    edited November 2004
    After 3 days of hard folding here is where I stand:

    Barton: p217 @ 28% complete
    Tbird: p1408 @ 21% complete
    AthlonXP: p1405 @ 0% complete

    The Athlon has dumped several WU's and started over including a failed attempt to use the console version. This is terrible. I'm on the verge of declaring my jihad a failure. Sure these WU's are worth some points when complete but when will that be? 3 days and only 28%??!!!! On a Barton??!!! My 24 hr average will fall to nothing and my points for the week will be right behind it.

    Weedo is losing steam.
  • botheredbothered Manchester UK
    edited November 2004
    I had another bad 'do' with folding when a server went down for about a week. I thought it was my PCs, I handed no points in for ages. In the end I uninstalled F@H and reinstalled a newer version which connected to another server. I lost quite a few points but it seems to fold quicker now. Don't give up on the jihad weedo, it'll get sorted.
  • botheredbothered Manchester UK
    edited November 2004
    Well you're handing in points now dood. :thumbsup:
  • WeedoWeedo New
    edited November 2004
    I've only got 1 machine doing any good now. The status of the other 3 7.5 hrs after that last status report:

    Barton: p217 @ 29% ???
    Tbird: p1408 at 24%
    AthlonXP: p1405 @ 11%

    I'm going nowhere. It will be many days before any of these machines post. By then, I'll need that telescope again.

    :wave: Weedo waves goodbye to Bothered as he chokes on his dust. ;D


    :bawling:
  • WeedoWeedo New
    edited November 2004
    I just took a look at Extreme Overclocker stats. It's funny how it looks like I'm out producing you point wise and the trend shows me passing you, but the reality is you're starting to open up your lead substantially.

    I'm living on my past exploits now. That trend will change soon.
  • WeedoWeedo New
    edited November 2004
    OMG!!!! It just happened again!!! My Barton dropped 30% progress of a p217. I am totally :screwed:

    There is no way I can ever hope to continue my jihad. I don't think my Win 98 machines can handle the big WU's but that's all i get sent to them. Perhaps if I added some memory, I don't know. I deleted FAH and reinstalled and it sent me a p1407.
Sign In or Register to comment.