Recouping lost WU progress
Weedo
New
One of my Win98 machines froze up sometime in the night. I couldn't do anything with it but hit the restart to get it going again. When it came back on the WU that was in progress started over at the beginning. It was a p1413_polyQ36x2 in water. It was at step 12 so you can image how much I have lost. This is the second time this has happened recently and on big WU's.
Is there anyway to recoup the lost work that was done? I can't see a way myself.
If you've been watching the folding guantlet thread then you know this comes at a bad time for me.
Check out this excerpt from my folding log:
[04:06:18] Extra SSE boost OK.
[04:21:27] Writing local files
[04:21:27] Completed 225000 out of 2500000 steps (9)
[05:14:29] Writing local files
[05:14:29] Completed 250000 out of 2500000 steps (10)
[06:07:31] Writing local files
[06:07:31] Completed 275000 out of 2500000 steps (11)
[07:01:42] Writing local files
[07:01:42] Completed 300000 out of 2500000 steps (12)
Folding@home Client Shutdown.
--- Opening Log file [November 25 15:46:29]
# Windows Graphical Edition ###################################################
###############################################################################
Folding@home Client Version 4.00
http://folding.stanford.edu
###############################################################################
###############################################################################
[15:46:29] - Ask before connecting: No
[15:46:29] - User name: Weedo (Team 93)
[15:46:29] - User ID = 67039BD37145EE45
[15:46:29] - Machine ID: 1
[15:46:29]
[15:46:30] Loaded queue successfully.
[15:46:30] Initialization complete
[15:46:30] + Benchmarking ...
[15:46:34]
[15:46:34] + Processing work unit
[15:46:34] Core required: FahCore_78.exe
[15:46:34] Core found.
[15:46:35] Working on Unit 06 [November 25 15:46:35]
[15:46:35] + Working ...
[15:46:37]
[15:46:37] *
*
[15:46:37] Folding@Home Gromacs Core
[15:46:37] Version 1.70 (October 24, 2004)
[15:46:37]
[15:46:37] Preparing to commence simulation
[15:46:37] - Ensuring status. Please wait.
[15:46:54] - Looking at optimizations...
[15:46:54] - Working with standard loops on this execution.
[15:46:54] - Previous termination of core was improper.
[15:46:54] - Files status OK
[15:46:55] - Expanded 314991 -> 1888769 (decompressed 599.6 percent)
[15:46:55] - Checksums don't match (work/wudata_06.xtc)
[15:46:55] - Starting from initial work packet
[15:46:55]
[15:46:55] Project: 1413 (Run 42, Clone 13, Gen 0)
[15:46:55]
[15:46:55] Entering M.D.
[15:47:01] Protein: p1413_Q36x2 in water
[15:47:01]
[15:47:02] Writing local files
[15:47:02] Writing local files
[15:47:02] Completed 0 out of 2500000 steps (0)
Checksums don't match?
Starting from initial work packet?
It's the same WU as before.
It ends at step 12 then restarts at 0. It should pick up where it left off.
Maybe it's the way I restarted it. :shakehead
Is there anyway to recoup the lost work that was done? I can't see a way myself.
If you've been watching the folding guantlet thread then you know this comes at a bad time for me.
Check out this excerpt from my folding log:
[04:06:18] Extra SSE boost OK.
[04:21:27] Writing local files
[04:21:27] Completed 225000 out of 2500000 steps (9)
[05:14:29] Writing local files
[05:14:29] Completed 250000 out of 2500000 steps (10)
[06:07:31] Writing local files
[06:07:31] Completed 275000 out of 2500000 steps (11)
[07:01:42] Writing local files
[07:01:42] Completed 300000 out of 2500000 steps (12)
Folding@home Client Shutdown.
--- Opening Log file [November 25 15:46:29]
# Windows Graphical Edition ###################################################
###############################################################################
Folding@home Client Version 4.00
http://folding.stanford.edu
###############################################################################
###############################################################################
[15:46:29] - Ask before connecting: No
[15:46:29] - User name: Weedo (Team 93)
[15:46:29] - User ID = 67039BD37145EE45
[15:46:29] - Machine ID: 1
[15:46:29]
[15:46:30] Loaded queue successfully.
[15:46:30] Initialization complete
[15:46:30] + Benchmarking ...
[15:46:34]
[15:46:34] + Processing work unit
[15:46:34] Core required: FahCore_78.exe
[15:46:34] Core found.
[15:46:35] Working on Unit 06 [November 25 15:46:35]
[15:46:35] + Working ...
[15:46:37]
[15:46:37] *
*
[15:46:37] Folding@Home Gromacs Core
[15:46:37] Version 1.70 (October 24, 2004)
[15:46:37]
[15:46:37] Preparing to commence simulation
[15:46:37] - Ensuring status. Please wait.
[15:46:54] - Looking at optimizations...
[15:46:54] - Working with standard loops on this execution.
[15:46:54] - Previous termination of core was improper.
[15:46:54] - Files status OK
[15:46:55] - Expanded 314991 -> 1888769 (decompressed 599.6 percent)
[15:46:55] - Checksums don't match (work/wudata_06.xtc)
[15:46:55] - Starting from initial work packet
[15:46:55]
[15:46:55] Project: 1413 (Run 42, Clone 13, Gen 0)
[15:46:55]
[15:46:55] Entering M.D.
[15:47:01] Protein: p1413_Q36x2 in water
[15:47:01]
[15:47:02] Writing local files
[15:47:02] Writing local files
[15:47:02] Completed 0 out of 2500000 steps (0)
Checksums don't match?
Starting from initial work packet?
It's the same WU as before.
It ends at step 12 then restarts at 0. It should pick up where it left off.
Maybe it's the way I restarted it. :shakehead
0
Comments
That should minimize data loss.
I've been doing some overclocking experimentation, and I always right click on the red gear icon and hit "Pause work" before restarting the computer. I don't know if that helps, but it can't hurt.
Um, see if a new client version is available for Windows 98, for starters. No, under client 4.0 there is not way, and Client 5 here is more stable, but I do not know if 5.03 is usable under 98 or 98 SE. IF what you have is 98 SE, and the client 5+ is Widnwos Me compatible, and you have Windowsupdated your 98 SE, then client 5.03 should be lots more stable than 4.0.
What probably happened, for some odd reason, is that it was writting to the .xtc file and crashed between starting that write and before it wrote its progess crosscheck checksums. Checksums, in this case, are probably an md5sum of the .xtc file after it got updated. This could be a HD flaking, RAM flaking, both, or a computer running too hot due to dust in case, as well as simply the client or core malfing. So, in this case, I would do the following, and this last has been know to casue this also:
In client 4.0+ ONLY, I have had issues when the FAHlog.txt file gets ebyond a certain size. In this case, I would look for a FAHlog.txt file of over 60K and rename it to FAHlog-Prev.txt with the CLIENT not working-- if the FAHlog.txt file is that big or bigger. Here, my FAHlog.txt crossing the 55-59K boundary used to hang client 4.0 sometimes. Client 5.0 was able to rename the log file and start a new one from the client itself. Client 4.0 used to LOCK, and often shutting down client, renaming file, and then letting it start up and make a new FAHlog.txt file would let it continue until it again reached 55-60 K and then I got to do it again. Once I had 30 plus iterations of that, I reported this, and the latest version 5.0 cleitn does NOT do that.
BUT, Folding hanging should not be the cause of whole computer hanging, except for one thing-- Folding sometimes, with some WUs, causes a warmer box than with others. The smaller WUs tend to heat boxes up less. The setting for that is to choose only small WUs in the Advanced part of the config script. If you do not have a console Windows client running, might be a good idea to run one for this box, simply to restrict what WUs it accepts. Older boxes, and especially OC'd older boxes, tend to handle smaller sized WUs BETTER for some reason. Older boxes also have less RAM to give the WU processing software and workspace needed in RAM while it is running. BIG WUs can eat a lot of RAM, and this one is complex enough that it might use lots of RAM while being worked on.
The other thing to think of, is to empty your recycle bin, if it is loaded up with over 1500 files the recycle bin processes CAN lock and then windows can lock. A very fragmented drive can also do this, or slow things down suddenly(causing processes to hang in wierd ways like this), and a drive with file system errors can also result in things flaking, so you might run scandisk and see if any errors show up, then defrag the drive if not, and also check for malware and virals. File system amintainance will be more intense when folding unless you tell Windows not to save in recycle bin the file types folding uses to store data, if you have tools for that. Essentially, for the file types that Folding uses, possibly sans .txt type, you want the files NOT to go into the recycle bin and instead be deleted totally. Otherwise, your recycle bin WILL fill up full quite rapidly, and over 1500 entries or overwritten entries in recycle bin, can hang Windows 98. IF you try to open your recycle bin and it locks windows, restart, browse to the Recycled folder (or Recycle Bin folderS) in My Computer or Windows Explorer and delete the files in small bunches and then your recycle bin should work again after a restart of Windows. I've run into this with many apps that replace files often, in 98 and Me.
How do you get to advanced part of the config script?
It started another 1413 and I don't want it.
I want it to run invisibly in the background but it shows in the taskbar.
I put the console file in a folder on my C drive and executed from there. It's still appears on the taskbar. I thought the console was supposed to run invisibly.
I need a console version for dummies tutorial.
Barton: p217 @ 28% complete
Tbird: p1408 @ 21% complete
AthlonXP: p1405 @ 0% complete
The Athlon has dumped several WU's and started over including a failed attempt to use the console version. This is terrible. I'm on the verge of declaring my jihad a failure. Sure these WU's are worth some points when complete but when will that be? 3 days and only 28%??!!!! On a Barton??!!! My 24 hr average will fall to nothing and my points for the week will be right behind it.
Weedo is losing steam.
Barton: p217 @ 29% ???
Tbird: p1408 at 24%
AthlonXP: p1405 @ 11%
I'm going nowhere. It will be many days before any of these machines post. By then, I'll need that telescope again.
Weedo waves goodbye to Bothered as he chokes on his dust.
:bawling:
I'm living on my past exploits now. That trend will change soon.
There is no way I can ever hope to continue my jihad. I don't think my Win 98 machines can handle the big WU's but that's all i get sent to them. Perhaps if I added some memory, I don't know. I deleted FAH and reinstalled and it sent me a p1407.