Apologies....

Straight_Man · May 2004

One of my Prescott pipes has a client that suffered a 18+ hour repeated seizure and looping. Core involved was a Core_78. WU involved is unknown as the WU parms never showed in verbose log and the error was an almost immediate CORE_OUTDATED abend (error hex code 6E) at core startup. Client showed invalid WU, but (I THINK) since checksums were valid for data in WU as to calc parms, tried for 16+ hours to get cores until it simply stopped itself for one day. I was out, sleeping, and doing other life things while this happened. Only by saving logs was I able to partly forensic this....

If you folks save logs (and this incident bridged over two log files) and run into this, please join in at thread specific link at bottom of this post on folding's forums.... Preventing this is going to require some recoding of thingsd at client and WU and core to check date required ranges in WU headers and pull a new WU instead of a 16 hour plus loop with a client work stoppage for 24 hours done by client.

Fix was the same core downloaded last, new WU download done by client after three folding session stop\start cycles and enough time elapsed to have the client still present do 80+% of a p1052 on one pipe of my Prescott box. Prescott was down to one console client functional for this time and I apologize to mmonnin and the whole gang for not monitoring things more tightly.... GEEZE....

http://forum.folding-community.org/viewtopic.php?t=7868

Post sizes needed to doc this would exceed what is proper here by only 10X what I normally post, so I chose a link and the docing on Folding's forums.... I commented log also, and truncated repeated things some....

Note, if you are new to folding or the folding forum, you will need a user ID which you can create at Folding's forums.

John D.

csimon · May 2004

was that rig configured for beta wu's john?

mmonnin · May 2004

I thought I posted in this thread??
/me scratches head.

What I meant to say was that no apology is needed. No one can be at their computer all the time to monitor folding.

Straight_Man · May 2004

csimon wrote:

was that rig configured for beta wu's john?

Not really deliberately, no-- unless my config is what you-all use for Betas.... For my purposes I have verbose logging enabled using -verbosity 9, save logs for months back, use -advmethods and -forceasm and -forcesse and -service and -local active. This is one of two on a Prescott, but similar things have happened to me with a graphical though in that case I reloaded client core, and wiped WU instead of letting it try to fix through three shutdown\restarts of client and that happened on a Barton. Stock client, it got a Double Gromacs and a Core_79 and ran it fine, it is using same downloaded core Core_78 that it last grabbed on the reload loop and am folding Gromacs with it now.

Looks awful like I got a WU with a mangled core date or Core spec in it, as it kept trying to grab cores and saying they wer OUTDATED whihc usually means WU wants newer core than what is there. BUT, it also kept saying invalid WU and grabbed more of exact same core down to rev anyhow. Circumstances were such that at least part of the time during this looping that the Linux box was trying to get things from Germany to update itself, but the looping started first by the Linux logs timestamps. AND, the router here was actually not very busy by its logs.

I have a bandwidth downward that blazes, an up bandwidth that is not bad, about like mid-grade DSL up (260-320 Kbits up), and in the 2.5+ Gigabit range down potentially-- on the FTP links I get from Sweden and German servers I can FTP at an average of 290 KBYTES per second, but that was not active, the Linux updater was trying to connect and never FTP'd anything.

Problem is all I can give the Pande group is what I gave them, unless they want about 112 KB of logs to look at just for this one incident. Nowhere int eh logs is there an entry that tells me the WU server IP nor the WU spec, the Core abended after download with a hex 6E code, I got a bunch of these repeated loops, cleint decided to autosleep for 24 hours, an hour into that time frame I closed out client in an orderly manner, brought it back up and it did more looping, closed out client ANOTHER time and it erased WU and got another WU, did NOT get another CORE, and went merrily on its way and has been folding since then quite happily.

but, a corrupt WU should not trigger this, period, not an 18 hour mess, much less the time it would have been if I had not checked it an hour after it autoslept for a day. Queue.dat is not corrupt, no corruption in logs, no trojans nor worms nor viruses on the box. second pipe on Prescott is running same core, same client, and no such issues. I am gonna put it down to a WU, but the dang client seems to be unable to validate WU headers to get a date and core requirement crioss-check versus even present time. AND, sicne cannot ID WU, cannot even tell folding whihc WU is at fault, which is why I wanted others to chime in in the hope of getting multiple times stamps for Folding admins to zero in on and see from their logs on their end which WU it was and check it that way fro embedded errors as to Core needed and date of core release in needed in WU header so others do not get this later on.

IF no errors of this sort show up at folding's end, I simply would like a reality type validity check done on the headers of WUs by future clients before core is engaged and a verbose mode log to show the WU ID and a WU redownload triggered instead of a core run and reload loop so this behavior can be stopped before the looping behavior is triggered on other boxes for 18-36 hours. I lost 18 hours work plus some minutes on one client instance of two simultaneous consoles (started via .bat's started through shortcuts stuck in the Startup folder of the XP Start Menu|Programs areas on the Prescott), and before and after can run two with ZERO reconfigging.

For me the aggravating thing was not so much that the box lost points, though for Marc it might be the major issue and indeed this would affect that also, but that parts of this have happened 5 times that I know of, and enough similarities exist that I need some debug things done to bypass what happened and see if we can pinpoint what with some timely logging. A corrupt header in a WU is the only thing I can think of, and if it was corrupted in transit or by a WU create error matters less than that the client did not do the right thing given what HAD to have happened and that 18 hours of production was lost.

One reason I check up on all instances I run at least every 24 hours is to take logs that are almost at 60K and chunk them to backup renames, and where prev log also exists, it gets renamed first so I have numerically consecutive logs.

csimon · May 2004

well with -advmethods you may have gotten a beta wu in its end stages of beta testing. too bad about the log.

Straight_Man · May 2004

Yep, if things like project ID could be doced before the core has started, as to WU ID at least, we could pinpoint without extensive log tracing on servers by timestamp from the assingment time and server on through system to pinpoint the project.

I am getting p1000 series WUs, but this one got a Core_78 download and I do not know if there are any such of those. AFAIK, what that client had worked on before this behavior, was a Tinker that completed... That is what the log shows. Before that, a Gromacs whihc completed, and before that the same client had worked on a mix of Tinkers, Gromacs, and a few Double Gromacs.

HD did check good several ways, I have done a check of data with a full chkdsk run since then, nada wrong. AV runs daily, updates daily. Only thing an updated HJT finds is Alexa from time to time, and Adaware keeps saying box is clean with latest updates of that in place-- even as to defs.

I think that and no corruptive actions by programs tells me that the problem is not local coruption, reinforced by the4 many successful completes of WUs by this console install.

I am reasonably sure this was not a Tinker nor a Double Gromacs, simply due to the Core_78 download looping-- it never got Core_65, nor Core_79, per the log. None of the servers were individually down for all of this looping time frame AFAIK. THAT, I had looked at at least twice in this time frame, near beginning after I had last looked at client's logs, and near end when I cross-checked that on the Linux box while XP was coming back up.

Apologies....

Comments