Ups & Downs

profdlp · December 2003

First the rant - without boring you with specifics, my humble Folding farm has been going through some tough times. Besides problems here at home, I've made two 200+ mile round-trips within the last two weeks to get my dad's and my daughter's rigs back in action. The Old Man has a bum HD, so I'll be going back when his RMA arrives...

The Specific Question:

My main computer has been dumping WU's right and left for the past week+. I've tried the following:

Wipe & Reload F@H program
Dumped cores (several times)
Stopped overclocking...

Ran memtest (passed fully)

Here are excerpts from my log:

*******************************

[19:54:02] Quit 101 - Fatal error:
[19:54:02] Step 31327, time 62.654 (ps) LINCS WARNING
[19:54:02] relative constraint deviation after LINCS:
[19:54:02] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[19:54:02] Simulation instability has been encountered. The run has entered a…<snip>
[19:54:03] Folding@home Core Shutdown: EARLY_UNIT_END
[19:54:07] CoreStatus = 72 (114)
[19:54:07] Sending work to server

*******************************

[21:32:02] Gromacs exception handled
[21:32:02] Folding@home Core Shutdown: SPECIAL_EXIT
[21:32:05] CoreStatus = 65 (101)
[21:32:05] Core internal error: SPECIAL_EXIT

*******************************

[23:45:18] Quit 101 - Fatal error:
[23:45:18] Step 12387, time 24.774 (ps) LINCS WARNING
[23:45:18] relative constraint deviation after LINCS:
[23:45:18] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[23:45:18] Simulation instability has been encountered. The run has entered a…<snip>
[23:45:18] Folding@home Core Shutdown: EARLY_UNIT_END
[23:45:22] CoreStatus = 72 (114)
[23:45:22] Sending work to server

******************************

[14:43:27] Completed 85000 out of 500000 steps (17)
[14:45:13] Quit 101 - Fatal error:
[14:45:13] Step 85141, time 170.282 (ps) LINCS WARNING
[14:45:13] relative constraint deviation after LINCS:
[14:45:13] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[14:45:13] Simulation instability has been encountered. The run has entered a…<snip>
[14:45:16] Folding@home Core Shutdown: EARLY_UNIT_END
[14:45:20] CoreStatus = 72 (114)
[14:45:20] Sending work to server

******************************

When I was up at my dads on Tuesday (system idle all day) it dumped 3 or 4 WU's...

Any suggestions????????? :banghead: :banghead:

The Good News: Got a few parts in and have had one of those "ripple effect" upgrades. Bottom line is that an Athlon 1200 has been replaced by an XP 2400+. Once I get the rest of the crap sorted out I should do better than ever.

primesuspect · December 2003

What client version are you using?

profdlp · December 2003

primesuspect had this to say
What client version are you using?

Should have mentioned:
Tried the 4.00, 3.25, and 3.24.

System is an Athlon 1200 on an Abit Kt7A-Raid (not running raid) with 512MB Crucial Cas2.

seversphere · December 2003

Can it make it thru any other stress programs? With that model Abit I'd check for bulging/leaking capacitors but usually if that's the case it will have problems even loading windows.

profdlp · December 2003

seversphere had this to say
Can it make it thru any other stress programs? With that model Abit I'd check for bulging/leaking capacitors but usually if that's the case it will have problems even loading windows.

No bulging caps (checked earlier today). It runs everything else just fine. Does UT (orig) and Age Of Empires II for hours, if need be. Haven't tried a stress benchmark, but haven't had any Windows errors at all. Just seems like F@H is cursed, and nothing else is affected...

csimon · December 2003

what flags are you using? If you're using -forceasm (or -forceSSE on FAH4) then try removing it ...

seversphere · December 2003

see what happens when you run two new client instances simultaneously. What happens if you underclock or run stock with less than stock vcore (i.e. same errors in FAH)? Is it happening with both the 1200 and xp2400 on the abit board?

profdlp · December 2003

csimon had this to say
what flags are you using? If you're using -forceasm (or -forceSSE on FAH4) then try removing it ...

No current flags. I have even tried adding them (though that seemed counterintuitive). Made no difference either way, but they are off now.

seversphere had this to say
see what happens when you run two new client instances simultaneously. What happens if you underclock or run stock with less than stock vcore (i.e. same errors in FAH)? Is it happening with both the 1200 and xp2400 on the abit board?

Not sure what you mean by the first part, how would I run two simultaneously? Haven't tried underclocking

, my has it come to that?

I'll mess with the voltages and see what happens. Also, I may have created some confusion by mentioning my other upgrades. Those are on other computers, this one has not changed a bit. The other comps are cranking out the WU's just fine.

This is what the comp did overnight:

[09:58:45] Completed 725000 out of 2500000 steps (29)
[11:05:28] Writing local files
[11:05:30] Completed 750000 out of 2500000 steps (30)
[11:05:47] Gromacs cannot continue further.
[11:05:47] Going to send back what have done.
[11:05:47] Folding@home Core Shutdown: EARLY_UNIT_END
[11:05:50] CoreStatus = 72 (114)
[11:05:50] Sending work to server
[11:06:16] + Working ...
[11:06:16]
[11:06:16] *
*
[11:06:16] Folding@home Gromacs Core
[11:06:16] Version 1.53 (October 2, 2003)
[11:06:16]
[11:06:16] Preparing to commence simulation
[11:06:16] - Read to use standard loops
[11:06:16] - Created dyn
[11:06:16] - Files status OK
[11:06:17] Project: 803 (Run 1, Clone 46, Gen 42)
[11:06:17]
[11:06:17] Entering M.D.
[11:06:24] Protein: p803_p53dimer803
[11:06:24]
[11:06:24] Writing local files
[11:06:27] Writing local files
[11:06:29] Completed 0 out of 500000 steps (0)
[11:31:19] Writing local files
[11:31:21] Completed 5000 out of 500000 steps (1)
[11:55:02] Quit 101 - Fatal error:
[11:55:02] Step 9961, time 19.922 (ps) LINCS WARNING
[11:55:02] relative constraint deviation after LINCS:
[11:55:02] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[11:55:02]
[11:55:02] Simulation instability has been encountered. The run has entered a
[11:55:02] state from which no further progress can be made.
[11:55:02] If you often see other project units terminating early like this
[11:55:02] too, you may wish to check the stability of your computer (issues
[11:55:02] such as high temperature, overclocking, etc.).
[11:55:02] Going to send back what have done.
[11:55:02] logfile size: 8337
[11:55:02] - Writing 9012 bytes of core data to disk...
[11:55:02] ... Done.
[11:55:02]
[11:55:02] Folding@home Core Shutdown: EARLY_UNIT_END
[11:55:06] CoreStatus = 72 (114)

a2jfreak · December 2003

prof: Download (if you don't already have it) Prime95 and run the Torture Test. If you get the newest version you should be able to choose which type of test . . . L2/RAM/etc. Choose the setting that stresses mainly your CPU (since you said memtest runs just fine) and that way you'll see if your CPU is sending the occasional wrong bit.

hypermood · December 2003

Checked for dust bunnies or processor voltage swings under load (i.e. is the PSU healthy) ?

t1rhino · December 2003

What are the temps like?

profdlp · December 2003

a2jfreak had this to say
...Download (if you don't already have it) Prime95 and run the Torture Test...

Great idea! I'll do it overnight tonight.

hypermood had this to say
Checked for dust bunnies or processor voltage swings under load (i.e. is the PSU healthy) ?

System has been totally cleaned. I'll try and keep an eye on MBM5 and see if I can spot anything.

t1rhino had this to say
What are the temps like?

Worth 1,000 words?:

seversphere · December 2003

the problem is your core voltage

but seriously my suggestions are just to see what happens so we can compare results and not posed as solutions - sorta process of elimination. I've only run into LINC problems when ram or the ram bus/subsystem was unstable. Like I ran 128mb pc100 stick on a KT7A-R at 133 for a while but it started to have errors and brought it down to 124 and it was okay. A damaged or unstable cpu (overclocked) usually results in consistant errors when it's dumping work units - always dumping at a certain frame or errors during initial decompress and start of the first frame.

Ups & Downs

Comments