Ups & Downs

profdlpprofdlp The Holy City Of Westlake, Ohio
edited December 2003 in Folding@Home
First the rant - without boring you with specifics, my humble Folding farm has been going through some tough times. Besides problems here at home, I've made two 200+ mile round-trips within the last two weeks to get my dad's and my daughter's rigs back in action. The Old Man has a bum HD, so I'll be going back when his RMA arrives... :rolleyes:

The Specific Question:

My main computer has been dumping WU's right and left for the past week+. I've tried the following:

Wipe & Reload F@H program
Dumped cores (several times)
Stopped overclocking... :(
Ran memtest (passed fully)

Here are excerpts from my log:
*******************************

[19:54:02] Quit 101 - Fatal error:
[19:54:02] Step 31327, time 62.654 (ps) LINCS WARNING
[19:54:02] relative constraint deviation after LINCS:
[19:54:02] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[19:54:02] Simulation instability has been encountered. The run has entered a…<snip>
[19:54:03] Folding@home Core Shutdown: EARLY_UNIT_END
[19:54:07] CoreStatus = 72 (114)
[19:54:07] Sending work to server

*******************************

[21:32:02] Gromacs exception handled
[21:32:02] Folding@home Core Shutdown: SPECIAL_EXIT
[21:32:05] CoreStatus = 65 (101)
[21:32:05] Core internal error: SPECIAL_EXIT

*******************************

[23:45:18] Quit 101 - Fatal error:
[23:45:18] Step 12387, time 24.774 (ps) LINCS WARNING
[23:45:18] relative constraint deviation after LINCS:
[23:45:18] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[23:45:18] Simulation instability has been encountered. The run has entered a…<snip>
[23:45:18] Folding@home Core Shutdown: EARLY_UNIT_END
[23:45:22] CoreStatus = 72 (114)
[23:45:22] Sending work to server

******************************

[14:43:27] Completed 85000 out of 500000 steps (17)
[14:45:13] Quit 101 - Fatal error:
[14:45:13] Step 85141, time 170.282 (ps) LINCS WARNING
[14:45:13] relative constraint deviation after LINCS:
[14:45:13] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[14:45:13] Simulation instability has been encountered. The run has entered a…<snip>
[14:45:16] Folding@home Core Shutdown: EARLY_UNIT_END
[14:45:20] CoreStatus = 72 (114)
[14:45:20] Sending work to server

******************************
When I was up at my dads on Tuesday (system idle all day) it dumped 3 or 4 WU's...

Any suggestions????????? :banghead: :banghead:

The Good News: Got a few parts in and have had one of those "ripple effect" upgrades. Bottom line is that an Athlon 1200 has been replaced by an XP 2400+. Once I get the rest of the crap sorted out I should do better than ever.

Comments

  • primesuspectprimesuspect Beepin n' Boopin Detroit, MI Icrontian
    edited December 2003
    What client version are you using?
  • profdlpprofdlp The Holy City Of Westlake, Ohio
    edited December 2003
    primesuspect had this to say
    What client version are you using?
    Should have mentioned:
    Tried the 4.00, 3.25, and 3.24.

    System is an Athlon 1200 on an Abit Kt7A-Raid (not running raid) with 512MB Crucial Cas2.
  • edited December 2003
    Can it make it thru any other stress programs? With that model Abit I'd check for bulging/leaking capacitors but usually if that's the case it will have problems even loading windows.
  • profdlpprofdlp The Holy City Of Westlake, Ohio
    edited December 2003
    seversphere had this to say
    Can it make it thru any other stress programs? With that model Abit I'd check for bulging/leaking capacitors but usually if that's the case it will have problems even loading windows.
    No bulging caps (checked earlier today). It runs everything else just fine. Does UT (orig) and Age Of Empires II for hours, if need be. Haven't tried a stress benchmark, but haven't had any Windows errors at all. Just seems like F@H is cursed, and nothing else is affected...
  • csimoncsimon Acadiana Icrontian
    edited December 2003
    what flags are you using? If you're using -forceasm (or -forceSSE on FAH4) then try removing it ...
  • edited December 2003
    see what happens when you run two new client instances simultaneously. What happens if you underclock or run stock with less than stock vcore (i.e. same errors in FAH)? Is it happening with both the 1200 and xp2400 on the abit board?
  • profdlpprofdlp The Holy City Of Westlake, Ohio
    edited December 2003
    csimon had this to say
    what flags are you using? If you're using -forceasm (or -forceSSE on FAH4) then try removing it ...
    No current flags. I have even tried adding them (though that seemed counterintuitive). Made no difference either way, but they are off now.

    seversphere had this to say
    see what happens when you run two new client instances simultaneously. What happens if you underclock or run stock with less than stock vcore (i.e. same errors in FAH)? Is it happening with both the 1200 and xp2400 on the abit board?
    Not sure what you mean by the first part, how would I run two simultaneously? Haven't tried underclocking :eek: , my has it come to that? :rolleyes: I'll mess with the voltages and see what happens. Also, I may have created some confusion by mentioning my other upgrades. Those are on other computers, this one has not changed a bit. The other comps are cranking out the WU's just fine.

    This is what the comp did overnight:
    [09:58:45] Completed 725000 out of 2500000 steps (29)
    [11:05:28] Writing local files
    [11:05:30] Completed 750000 out of 2500000 steps (30)
    [11:05:47] Gromacs cannot continue further.
    [11:05:47] Going to send back what have done.
    [11:05:47] Folding@home Core Shutdown: EARLY_UNIT_END
    [11:05:50] CoreStatus = 72 (114)
    [11:05:50] Sending work to server
    [11:06:16] + Working ...
    [11:06:16]
    [11:06:16] *
    *
    [11:06:16] Folding@home Gromacs Core
    [11:06:16] Version 1.53 (October 2, 2003)
    [11:06:16]
    [11:06:16] Preparing to commence simulation
    [11:06:16] - Read to use standard loops
    [11:06:16] - Created dyn
    [11:06:16] - Files status OK
    [11:06:17] Project: 803 (Run 1, Clone 46, Gen 42)
    [11:06:17]
    [11:06:17] Entering M.D.
    [11:06:24] Protein: p803_p53dimer803
    [11:06:24]
    [11:06:24] Writing local files
    [11:06:27] Writing local files
    [11:06:29] Completed 0 out of 500000 steps (0)
    [11:31:19] Writing local files
    [11:31:21] Completed 5000 out of 500000 steps (1)
    [11:55:02] Quit 101 - Fatal error:
    [11:55:02] Step 9961, time 19.922 (ps) LINCS WARNING
    [11:55:02] relative constraint deviation after LINCS:
    [11:55:02] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
    [11:55:02]
    [11:55:02] Simulation instability has been encountered. The run has entered a
    [11:55:02] state from which no further progress can be made.
    [11:55:02] If you often see other project units terminating early like this
    [11:55:02] too, you may wish to check the stability of your computer (issues
    [11:55:02] such as high temperature, overclocking, etc.).
    [11:55:02] Going to send back what have done.
    [11:55:02] logfile size: 8337
    [11:55:02] - Writing 9012 bytes of core data to disk...
    [11:55:02] ... Done.
    [11:55:02]
    [11:55:02] Folding@home Core Shutdown: EARLY_UNIT_END
    [11:55:06] CoreStatus = 72 (114)
  • a2jfreaka2jfreak Houston, TX Member
    edited December 2003
    prof: Download (if you don't already have it) Prime95 and run the Torture Test. If you get the newest version you should be able to choose which type of test . . . L2/RAM/etc. Choose the setting that stresses mainly your CPU (since you said memtest runs just fine) and that way you'll see if your CPU is sending the occasional wrong bit.
  • hypermoodhypermood Smyrna, GA New
    edited December 2003
    Checked for dust bunnies or processor voltage swings under load (i.e. is the PSU healthy) ?
  • t1rhinot1rhino Toronto
    edited December 2003
    What are the temps like?
  • profdlpprofdlp The Holy City Of Westlake, Ohio
    edited December 2003
    a2jfreak had this to say
    ...Download (if you don't already have it) Prime95 and run the Torture Test...
    Great idea! I'll do it overnight tonight.
    hypermood had this to say
    Checked for dust bunnies or processor voltage swings under load (i.e. is the PSU healthy) ?
    System has been totally cleaned. I'll try and keep an eye on MBM5 and see if I can spot anything.
    t1rhino had this to say
    What are the temps like?
    Worth 1,000 words?:
  • edited December 2003
    the problem is your core voltage ;D

    but seriously my suggestions are just to see what happens so we can compare results and not posed as solutions - sorta process of elimination. I've only run into LINC problems when ram or the ram bus/subsystem was unstable. Like I ran 128mb pc100 stick on a KT7A-R at 133 for a while but it started to have errors and brought it down to 124 and it was okay. A damaged or unstable cpu (overclocked) usually results in consistant errors when it's dumping work units - always dumping at a certain frame or errors during initial decompress and start of the first frame.
Sign In or Register to comment.