SMP oddness

sgstairsgstair Reverse EngineerRedmond, WA Icrontian
edited November 2007 in Folding@Home
So, this is a bit strange. I'm not sure if it's just a singular workunit or just luck of the draw or whatnot-

Here's a normal 2563 SMP WU -
about 30 minutes per frame or so at the start, settles down to ~20 mins per frame for the duration.
[00:14:04] Project: 2653 (Run 7, Clone 123, Gen 8)
...
[00:14:16] Completed 0 out of 500000 steps  (0 percent)
[00:44:06] Writing local files
[00:44:06] Completed 5000 out of 500000 steps  (1 percent)
[01:15:24] Writing local files
[01:15:24] Completed 10000 out of 500000 steps  (2 percent)
[01:45:40] Writing local files
[01:45:40] Completed 15000 out of 500000 steps  (3 percent)
[02:15:12] Writing local files
[02:15:12] Completed 20000 out of 500000 steps  (4 percent)
...
[02:01:54] Completed 295000 out of 500000 steps  (59 percent)
[02:23:12] Writing local files
[02:23:12] Completed 300000 out of 500000 steps  (60 percent)
[02:44:33] Writing local files
[02:44:33] Completed 305000 out of 500000 steps  (61 percent)
[03:05:55] Writing local files
[03:05:55] Completed 310000 out of 500000 steps  (62 percent)
Then the other day when I started another 2653, it did this:
first several %s are ~55 minutes... and the later %'s are ... also ~55 minutes
[18:47:32] Project: 2653 (Run 20, Clone 115, Gen 10)
...
[18:47:43] Completed 0 out of 1000000 steps  (0 percent)
[19:41:36] Writing local files
[19:41:36] Completed 10000 out of 1000000 steps  (1 percent)
[20:35:30] Writing local files
[20:35:30] Completed 20000 out of 1000000 steps  (2 percent)
[21:30:19] Writing local files
[21:30:19] Completed 30000 out of 1000000 steps  (3 percent)
[22:28:09] Writing local files
[22:28:09] Completed 40000 out of 1000000 steps  (4 percent)
...
[21:07:32] Completed 810000 out of 1000000 steps  (81 percent)
[22:02:37] Writing local files
[22:02:37] Completed 820000 out of 1000000 steps  (82 percent)
[22:57:46] Writing local files
[22:57:46] Completed 830000 out of 1000000 steps  (83 percent)
[23:52:50] Writing local files
[23:52:50] Completed 840000 out of 1000000 steps  (84 percent)
I didn't notice this until just today, it's been folding for about 3 days now, a lot longer than normal.
Upon closer inspection one of the fahcore_a1 processes was taking up 50% cputime, and another only had about 7%; the cpu times for all 4 processes are 77h, 38h, 28h, and 17h - I suspect that the multi-cpu task splitting isn't very assymetric, so having one process on one cpu and 3 on the other is causing it to be cpu-bound by the slowest process.
I've manually set the affinity for now, to keep 2 on each core - but I'm considering using something like affinity changer now. Didn't think it would be that much of an issue
(Note: having set the affinity, I'll post back here after it's had a few more percentage points to churn, to see how much the frame time changed.)

Comments

  • LeonardoLeonardo Wake up and smell the glaciers Eagle River, Alaska Icrontian
    edited November 2007
    Truly strange. Wish I had an answer for you. I've never seen that in months of SMP Folding with dual and quad cores.

    Are you sure there wasn't some process running in the background, or pardon me, but have to ask - running a game or something else resource intensive? You know, SMP is demanding of the RAM as well. (oh, but you've got tons of RAM...not an issue)
  • csimoncsimon Acadiana Icrontian
    edited November 2007
    First of all you should have this line (or something quite similar):

    [12:04:00] Project: 2653 (Run 18, Clone 173, Gen 8)
    [12:04:00]
    [12:04:01] Assembly optimizations on if available.
    [12:04:01] Entering M.D.
    [12:04:07] Calling FAH init
    [12:04:08] Read topology
    [12:04:08] g local files
    [12:04:08] checkpoint)
    [12:04:08] Read checkpoint
    [12:04:08] Protein: Protein in POPC
    [12:04:08] Writing local files
    [12:04:09] Extra SSE boost OK.
    [12:04:10] Writing local files
    [12:04:10] Completed 0 out of 500000 steps (0 percent)

    If not try stopping the client and then restarting.

    Without it that means that "assembly optimizations" were not initialized on the cpu. To prevent this from happening some of us use the -forceasm flag.
  • sgstairsgstair Reverse Engineer Redmond, WA Icrontian
    edited November 2007
    Yeah, it's got SSE - and to follow up, even after setting the affinity it's still running ~52 minutes a frame.
    18:47:32] Project: 2653 (Run 20, Clone 115, Gen 10)
    [18:47:32] 
    [18:47:34] Entering M.D.
    [18:47:40] Rejecting checkpoint
    [18:47:41] Protein: Protein in POPC
    [18:47:41] Writing local files
    [18:47:42] Extra SSE boost OK.
    [18:47:43] Writing local files
    [18:47:43] Completed 0 out of 1000000 steps  (0 percent)
    ...
    [09:53:08] Completed 950000 out of 1000000 steps  (95 percent)
    [10:45:09] Writing local files
    [10:45:09] Completed 960000 out of 1000000 steps  (96 percent)
    [11:37:07] Writing local files
    [11:37:07] Completed 970000 out of 1000000 steps  (97 percent)
    [12:29:08] Writing local files
    [12:29:08] Completed 980000 out of 1000000 steps  (98 percent)
    
    I guess it's just the workunit, kinda strange though.
    I have another 2653 now (20/129/11) which is taking 22 mins/frame again.
  • mmonninmmonnin Centreville, VA
    edited November 2007
    Use the affinity setting app I posted in another thread. It will assign 2 fahcores to each CPU core. I'm guessing 1 fahcore has 1 CPU core and the other 3 fahcores are fighting for the last CPU core. Just a thought.
  • sgstairsgstair Reverse Engineer Redmond, WA Icrontian
    edited November 2007
    Well only now I'm not as sure that was the problem. While it is true that running 2 processes on each core results in lower kernel cpu times, I'm thinking that WU may have been slow without the process imbalance.
    I'm devising a test to see if I can replicate the problem...
Sign In or Register to comment.