SMP oddness
sgstair
Reverse EngineerRedmond, WA Icrontian
So, this is a bit strange. I'm not sure if it's just a singular workunit or just luck of the draw or whatnot-
Here's a normal 2563 SMP WU -
about 30 minutes per frame or so at the start, settles down to ~20 mins per frame for the duration.
first several %s are ~55 minutes... and the later %'s are ... also ~55 minutes
Upon closer inspection one of the fahcore_a1 processes was taking up 50% cputime, and another only had about 7%; the cpu times for all 4 processes are 77h, 38h, 28h, and 17h - I suspect that the multi-cpu task splitting isn't very assymetric, so having one process on one cpu and 3 on the other is causing it to be cpu-bound by the slowest process.
I've manually set the affinity for now, to keep 2 on each core - but I'm considering using something like affinity changer now. Didn't think it would be that much of an issue
(Note: having set the affinity, I'll post back here after it's had a few more percentage points to churn, to see how much the frame time changed.)
Here's a normal 2563 SMP WU -
about 30 minutes per frame or so at the start, settles down to ~20 mins per frame for the duration.
[00:14:04] Project: 2653 (Run 7, Clone 123, Gen 8) ... [00:14:16] Completed 0 out of 500000 steps (0 percent) [00:44:06] Writing local files [00:44:06] Completed 5000 out of 500000 steps (1 percent) [01:15:24] Writing local files [01:15:24] Completed 10000 out of 500000 steps (2 percent) [01:45:40] Writing local files [01:45:40] Completed 15000 out of 500000 steps (3 percent) [02:15:12] Writing local files [02:15:12] Completed 20000 out of 500000 steps (4 percent) ... [02:01:54] Completed 295000 out of 500000 steps (59 percent) [02:23:12] Writing local files [02:23:12] Completed 300000 out of 500000 steps (60 percent) [02:44:33] Writing local files [02:44:33] Completed 305000 out of 500000 steps (61 percent) [03:05:55] Writing local files [03:05:55] Completed 310000 out of 500000 steps (62 percent)Then the other day when I started another 2653, it did this:
first several %s are ~55 minutes... and the later %'s are ... also ~55 minutes
[18:47:32] Project: 2653 (Run 20, Clone 115, Gen 10) ... [18:47:43] Completed 0 out of 1000000 steps (0 percent) [19:41:36] Writing local files [19:41:36] Completed 10000 out of 1000000 steps (1 percent) [20:35:30] Writing local files [20:35:30] Completed 20000 out of 1000000 steps (2 percent) [21:30:19] Writing local files [21:30:19] Completed 30000 out of 1000000 steps (3 percent) [22:28:09] Writing local files [22:28:09] Completed 40000 out of 1000000 steps (4 percent) ... [21:07:32] Completed 810000 out of 1000000 steps (81 percent) [22:02:37] Writing local files [22:02:37] Completed 820000 out of 1000000 steps (82 percent) [22:57:46] Writing local files [22:57:46] Completed 830000 out of 1000000 steps (83 percent) [23:52:50] Writing local files [23:52:50] Completed 840000 out of 1000000 steps (84 percent)I didn't notice this until just today, it's been folding for about 3 days now, a lot longer than normal.
Upon closer inspection one of the fahcore_a1 processes was taking up 50% cputime, and another only had about 7%; the cpu times for all 4 processes are 77h, 38h, 28h, and 17h - I suspect that the multi-cpu task splitting isn't very assymetric, so having one process on one cpu and 3 on the other is causing it to be cpu-bound by the slowest process.
I've manually set the affinity for now, to keep 2 on each core - but I'm considering using something like affinity changer now. Didn't think it would be that much of an issue
(Note: having set the affinity, I'll post back here after it's had a few more percentage points to churn, to see how much the frame time changed.)
0
Comments
Are you sure there wasn't some process running in the background, or pardon me, but have to ask - running a game or something else resource intensive? You know, SMP is demanding of the RAM as well. (oh, but you've got tons of RAM...not an issue)
[12:04:00] Project: 2653 (Run 18, Clone 173, Gen 8)
[12:04:00]
[12:04:01] Assembly optimizations on if available.
[12:04:01] Entering M.D.
[12:04:07] Calling FAH init
[12:04:08] Read topology
[12:04:08] g local files
[12:04:08] checkpoint)
[12:04:08] Read checkpoint
[12:04:08] Protein: Protein in POPC
[12:04:08] Writing local files
[12:04:09] Extra SSE boost OK.
[12:04:10] Writing local files
[12:04:10] Completed 0 out of 500000 steps (0 percent)
If not try stopping the client and then restarting.
Without it that means that "assembly optimizations" were not initialized on the cpu. To prevent this from happening some of us use the -forceasm flag.
I have another 2653 now (20/129/11) which is taking 22 mins/frame again.
I'm devising a test to see if I can replicate the problem...