QMD core lockups common
csimon
Acadiana Icrontian
If you're requesting big wu's and use hyperthreading or dual processors or both you should be aware that you will quite possibly encounter a timer checkpoint and a frozen core before the wu ever really begins. I'm not sure why this is happening but it seems to occur when the system tries to fold 2 qmd core related wu's at the same time ...only one will continue and the other will freeze leaving your processor only utilizing 50% or 25% for folding depending on your particular processor configuration.
I'm sure some of you can come up with ideas but I have a basic suggestion ...leave one client configured for big wu's and reconfigure the rest not to request them.
Also ...it seems that what may be happening is the huge bandwidth requirements for this particular core.
So ...my recomendation to you with more than one process of folding per workstation only request one big wu so you shouldn't have to worry about this.
Again ...as far as I know this only pertains to more than one process per workstation.
I'm sure some of you can come up with ideas but I have a basic suggestion ...leave one client configured for big wu's and reconfigure the rest not to request them.
Also ...it seems that what may be happening is the huge bandwidth requirements for this particular core.
So ...my recomendation to you with more than one process of folding per workstation only request one big wu so you shouldn't have to worry about this.
Again ...as far as I know this only pertains to more than one process per workstation.
0
Comments
it had happened before, and after awhile it just started like normal
If you look around you will find a few more.
So baron do you use a hyperthreaded cpu or dualie?
The issure really only concerns with more than one instance of the qmd core that is all.
edit:\\ apparantly it may be more than once instance trying to fold on the same processor. So if you run dual procs you may be able to get away with qmd's and 2 non-qmd's ...I'm experimenting now with that so I'll let you know my findings as soon as I have any.
I just thought it was wierd that it made me re-config out of the blue
EDIT: Reading that thread, I think I'll probably go ahead and disable big work units on one of my clients until I hear that the problem has been fixed
Basically, this guy, Audioafectionado, has a Xeon dually with 2 gigs of ram and he also ran into this problem while running 4 clients on it and the newest release of the QMD core (1.03 I think). Normally, he shouldn't have any problem even with 4 QMD wu's running because of the large amount of ram he has on that machine but he's experiencing stalls just like you 2. He's seeming to think that it's some kind of issue with bandwidth sharing of the ram on Intel machine, since the Xeon shares it's memory bandwidth between all physical and logical processors. I think he has turned off 2 clients on that machine or configured a couple for no BP work until Stanford gets to the bottom of the problem. It seems like I remember that he tried configuring 2 clients for non-BP work and still saw a significant slowdown on the folding speed on the QMD work, so you might check to see what kind of points/hour return you get both ways.
How much physical ram? Sorry if you already stated it.
I'm so jealous, ever since I stopped getting them my production is down 5-800 points a day...
Had some unusual activity in one over the last day or so. The log showed around 26 min per frame. After having thunder storms and farm shut downs, it went from 26min to something like 12hrs per frame. To finish off the WU it took like 48hrs for the last 3 frames. :bawling:
[09:31:49] Completed 1980000 out of 2000000 steps (99)
[10:02:26] Timered checkpoint triggered.
[10:33:15] Timered checkpoint triggered.
[11:03:50] Timered checkpoint triggered.
[11:34:37] Timered checkpoint triggered.
[12:05:38] Timered checkpoint triggered.
[12:35:44] Timered checkpoint triggered.
[13:06:29] Timered checkpoint triggered.
[13:13:43] - Autosending finished units...
[13:13:43] Trying to send all finished work units
[13:13:43] + No unsent completed units remaining.
[13:13:43] - Autosend completed
[13:37:15] Timered checkpoint triggered.
[14:07:52] Timered checkpoint triggered.
[14:38:31] Timered checkpoint triggered.
[15:09:09] Timered checkpoint triggered.
[15:39:45] Timered checkpoint triggered.
[16:10:29] Timered checkpoint triggered.
[16:41:09] Timered checkpoint triggered.
[17:12:21] Timered checkpoint triggered.
[17:43:00] Timered checkpoint triggered.
[18:13:38] Timered checkpoint triggered.
[18:43:57] Timered checkpoint triggered.
[19:13:43] - Autosending finished units...
[19:13:43] Trying to send all finished work units
[19:13:43] + No unsent completed units remaining.
[19:13:43] - Autosend completed
[19:14:58] Timered checkpoint triggered.
[19:46:22] Timered checkpoint triggered.
[20:16:58] Timered checkpoint triggered.
[20:48:02] Timered checkpoint triggered.
[21:19:16] Timered checkpoint triggered.
[21:50:00] Timered checkpoint triggered.
[22:21:00] Timered checkpoint triggered.
[22:52:04] Timered checkpoint triggered.
[23:23:05] Timered checkpoint triggered.
[23:53:47] Timered checkpoint triggered.
[00:24:43] Timered checkpoint triggered.
[00:55:28] Timered checkpoint triggered.
It finally went in about 1 hr ago. Back to normal with 26min per frame. This is a P1911.
[05:29:55] Completed 0 out of 2000 steps (0)
[05:55:05] Completed 21 out of 2021 steps (1)
[05:55:05] Writing local files
[06:20:50] Completed 41 out of 2041 steps (2)
[06:20:50] Writing local files
[06:47:41] Completed 62 out of 2062 steps (3)
[06:47:41] Writing local files
And then we had this on the same puter in the GUI logfile.
This why folding can be such a braindrain.
[16:15:35] *
*
[16:15:35] Folding@Home QMD Core
[16:15:35] Version 1.03 (Mar 24, 2005)
[16:15:35]
[16:15:35] Preparing to commence simulation
[16:15:35] - Assembly optimizations manually forced on.
[16:15:35] - Not checking prior termination.
[16:15:35] - Expanded 109804 -> 348141 (decompressed 317.0 percent)
[16:15:35] + New frame time estimate; Working...
[16:15:35]
[16:15:35] Project: 1901 (Run 0, Clone 29, Gen 13)
[16:15:35]
[16:15:35] Writing local files
[16:15:35] Extra SSE2 boost OK.
[16:15:35] Entering QMD...
[16:15:40] + New frame time estimate; Working...
[16:15:45] + New frame time estimate; Working...
[16:15:50] + New frame time estimate; Working...
[16:15:52] System: p1901_32_water_molecules
[16:15:52]
[16:15:52] Performing initial WF calculations
[16:15:52] - Number of total steps will change until convergence
[16:15:55] + New frame time estimate; Working...
[16:16:00] + New frame time estimate; Working...
[16:16:05] + New frame time estimate; Working...
[16:16:10] + New frame time estimate; Working...
[16:16:15] + New frame time estimate; Working...
[16:16:16] Completed 0 out of 2000 steps (0)
[16:16:21] + New frame time estimate; Working...
[16:16:26] + New frame time estimate; Working...
[16:16:31] + New frame time estimate; Working...
[16:16:36] + New frame time estimate; Working...
[16:16:41] + New frame time estimate; Working...
[16:16:46] + New frame time estimate; Working...
[16:16:51] + New frame time estimate; Working...
[16:16:56] + New frame time estimate; Working...
[16:17:01] + New frame time estimate; Working...
[16:17:06] + New frame time estimate; Working...
[16:17:11] + New frame time estimate; Working...
[16:17:16] + New frame time estimate; Working...
[16:17:22] + New frame time estimate; Working...
[16:17:27] + New frame time estimate; Working...
[16:17:32] + New frame time estimate; Working...
[16:17:37] + New frame time estimate; Working...
[16:17:42] + New frame time estimate; Working...
[16:17:47] + New frame time estimate; Working...
[16:17:53] + New frame time estimate; Working...
[16:17:58] + New frame time estimate; Working...
[16:18:03] + New frame time estimate; Working...
[16:18:08] + New frame time estimate; Working...
[16:18:13] + New frame time estimate; Working...
[16:18:18] + New frame time estimate; Working...
[16:18:23] + New frame time estimate; Working...
[16:18:28] + New frame time estimate; Working...
[16:18:33] + New frame time estimate; Working...
[16:18:38] + New frame time estimate; Working...
[16:18:43] + New frame time estimate; Working...
[16:18:48] + New frame time estimate; Working...
[16:18:54] + New frame time estimate; Working...
[16:18:59] + New frame time estimate; Working...
[16:19:04] + New frame time estimate; Working...
[16:19:09] + New frame time estimate; Working...
[16:19:14] + New frame time estimate; Working...
[16:19:19] + New frame time estimate; Working...
[16:19:24] + New frame time estimate; Working...
[16:19:29] + New frame time estimate; Working...
[16:19:34] + New frame time estimate; Working...
[16:19:39] + New frame time estimate; Working...
[16:19:44] + New frame time estimate; Working...
[16:19:49] + New frame time estimate; Working...
[16:19:54] + New frame time estimate; Working...
[16:19:59] + New frame time estimate; Working...
[16:20:04] + New frame time estimate; Working...
[16:20:09] + New frame time estimate; Working...
[16:20:14] + New frame time estimate; Working...
[16:20:19] + New frame time estimate; Working...
[16:20:25] + New frame time estimate; Working...
[16:20:30] + New frame time estimate; Working...
[16:20:35] + New frame time estimate; Working...
[16:20:40] + New frame time estimate; Working...
[16:20:45] + New frame time estimate; Working...
[16:20:50] + New frame time estimate; Working...
[16:20:56] + New frame time estimate; Working...
[16:21:01] + New frame time estimate; Working...
[16:21:06] + New frame time estimate; Working...
[16:21:07] Completed 21 out of 2021 steps (1)
[16:21:07] Writing local files
[16:21:11] + Writing 'sec_per_frame = 21.095238' to config
[16:21:11] + Working ...Timered checkpoint triggered.
[16:24:02] WF converged, jumping to MD
[16:24:02] Verifying checksum
[16:24:02] Finished
[16:24:20] Completed 33 out of 2033 steps (1)
[16:26:42] Completed 41 out of 2033 steps (2)
[16:26:42] Writing local files
[16:26:42] + Writing 'sec_per_frame = 16.549999' to config
[16:26:42] + Working ...Completed 61 out of 2033 steps (3)
[16:32:41] Writing local files
[16:32:45] + Writing 'sec_per_frame = 18.150000' to config
[16:32:45] + Working ...Completed 82 out of 2033 steps (4)
[16:38:56] Writing local files
[16:38:57] + Writing 'sec_per_frame = 17.714285' to config
[16:38:57] + Working ...Completed 102 out of 2033 steps (5)
[16:44:52] Writing local files
[16:44:54] + Writing 'sec_per_frame = 17.850000' to config
[16:44:54] + Working ...Completed 122 out of 2033 steps (6)
[16:50:50] Writing local files
[16:50:51] + Writing 'sec_per_frame = 17.850000' to config
[16:50:51] + Working ...Completed 143 out of 2033 steps (7)
[16:57:06] Writing local files
[16:57:09] + Writing 'sec_per_frame = 18.000000' to config
[16:57:09] + Working ...Completed 163 out of 2033 steps (8)
[17:03:04] Writing local files
[17:03:07] + Writing 'sec_per_frame = 17.850000' to config
[17:03:07] + Working ...+ New frame time estimate; Working...
[17:46:48] + New frame time estimate; Working...
[17:46:53] + New frame time estimate; Working...
[17:46:58] + New frame time estimate; Working...
[17:47:03] + New frame time estimate; Working...
[17:47:08] + New frame time estimate; Working...
[17:47:13] + New frame time estimate; Working...
[17:47:19] + New frame time estimate; Working...
[17:47:24] + New frame time estimate; Working...
[17:47:29] + New frame time estimate; Working...
[17:47:34] + New frame time estimate; Working...
[17:47:39] + New frame time estimate; Working...
[17:47:44] + New frame time estimate; Working...
[17:47:50] + New frame time estimate; Working...
[17:47:55] + New frame time estimate; Working...
[17:47:55] Completed 183 out of 2033 steps (9)
[17:47:55] Unit 1's deadline (January 19 16:15) has passed.
[17:47:55] Going to interrupt core and move on to next unit...
[17:47:55] Writing local files
[17:47:55] Unit 1's deadline (January 19 16:15) has passed.
[17:47:55] Going to interrupt core and move on to next unit...
[17:47:55] Waiting for the core to finish writing checkpoint files...
Not complaining ................IT happens.
I've read the FAQ for the QMD core at folding.stanford.edu and my computer can handle them. It has 2GB RAM, with at least 1.5G free most of the time if I am not editing video or photos.
Personally, I think anyone that has an AMD rig should be glad this is happening as I personally think that the QMD core stuff is still too beta to even be released even to -advmetheds without having modified the client so that the QMD cores could be excluded from being chosen.
So you mean the evil monpolistic microchip giant... cough... Intel....cough, has succesfully made it so "only" their pee-fours can get a QMD... I am so surprised.
AMD rules
Athlon 64 3400+ 2400mhz
It may be fixed by now with the new core 1.04 dated april 7, 2005.
the gromacs core is also updated to 1.81 as of april 6, 2005.
Essentially I think the problem lies in that the qmd core taxes the system so hard that 2 instances of it are just too much and that's why I recommended 1 qmd and 1 gromac or tinker. This issue has really gotten a lot of attention over at stanford and a few of the really major donators actually threatened quitting altogether. I'm not sure what ever came about from that but as long as you're aware of the workaround your production should be good.
btw ...be on the lookout for my contracting thread since we just got the bulk of the furniture and I am having the yard leveled and contoured with fill dirt right now ...topsoil within the next few days weather permitting!