QMD core lockups common

csimon · March 2005

If you're requesting big wu's and use hyperthreading or dual processors or both you should be aware that you will quite possibly encounter a timer checkpoint and a frozen core before the wu ever really begins. I'm not sure why this is happening but it seems to occur when the system tries to fold 2 qmd core related wu's at the same time ...only one will continue and the other will freeze leaving your processor only utilizing 50% or 25% for folding depending on your particular processor configuration.

I'm sure some of you can come up with ideas but I have a basic suggestion ...leave one client configured for big wu's and reconfigure the rest not to request them.
Also ...it seems that what may be happening is the huge bandwidth requirements for this particular core.
So ...my recomendation to you with more than one process of folding per workstation only request one big wu so you shouldn't have to worry about this.

Again ...as far as I know this only pertains to more than one process per workstation.

TheBaron · March 2005

I saw this happening on my computer at home. I stopped both the services, and manually opened the client that had stopped. I found that it was requesting new config info... very confusing.

it had happened before, and after awhile it just started like normal

csimon · March 2005

Here is a thread over at the community on the subject http://forum.folding-community.org/viewtopic.php?t=11734
If you look around you will find a few more.

So baron do you use a hyperthreaded cpu or dualie?

The issure really only concerns with more than one instance of the qmd core that is all.

edit:\\ apparantly it may be more than once instance trying to fold on the same processor. So if you run dual procs you may be able to get away with qmd's and 2 non-qmd's ...I'm experimenting now with that so I'll let you know my findings as soon as I have any.

TheBaron · March 2005

I am running hyperthreaded, but even with 2 QMD cores folding it DOES NOT use all of my physical RAM (though it gets close). I'm kind of curious what the actual problem is

I just thought it was wierd that it made me re-config out of the blue

EDIT: Reading that thread, I think I'll probably go ahead and disable big work units on one of my clients until I hear that the problem has been fixed

muddocktor · March 2005

They also have a thread about this issue over in Overclocker's folding forum too. I was going to post about it this morning, but the damn sat connection on the rig was acting up again. :banghead:

Basically, this guy, Audioafectionado, has a Xeon dually with 2 gigs of ram and he also ran into this problem while running 4 clients on it and the newest release of the QMD core (1.03 I think). Normally, he shouldn't have any problem even with 4 QMD wu's running because of the large amount of ram he has on that machine but he's experiencing stalls just like you 2. He's seeming to think that it's some kind of issue with bandwidth sharing of the ram on Intel machine, since the Xeon shares it's memory bandwidth between all physical and logical processors. I think he has turned off 2 clients on that machine or configured a couple for no BP work until Stanford gets to the bottom of the problem. It seems like I remember that he tried configuring 2 clients for non-BP work and still saw a significant slowdown on the folding speed on the QMD work, so you might check to see what kind of points/hour return you get both ways.

TheBaron · March 2005

some of the QMD core WUs absolutely KILL my PPW on units running on my other client. I've only noticed this happen once or twice (since in general, if my CPU is at 100%, then I assume everything is working as it should) but I could probably look at logs and find out what WU did it

csimon · March 2005

muddocktor wrote:

Basically, this guy, Audioafectionado, has a Xeon dually with 2 gigs of ram and he also ran into this problem while running 4 clients on it and the newest release of the QMD core (1.03 I think).

That sounds like the same rig I was having trouble on ...folding 4 instances.

TheBaron wrote:

I am running hyperthreaded, but even with 2 QMD cores folding it DOES NOT use all of my physical RAM (though it gets close).

How much physical ram? Sorry if you already stated it.

TheBaron · March 2005

1.5 Gb

csimon · March 2005

I'll have to read further ...perhaps you have enough ram. I have 1gb per processor.

DanG · March 2005

I am running big WU's on a dual xeon with HT at work. 2 of the cpu's were chugging along with QMD's, one had a tinker and one gromac. This is with 4GB of ram. As of this morning(just set up 3 of the 4 clients last night) everything was well.

csimon · March 2005

I'm running 2 qmd's and 2 600 point gromacs on mine w/ 2gb ram and no issues ...the problem seems to only arise when I get 2 qmd's on the same processor.

DanG · March 2005

csimon wrote:

I'm running 2 qmd's and 2 600 point gromacs on mine w/ 2gb ram and no issues ...the problem seems to only arise when I get 2 qmd's on the same processor.

I'm so jealous, ever since I stopped getting them my production is down 5-800 points a day...

dragonV8 · March 2005

Those problems are surfacing in the Anderson household as well. All bar 1 computer run with 512m per WU. HT computers have 1gig.

Had some unusual activity in one over the last day or so. The log showed around 26 min per frame. After having thunder storms and farm shut downs, it went from 26min to something like 12hrs per frame. To finish off the WU it took like 48hrs for the last 3 frames. :bawling:

[09:31:49] Completed 1980000 out of 2000000 steps (99)
[10:02:26] Timered checkpoint triggered.
[10:33:15] Timered checkpoint triggered.
[11:03:50] Timered checkpoint triggered.
[11:34:37] Timered checkpoint triggered.
[12:05:38] Timered checkpoint triggered.
[12:35:44] Timered checkpoint triggered.
[13:06:29] Timered checkpoint triggered.
[13:13:43] - Autosending finished units...
[13:13:43] Trying to send all finished work units
[13:13:43] + No unsent completed units remaining.
[13:13:43] - Autosend completed
[13:37:15] Timered checkpoint triggered.
[14:07:52] Timered checkpoint triggered.
[14:38:31] Timered checkpoint triggered.
[15:09:09] Timered checkpoint triggered.
[15:39:45] Timered checkpoint triggered.
[16:10:29] Timered checkpoint triggered.
[16:41:09] Timered checkpoint triggered.
[17:12:21] Timered checkpoint triggered.
[17:43:00] Timered checkpoint triggered.
[18:13:38] Timered checkpoint triggered.
[18:43:57] Timered checkpoint triggered.
[19:13:43] - Autosending finished units...
[19:13:43] Trying to send all finished work units
[19:13:43] + No unsent completed units remaining.
[19:13:43] - Autosend completed
[19:14:58] Timered checkpoint triggered.
[19:46:22] Timered checkpoint triggered.
[20:16:58] Timered checkpoint triggered.
[20:48:02] Timered checkpoint triggered.
[21:19:16] Timered checkpoint triggered.
[21:50:00] Timered checkpoint triggered.
[22:21:00] Timered checkpoint triggered.
[22:52:04] Timered checkpoint triggered.
[23:23:05] Timered checkpoint triggered.
[23:53:47] Timered checkpoint triggered.
[00:24:43] Timered checkpoint triggered.
[00:55:28] Timered checkpoint triggered.

It finally went in about 1 hr ago. Back to normal with 26min per frame. This is a P1911.

[05:29:55] Completed 0 out of 2000 steps (0)
[05:55:05] Completed 21 out of 2021 steps (1)
[05:55:05] Writing local files
[06:20:50] Completed 41 out of 2041 steps (2)
[06:20:50] Writing local files
[06:47:41] Completed 62 out of 2062 steps (3)
[06:47:41] Writing local files

And then we had this on the same puter in the GUI logfile.
This why folding can be such a braindrain.

[16:15:35] *

*
[16:15:35] Folding@Home QMD Core
[16:15:35] Version 1.03 (Mar 24, 2005)
[16:15:35]
[16:15:35] Preparing to commence simulation
[16:15:35] - Assembly optimizations manually forced on.
[16:15:35] - Not checking prior termination.
[16:15:35] - Expanded 109804 -> 348141 (decompressed 317.0 percent)
[16:15:35] + New frame time estimate; Working...
[16:15:35]
[16:15:35] Project: 1901 (Run 0, Clone 29, Gen 13)
[16:15:35]
[16:15:35] Writing local files
[16:15:35] Extra SSE2 boost OK.
[16:15:35] Entering QMD...
[16:15:40] + New frame time estimate; Working...
[16:15:45] + New frame time estimate; Working...
[16:15:50] + New frame time estimate; Working...
[16:15:52] System: p1901_32_water_molecules
[16:15:52]
[16:15:52] Performing initial WF calculations
[16:15:52] - Number of total steps will change until convergence
[16:15:55] + New frame time estimate; Working...
[16:16:00] + New frame time estimate; Working...
[16:16:05] + New frame time estimate; Working...
[16:16:10] + New frame time estimate; Working...
[16:16:15] + New frame time estimate; Working...
[16:16:16] Completed 0 out of 2000 steps (0)
[16:16:21] + New frame time estimate; Working...
[16:16:26] + New frame time estimate; Working...
[16:16:31] + New frame time estimate; Working...
[16:16:36] + New frame time estimate; Working...
[16:16:41] + New frame time estimate; Working...
[16:16:46] + New frame time estimate; Working...
[16:16:51] + New frame time estimate; Working...
[16:16:56] + New frame time estimate; Working...
[16:17:01] + New frame time estimate; Working...
[16:17:06] + New frame time estimate; Working...
[16:17:11] + New frame time estimate; Working...
[16:17:16] + New frame time estimate; Working...
[16:17:22] + New frame time estimate; Working...
[16:17:27] + New frame time estimate; Working...
[16:17:32] + New frame time estimate; Working...
[16:17:37] + New frame time estimate; Working...
[16:17:42] + New frame time estimate; Working...
[16:17:47] + New frame time estimate; Working...
[16:17:53] + New frame time estimate; Working...
[16:17:58] + New frame time estimate; Working...
[16:18:03] + New frame time estimate; Working...
[16:18:08] + New frame time estimate; Working...
[16:18:13] + New frame time estimate; Working...
[16:18:18] + New frame time estimate; Working...
[16:18:23] + New frame time estimate; Working...
[16:18:28] + New frame time estimate; Working...
[16:18:33] + New frame time estimate; Working...
[16:18:38] + New frame time estimate; Working...
[16:18:43] + New frame time estimate; Working...
[16:18:48] + New frame time estimate; Working...
[16:18:54] + New frame time estimate; Working...
[16:18:59] + New frame time estimate; Working...
[16:19:04] + New frame time estimate; Working...
[16:19:09] + New frame time estimate; Working...
[16:19:14] + New frame time estimate; Working...
[16:19:19] + New frame time estimate; Working...
[16:19:24] + New frame time estimate; Working...
[16:19:29] + New frame time estimate; Working...
[16:19:34] + New frame time estimate; Working...
[16:19:39] + New frame time estimate; Working...
[16:19:44] + New frame time estimate; Working...
[16:19:49] + New frame time estimate; Working...
[16:19:54] + New frame time estimate; Working...
[16:19:59] + New frame time estimate; Working...
[16:20:04] + New frame time estimate; Working...
[16:20:09] + New frame time estimate; Working...
[16:20:14] + New frame time estimate; Working...
[16:20:19] + New frame time estimate; Working...
[16:20:25] + New frame time estimate; Working...
[16:20:30] + New frame time estimate; Working...
[16:20:35] + New frame time estimate; Working...
[16:20:40] + New frame time estimate; Working...
[16:20:45] + New frame time estimate; Working...
[16:20:50] + New frame time estimate; Working...
[16:20:56] + New frame time estimate; Working...
[16:21:01] + New frame time estimate; Working...
[16:21:06] + New frame time estimate; Working...
[16:21:07] Completed 21 out of 2021 steps (1)
[16:21:07] Writing local files
[16:21:11] + Writing 'sec_per_frame = 21.095238' to config
[16:21:11] + Working ...Timered checkpoint triggered.
[16:24:02] WF converged, jumping to MD
[16:24:02] Verifying checksum
[16:24:02] Finished
[16:24:20] Completed 33 out of 2033 steps (1)
[16:26:42] Completed 41 out of 2033 steps (2)
[16:26:42] Writing local files
[16:26:42] + Writing 'sec_per_frame = 16.549999' to config
[16:26:42] + Working ...Completed 61 out of 2033 steps (3)
[16:32:41] Writing local files
[16:32:45] + Writing 'sec_per_frame = 18.150000' to config
[16:32:45] + Working ...Completed 82 out of 2033 steps (4)
[16:38:56] Writing local files
[16:38:57] + Writing 'sec_per_frame = 17.714285' to config
[16:38:57] + Working ...Completed 102 out of 2033 steps (5)
[16:44:52] Writing local files
[16:44:54] + Writing 'sec_per_frame = 17.850000' to config
[16:44:54] + Working ...Completed 122 out of 2033 steps (6)
[16:50:50] Writing local files
[16:50:51] + Writing 'sec_per_frame = 17.850000' to config
[16:50:51] + Working ...Completed 143 out of 2033 steps (7)
[16:57:06] Writing local files
[16:57:09] + Writing 'sec_per_frame = 18.000000' to config
[16:57:09] + Working ...Completed 163 out of 2033 steps (8)
[17:03:04] Writing local files
[17:03:07] + Writing 'sec_per_frame = 17.850000' to config
[17:03:07] + Working ...+ New frame time estimate; Working...
[17:46:48] + New frame time estimate; Working...
[17:46:53] + New frame time estimate; Working...
[17:46:58] + New frame time estimate; Working...
[17:47:03] + New frame time estimate; Working...
[17:47:08] + New frame time estimate; Working...
[17:47:13] + New frame time estimate; Working...
[17:47:19] + New frame time estimate; Working...
[17:47:24] + New frame time estimate; Working...
[17:47:29] + New frame time estimate; Working...
[17:47:34] + New frame time estimate; Working...
[17:47:39] + New frame time estimate; Working...
[17:47:44] + New frame time estimate; Working...
[17:47:50] + New frame time estimate; Working...
[17:47:55] + New frame time estimate; Working...
[17:47:55] Completed 183 out of 2033 steps (9)
[17:47:55] Unit 1's deadline (January 19 16:15) has passed.
[17:47:55] Going to interrupt core and move on to next unit...
[17:47:55] Writing local files
[17:47:55] Unit 1's deadline (January 19 16:15) has passed.
[17:47:55] Going to interrupt core and move on to next unit...
[17:47:55] Waiting for the core to finish writing checkpoint files...

Not complaining

................IT happens.

csimon · March 2005

Well I wasn't successful in getting more than 1 qmd to fold on the server so now i'ts 1 qmd + 3 gromacs.

Medlock · March 2005

I'll look into this when I get home. I have been noticing something not quite right on my main folder.

rc1974 · April 2005

What are you running to get QMD cores? I'm running the usual -advmethods and large WU's, but I have yet to get any QMD WU's.

I've read the FAQ for the QMD core at folding.stanford.edu and my computer can handle them. It has 2GB RAM, with at least 1.5G free most of the time if I am not editing video or photos.

csimon · April 2005

rc1974 wrote:

What are you running to get QMD cores? I'm running the usual -advmethods and large WU's, but I have yet to get any QMD WU's.

pee-fores

muddocktor · April 2005

Someone stated over in the Overclockers folding forums that the reason that the QMD core work isn't being assigned to AMD procs is due to licensing problems with SSE2 and AMD procs, but I'm not sure how true this may be. Since these fold so damn slowly without using SSE2 optimizations, Stanford decided to to not assign these to A64 and Opteron procs until the licensing issues are settled.

Personally, I think anyone that has an AMD rig should be glad this is happening as I personally think that the QMD core stuff is still too beta to even be released even to -advmetheds without having modified the client so that the QMD cores could be excluded from being chosen.

rc1974 · April 2005

csimon wrote:

pee-fores

So you mean the evil monpolistic microchip giant... cough... Intel....cough, has succesfully made it so "only" their pee-fours can get a QMD...

I am so surprised.

AMD rules

Athlon 64 3400+ 2400mhz

Medlock · April 2005

Has this problem been fixed yet? My average PPW could be a bit higher on my P4 with 2 large units...

csimon · April 2005

TheGr81 wrote:

Has this problem been fixed yet? My average PPW could be a bit higher on my P4 with 2 large units...

idunno but if you fold one qmd and one gromac you shouldn't have a problem ...that's what I'm doing essentially.

It may be fixed by now with the new core 1.04 dated april 7, 2005.
the gromacs core is also updated to 1.81 as of april 6, 2005.

Leonardo · August 2005

I wish I had noticed this thread MONTHS ago. The last two weeks system 1 (HT enabled, 2 F@H instances) has been dumping work units left and right in both Folding instances. "Timered checkpoint...". I have no changed one of the client configs to not at accept large packets.

csimon · August 2005

Leonardo wrote:

I wish I had noticed this thread MONTHS ago. The last two weeks system 1 (HT enabled, 2 F@H instances) has been dumping work units left and right in both Folding instances. "Timered checkpoint...". I have no changed one of the client configs to not at accept large packets.

sorry about your loss leon. I suppose the upside news is that your points should be on the uprise soon since you now seem to have corrected the issue.

Essentially I think the problem lies in that the qmd core taxes the system so hard that 2 instances of it are just too much and that's why I recommended 1 qmd and 1 gromac or tinker. This issue has really gotten a lot of attention over at stanford and a few of the really major donators actually threatened quitting altogether. I'm not sure what ever came about from that but as long as you're aware of the workaround your production should be good.

btw ...be on the lookout for my contracting thread since we just got the bulk of the furniture and I am having the yard leveled and contoured with fill dirt right now ...topsoil within the next few days weather permitting!

QMD core lockups common

Comments