Crappy WU's, how long till they're gone?

DanG · January 2007

These 5 way melt things are killing production for everyone, when are the good ones coming back?

shwaip · January 2007

DanG wrote:

These 5 way melt things are killing production for everyone, when are the good ones coming back?

Then there's not really much to complain about, right?

profdlp · January 2007

A week ago I had nine out of ten boxes on the good old 5-Way's. As of today I am down to one. I'd guess that we'll see other projects for at least a little while.

For what it's worth, there has been a ruckus over in the Folding Community forums about their apparent low point yield relative to the time it takes to complete them. I wouldn't be surprised if when they return they have a higher point value.

mirage · January 2007

Six of them are here. I am sure there will be more when these are complete :rolleyes2

Donut · January 2007

profdlp wrote:

For what it's worth, there has been a ruckus over in the Folding Community forums about their apparent low point yield relative to the time it takes to complete them. I wouldn't be surprised if when they return they have a higher point value.

lurking around that forum yesterday, it seems there's also a ruckus about people dumping these wu's or just blocking the server so they don't get any.

I want the points also, but we have to finish these UGLY ones before we get the "good" ones.

3 out of my 5 Windows clients have them.

Thermalfish · January 2007

I have nearly 50 melts folding atm.

mirage · January 2007

Thermalfish wrote:

I have nearly 50 melts folding atm.

And, don't tell me you have 50 CPU's all together

Donut · January 2007

Thermalfish- Active Processors (within 7 days)=69

QCH · January 2007

I'm in this for the long haul. As long as everyone is getting these, I'm OK with the low points. If there is a concerted effort by other to delete or block the servers issuing these... Grrrrrr

the_technocrat · January 2007

I can't imagine how many melts I have right now, but everyone else is getting them too, so in the grand scheme of things, not a big deal... I'm still where I usually am in terms of overall ppd in the 'Overall Users' rankings...

mirage · January 2007

Well, Stanford's overly complicated bonus point system is causing all this discussion. What I spend is electricity and CPU time with the intention to help for a good cause that's all. I am not a researcher at Stanford who can appreciate the importance of the WU I am running, all of them are equal to me. The only way to get more point should be owning a faster computer. If the point counting turns into a lottery system, there will be cheaters for sure. :shakehead

urd · January 2007

hmm .. well i don't know a lot about this but all i know is WU's are something i need to process. For me, every 1% is around 42 mins long on my overclocked 4000+ X2.. I hope it's normal

Leonardo · January 2007

Urd, it all depends upon the specific work unit and the settings for your Folding@Home client's config.sys file. If you are unsure, please open a separate thread with your concern and you'll get help.

Oriane · January 2007

How do I say this simply? Attached is the FAH project description of what is going on- but it may still be over a lot of our heads.

These projects are mostly analytical, but they are vital in helping to validate the models they wish to use to simulate much larger and more complicated proteins. This may be extremely important in being able to reduce the workload they must issue to accomplish a task. They are using double precision, so this may be an advantage to those using a native 64-bit system and OS.

I guess, bottom-line, projects 2124 and 2125 (for example) are about:

Note that the time we save by replacing the explicit water molecules with an implicit model will eventually help us to study larger and larger proteins -- big proteins like HIV reverse transcriptase and viruses.
This project, together with 2102/2103, will help us to answer the questions: how much solvent detail do we need, and can we enjoy a simulation speed-up using implicit water while still obtaining physically meaningful results? If the answer to the latter is yes (and surely it is to some degree), then 2106 versus 2102/2103 will help us to answer the question of how much detail we can dispense with: do we need sophisticated implict solvent models, or will simpler ones suffice?

If you are a math person, maybe you can understand. It is sort of like reducing variables by either dropping them out of the equation or replacing them with constants.

For those of you who are not, these things simply could well help in coming to solutions faster ... and hopefully treatments and cures.

Personally, all I can say to you all is-

Thank you very much for bearing with it. Sometimes the dirtiest job is very important.

Ref. Linky

csimon · January 2007

I set up F@h on an old 2.0ghz xeon workstation laying around the office this afternoon and the first thing it downloaded was a double gromacs. I should have taken note of the particulars but man that thing was flying ...it no time at all it had crunched through 12 frames. It will be long gone by the time I get back to work tomorrow but I'll try and note which wu it is. I love dgromacs.

Thermalfish · February 2007

mirage wrote:

And, don't tell me you have 50 CPU's all together

Yeah I do. 60 CPUs in two rooms. I have to manually turn on folding every morning, so I get to see whats being folded.

Ive been getting alot of these supervillin-03's. But the point value seems to vary?

mirage · February 2007

Thermalfish wrote:

Yeah I do. 60 CPUs in two rooms. I have to manually turn on folding every morning, so I get to see whats being folded.

Ive been getting alot of these supervillin-03's. But the point value seems to vary?

I will be working on adding a dual-core Opteron this weekend. There is no hope I can catch your production but this should increase my production a little more than 800PPD. I will say adios to you in about a week

The 3038 supervillin jobs are crashing here.

I submitted two incomplete jobs today from two different Windows machines after simulations encountered NaN errors. There can not be any system stability problems; those two computer have been up for months continuously without any errors before. Two crashes with the same job on the same day should have been caused by the work units.

Oriane · February 2007

mirage wrote:

.... The 3038 supervillin jobs are crashing here. I submitted two incomplete jobs today from two different Windows machines after simulations encountered NaN errors. There can not be any system stability problems; those two computer have been up for months continuously without any errors before. Two crashes with the same job on the same day should have been caused by the work units.

I'm also folding p3038_supervillin-03s but without a problem. I’ve finished one and have another going now.

A while back I had a different WU similarly abort, but the FAH server somehow caught the error, automatically replaced the core and restarted the instance. Everything worked out after that. Check your log. If you aren't sure about what is going on- post it.

I see you are aggressively overclocking machines- try backing off if the problem persists. I would really try that first. Overclocking and overheating (could be caused by just being dirty too), really make for a lot of aborted WUs.

If everything checks out and they continue to abort, you still might have a corrupted core. If you shut down your FAH instance and delete the core (mine use FahCore_78.exe, but check your FAH log to be sure). The FAH client will download a new one from the server when you restart client and it sees that the core is not there. You should not hurt anything by doing this.

Again, if you get stuck- post your log.

mirage · February 2007

Oriane wrote:

I'm also folding p3038_supervillin-03s but without a problem. I’ve finished one and have another going now.

A while back I had a different WU similarly abort, but the FAH server somehow caught the error, automatically replaced the core and restarted the instance. Everything worked out after that. Check your log. If you aren't sure about what is going on- post it.

I see you are aggressively overclocking machines- try backing off if the problem persists. I would really try that first. Overclocking and overheating (could be caused by just being dirty too), really make for a lot of aborted WUs.

If everything checks out and they continue to abort, you still might have a corrupted core. If you shut down your FAH instance and delete the core (mine use FahCore_78.exe, but check your FAH log to be sure). The FAH client will download a new one from the server when you restart client and it sees that the core is not there. You should not hurt anything by doing this.

Again, if you get stuck- post your log.

The work units from the same project (3038) crashed on two different computers on the same day. These two computers have completed all of the work units without a single crash since last several months. I checked the temperatures, they are normal too. It could still be something related to voltage fluctuation, or cosmic radiation, etc. but my computers are as stable as they can be without expensive parts like UPS systems and ECC memory. I do not have any 3038 WUs right now. I will be monitoring a little closer, and post here if I catch something. Thanks for your help

Thermalfish · February 2007

Super-villins, super-villins EVERYWHERE.

Kentigern · February 2007

I got them too - super-villins - What are UNK points?
EMIII usually gives a number.:)

Donut · February 2007

UNK= Unknown, I guess EM3 doesn't have the point values for these yet.

Kentigern · February 2007

Thanks Donut

profdlp · February 2007

The Supervillins I've seen are 186 points. EMIII took about a week after they started appearing on my computers before it properly identified the point value.

mirage · February 2007

mirage wrote:

The work units from the same project (3038) crashed on two different computers on the same day. These two computers have completed all of the work units without a single crash since last several months. I checked the temperatures, they are normal too. It could still be something related to voltage fluctuation, or cosmic radiation, etc. but my computers are as stable as they can be without expensive parts like UPS systems and ECC memory. I do not have any 3038 WUs right now. I will be monitoring a little closer, and post here if I catch something. Thanks for your help

I can now confirm that P3038's are crashing everywhere. On a third computer another P3038 crashed today, this computer had completed 321 WUs without a single crash. The problem relies with the WUs. Here is the log.

[13:44:18] Completed 5000000 out of 5000000 steps (100)
[13:44:18] Writing final coordinates.
[13:44:18] Past main M.D. loop
[13:45:18]
[13:45:18] Finished Work Unit:
[13:45:18] - Reading up to 232536 from "work/wudata_04.arc": Read 232536
[13:45:18] - Reading up to 452628 from "work/wudata_04.xtc": Read 452628
[13:45:18] goefile size: 0
[13:45:18] logfile size: 249306
[13:45:18] Leaving Run
[13:45:23] - Writing 1132194 bytes of core data to disk...
[13:45:23] ... Done.
[13:45:23] - Shutting down core
[13:45:23]
[13:45:23] Folding@home Core Shutdown: FINISHED_UNIT
[13:45:25] CoreStatus = 64 (100)
[13:45:25] Sending work to server

[13:45:25] + Attempting to send results
[13:45:30] + Results successfully sent
[13:45:30] Thank you for your contribution to Folding@Home.
[13:45:30] + Number of Units Completed: 321

[13:45:34] - Preparing to get new work unit...
[13:45:34] + Attempting to get work packet
[13:45:34] - Connecting to assignment server
[13:45:35] - Successful: assigned to (171.65.103.160).
[13:45:35] + News From Folding@Home: Welcome to Folding@Home
[13:45:35] Loaded queue successfully.
[13:45:37] + Closed connections
[13:45:37]
[13:45:37] + Processing work unit
[13:45:37] Core required: FahCore_78.exe
[13:45:37] Core found.
[13:45:37] Working on Unit 05 [February 12 13:45:37]
[13:45:37] + Working ...
[13:45:37]
[13:45:37] *

*
[13:45:37] Folding@Home Gromacs Core
[13:45:37] Version 1.90 (March 8, 2006)
[13:45:37]
[13:45:37] Preparing to commence simulation
[13:45:37] - Assembly optimizations manually forced on.
[13:45:37] - Not checking prior termination.
[13:45:38] - Expanded 292064 -> 1461493 (decompressed 500.4 percent)
[13:45:38] - Starting from initial work packet
[13:45:38]
[13:45:38] Project: 3038 (Run 0, Clone 883, Gen 6)
[13:45:38]
[13:45:38] Assembly optimizations on if available.
[13:45:38] Entering M.D.
[13:45:44] Protein: p3038_supervillin-03
[13:45:44]
[13:45:44] Writing local files
[13:45:44] Extra SSE boost OK.
[13:45:44] Writing local files
[13:45:44] Completed 0 out of 5000000 steps (0)
[14:08:52] Writing local files
[14:08:52] Completed 50000 out of 5000000 steps (1)
[14:32:00] Writing local files
[14:32:00] Completed 100000 out of 5000000 steps (2)
[14:55:13] Writing local files
[14:55:13] Completed 150000 out of 5000000 steps (3)
[15:18:23] Writing local files
[15:18:23] Completed 200000 out of 5000000 steps (4)
[15:44:10] Writing local files
[15:44:10] Completed 250000 out of 5000000 steps (5)
[16:08:42] Writing local files
[16:08:42] Completed 300000 out of 5000000 steps (6)
[16:32:56] Writing local files
[16:32:56] Completed 350000 out of 5000000 steps (7)
[16:58:12] Writing local files
[16:58:12] Completed 400000 out of 5000000 steps (8)
[17:22:53] Writing local files
[17:22:53] Completed 450000 out of 5000000 steps (9)
[17:46:55] Writing local files
[17:46:55] Completed 500000 out of 5000000 steps (10)
[18:10:36] Writing local files
[18:10:36] Completed 550000 out of 5000000 steps (11)
[18:34:08] Writing local files
[18:34:08] Completed 600000 out of 5000000 steps (12)
[18:57:50] Writing local files
[18:57:50] Completed 650000 out of 5000000 steps (13)
[19:22:07] Writing local files
[19:22:07] Completed 700000 out of 5000000 steps (14)
[19:46:43] Writing local files
[19:46:43] Completed 750000 out of 5000000 steps (15)
[20:10:48] Writing local files
[20:10:49] Completed 800000 out of 5000000 steps (16)
[20:35:15] Writing local files
[20:35:15] Completed 850000 out of 5000000 steps (17)
[21:00:05] Writing local files
[21:00:05] Completed 900000 out of 5000000 steps (18)
[21:25:41] Writing local files
[21:25:41] Completed 950000 out of 5000000 steps (19)
[21:53:17] Writing local files
[21:53:17] Completed 1000000 out of 5000000 steps (20)
[22:17:13] Writing local files
[22:17:13] Completed 1050000 out of 5000000 steps (21)
[22:40:50] Writing local files
[22:40:50] Completed 1100000 out of 5000000 steps (22)
[23:04:07] Writing local files
[23:04:07] Completed 1150000 out of 5000000 steps (23)
[23:27:14] Writing local files
[23:27:14] Completed 1200000 out of 5000000 steps (24)
[23:50:24] Writing local files
[23:50:24] Completed 1250000 out of 5000000 steps (25)
[00:13:37] Writing local files
[00:13:38] Completed 1300000 out of 5000000 steps (26)
[00:36:47] Writing local files
[00:36:47] Completed 1350000 out of 5000000 steps (27)
[00:59:56] Writing local files
[00:59:56] Completed 1400000 out of 5000000 steps (28)
[01:23:09] Writing local files
[01:23:09] Completed 1450000 out of 5000000 steps (29)
[01:46:19] Writing local files
[01:46:19] Completed 1500000 out of 5000000 steps (30)
[02:09:29] Writing local files
[02:09:29] Completed 1550000 out of 5000000 steps (31)
[02:32:43] Writing local files
[02:32:43] Completed 1600000 out of 5000000 steps (32)
[02:55:51] Writing local files
[02:55:51] Completed 1650000 out of 5000000 steps (33)
[03:19:00] Writing local files
[03:19:00] Completed 1700000 out of 5000000 steps (34)
[03:42:08] Writing local files
[03:42:08] Completed 1750000 out of 5000000 steps (35)
[04:05:23] Writing local files
[04:05:23] Completed 1800000 out of 5000000 steps (36)
[04:28:32] Writing local files
[04:28:32] Completed 1850000 out of 5000000 steps (37)
[04:51:45] Writing local files
[04:51:45] Completed 1900000 out of 5000000 steps (38)
[05:15:01] Writing local files
[05:15:01] Completed 1950000 out of 5000000 steps (39)
[05:38:12] Writing local files
[05:38:12] Completed 2000000 out of 5000000 steps (40)
[06:01:22] Writing local files
[06:01:22] Completed 2050000 out of 5000000 steps (41)
[06:24:36] Writing local files
[06:24:36] Completed 2100000 out of 5000000 steps (42)
[06:42:31] Gromacs cannot continue further.
[06:42:31] Going to send back what have done.
[06:42:31] logfile size: 108578
[06:42:31] - Writing 109114 bytes of core data to disk...
[06:42:31] ... Done.
[06:42:31]
[06:42:31] Folding@home Core Shutdown: EARLY_UNIT_END
[06:42:35] CoreStatus = 72 (114)
[06:42:35] Sending work to server

[06:42:35] + Attempting to send results
[06:42:37] + Results successfully sent
[06:42:37] Thank you for your contribution to Folding@Home.
[06:42:41] - Preparing to get new work unit...
[06:42:41] + Attempting to get work packet
[06:42:41] - Connecting to assignment server
[06:42:42] - Successful: assigned to (171.65.103.160).
[06:42:42] + News From Folding@Home: Welcome to Folding@Home
[06:42:42] Loaded queue successfully.
[06:42:44] + Closed connections
[06:42:49]
[06:42:49] + Processing work unit
[06:42:49] Core required: FahCore_78.exe
[06:42:49] Core found.
[06:42:49] Working on Unit 06 [February 13 06:42:49]
[06:42:49] + Working ...
[06:42:49]
[06:42:49] *

*
[06:42:49] Folding@Home Gromacs Core
[06:42:49] Version 1.90 (March 8, 2006)
[06:42:49]
[06:42:49] Preparing to commence simulation
[06:42:49] - Assembly optimizations manually forced on.
[06:42:49] - Not checking prior termination.
[06:42:50] - Expanded 292498 -> 1461493 (decompressed 499.6 percent)
[06:42:50] - Starting from initial work packet
[06:42:50]
[06:42:50] Project: 3038 (Run 7, Clone 123, Gen 9)
[06:42:50]
[06:42:50] Assembly optimizations on if available.
[06:42:50] Entering M.D.
[06:42:56] Protein: p3038_supervillin-03
[06:42:56]
[06:42:56] Writing local files
[06:42:56] Extra SSE boost OK.
[06:42:56] Writing local files
[06:42:56] Completed 0 out of 5000000 steps (0)
[07:06:27] Writing local files
[07:06:27] Completed 50000 out of 5000000 steps (1)
[07:29:55] Writing local files
[07:29:55] Completed 100000 out of 5000000 steps (2)

Kentigern · February 2007

I had the same prob with 3038 - am now on 3040 + 3042 and they are finishing ok

mirage · February 2007

Kentigern wrote:

I had the same prob with 3038 - am now on 3040 + 3042 and they are finishing ok

Thanks for the info. I have lost about half a dozen P3038 jobs until now. I wish they would fail at the beginning but they run for quite some time before failing

Murphy is in business I guess:)

Leonardo · February 2007

Perhaps I am completely wrong, but with that many 3038 failures, I can't help but think the subject computer(s?) might have an unknown stability problem. I've processed many of them on four different computers with 100% success rate. This is not a comparison of tech knowledge between you and me, but merely my encouragement for you to check your computer very closely. PSU output voltage instability? Physical memory instability? Power surges? CPU overheating? Less than rock stable overclock? Any of those conditions can wreck a work unit in progress.

mirage · February 2007

Leonardo wrote:

Perhaps I am completely wrong, but with that many 3038 failures, I can't help but think the subject computer(s?) might have an unknown stability problem. I've processed many of them on four different computers with 100% success rate. This is not a comparison of tech knowledge between you and me, but merely my encouragement for you to check your computer very closely. PSU output voltage instability? Physical memory instability? Power surges? CPU overheating? Less than rock stable overclock? Any of those conditions can wreck a work unit in progress.

Failures are specific to P3038 and it happens on entirely different computers (one of them is Dell Precision workstation with ECC RAM and Xeon processor, not overclocked of course). There is no reason (at least for me) to suspect any stability issue. Thanks for the thought though.

csimon · February 2007

I lost a of wu's over the past few weeks and I was doing a lot of overclocking experiments. I had attributed it all to that but maybe it wasn't to blame 100% afterall.

mirage · February 2007

Here is the link. Apparently, it is a common problem.

Crappy WU's, how long till they're gone?

Comments