Eating WU
drasnor
Starship OperatorHawthorne, CA Icrontian
I don't know if this constitutes a pattern, but my system has been failing WU over the past few weeks, like 1 in 10. I started keeping track of which ones fail, and here 's the rundown.
[00:47:41] Project: 526 (Run 7, Clone 77, Gen 1)
[14:13:54] Project: 917 (Run 7, Clone 55, Gen 22)
[17:56:47] Project: 914 (Run 0, Clone 36, Gen 22)
[10:25:31] Project: 683 (Run 39, Clone 90, Gen 18)
[16:35:22] Project: 922 (Run 8, Clone 20, Gen 33)
[02:32:36] Project: 922 (Run 2, Clone 11, Gen 4)
[02:17:55] Project: 921 (Run 21, Clone 11, Gen 20)
[05:18:20] Project: 921 (Run 45, Clone 22, Gen 23)
[06:21:20] Project: 920 (Run 2, Clone 26, Gen 25)
[18:11:58] Project: 922 (Run 43, Clone 9, Gen 24)
[10:41:36] Project: 922 (Run 15, Clone 14, Gen 35)
[00:09:06] Project: 922 (Run 23, Clone 2, Gen 17)
[17:54:08] Project: 917 (Run 43, Clone 15, Gen 18)
They're mostly 917, 920, 921, and 922 projects. The 526 and 683 happened when I was having thermal issues earlier, but that isn't the case anymore. I figured I'd go ahead and see what you folks think. I'm sure Stanford takes notice when they get failed WU back.
-drasnor
[00:47:41] Project: 526 (Run 7, Clone 77, Gen 1)
[14:13:54] Project: 917 (Run 7, Clone 55, Gen 22)
[17:56:47] Project: 914 (Run 0, Clone 36, Gen 22)
[10:25:31] Project: 683 (Run 39, Clone 90, Gen 18)
[16:35:22] Project: 922 (Run 8, Clone 20, Gen 33)
[02:32:36] Project: 922 (Run 2, Clone 11, Gen 4)
[02:17:55] Project: 921 (Run 21, Clone 11, Gen 20)
[05:18:20] Project: 921 (Run 45, Clone 22, Gen 23)
[06:21:20] Project: 920 (Run 2, Clone 26, Gen 25)
[18:11:58] Project: 922 (Run 43, Clone 9, Gen 24)
[10:41:36] Project: 922 (Run 15, Clone 14, Gen 35)
[00:09:06] Project: 922 (Run 23, Clone 2, Gen 17)
[17:54:08] Project: 917 (Run 43, Clone 15, Gen 18)
They're mostly 917, 920, 921, and 922 projects. The 526 and 683 happened when I was having thermal issues earlier, but that isn't the case anymore. I figured I'd go ahead and see what you folks think. I'm sure Stanford takes notice when they get failed WU back.
-drasnor
0
Comments
Errors by WU would be nice, if anything mre specific, and an idea if you are OCing and what teh case and CPU and ambient temps for room and PSU exhaust temps are also. The PSU internal temp is hottest place in computer, adn a PSU with a filing fan or a load issue can result in seemingly random failures, so one way to detect a load or PSU fan issue early on is to look at relative PSU temps versus case. IF PSU is relatively high and climbing higher over time, PSUs these days will get less efficient as they get hotter. Cheaper PSUs often have cheaper fans and are less effective when hot than when at normal. cheaper fans fail faster. Capacitors and transformers bleed a lot of heat-- transformers more than CAPs but they can heat CAPs also as well as gen more EMI when hot than when cooler or normal.
I would look at heat and voltages and client version and Core version with Gromacs,and stacks that include your telecomm or netowrk devices-- stack those and you get bad downloads as you can get junk in download process of kinds that are hard to isolate. Include PSU in your relative heat mapping if you have had box shut down thermally, this can be PSU thermalling. PSU is stressed to limit when it thermals. Look for unusual relations between PSU and case and CPU temps, highest relatively is what you address first. You are looking for heat buildup as symptoms of failure here.
I can explain more if ASKED, and will get more specific if asked.
John.
Motherboard: MSI K8T Master2-FAR
Processor(s): 2x AMD Opteron 248 @ 2.2GHz, 1MB L2 cache
Memory: 1GB (2x512MB) Corsair TwinX PC3200RE-LLPT DDR400, 2-3-2-6 @ 2.65V
Graphics: nVidia GeForce 3 Ti500
Chassis: Lian Li PC-7 w/ all Panaflo H1A fans.
NIC: Integrated Broadcomm gigabit LAN w/ boot ROM.
Storage: 2x Western Digital Caviar 160GB/8MB drives in RAID0 using SATA->PATA bridge boards and VIA Southbridge SATA RAID, 1x Western Digital SE 250GB/8MB for backing up the array.
Sound: Creative SoundBlaster Audigy 2 Platinum
PSU: Antec TruePower EPS12v 550W, partially sleeved.
Optical Drives: Sony CRX300E 16x DVD, 48x24x48 CD-R/RW. Sony DW-U14A 4x2 DVD+/-R/RW, 24x16 CD-R/RW, 32x8 CD read/DVD read.
Other:
Internally-mounted ATI Remote Wonder
Internally-mounted Atech Flash Pro-9 USB2 multi-reader/writer.
3.5" floppy drive.
5.25" floppy drive w/ blue activity LED.
USRobotics 56k V.92 hardware PCI modem.
ALi-chipset USB2 adapter on PCI.
This machine passed memtest86 without any trouble, though I haven't tried it with prime95 yet. It is using a three-week old install of Windows XP Professional with Service Pack 1 installed and everything up to date. The BIOS is the original 1.0, since the 1.1 BIOS makes the machine very unstable and prone to hard lockups. There aren't any IRQ conflicts, and there shouldn't be since this is an ACPI machine and all my peripherals support IRQ sharing.
Ambient: 22 C
CPU0: 46 C
CPU1: 43 C
Chassis: 36 C
PC Alert III shows my volts like so, but I'm dubious about using built-in monitoring hardware.
+3.3V: 3.13V
+5V: 4.93V
+12V: 11.80V
Vcore: 1.14V
Vdimm is set to 2.65V, though I don't know what it's at.
Current System Uptime: 2 days, 15 hrs, 22 mins, 30 secs as of this post.
This error is for Project: 922 (Run 2, Clone 11, Gen 4), which is p922_vpf913
This error is for Project: 914 (Run 0, Clone 36, Gen 22), which is p914_vpf909.
This error is for Project: 683 (Run 39, Clone 90, Gen 18), which is p683_TZ2_NAT_EXP.
This error is for Project: 922 (Run 8, Clone 20, Gen 33), which is p922_vpf913.
This error is for Project: 922 (Run 15, Clone 14, Gen 35), which is p922_vpf913.
This error is for Project: 922 (Run 23, Clone 2, Gen 17), which is p922_vpf913.
This error is for Project: 917 (Run 43, Clone 15, Gen 18), which is p917_v2180pf909.
This error is for Project: 917 (Run 24, Clone 65, Gen 16), which is p917_v2180pf909.
This error is for Project: 921 (Run 21, Clone 11, Gen 20), which is p921_vpf912.
This error is for Project: 921 (Run 45, Clone 22, Gen 23), which is p921_vpf912.
This error is for Project: 920 (Run 2, Clone 26, Gen 25), which is p920_vpf910.
This error is for Project: 922 (Run 43, Clone 9, Gen 24), which is p922_vpf913.
-drasnor
An intersting discussion on Opterons and similar issues here:
http://www.abxzone.com/forums/showthread/t-59454.html
I was wondering about that too...
-drasnor
http://forum.folding-community.org/viewtopic.php?t=6970
-drasnor
What is the configuration for your system cooling exactly air in air out. Those heatsinks seems to be doing a fine job extracting the heat from the cpu but your case temps should be a lot lower. At 36C average case temp there are probably some mobo components that are very hot, voltage regulators to start with.
I think we need to work on getting more cool air in and getting the hot air out.
Regards
John.
All the panaflos in this case are high output versions, which is less than the ultra high output and greater than the medium output versions.
The VRMs near the ATX I/O shield have large finned aluminum HS on them, and there's a crappy yellow stock HS on my northbridge. There's some capacitors close to the northbridge though, so I'd have to mod a Vantec copper Iceberq to get it to fit. Anything taller interferes with the AGP slot.
I'm going to move the case fans over to the full on connector next time I crack the case (later tonight) and see how much a difference it makes. It's folding Tinkers right now, since the assignment server wasn't giving me anything else.
-drasnor
BTW, I keep forgetting to thank everyone for all the help they've given. I appreciate you popping over here pythagoras; I know most people wouldn't sign up for a different forum to solve someone else's problem.
-drasnor
Ok, what I do with Panaflos is this:
Front get mediums and rears high. BUT, I have two each in addition to the PSU fanning.
With what you have, you are probably pushing more air in than is being exhausted, positive pressure is likely, and a 14 C rise in chassis over room ambient is WAY too high. I get 6 C higher inside chassis compared to room ambient-- even when room is at 29-30 C which DOES happen in Florida.
So, given what you have, one of two things would help:
Either use mediums for front and an ultra-high for back, or one medium and one high and an ultra-high exhaust. OR, do a top blowhole with a high in it and high out the back, and use two mediums in front.
Hot air expands, you want more exhaust CFM than intake CFM. Let eh PSU cool itself, and you use additional fans to cool rest of case. If as air expands you get positive pressure, you get trapped hot air, and this is what looks like is happening to me.
ADD: At a guess, your PSU is on or close to the ragged edge of what the PSU can handle temp wise after two days running, and with a positive pressure situation you are also dragging hot air into PSU-- this is one reason chassis temp needs to be lower also; PSU needs cooler air coming in than going out to cool itself well.
John D-- we are getting a lot of Johns here, will use last inital from now on.
I tend to agree with John D here, we need to find a way of getting that heat that your switechs are extracting out of the case. I f we can do that first we can look at other areas later.
Regards
John W
Here's the new temps:
Ambient: 27C
Chassis: 36C
CPU1: 46C
CPU2: 44C
-drasnor
Other than ducting them to one very high-capacity rear fan (keep the HS fans in place, you want a PUSH-PULL through duct, greater pull than push, or equal to pushes to keep from making a pressure backlash onto the Hetsink fans which should be pushing air into duct), I do not know a hyper-good way to get air away from CPUs better without heating case. BUT, I can tell you that for Opterons and for fast Bartons and for P4s in the high 2GHz and up range, those CPU temps are NORMAL. Problem is that a duct would have to be custom built. Probably aluminum sheeting about 3\32nd's of an inch thick would be best. For rear fan, try a big Delta, probably a 120 mm high volume fan mounted on outside of case. Expect NOISE.
Right now, P4 case is at 29 C here, CPU, which is OC'd to about 3.2 GHz and is a P4, has been in the 53-56 range for three days running, PWM or OTES on that box is 41 C and has been floating from 39 C to 43 C. The Barton case is at 30 C with CPU at 45 C right now. Barton case floats at 29-32 C, CPU on Barton box floats at 44-47 C. Barton CPU is a 2500+ running slightly OC'd-- almost at what a 2700+ would run at if there were such a thing, detects as a 2600+. Oh, room ambient is about 25 C outside case.... Both boxes are hyper stable, between them gen 160-250 folding points a DAY depending on WU effectiveness.
One more thing, the P4s die at 74+ C for speed range I am talking now. The Bartons will go radically unstable at 58-67 C. I am going to P4s because they are more heat tolerant than the Bartons are in reality. The Opterons, with their metal heat spreader caps, should be more high-heat stable than the Bartons.
John D.
Yeah, these are acceptable temps and probably the best I'm going to get at a semi-low noise level for a dual processor air-cooled machine.
Thermaltake makes an 80mm flexible duct (think clothes dryer exhaust duct) for overclockers that's supposed to route cold air straight from a case intake to a CPU fan, but I could switch the the flow of my CPU fans and install those as exhausts. I seriously doubt that would help more than maybe 1C, and it could hurt if it screws up airflow around the ducts or if there's back pressure at all. As it stands now, the entire back of my case is fan ports, so even moving to a bigger case isn't going to do much.
Even so, these are decent temperatures, especially given this is a dual-processor machine. I've reinstalled Windows a couple times since I got this machine, and I've had this problem on every install. If this is a software issue, it's out of my league.
-drasnor