Three different boards, RAM is fine - slots A1 and A2 unusable

Harudath · August 2011

Hi guys,

I'm completely stumped on this one. I turned my PC off and cleaned it with some compressed air, turned it on and it wouldn't POST. After a long time of fiddling with stuff I found that if any RAM was present in the first two slots then nothing happens.

B1 and B2 were fine when in use, have tested this with three different kits of RAM. So I bought a new motherboard; same problem, straight outta the box. Took that back, got a different one. And again. I can only assume when I did my RMAs they didn't actually test them because the odds of getting the same problem over three boards from two manufacturers is borderline absurd, no?

Original board was an Asrock P67 Extreme 4 1155, second was an ASUS P8Z68-V PRO, and third (current) is an Asrock Z68 Extreme 4 1155. I checked that my memory was compatible with the boards, and it is - this only leaves the CPU unless I'm missing something (which I am). However its working fine, haven't noticed anything out of the ordinary and works perfectly...

Suffice to say, HELP.

Specs are below, thanks Icrontic

PirateNinja · August 2011

If you are going for dual channel you need to use A1 and B1 OR A2 and B2. That shouldn't matter though, just a note. I'm not sure if you are trying to do that or not, your sig implies you probably use 3x4gb anyways.

So just to be clear:
What exact set of memory are you using?
Is it the Asrock Z68 Extreme4 that you have right now?

PirateNinja · August 2011

If I am reading about HyperX Grey correctly they are only sold in Dual Channel kits.
I'm not knowledgeable enough, but one theory is that the memory's profile only allows for it to work in sets of 2 or 4, bu not a set of 3.

Have you tried just taking 2 sticks and using a1/b1 or a2/b2?

Harudath · August 2011

I've got two kits of two sticks, one set is 2x4GB, one set is 2x2GB, I've tried them all seperately - if anything is present in the first two memory slots, regardless of setup then the PC doesn't post. Single channel with one module, dual channel, combinations of anything - if there's anything in the two slots closest to the CPU then nothing happens.
Yeah, current board is the Asrock Z68.

Memory is KHX1600C9D3X2K2/8GX and /4GX

Harudath · August 2011

Also tried some other triple channel memory I had lying around (just to see if it was those particular sticks, was only in single channel, tried each of the slots, same problem so it can't be the memory)

PirateNinja · August 2011

Sorry -- just to be sure.

Have you tried putting:

1x4gb stick in slot1
AND
1x4gb stick in slot3
WITH
slots2/4 empty

IF that does not work:

1x4gb stick in slot2
AND
1x4gb stick in slot4
WITH
slots1/3 empty

See picture for reference.

I would try these configurations and each time reset the bios using the clr cmos button on the back of the board.

It isn't possible that you have three motherboards with bad A1/B1 dimms, and that the first went bad after a cleaning. The cpu isn't playing a role in this. It's something to do with the memory and/or how it is being installed.

Harudath · August 2011

Yes, I've tried 1/3 and 2/4, with both sets of modules. My original P67 was working fine with 4GB in dual channel - it had been doing so since February. I turned it off to clean it (compressed, moisture-free air only), turned it back on again and the problems started.

PirateNinja · August 2011

Do you get a beep code when it doesn't post?

Edit: OR LED debug code?

Harudath · August 2011

45, which is apparently some unexplained (by the manufacturer) memory error, but having tried three different kits which work fine in other slots, on three different boards, from two different manufacturers... Only common factors here are the cpu, drives and graphics cards...

PirateNinja · August 2011

Did you clear the CMOS?

According to the manual:
0x3F-0x4E OEM post memory initialization codes

It sounds like it's trying to run the ram in a state or voltage it can't handle.

BIOS update or lower ram settings?

Harudath · August 2011

Um, where did you get 0x3F0ox4E from? The error I got was 45. I'll see what the voltage is set to, the modules want 1.65V so I suppose I'll change it from auto..

PirateNinja · August 2011

This is a range of values:
0x3F-0x4E
which include 0x45

Well you can really quickly rule out other common factors like hard drives and vid cards by removing them or swapping in replacements.

If lowering the voltage and/or timings to calm down the ram in dual channel mode doesn't work check out this thread I found:
http://www.overclock.net/intel-motherboards/1007125-asrock-p67-extreme6-code-45-a-2.html

RootWyrm · August 2011

Harudath wrote:

Um, where did you get 0x3F0ox4E from? The error I got was 45. I'll see what the voltage is set to, the modules want 1.65V so I suppose I'll change it from auto..

Yeah, I'm wondering the same; 3F is "Unable to Recover" and 4E is in the reserved block - as is 45. As usual, Newegg is full of it, as KHX1600C9D3X2K2 is a 1.5V +-0.075V module, period. It is not a 1.65V module.

Your read is absolutely right Harudath; it's the CPU. I can't say I've seen it before, but that is the only possible answer. There is absolutely NO other possibility after your testing. The on-die memory controller has lost the A channel.
Chances are pretty good it was defective from day one, and the cleaning was just incidental to when it let go. Time to make the rare Intel warranty claim.

PirateNinja · August 2011

RootWyrm wrote:

Yeah, I'm wondering the same; 3F is "Unable to Recover" and 4E is in the reserved block - as is 45. As usual, Newegg is full of it, as KHX1600C9D3X2K2 is a 1.5V +-0.075V module, period. It is not a 1.65V module.

Your read is absolutely right Harudath; it's the CPU. I can't say I've seen it before, but that is the only possible answer. There is absolutely NO other possibility after your testing. The on-die memory controller has lost the A channel.
Chances are pretty good it was defective from day one, and the cleaning was just incidental to when it let go. Time to make the rare Intel warranty claim.

Speaking in absolutes when you can't logically deduce them is fun.

Here is his manual:
ftp://174.142.97.10/manual/P67%20Extreme4.pdf
This is what it says:
0x3F-0x4E OEM post memory initialization codes
0x45 is in that group

It's unlikely it would make it that far in POST with there being a serious issue with the CPU.

I honestly think it could be the backing of the heatsink he is using making metal contact with the other side of the mobos PCB and interfering with the first two DIMMS (A1/B1). It could be he just needs plastic washers, or his memory isn't getting the right voltage to run dual channel.

Take my thoughts or leave them. If I am completely missing some logic that makes it 100% the CPU, fine. You can always buy a new CPU and let us know what happens. I'm super curious.

Harudath · August 2011

I was thinking about getting that cheap-ass Â£50 socket 1155 CPU just to test it, seems less hassle than sending my only 2nd gen CPU off to the glue factory..
Kingston website says pretty much all DDR3 HyperX RAM is meant to run at 1.65v, including the two kits I have:

http://www.valueram.com/datasheets/KHX1600C9D3X2K2_4GX.pdf

http://www.valueram.com/datasheets/KHX1600C9D3X2K2_8GX.pdf

I'll have a closer look at the board and make sure nothing is making contact where it shouldn't be, thogh I was careful on all of the installs to make ensure that there were no screws without plastic washers and nothing obvious was contacting the board where it shouldn't have been.

EDIT: That's the P67 manual, I'm currently using the Z68 though I doubt the error codes will have changed dramatically...

RootWyrm · August 2011

Harudath wrote:

I was thinking about getting that cheap-ass Â£50 socket 1155 CPU just to test it, seems less hassle than sending my only 2nd gen CPU off to the glue factory..
Kingston website says pretty much all DDR3 HyperX RAM is meant to run at 1.65v, including the two kits I have:

Bad wording on my part. Read spec sheet closer.
Default HyperX settings are JEDEC and have always been JEDEC. Unless XMP is specifically enabled or it is manually set, SPD programming is in fact, 1333MHz @ 9-9-9 @ 1.5V. This is true of all HyperX; SPD settings are conservative as hell and comply with JEDEC always. XMP profile and actual capability are different. Kingston does not offer any DIMM which does not default to a JEDEC compliant profile and never has. (Which is why I like the HyperX stuff.)
Unless set to XMP or otherwise, the HyperX DIMMs will operate normally at 1333MHz, 1.5V with no faults. This is the default values, if the board's BIOS is JEDEC compliant. (Sadly, most these days seem to completely ignore SPD and XMP.) However, they are capable of running at 1600MHz, 1.65V, and this is the settings programmed in the XMP profile. They are guaranteed to run at any voltage between 1.5V and 1.65V and any clock speed up to 1600MHz. If this sounds familiar, it should - the Kingston module method is straight up binning.

I'll have a closer look at the board and make sure nothing is making contact where it shouldn't be, thogh I was careful on all of the installs to make ensure that there were no screws without plastic washers and nothing obvious was contacting the board where it shouldn't have been.

This is already ruled out; if it was a contact short it wouldn't have followed boards as it did. Modern boards don't require plastic washers. In fact, you're not supposed to use washers with them. The metal rings around the screw holes in a modern PCB are actually often used as an alternate/supplemental ground path. ATX 1.3 and later specification actually defines screw holes as a keepout area as well, so any compliant board (read: all of them) has sufficient clearance for screws and posts guaranteed. (This does not guarantee contact shorts don't happen from posts being present which they don't use, obviously. They have that option.)

PirateNinja wrote:

It's unlikely it would make it that far in POST with there being a serious issue with the CPU.

You'd be wrong there; modern QPI is remarkably resilient. The issue would seem to be a single channel of the IMC having failed, which wouldn't actually cause any issues unless that channel's used. Otherwise, there's likely no defect.
The other possibility, which I agree with you on almost, is shorting. He's not shorting DIMMs, but possibly shorting traces. However, if he was shorting traces or contacts on the board, there'd be issues on the B channel even with A channel removed. Not only that, but it wouldn't have followed three boards absolutely identically.
Because they don't actually provide detailed technical information, the 45 code is meaningless. It's in memory, okay, where and what? Is that SPD read or is that walking test? Voltage check? Who the hell knows! We don't even have a way to validate the 45 code. (BTW, it's Asrock specific - AMIBIOS8 doesn't specify for 45.) The one thing we can validate is that 45 is not during cache walk; if it was cache walk causing a fault, that's A) internal to CPU B) would cause problems on the B channel. BIG problems. Lockup, crashes, random power off, etcetera.

By process of elimination, the only possible remaining issue is the CPU's IMC A channel. Three boards, multiple DIMM, symptom only follows CPU. This rules out ALL other issues, period.
It cannot be socket damage - followed boards. It cannot be trace short - followed boards, didn't follow channels. It cannot be DIMM contact short - followed boards, didn't follow channels. It cannot be DIMM defect or incompatibility - didn't follow channels, followed boards. It cannot be board defect or failure - three boards, all behaved identically. It cannot be CPU cache - didn't follow channel.

PirateNinja · August 2011

Harudath:
When you very first cleaned the system out, did you remove your HSF? I know you did when you swapped mobos, but I'm curious about the first time. Also what HSF do you use?

RootWyrm wrote:

This is already ruled out; if it was a contact short it wouldn't have followed boards as it did. Modern boards don't require plastic washers. In fact, you're not supposed to use washers with them. The metal rings around the screw holes in a modern PCB are actually often used as an alternate/supplemental ground path. ATX 1.3 and later specification actually defines screw holes as a keepout area as well, so any compliant board (read: all of them) has sufficient clearance for screws and posts guaranteed. (This does not guarantee contact shorts don't happen from posts being present which they don't use, obviously. They have that option.)

http://www.overclock.net/intel-motherboards/1007125-asrock-p67-extreme6-code-45-a-2.html

That person had the same problem (same mobo, same memory init error), on a modern PCB, and fixed it by replacing their HSF with one that had a protective attachment base.

I'm not saying it isn't the CPU, it's likely because it is one of the common denominators. However, it seems crazy to me that with all the time spent on testing standardized hardware configurations, nobody would think that it is a bad idea to ignore a "partially" faulty IMC on POST and wait to report a memory initialization problem after the CPU debugging is done. Just think of Intel testing their initial IMC units, it would be absurd to ignore failures on POST. But hey, crazier things have happened.

At this point we have a non-discrete memory initialization error which we unfortunately have to make assumptions on (like it is a short, a bad IMC, bad DIMMs, etc). As with all diagnosis it is time for process of elimination, so yes it will be good to see if the CPU replacement fixes this. For all we know this will have all been caused by ESD :P

RootWyrm · August 2011

PirateNinja wrote:

(link removed)

That person had the same problem (same mobo, same memory init error), on a modern PCB, and fixed it by replacing their HSF with one that had a protective attachment base.

OCN is not authoritative, nor is that indicative of anything other than the ugly fact that that particular user is too lazy to bother building a system right. Step one of mounting a heatsink is a clearance check, period. That's been true since 1996.

It also means absolutely nothing, period, because we have THREE DIFFERENT BOARDS. The symptoms would NOT reproduce identically across THREE ENTIRELY DIFFERENT MOTHERBOARDS if it were that. Stop latching onto this meaningless "45" - it is utterly pointless. We have 3 boards with identical symptoms - the error code from one matching up with some lazy kid's screwup means absolutely frigging nothing and never will.

I'm not saying it isn't the CPU, it's likely because it is one of the common denominators. However, it seems crazy to me that with all the time spent on testing standardized hardware configurations, nobody would think that it is a bad idea to ignore a "partially" faulty IMC on POST and wait to report a memory initialization problem after the CPU debugging is done. Just think of Intel testing their initial IMC units, it would be absurd to ignore failures on POST. But hey, crazier things have happened.

And how many years of BIOS engineering do you have? I have 2 years of OEM branding on AMIBIOS back in '97-'98; all modern ROMBIOS derives from that. Things that were not tested past a basic "is it present" test include memory controllers, L1 cache and L2 cache. You can happily boot an i430HX, i430TX, MVP3 or MVP4 chipset based board with a defective L2 cache. It will POST, it will boot, it will run Windows, and then it will crash and burn when you hit the bad L2 block. (BTW, OEM branding means about what you think, but you get full access to documentation and sufficient code to build a complete custom BIOS including restricting or granting access to settings and changing defaults.)
Non-Intel boards do not do a full CPU self-test because "it takes too long." Intel boards do partial MBIST of CPU and take up to 10 seconds just to get to BIOS POST. There is no "CPU debugging" - it's not a full or even partial MBIST. It's strictly "is it there? Does it execute base load as expected? Go." Otherwise, you'd spend all your time whining about how long it takes to get to POST. Standardization? HA. The grand myth. 80h is "standardized" and half the manufacturers insist on ignoring it and creating their own set, like Asrock did. Again, 45 is a non-standard code period, and not even supposed to be used. "Reserved" does NOT mean "okay for OEM use," it means "DO NOT USE THIS CODE."

Harudath · August 2011

I use an Arctic Freezer 13, which I didn't remove when I originally cleaned the PC - the compressed air had only arrived that day and there was a fair bit of dust I wanted to blast out, so I turned off the PC, blew away the dust and when I turned it back on I had the memory problem.
Pay day isn't for another week, so I'm afraid I can't test it until then... Phooey. Unless Ebuyer decided to give me Â£100 store credit, anyway... (they didn't >.>)

PirateNinja · August 2011

RootWyrm wrote:

OCN is not authoritative, nor is that indicative of anything other than the ugly fact that that particular user is too lazy to bother building a system right. Step one of mounting a heatsink is a clearance check, period. That's been true since 1996.

It also means absolutely nothing, period, because we have THREE DIFFERENT BOARDS. The symptoms would NOT reproduce identically across THREE ENTIRELY DIFFERENT MOTHERBOARDS if it were that. Stop latching onto this meaningless "45" - it is utterly pointless. We have 3 boards with identical symptoms - the error code from one matching up with some lazy kid's screwup means absolutely frigging nothing and never will.

And how many years of BIOS engineering do you have? I have 2 years of OEM branding on AMIBIOS back in '97-'98; all modern ROMBIOS derives from that. Things that were not tested past a basic "is it present" test include memory controllers, L1 cache and L2 cache. You can happily boot an i430HX, i430TX, MVP3 or MVP4 chipset based board with a defective L2 cache. It will POST, it will boot, it will run Windows, and then it will crash and burn when you hit the bad L2 block. (BTW, OEM branding means about what you think, but you get full access to documentation and sufficient code to build a complete custom BIOS including restricting or granting access to settings and changing defaults.)
Non-Intel boards do not do a full CPU self-test because "it takes too long." Intel boards do partial MBIST of CPU and take up to 10 seconds just to get to BIOS POST. There is no "CPU debugging" - it's not a full or even partial MBIST. It's strictly "is it there? Does it execute base load as expected? Go." Otherwise, you'd spend all your time whining about how long it takes to get to POST. Standardization? HA. The grand myth. 80h is "standardized" and half the manufacturers insist on ignoring it and creating their own set, like Asrock did. Again, 45 is a non-standard code period, and not even supposed to be used. "Reserved" does NOT mean "okay for OEM use," it means "DO NOT USE THIS CODE."

- POST debug codes mean nothing
- I have zero years programming BIOS, never claimed to have any
- Apparently I was latching on to a theory and not being open to other ideas
- The manual for his motherboard published false information
- Root knows exactly how 2011 BIOS/UEFI systems work because he worked in BIOS thirteen years ago when IMCs did not exist.
Got it.

Harudath -- if it turns out to be the CPU maybe you could make a warranty claim with Intel. It might be a good idea to attempt that now actually. Good luck and I hope you can fix / afford to fix it soon.

Harudath · August 2011

PC is still running fine and the warranty on the CPU (if its one year only) is still good until late february 2012, and the world will have ended by then anyway - I'm in no rush just yet. I'll update you guys once I have managed to buy that spare CPU, since as we all know - going a week or two without a working PC is close as mortals get to the deepest circle of hell... Dante obviously worked for AMD...

Thanks guys, will keep you posted

RootWyrm · August 2011

PirateNinja wrote:

(Trimmed personal attacks.)
Got it.

That is not what I said. There is also no reason to attempt personal attacks, especially when you have no idea what my current experience is. Regardless of all other factors, the absolute unequivocal fact is that this is basic troubleshooting, which Harudath has done correctly.

If it's followed in this particular pattern, there is nothing else it could be. Motherboards are mechanically different, not to mention we already knew two facts - one, it worked previously with no issues; this automatically rules out the heatsink. Two, the only change was blowing dust out of the system with canned air. Based on fact two, Harudath followed the correct path for diagnosis; replace DIMMs on presumption of damage or spontaneous failure: Negative. Relocate DIMMs on presumption of slot fault: Positive. Replace motherboard on presumption of defect or damage to DIMM slot: Negative. Relocate DIMMs again on presumption of slot fault: Positive. Repeat the last two again. Only one component has not been replaced - the CPU. This component is the only one directly related to memory which has not been substituted. Also eliminated by Harudath: socket damage, trace damage, installation error, and disk incompatibility - any one of these would have presented a different issue at some point.

Normally, I would use IPDT to confirm a defect. However, IPDT cannot test the IMC channel unless memory is associated to that specific channel. Because the system will not pass POST (on three different boards no less) with memory attached to A, IPDT cannot test A, only B, which will return a false OK. The IMC test is a walking+inversion write-read test and the CPU is passing BIST (which FYI does not change significantly between BIOS and EFI; EFI resides atop ROM to present a GUI for BIOS interaction and in theory reduce all the random crap that OEMs have propagated over the past 15 years.) Testing known working gets us nothing, and an IMC fault won't show under stress unless there's memory associated.

I'm not even going to guess why it failed. Spontaneous failures happen. Usually without warning and without ever finding an explanation. RCA would be nice, but not likely to happen. I would double check temperatures on the CPU with Everest or similar; you need to verify that Tcase is under 72.6C. (TJmax is not used on i7 Gen 2.) I don't see any reason it would be near or over it, so I'm chalking it up to random failure; see if Intel will advance ship a replacement. I honestly haven't RMA'd a CPU to them in ages, so no idea what current policies are.

PirateNinja · August 2011

Root, in all honesty I didn't make a personal attack. If anything I think you are a bit brash when you post and perhaps not open to others ideas and I tried to sum that up.

If you want to carry on our conversation over PM fine (even though I'm a bit exhausted with it), otherwise I feel like we have crowded up this thread enough with conjunction, appeal to authority, and a ton of other illogical jib jab. All we needed to do was help Harudath out, and we got as far as that as we could with that yesterday.

I'm done posting in this thread, but Harudath I'd love to hear from you even post 2012 if and when you get this fixed. Root, I'll buy you a beer if it's the CPU.

Harudath · September 2011

PirateNinja, looks you owe him a beer

The replacement socket 1155 fixed it, and now with the new i7-2600k in everything's working fine, got all 12GB running in dual channel.

Thanks guys, greatly appreciated the help

Three different boards, RAM is fine - slots A1 and A2 unusable

Comments