[Resolved] Looking for assistance with crashes, faults, bluescreens on a new build
Hi folks! This is a more thorough continuation of what I brought up in Discord: I recently built an all-AMD system, and when it works it's an absolute monster. However...
I experience intermittent bluescreens, no matter the configuration or part setup or drivers. Some more finicky games (WoW is the major one, pubg I think is included) also throw exceptions quite a bit. These happen under all types of load (idling, gaming, zipping large packages), with a huge range of time (two minutes after restart, four days of no problems, all in between). I've done a TON of testing and debugging and checking, to no avail. And so I turn to you!
I haven't found anything to reliably cause a crash. Today I had a crash from restarting from a crash, starting my music player, walking away to get the mail, and hearing it bluescreen and restart as I was sorting letters. It's crashed during stream viewing, youtube viewing, game playing, and playing music with nothing else open.
To date, it has NOT crashed while running memtests for more than a day, and putting the system under significant benchmarking strain doesn't cause it to happen more frequently, so far as I can tell.
Bluescreen codes seem random. KMODE_EXCEPTION_NOT_HANDLED, DRIVER_OVERRAN_STACK_BUFFER, PAGE_FAULT_IN_NONPAGED_AREA, several others.
Sometimes the bluescreen is yellow!
Sometimes the system just freezes and restarts without a code or dump.
WoW's errors are uniformly ACCESS_VIOLATIONs stating that pointers were to memory that couldn't be written to or read from. Research into this issue as its own thing has yielded no results so I do include it as a symptom here. No guarantees.
OS: Windows 10 Enterprise (currently trial)
PSU: EVGA Supernova 850W (site)
CPU: AMD Ryzen 7 3700X (site) with stock Wraith Cooler
Motherboard: Gigabyte X570 Gaming X Rev 1 (site)
RAM: G.Skill Ripjaw V DDR4-3200C16S-16GKV at XMP timings and voltage, x2 in dual channel (site) (QVL with motherboard)
Video: Gigabyte RX5700 8G (site)
Monitors: Dual AOC G2460 @ 144hz, 1920x1080 8-bit RGB via DisplayPort (have used HDMI as well)
4 HDDs: OCZ-Agility3s, WD Blacks, on sata, Mushkin Pilot-E on NVMe, each used solo in testing (including a known good from another machine that I formatted just for testing).
USB: Deathadder 3500 mouse, Microsoft 4000 Ergonomic keyboard
Sound: Tested with onboard, USB headphones, X-Fi Xtreme PCI card
RMAs: RMA'd the graphics card AND the motherboard, both came back as okay.
Temperature: Temps are stable, all remain under 72degC even when under load.
Voltages: No real voltage jitter seen during idle or under load.
Memory: memtest86 for 48 hours, memtest64 for 10 hours, windows 10 memtest tool twice, all pass.
Tried running at default speed and voltage.
Tried single sticks in each of the 4 slots available.
Never ran at higher voltage than XMP's 1.35V recommended.
CPU: The triangle lined up, seated with 0 force, lever locked into place, cpu was not loose after lock went in. Used thermal paste, seated heatsink, locked heatsink down.
Original purchase from newegg had totally legit-looking packaging but someone had stolen the 3700X and left a poorly-cleaned 1700X in its place. I sent that back and my current one totally shows up as a 3700X in hwinfo and other tools sooooooo...
Cabling: All cabling came from the rig this is upgrading which worked without issue.
Sata cables swapped with other known-good cables.
Modular power supply cables swapped out.
Checked for loops, kinks, cuts, weird looseness.
Tried using different sata power plugs in the power lines.
Tried different SATA slots on the motherboard (there are 6, tried 6).
Power: Motherboard has a 24-pin and an 8-pin power lines, both are plugged in.
Swapped in another known good power supply (Thermaltake Toughpower 750W Modular (site)). Issue still occurred, even after trying different modular cabling.
Tested grounding through case, and all motherboard screws, to the surge strip ground successfully.
Tested system laid out on non-conductive surface without a case at all.
Video: Swapped in a known-good Nvidia 970. Seeeeeemed to crash less, but no real data and still bluescreened.
Turned off all special settings via Radeon, set to single monitor 60hz 1920x1080 over HDMI to AOC monitor.
Turned PCIE to version 3.
Disabled hardware acceleration for browsing and discord.
Drivers and OS: Verified Win10 install iso via sha256sum. Installed via USB.
DISM and sfc come back clean (other than some noise that other places on the internet say is the windows default antivirus interfering and is completely normal).
Clean installs to various other hard drives tested solo (no other large disks) to no avail.
I've used Gigabyte's drivers (chipset, driver) for motherboard and video, and the AMD-available newer ones, in available sequence.
Also installed gigabyte's motherboard tools to no effect.
BIOS for motherboard has been updated and rolled back and updated again multiple times, testing each available new update.
Display Driver Uninstaller used in safe mode and in normal mode to remove video drivers, while not connected to internet, installing video drivers in normal mode without internet connection.
Just in case, even installed non-generics for the monitors to no avail.
God that's a lot. Sorry. I've gone through quite a bit trying to get this working. Thanks for nosing through, and thanks to the folks in Discord who tried there too!
Let me know if there is more information you'd need or suggestions for things I haven't done yet. My hope is I've missed something elementary that I just don't know to do, or something. I do not have the kind of money that would allow for randomly buying parts, and nobody near me has current-gen parts they can spare for testing. I'm mostly ready to just switch back to the old parts until things have gotten better a couple months from now, but I want this to work, dammit!
Some images that might help:
I plugged in the 8 pin, 24 pin, heatsink fan, 8 pin and 6 pin on video as expected.
Hard drives are plugged in as one might expect (that dangling unhooked wire has been like that since the old build and no problems there). I've got a completely unplugged drive that I use for backups every month or so as well.
Nooooo I didn't forget to keep the motherboard from grounding against the case. This time.
Could be so many things, but to check something simple first. Do you have any background apps running while gaming? All your crashes do seem gaming related? Recently I had a different issue, it was a strange network lag thing where occasionaly I'd lag out of games but I'd look and everything on the hardware and network end seemed fine, not like my network card crashed or anything. After troubleshooting it turned out to be a Citrix client software that was running on the front end, I had installed it for some work functions. I disabled that on start up and all my problems disapeared. Perhaps there are some background apps that you can disable and try gaming with them off to see if any of those may be causing a conflict?
I second what Cliff is saying, there is a lot of stuff that could be wrong with it right now but not enough info to make a solid decision.
Right now I would see if you can get "Heaven" benchmark to run at max settings for at least an hour without issues. If that passes fine I would move on to "Superposition" and run that for at least an hour as well. You can find both at https://benchmark.unigine.com/
If you're unable to get them solid for an hour each switch over to "benchmarking" mode and post the results.
Does your BIOS have an option where it can dump all of its current settings to text file on a flash drive? My Asus does.
You should also consider analyzing a crashdump with the WinDBG tool. It may reveal a specific driver responsible for your BSODs. I've seen network drivers, capture cards and webcam drivers wreak havoc with a system.
I'll be running Heaven and then Superposition on highest settings for an hour each, and WinDBG is a thing I was looking for but couldn't find before I settled on WhoCrashed. I'm not scared of crash dumps so I'll happily look through those next I generate a real one (crashing multiple times a day with enormous crash dumps led me to turning the dump size down, but now that I can introspect them, right back to full).
Once those are done I'll see if I can generate a bios settings dump and zip it or something for here.
Weird, but the last time I ran into that it was a bad psu. No promises though. I think Thrax covered all the bases except that.
No crashes while running the benchmark programs. No crashes yet today for full memory dump debugging.
Bios does not have support for dumping out all settings, but if a few crash dumps don't get me anywhere, I'll just make one by hand.
I have already disabled almost everything provided by the task manager startup bar, haven't ran through services yet as that requires a lot more research to do safely.
I'll let you know how the next few crash dumps look in WinDBG.
@Strikes as you were able to run heaven and superposition without issues you can more or less rule out your GPU, motherboard/CPU or PSU being the main culprit. They're pretty good real world stress tests as compared to SuperPI or FurMark.
I agree, I'm thinking given those stress tests that it's less likely a hardware issue and more likely some kind of odd little software bug. Does not hurt to run a long memtest run overnight just to rule out any memory errors, but if that comes back clean I'd just back up and clean install Windows and all your hardware drivers.
I've reinstalled Win10 8 times. I'm exhausted from redoing basic configs, but I'll get ready to do it again. Maybe there's old driver cruft or something somehow somewhere.
Today's bluescreen: CLOCK_WATCHDOG_TIMEOUT (101) via amdppm.sys. The threads available in the dump were mostly waiting on context swaps that were waiting on interrupts that the processor quit giving out. Looks like just about anything can cause this, so no luck. Gonna try using Verifier one more time and seeing if WinDBG makes anything they do more apparent. Then cutting out any running service / program I don't actively use and seeing if I can get a near-idle crash again. Then on to another reinstall.
@Strikes when you saying you're using verifier are you referring to running sfc /scannow though the command prompt? If not, try running it and see if it helps.
@Thrax may be able to offer more insight on next steps
I know that the GPU and was RMA'd and came back good, but for the sake of thoroughness, have you tested with alternate GPU, RAM, PSU, drive cables?
I've run into situations over the years where a component that tests good in one system will cause wild instability in another system. Recently I had a brand-new Dell Precision workstation laptop that would not wake from sleep mode. I spent days troubleshooting it. Turned out the Crucial RAM I installed in it did not agree with it, even though they would pass a memtest while installed in the computer and there were no other identifiable symptoms.
I assume that mobo has onboard graphics, so the first thing I'd do is pull the GPU and see if you have stability problems that persist in its absence. If they go away and the problems return, you can say 100% it's the GPU or the motherboard itself that is somehow not happy together and then test with an alternate GPU to see if the problems persist there.
Does AMD have a routine that will test the CPU, internal mem and interconnects?
I had an issue in CPU where some of the buffer stopped talking to other parts.
I believe there is a possibility, however remote, that you may have RMAed a component and gotten back either the same component or another broken one.
It happened to me once. Maddening.
@MrTRiot Driver Verifier. No dice this time. I've already done sfc and DISM, both no joy. Thanks though! Thrax also offered helpful advice on testing PCIE generation and making sure it wasn't a DisplayPort cable issue, no joy there either.
@RyanMM Alternate GPU was the 970 it replaced. Alternate PSU was my minecraft server's box (also the previous gaming rig build's). I have indeed swapped all drive cables, SATA and power both. All no joy. I do not have access to other RAM without buying more and I cannot afford more at this moment, although if that changes I'd gladly test different sticks entirely. G.Skill has them on the QVL list so it'd be really weird, though, to have a defect like that. As you said
No onboard video (the motherboard DOES have an HDMI slot, but it's only active if the processor supports graphics processing, and the 3700X does not). I really wish that would point to just the motherboard, but my shitty experience with Newegg's processors this time around makes me reaaaaaally wonder if they didn't ship me someone else's 3700X after they overclocked the fuck out of it and then sent it back or something equally stupid. Testing a different motherboard is the same problem as RAM: can't afford a new current-gen motherboard just for testing and nobody near me has a spare to test with. Thanks for the advice though!
@edcentric No specific testing program for AMD products that I know of. That's most of what continuous benchmarking / loadtesting / memtest is supposed to test indirectly, and I don't have trouble there.
@primesuspect Gigabyte tested my motherboard for less than a day before they boxed it back up and mailed it back to me. It's got the same serial number sticker on it, and their email explicitly says, "Since we couldn't reproduce the issue you're RMA'ing for, we're sending your board back to you." Maddening is correct, considering how much I suspect this motherboard of being the issue. After I bought the current mobo, THEN newegg started getting a lot of low-star reviews, but newegg's motherboard return policy is pretty much just You Can't Once You Open The Box. No proof until I purchase another motherboard though, and that's not gonna happen for a while.
Driver Verifier sure did crash me a lot. The only thing that came up was that WoW was doing something that afunix didn't like, but after re-updating my network drivers and even installing a wireless networking card didn't change anything, I'm letting it go. My crashes are not WoW-specific and all other usage didn't seem to cause any trouble via Driver Verifier. I'm currently burning the Win10 trial to an actual physical DVD, installing a dvd drive, pulling one of my blank hard drives out of storage, and installing on that next.
I really think Gigabyte did you dirty on this. I think you have a bad motherboard on your hands. I would almost send them to this thread or copypaste it into another angry email.
Given everything that has been tried, you could try running SuperPI on all cores until it either crashes or starts spitting out a boatload of uncorrectable errors.
F@H would also work provided you switch the settings for it to be CPU only (and not dedicating a core to the GPU)
You could also post your heaven and superposition benchmarking scores and see if they're in an acceptable range for your system.
Both are long shots at this point.
One thing I had to researcn was what the X suffix meant for a CPU. Along the line, I discovered that X suffix CPUs could overclock well, but that they did not have builtin graphics. And the motherboard I ended up with one that had builtin graphics that could be overclocked-- but not by much.
Superposition was 4300 which seems a little low for matching hardware but I'd imagine correct for an untweaked dual screen system. FPS ranges especially seem right. It also proved that my video card actually tops out at 76degC instead of 72, which is nothing worth noting other than that it tripped my sensor check so that's working.
Running Super Pi on each core one at a time for 16 logical cores is going to take two and a half hours of manually swapping affinities but seems like a worthwhile test to see if there's a janky core (provided the jank found can be triggered by execution of x87/NPX instructions). On 6 so far, then on to pulling all drives and doing the fresh install from CD.
@Straight_Man I didn't know that, but it'd make more sense of the X than I had before for sure I did consider making a build that used built-in graphics (I am using a gaming laptop as my TV rig but it's just adequate instead of Cool Bookend Microsystem, you know), but my main rig was out of date enough that I wanted new hotness. My setup is definitely new!
@primesuspect I will do that once I finish even more very exhaustive exhausting testing. I rather imagine it'll get the same Oops LoL Shrug Emoji response they gave the first time and I just have to get myself into a place where I can take that again.
Apologies in advance if I suggest something that’s already been tried. I’m typing on a phone and don’t have a computer nearby.
This sounds like a super weird memory issue. Since you’re passing memtest and have RMA’ed the motherboard, that leaves the RAM itself, the PSU and the CPU.
Since your RAM has passed in different combinations of quantity and location, it’s probably fine.
The PSU is less likely if you’ve seen stable voltages. Was it stable across all of the lines? I had an issue with RAID controllers because either the 3.3V or 5V would quickly drop and return to normal sometimes.
This may seem out there, but I would say this is going to be an issue with your CPU. You’re having memory access violations and kernel mode BSODs. Perhaps it’s a dud memory controller or some other component of the processor That’s only being exercised under certain conditions.
1: change your memory timings off of XMP to one of the lower SPD settings. I have a Ryzen board that won’t POST if the RAM is set to anything close to XMP.
2: If you’re not running the latest BIOS, update. That board is new enough that this shouldn’t be an issue, but it never hurts to look.
Edit: to piggyback off of what @primesuspect said about the Gigabyte RMA, maybe place a mark on the board in an inconspicuous place...something like a sharpie mark on the edge. I’ve had to deal with that before, not from Gigabyte but I have seen it happen.
Um, if you have a battery backup, try plugging it in. I have one computer here that only works off battery backup.
Rock solid voltages. Temps also good. Note that this is over a period of 16 hours (no crashes because they're InTeRmItTeNt), including during PubG play. Well within 5% ATX tolerances for PSU behavior.
... Vcore minimum is 1v which is what it's supposed to be because hwinfo probing causes it to not report real core sleep right. Vcore max was 1.524??? I knew single core auto voltage was aggressive, but that seems really fucking high??? Temps didn't get crazy but jeez that's high. Core Voltage via SVI2 TFN reported a max of 1.487v which is within tolerance so maybe it's just a measurement difference.
Super Pi didn't turn up anything.
@mertesn I have used stock memory settings (much, MUCH lower than XMP) to no avail. Definitely have tried every bios offered. As for the RMA, like I said, I did RMA it, and they were very very clear that 1) they tested the board by plugging it in and turning it on and it turned on and maybe they wiggled the mouse on their winxp test install and 2) they sent the original back and said as much.
@Straight_Man UPS died more than a year ago, but I did try without my powerstrip in the way, and with different wall outlets with and without the strip to no avail.
Another fresh install via CD on the way.
Sorry, I just know what I suggested.... I cannot come up with any other ideas.
I looked again at your photos, did you use a riser with no insulation washer?
@Strikes Unless we can either induce a crash or do something specific to solve them then we're basically shooting blind.
That being said, your core 7 power reaching past 11w seems a bit high considering all your other averages and maxes. It's typical for the first few cores to have high voltages due to varying multicore supports but it strange to see such a later core reach those levels even with stress testing
His core voltages and watts are fine.
Back from the new install via CD and yet another hard drive, still crashing and bluescreening unpredictably there.
@Straight_Man No insulating washers because that defeats the purpose of properly grounding a motherboard to the case. I already tried running the system sans case earlier which I believe is essentially the same thing.
@MrTRiot I'm inclined to agree, although I've been past the easy checks for quite some time. AMD's listed specs for wattage and voltage look high when compared to previous / other CPUs but the spec's the spec and I'm in it.
Thanks to everybody for your help. One the one hand I'm sad that the issue's esoteric enough that even a community of tech-savvy people are stumped by it, but on the other I'm extremely grateful that I've received so much help from you all!
At this point, I believe it's better for my health and sanity to go back to my old components. I'll likely just sell off the parts I can't use. If I run brainlessly headlong into a pack of adds, or lose four hours of gameplay in a single player game that doesn't autosave, or have to hear the same 40,000th of a second of audio loop for over a minute, or debug a crash dump that isn't my own product, or have to tell my friends to recap the last four minutes of their stream, or use winlayoutmanager to reset my window positions again this year it'll be too soon
Bless you folks o7
Reading everything at this point, I think @primesuspect is right that the motherboard is the most likely culprit. The only thing I think is missing for thoroughness' sake is testing with alternate memory; I know you said you can't afford another pair right now, but think of it more like a rental - grab a set from a local retailer with a good return policy, test with the memory, and if you STILL have problems, I think you can safely say you've ruled out everything except the motherboard.
At which point you have to take Gigabyte to task and hope you can get to the level of tech support where they don't blow you off with a by the numbers diagnostic at their RMA facility.
Strikes-- every mobo of the ones (about 750) I have handled have had some mount points which should not be grounded, and a lesser number that need to be grounded to case if they will run that way. If the mobo runs with the PSU plugged into it, AND with the mobo risers only fastened to a piece of plywood, THEN that mobo should be insulated at all mount points at case because in the case at hand that mobo will have wires in the power harness to ground it to the PSU. I have seen a few rare boards that should be grounded at all riser points, but those were very early ones. These days most of the boards with layers going to sockets, are grounded by using traces going to the sockets internal and external from board. Not criticizing you, just sharing.
@Strikes -- if you have good relations with a computer store, most of them will test RAM for a small fee. They keep a RAM tester on hand for this.
As a follow-up, this thread led me to leaning much harder on Gigabyte than I did the first time. They finally accepted a second RMA... and even paid for shipping this time. After a week and a half of whatever they did on their end, they sapped my board for a same model replacement.
I have not had a single error, let alone a crash, since implementing the new board. So this journey of intermittent horrors comes to a satisfying conclusion. If only it didn't take 9 months of not being able to reliably use my new machine :<
Thanks again to everybody who helped me figure out this terrible issue!