foone,
@foone@digipres.club avatar

So I ran memtest86+ on my desktop.
It didn't error!
or succeed. It just... hung, 25 minutes in.

that's probably not a good sign

fuchsiii,
@fuchsiii@oxytodon.com avatar
kkarhan,
@kkarhan@mstdn.social avatar

@fuchsiii @foone did crash or did it just freeze?

foone,
@foone@digipres.club avatar

@kkarhan @fuchsiii just froze

kkarhan,
@kkarhan@mstdn.social avatar

@foone @fuchsiii hopefully it didn't coil-while, made noises or caught fire.

jonoabroad,

@foone because you have SOOOOOOOOO much memory it didn't finish?

foone,
@foone@digipres.club avatar

@jonoabroad look 640kb ought to be enough for anyone

jonoabroad,

@foone I still remember our first PC and just running through all the DOS commands.

Anywooo recover is a cursed command and I had to explain to dad how I'd broken the less than 24 hour old computer.

That was 40 years ago or so.

Yup yup yup

mmu_man,
@mmu_man@m.g3l.org avatar

@foone might want to try another (an older?) version, I recall seeing various hangs on various machines…

foone,
@foone@digipres.club avatar

@mmu_man I've had that before, but it was mainly because I was needing to run it on 386s and the like. This is a modern machine so there should be no problems.

mmu_man,
@mmu_man@m.g3l.org avatar

@foone well our ProLiant G6 disagrees, it doesn't like all versions…

foone,
@foone@digipres.club avatar

I'm gonna have to start pulling parts out and get it down to a minimal hardware setup and see if it still crashes.

like, I probably could disconnect the floppy drives. those shouldn't be needed just to run memtest

foone,
@foone@digipres.club avatar

even if they are clearly VITAL for day-to-day work

foone,
@foone@digipres.club avatar

yanked the odd hardware like GPU and hard drives, as well as the common stuff like DVI video capture device, both floppy drives, optical drive, and serial port

foone,
@foone@digipres.club avatar

I had to go find another keyboard though. my spare keyboard on hand is an AT keyboard, and while of course I have a (hand made!) AT-to-PS/2 adapter, this computer for some reason only has USB ports.

I can't understand these companies just expecting us to have obscure hardware on hand to diagnose their failures.

foone,
@foone@digipres.club avatar

I normally have it plugged into a KVM but I'm bypassing that on the extreme off-chance that there's some weird interaction between the KVM and the PC

foone,
@foone@digipres.club avatar

WE'VE GOT ERRORS! hitting a bunch of RAM problems around 31-39gb.

nastily that's divided between two banks. So... it's probably not just one pair of DIMMS going bad.
(and of course it's also possible it's some other problem that's causing RAM issues)

foone,
@foone@digipres.club avatar

yanked half the RAM to see if it still errors

foone,
@foone@digipres.club avatar

with half ram, I did two successful passes of memtest86+, no errors. time to check THE OTHER HALF (to confirm if it was just a needs-reseating/too much ram for the mobo problem)

foone,
@foone@digipres.club avatar

and it passes too.
So, either my motherboard has stopped being able to support 64gb, or it was just a reseating problem. Let's stick it back in and see.

foone,
@foone@digipres.club avatar

that's... probably not a good sign

AlesandroOrtiz,

@foone On the other hand, you created one of the coolest screenshots of Memtest86+

foone,
@foone@digipres.club avatar

I really don't like that it corrupted outside of the text window.
That shouldn't be happening if this is just a RAM problem. That's like... a GPU corruption problem.

Which is very bad because I'm currently using the CPU's on-die GPU

xabean,

@foone but remember your RAM is your VRAM right

foone,
@foone@digipres.club avatar

@xabean good point

foone,
@foone@digipres.club avatar

@manawyrm pointed out that my ram sticks weren't properly interleaved. So I'm trying memtest again with that fixed

foone,
@foone@digipres.club avatar

NOPE! it ran for 3 hours, threw a ton of memory errors, then crashed in the same matrixy way.
So, back to just one pair of sticks, and I'm gonna let memtest run for a while and see if that seems stable.

foone,
@foone@digipres.club avatar

OK so ran the half-ram test for nearly 8 hours with zero issues.
Now I've reattached the GPU, usb ports, serial port, floppy drives, and rust drives and optical drives (but not the m.2 ssds yet because they're boot drives) and I'm running it again, to confirm it's not a PSU problem

foone,
@foone@digipres.club avatar

No issues after running for 4 hours!
I'll leave it on overnight but this seems to be functional

foone,
@foone@digipres.club avatar

The only thing that's not back in is the two m.2 drives, which I doubt will make much difference.
We'll see!

foone,
@foone@digipres.club avatar

It turns out the GPU wasn't in use so I can't be sure it's not still a power problem.
So I fixed that, put in the m.2, and started again

foone,
@foone@digipres.club avatar

Memtest didn't complete, but for a very silly reason: I have my computer up on the desk and plugged into a different outlet for this testing, which means there's a power cable crossing the doorway.
When I left the room, moving the cable so I could go under it slid it out of the power supply and it shut off

foone,
@foone@digipres.club avatar

memtest completed and now I'm going full on stress testing: playing fullscreen video, 3D games, processing a bunch of shit in the background, some VMs. you know, the usual stuff I do on an average day

foone,
@foone@digipres.club avatar

I probably shouldn't try to run the VM that has 16gb of RAM allocated to it, though

foone,
@foone@digipres.club avatar

Well that lasted all of 30 minutes before the system hung.
Fuck. It's not just the RAM.

jernej__s,

@foone Ouch.

foone,
@foone@digipres.club avatar

So, it's probably one of motherboard, cpu, or PSU. At a stretch, it could be the GPU.

I have another spare GPU I could swap in. I have a near-identical CPU that I could swap in (it's in use, but I can temporarily borrow it).
PSU and mobo are trickier.
So, I'll have to try the easy ones first. Swap the GPU and see if windows still hard crashes like that, then the cpu, then start working on the others.

baishen,

@foone PSU tester? They're fairly cheap.

foone,
@foone@digipres.club avatar

@baishen I've got one somewhere but this seems too subtle a problem for it to detect. It's not failing to boot, it's just falling over after running for 20-30 minutes

baishen,

@foone I was hoping it would show if one of the rails was marginal.

Does it start right back up or need a cool off?

foone,
@foone@digipres.club avatar

@baishen starts up just fine

foone,
@foone@digipres.club avatar

If you'd like to help me get back online (and gay cats, of course), donations would help. I'm kinda broke and not having a working computer is not going to help.

https://ko-fi.com/fooneturing

Pibble,

@foone intel or amd? what gen? this looks legit like a cache instability issue (not cache per se, but ringbus/infinity fabric instability)

foone,
@foone@digipres.club avatar

@Pibble it's a core i7-8700k

Pibble,

@foone default uncore/ring ratio I assume. What is vccsa and vccio read (voltages, system agent and IO)? also how much memory, dual rank sticks, or single rank? 2 dimm or 4 dimm, what mobo?

Pibble,

@foone the only reason I ask is because 8th gen Intel memory controller likes to fall over when dealing with 4 sticks of dual rank memory aside from micron e die, and z1/4xx boards (6th-9th) boards were notorious for shoving a ton of sa and io voltage when all dimms were populated or xmp was enabled, and by a ton. I mean voltage that will degrade or kill the imc fairly quickly, like 6 months to a year.

foone,
@foone@digipres.club avatar

@Pibble that could definitely be it. this motherboard is about a year old

foone,
@foone@digipres.club avatar

okay today's first test: I yanked out my GPU and I'm running on just the internal GPU. I'm gonna load up some videos, VMs, 3D games, and a bunch of browser tabs. See if this falls over too

foone,
@foone@digipres.club avatar

I am not getting a large number of frames, and I only have one game running at the moment.

foone,
@foone@digipres.club avatar

somehow I got my youtube video playing over my actual speakers but one of the games playing out the HDMI and the little speakers on the monitor. that's weird.

foone,
@foone@digipres.club avatar

I AM STRESS TESTING THIS MACHINE AT NOT A LOT OF FRAMES A SECOND

ChlorideCull,

@foone Don't know why I expected your Minecraft skin to be a floppy.

foone,
@foone@digipres.club avatar

@ChlorideCull I think my minecraft skin actually predates my floppy disk hyperfixation, if that's even believable

foone,
@foone@digipres.club avatar

okay I've made it an hour running sans-GPU. That doesn't mean it's the GPU though. This machine is using way less power without the GPU... so it could still be a PSU related problem.

revk,
@revk@toot.me.uk avatar

@foone I love that processors can slow down when too hot, it is literally like people getting tired.

foone,
@foone@digipres.club avatar

so after running fine for about 3 hours with no GPU, I've gone out and bought a new... power supply.

yeah I don't think it's the GPU. And a flakey PSU could easily fail with the GPU and not without, since the power usage is way lower without a GPU in there

timixretroplays,
@timixretroplays@digipres.club avatar

@foone if it's not DNS it's PSU

foone,
@foone@digipres.club avatar

okay new PSU is in. That took way longer than it should.
Apparently between the RM650x and the RM850x, Corsair redesigned their modular cables, so I couldn't just swap the PSU and reuse the cables. So now I have a cable management nightmare, but it's running. Let's put the stress-testing pants on

foone,
@foone@digipres.club avatar

the worst part is that I forgot to double-check that the new PSU would come with the right cables to let me hook up my floppy drives. Thankfully, it did.

foone,
@foone@digipres.club avatar

that could have been a disaster

glyph,
@glyph@mastodon.social avatar

@foone good luck with the burn-in testing on the new PSU 🤞🏻

foone,
@foone@digipres.club avatar

well my 3.5" floppy drive is working. That's a good test

foone,
@foone@digipres.club avatar

changing my PSU seems to have confused Satisfactory into thinking I'm a different person and now I'm sitting on the floor of my own base. WHO ARE YOU?

foone,
@foone@digipres.club avatar

CRASHHANG.

foone,
@foone@digipres.club avatar

sticking in a different GPU.
Swapping out my Asus Gegorce RTX3070 for an EVGA GeForce GT 1030

foone,
@foone@digipres.club avatar

ran an hour and 30 minutes on the other GPU with no crashes.

god damn it, IS it my GPU?

scottmichaud,
@scottmichaud@mastodon.gamedev.place avatar

@foone The visual artifacts in Memtest strongly suggest that something's wrong with the GPU.

Could be multiple failures, too.

foone,
@foone@digipres.club avatar

@scottmichaud that was using the onboard GPU, not the suspect one!

foone,
@foone@digipres.club avatar

taking a work meeting from the system under test
this is known as "living dangerously"

foone,
@foone@digipres.club avatar

okay. on the new PSU, with old GPU back in, but in a different PCIe slot. Let's see how this goes

foone,
@foone@digipres.club avatar

no crash in an hour with it in a different slot? weird.

foone,
@foone@digipres.club avatar

19 hours in the different slot, no crashes. Very strange.
So, theories:

  1. that slot was just bad/dirty. Possible, I guess? The other GPU worked fine in that slot, though.
  2. The GPU might be running at 8x PCIe instead of 16x PCIe. Maybe that pushes it over some timing/temperature threshold and makes it not crash?
foone,
@foone@digipres.club avatar

okay, GPU-z says it is indeed running at x8 3.0 speed, when it's capable of x16 4.0.

So, how much you wanna bet that if I fix that, the system will start crashing again?

tallawk,
@tallawk@mastodon.world avatar

@foone Are you by chance collecting a snowstorm of pcie errors in your system logs?

foone,
@foone@digipres.club avatar

@tallawk not that I've seen but I'm going to try and check

azonenberg,
@azonenberg@ioc.exchange avatar

@foone Other possibility: one of the higher lanes on the card has an electrical problem causing intermittent, undetected packet corruption.

By forcing it to run in x8 mode you avoid that.

foone,
@foone@digipres.club avatar

So it turns out I can't get it to do 16x in the other slot. My motherboard has 4 16x slots, but it does them in sequential order: if the first one is full, it gets 16x. if the first and second are full, they get 16x. if the second one is full and the first isn't, the second gets 8x.

yeah.

foone,
@foone@digipres.club avatar

So I swapped the card back to slot 1.
Interestingly, GPU-z says it's at 2.0 now instead of 3.0. Not sure why that is.

foone,
@foone@digipres.club avatar

wait no it doesn't. I can't read suddenly

foone,
@foone@digipres.club avatar

I'm gonna run my stress test with some performance logging on to see if it's overheating.
I did realize my card has a physical switch for "high fan" vs "quiet fan" and switched it to "high fan".
I can't imagine that'd be why it was crashing but maybe.

foone,
@foone@digipres.club avatar

also while looking around my BIOS, I realized I could clock my ram faster. It's running at 2133mhz and could go up to 3200mhz, supposedly.

I didn't test that out for obvious reasons.

foone,
@foone@digipres.club avatar

over an hour with the GPU back in the Crashy Slot and no crashes.

huh. Maybe it was temperature based?
My GPU isn't getting THAT hot, my fans aren't even maxing out.
GPU temp hit a max of 65C with a hot spot of 76C.
Those aren't out of range for a GPU under load, and they're not trending upward at all, it's stable.

foone,
@foone@digipres.club avatar

or it's still just a memory corruption problem and it's just VERY random and I need to test for longer

foone,
@foone@digipres.club avatar

it's now been nearly 5 hours with no crashes.
what the fuck?

(temps are about the same)

vxo,
@vxo@digipres.club avatar

@foone that certainly sounds like computers

StompyRobot,
@StompyRobot@mastodon.gamedev.place avatar

@foone you fixed it!

foone,
@foone@digipres.club avatar

@StompyRobot BUT HOW

karolherbst,
@karolherbst@chaos.social avatar

@foone your story sounds like my ampere gpu which also just crashes randomly (sometimes after hours, sometimes minutes), but that's a quadro a6000 I got from nvidia through work, other GPUs work just fine on the identical setup/slot.

Was on Linux though

foone,
@foone@digipres.club avatar

So it's now ran for 23 hours... no system-crash.

minecraft did crash, but it's modded minecraft. it might have just done that on its own

foone,
@foone@digipres.club avatar

I also stuck the "bad" set of ram sticks into a spare computer and ran it on memtest.
16 hours, 9 full passes of memtest86+, zero errors.

foone,
@foone@digipres.club avatar

starting to think this is a motherboard problem

foone,
@foone@digipres.club avatar

so the remaining Questionable Hardware is:

  1. Motherboard
  2. CPU
  3. GPU
  4. Half the RAM

So that's, like, 1200$ if I wanted to replace it all. That's definitely not going to happen. So for now I think I'm just gonna have to wait to see if more shit breaks, but ordering more RAM is next on the list, since I'd like to be back up to full RAM and it'd be useful to know if adding RAM back in causes it to crash again

foone,
@foone@digipres.club avatar

actually, I can use the RAM in one of my other machines if it turns out to not be useful to fix my main PC. So I hit order on that

foone,
@foone@digipres.club avatar

New RAM installed. I had a fun moment where the system seemed to be completely dead... Turns out one of the modular cables slipped out while I was installing the RAM.
But no crashes so far!
I also updated the BIOS, to a version that has "dram stability improvements" listed on on the changelog.

lewiscowles1986,
@lewiscowles1986@phpc.social avatar

@foone the more you write about this, the more it sounds exactly like what I went through with GPU firmware.

Every step you've done, is identical, minus putting GPU into other machines and watching them crash / manifest the same bug.

The hanging during memtest. Everything. Although I did have more RAM. You even went for the same PSU I purchased. Corsair 850W

I Guess we have different GPU manufacturers.

dougbarry,

@foone so you changed more than one thing at once?..

foone,
@foone@digipres.club avatar

@dougbarry no, but I summarized for mastodon rather than specify each change separately in multiple posts

Pibble,

@foone Yeah, like I was saying, a lot of the boards up until 10th gen were very not memory focused, so they didnt have great shielding or trace layout, you often even got really bad t topology layouts or extremely long daisy chain that would fall over at anything above 2133 with 4 dimms.

Have you looked at the bios on the board, there may be a few updates that include "improved memory stability" and usually those are extremely helpful for running dimms made well after the board.

foone,
@foone@digipres.club avatar

@Pibble there is a BIOS update, it looks like. I'll have to try installing

SvenGeier,
@SvenGeier@mathstodon.xyz avatar

@foone
The only thing more insane than doing the same thing again and again expecting a different outcome is doing the same thing again and again and GETTING a different outcome... 🤷

cmdrmoto,
@cmdrmoto@hachyderm.io avatar
OtterMatic,
@OtterMatic@woof.group avatar

@foone could be re-seating the card made a difference as well.

foone,
@foone@digipres.club avatar

@OtterMatic it's possible, but I've pulled and re-inserted it like 5-6 times through this whole saga, and it was still crashing up until this last time

cw,
@cw@hachyderm.io avatar

@foone @foone what are your core temps and clocks like in hwinfo? Heatsinks all snug? All internals unobstructed? Checked the board for any spicy caps? When you tested RAM, was it a matched pair? Anything overclocked or undervolted? Tried another PCIe slot for the GPU?

Worst case the motherboard is just dying, I had an X99 Asus board which was notorious for eating CPUs (which mine did, while simultaneously FUBARing itself). That's probably a niche edge case though 😅

foone,
@foone@digipres.club avatar

@cw I don't have the temps on the crashy setup, but I've not seen any temps outside of reasonable ranges. No cooling problems. RAM was matched pair, nothing should be overclocked or undervolted. I've not tried a new PCIe slot yet, that's definitely something I should try out. I think it'd cause my GPU to run at lower speed but at least I can confirm it works/doesn't work there

vxo,
@vxo@digipres.club avatar

@foone for a moment I thought about how I need to make sure I still have adapters in my collection to power 3.5" drives with the 0.1" center 4 pin JST header off of a full size power connector. aaahhhh!

flameeyes,
@flameeyes@mastodon.social avatar

@foone no two modular systems are compatible.

At least it sounds like Corsair didn't let you plug the old cables for which they swapped voltages.

jernej__s,

@flameeyes @foone Even if you keep the same model (this happened to me years ago with 600W Enermax – replacement had 5 pin connectors on PSU side instead of 6).

jernej__s,

@foone I had the same thing happen years ago when a 600W Enermax died – got it replaced under warranty, but the new one had different connectors on the PSU side (despite being the same model), so I had to redo all my cables…

apgarcia,
@apgarcia@fosstodon.org avatar

@foone one really good stress test, besides stress-ng, is to compile gcc...

jpm,
@jpm@aus.social avatar

@foone this is sounding like a memory clocking problem. XMPP speeds are only (usually) for a single stick per channel, and down-clocking to the stock speed is (usually) required for multiple sticks per channel

jernej__s,

@foone I recently dealt with an Intel i5 with dying GPU. As long as the Intel driver was in use, the screen was full of artifacts, and moving windows around created random polygons all over the screen.

manawyrm,
@manawyrm@chaos.social avatar

@foone that's a fail :P

manawyrm,
@manawyrm@chaos.social avatar

@foone ... actually: with this behaviour: Clean the contacts on the RAM with IPA and the ones on the CPU as well.

manawyrm,
@manawyrm@chaos.social avatar

@foone it's a bit hard to tell from the DMI info, but is the RAM installed the right way round (each vendor together per memory channel)?

I would've expected Slots 0,1 to be A, Slots 2,3 to be B, not interleaved like this.

foone,
@foone@digipres.club avatar

@manawyrm No, I think it was interleaved. I thought I had corrected this when I upgraded it. Thanks for catching that!

lewiscowles1986,
@lewiscowles1986@phpc.social avatar

@foone
I Had errors like this on an AMD Radeon card after a year, and it turned out to be shitty software. It cost me a heap (new RAM, power) to find out that a stupid driver update was the cause and by editing some files I could have the kernel overcome that.

If the RAM tests say it's fine, it might be crap software.

foone,
@foone@digipres.club avatar

@lewiscowles1986 I've been getting crashes in memtest and the BIOS, so definitely not a bad driver for me!

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • Durango
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • tacticalgear
  • khanakhh
  • Youngstown
  • mdbf
  • slotface
  • rosin
  • everett
  • ngwrru68w68
  • kavyap
  • InstantRegret
  • JUstTest
  • cubers
  • GTA5RPClips
  • cisconetworking
  • ethstaker
  • osvaldo12
  • modclub
  • normalnudes
  • anitta
  • tester
  • megavids
  • Leos
  • provamag3
  • lostlight
  • All magazines