ironicbadger,
@ironicbadger@techhub.social avatar

My Nixos experiment was going so well… genuinely I am a bit miffed by this!

Alas, I am having repeated hard lock ups with large zfs replications.

Things I have tried:

  • memtest for 5hours. ✅
  • s-tui cpu stress test for 3hrs. ✅
  • booted in a live Ubuntu iso and performed the same zfs replication for 2hrs. No lock ups.

Because the machine just hard locks I get no stacktrace or kernel panic to help me here. No logs I can find or anything.

So it’s got to be one of the following.

  • A hardware issue. I’m not ruling this out because even though the majority of the system was untouched there is a new HBA (LSI 16i) in there. And a couple of new SSDs.
  • A software issue. Is there something in the Nixos implementation of zfs that’s causing this? Seems so unlikely but I think I owe it to try a different OS today for testing.
ZS,
@ZS@techhub.social avatar

@ironicbadger Have you heard of unRAID? Works great. 😜

vt52,
@vt52@ioc.exchange avatar

@ironicbadger I assume you've checked journalctl -b -1 after reboot?

Only thing that jumps out at me with your config is that you should be choosing kernel version based on your zfs dependency:

boot.kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages;  

However, I wouldn't expect that would be the source of your issue (unless you're picking up something older than you would otherwise).

ironicbadger,
@ironicbadger@techhub.social avatar

@vt52 journalctl showed absolutely nothing. It must have an absolutely instant lockup somehow.

I didn’t try the kernel thing. Good point.

But I’ve reverted to ProxMox now. This is the “Morpheus” old Intel server that’s been running for 3+ years. Never had any lock ups like this before. Im stopping short of blaming nix fully quite yet, but it’s looking likely unfortunately.

jinxd,
@jinxd@fosstodon.org avatar

@ironicbadger hardware, software.not sure a 3rd option exists ;)

ironicbadger,
@ironicbadger@techhub.social avatar

@jinxd sure it does. users!!

HankB,

@ironicbadger @jinxd When I worked on cars for a living (VW/Porsche/Audi in fact) it was "loose nut behind the steering wheel." 🤣

Did you do any disk stress testing aside from heavy ZFS loads? I'd look at that to differentiate between S/W and H/W. I had some issues with ZFS 2.1.11 but did not result in hard lockup. I'm on 2.3.3 on Debian for about a month and just went through hell fixing an unbootable system. If you use bpool/rpool split DO NOT SNAPSHOT BPOOL.

ironicbadger,
@ironicbadger@techhub.social avatar

@HankB @jinxd i leave for SCaLE in 2 days and would prefer to leave my wife and 3 year old with a stable server! so in the interest of time, i’m installing proxmox right now and will test the replication all day today.
fingers, toes, eyes, legs all crossed it's not a hardware issue. at least software is easily fixed.

jinxd,
@jinxd@fosstodon.org avatar

@ironicbadger @HankB Is that the new server?

ironicbadger,
@ironicbadger@techhub.social avatar

@jinxd @HankB it’s my old Intel “Morpheus” media server. The only hardware change from the system running reliably the last 3+ years is a new HBA.
So it’s either that or Nixos. But so far the only situation that’s been able to repro is a Nixos install so that’s where I’m focused RN

jinxd,
@jinxd@fosstodon.org avatar

@ironicbadger @HankB Surely sounds like it's NixOS somehow,. Especially since no issue with you trying Ubuntu live.

ironicbadger,
@ironicbadger@techhub.social avatar

@jinxd id really worked up a lot of momentum to put nix as the primary and it does this to me? Hahah. Oh well.

sageofredondo,
@sageofredondo@mastodon.social avatar

@ironicbadger not necessarily. I have had hard restarts where it was so severe for the kernel it just restarted without being able to drop down to the code to print the stack when I was testing Core Scheduling when it was out of tree.

If this hardware works on other distributions then it signals to me it is a software issue.

You need to post more info. Kernel and openzfs.

ironicbadger,
@ironicbadger@techhub.social avatar

@sageofredondo i'm going to install proxmox on the same hardware this morning, and test that all day.

i am running nixos on zfs mirrored root. i wonder if that could be related somehow?

still at the WTF stage! i do hope to be able to provide more info for upstream if it is software.

sageofredondo,
@sageofredondo@mastodon.social avatar

@ironicbadger what is the kernel and openzfs number?

ironicbadger,
@ironicbadger@techhub.social avatar

@sageofredondo

nixos 23.11.5195.2be119add7b3

kernel 6.1.81

zfs 2.2.3-1
zfs-kmod 2.2.3-1

config files: https://github.com/ironicbadger/nix-config/tree/main/testing/mediaserver

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • DreamBathrooms
  • magazineikmin
  • mdbf
  • GTA5RPClips
  • rosin
  • Youngstown
  • everett
  • cubers
  • slotface
  • osvaldo12
  • ngwrru68w68
  • kavyap
  • thenastyranch
  • Durango
  • megavids
  • cisconetworking
  • Leos
  • InstantRegret
  • normalnudes
  • tacticalgear
  • ethstaker
  • khanakhh
  • tester
  • anitta
  • provamag3
  • modclub
  • JUstTest
  • lostlight
  • All magazines