Re: Need help debugging SIGBUS crashes

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Peter 'PMc' Much <[email protected]>
To: Jakub Wartak <[email protected]>
Cc: [email protected]
Subject: Re: Need help debugging SIGBUS crashes
Date: Tue, 17 Mar 2026 16:29:14 +0100
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAKZiRmyQz+jZWLC4GbyuCa6cjurS0nECgFbYVyjgxB3Hgo+VnQ@mail.gmail.com>
References: <[email protected]>
	<CAKZiRmyQz+jZWLC4GbyuCa6cjurS0nECgFbYVyjgxB3Hgo+VnQ@mail.gmail.com>

On Tue, Mar 17, 2026 at 02:50:25PM +0100, Jakub Wartak wrote:
! 
! Not an answer from a regular FreeBSD guy, but more questions:
! 
! So have you removed those ZFS patches or not? (You said You reverted only
! NUMA ones)?

They are completely removed now. 

! Maybe those ZFS patches they corrupt some memory and jemalloc just
! hits those regions? I would revert the kernel to stock thing

Yes, I would, too, but I can't. There are patches for kerberos
(FreeBSD 14 still uses that very old Heimdal implementation, that
is why I am kind of stuck with PG 15, and upgrading that one will
be a bit of work), there are patches to make IPv6 fragmentation work
with the firewalls - in short, removing all of the patches will make
the SSO and networking fall apart entirely, and make the site
nonfunctional.

OTOH this crash seems to prefer happening in production. Last night
when it happened, the machine was busy rebuilding the OS etc. for
other nodes to upgrade to 14.4, and then I got bored and additionally
did run an LLM for entertainment. So the server had some 25 GB paged
out, when the nightly housekeeping started to push daily log data
into the databases - which then led to the crash.

That means,
 A) I have no good idea how to properly reproduce such conditions
    in a test scenario, and
 B) it is not impossible that there is a bug (somewhere), that just
    doesn't usually happen to orderly people who run their databases
    in rather overprovisioned conditions.

! Are You using hugepages? The jemalloc stack also contains "_large_" so can we
! assume jemalloc is using hugepages ?

I think I remember I once tried to, but hugepages with postgres do not
work on FreeBSD. The docs also say: 
   "this setting is supported only on Linux and Windows."

! I don't know if that might help, but last time I hunted down SIGBUS [0] it was
! due to our incorrect patches (causing NUMA hugepages imbalances across nodes;
! our patch has some pause there, but what I did to track it down was to
! stack trace
! to Linux's kernel do_sigbus() routine via eBPF). Possibly You could hijack/
! detect some traps and/or hijack some routines using DTrace that's in FreeBSD and
! that would get some hints?

Thank You, currently everything helps. :)
DTrace is super cool, but then it also needs to understand the code
first before getting useful insight from it.
So any approach will imply a bunch of work, and I am currently looking
for the shortest path to an unknown target. ;)

PMc

view thread (9+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Need help debugging SIGBUS crashes
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox