Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1t63Dj-00AKQA-5r for pgsql-general@arkaria.postgresql.org; Wed, 30 Oct 2024 07:34:23 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1t63Dh-006LJF-CV for pgsql-general@arkaria.postgresql.org; Wed, 30 Oct 2024 07:34:21 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1t63Dg-006LJ6-Oc for pgsql-general@lists.postgresql.org; Wed, 30 Oct 2024 07:34:21 +0000 Received: from mail-ed1-f50.google.com ([209.85.208.50]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1t63Dd-003aQd-FC for pgsql-general@lists.postgresql.org; Wed, 30 Oct 2024 07:34:19 +0000 Received: by mail-ed1-f50.google.com with SMTP id 4fb4d7f45d1cf-5cacb76e924so8143058a12.0 for ; Wed, 30 Oct 2024 00:34:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730273655; x=1730878455; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=YvfPVkX7x5Ynb/qMWHj8joclWhACVJhq6WQX2Anx0TA=; b=kd0ixT9NoFk1C02ZIyH/0muKRyr3DBM/oUCFFEhY85uYxp1/ef8LPqMgMCWTUJv8ZA siwNU5kmSY22cuQFeY8Z1Ku8YNOFegKrDBNseq5RI1gI84KSgs8F4vYJIEF8x9MD6WUG ZKK46Sy4wxbUol2sefqWXQ6wn24bPHx+2MOwpOH3TJhAD5iRsaXnE1QnJpbByxRpus+d jGmpJgXw8trSKI2qWzijchcipzBiCmRxmS01Y6IAV248xXijDey1dFyuMxIAydYdGH5z JMXGgB9qtMcHr8c2fD5R5adqhrcdRiJ5AxyjUq9bgjhX/24DIjZb673OMY9bFivMM2xD vMWw== X-Gm-Message-State: AOJu0YxcF1E3CXtaP/426dKZvTOXCGFmSPoMONRBzltAdC3c+cZG8O0i E29MH4G+EPQmgvixas9utejdMTFqeLMbTAhw3UmF0j7O2vJBmE3znOGQXKDKPr0F2+dLuxxgpSK QY2sK3yyZrz2ubiF2rJpp81yC0JYMYh7s X-Google-Smtp-Source: AGHT+IFZ59YQWXknbPLGOv8UI2bi2Ht0uhfVKi7i2k4294+wCUH9fd9dEZhw5Gkqw3Ps26dx2Hy8XtPj2+Lr5VTatrM= X-Received: by 2002:a17:906:c10d:b0:a9a:37fe:e7d0 with SMTP id a640c23a62f3a-a9de6199186mr1408059266b.64.1730273654750; Wed, 30 Oct 2024 00:34:14 -0700 (PDT) MIME-Version: 1.0 From: Ian J Cottee Date: Wed, 30 Oct 2024 07:34:03 +0000 Message-ID: Subject: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS To: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000008e47d70625acba49" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000008e47d70625acba49 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello everyone, I=E2=80=99ve been using postgres for over 25 years now and = never had any major issues which were not caused by my own stupidity. In the last 24 hours however I=E2=80=99ve had a number of issues on one client's server= which I assume are a bug in postgres or a possible hardware issue (they are running on a Linode) but I need some clarification and would welcome advice on how to proceed. I will also forward this mail to Linode support to ask them to check for any memory issues they can detect. This particular Postgres is running on Ubuntu LTS 22.04 and has the following version information: ``` PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit ``` The quick summary is that over a 24 hour period I had the following errors appear in the postgres logs at different times causing the system processes to restart: - stuck spinlock detected - free(): corrupted unsorted chunks - double free or corruption (!prev) - corrupted size vs. prev_size - corrupted double-linked list - *** stack smashing detected ***: terminated - Segmentation fault Here=E2=80=99s the more detailed breakdown. On Monday evening this week, the following event occurred on the server ``` 2024-10-28 18:12:47.145 GMT [575437] xxx@xxx PANIC: stuck spinlock detected at LWLockWaitListLock, ./build/../src/backend/storage/lmgr/lwlock.c:913 ``` Followed by: ``` 2024-10-28 18:12:47.249 GMT [1880289] LOG: terminating any other active server processes 2024-10-28 18:12:47.284 GMT [1880289] LOG: all server processes terminated; reinitializing ``` And eventually ``` 2024-10-28 18:12:48.474 GMT [575566] xxx@xxx FATAL: the database system is in recovery mode 2024-10-28 18:12:48.476 GMT [575550] LOG: database system was not properly shut down; automatic recovery in progress 2024-10-28 18:12:48.487 GMT [575550] LOG: redo starts at DD/405E83A8 2024-10-28 18:12:48.487 GMT [575550] LOG: invalid record length at DD/405EF818: wanted 24, got 0 2024-10-28 18:12:48.487 GMT [575550] LOG: redo done at DD/405EF7E0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 2024-10-28 18:12:48.515 GMT [1880289] LOG: database system is ready to accept connections ``` This wasn=E2=80=99t noticed by myself or any users as they tend to all be f= inished by 17:30. However later, ``` 2024-10-28 20:27:15.258 GMT [611459] xxx@xxx LOG: unexpected EOF on client connection with an open transaction 2024-10-28 21:01:05.934 GMT [620373] xxx@xxxx LOG: unexpected EOF on client connection with an open transaction free(): corrupted unsorted chunks 2024-10-28 21:15:02.203 GMT [1880289] LOG: server process (PID 623803) was terminated by signal 6: Aborted 2024-10-28 21:15:02.204 GMT [1880289] LOG: terminating any other active server processes ``` This time it could not recover and I didn=E2=80=99t notice until early the = next morning whilst doing some routine checks. ``` 2024-10-28 21:15:03.643 GMT [623807] LOG: database system was not properly shut down; automatic recovery in progress 2024-10-28 21:15:03.655 GMT [623807] LOG: redo starts at DD/47366740 2024-10-28 21:15:03.663 GMT [623807] LOG: invalid record length at DD/475452A0: wanted 24, got 0 2024-10-28 21:15:03.663 GMT [623807] LOG: redo done at DD/47545268 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 2024-10-28 21:15:03.682 GMT [623829] xxx@xxx FATAL: the database system is in recovery mode double free or corruption (!prev) 2024-10-28 21:15:03.832 GMT [1880289] LOG: startup process (PID 623807) was terminated by signal 6: Aborted 2024-10-28 21:15:03.832 GMT [1880289] LOG: aborting startup due to startup process failure 2024-10-28 21:15:03.835 GMT [1880289] LOG: database system is shut down ``` When I noticed in the morning it was able to start without an issue. From googling it appeared to be a memory issue and I wondered if the problem was sorted now the server process had stopped completely and restarted. The problem was not sorted although all the above errors were recovered from automatically without any input from myself or the client=E2=80=99s noticin= g. ``` corrupted size vs. prev_size 2024-10-29 09:55:24.417 GMT [894747] LOG: background worker "parallel worker" (PID 947642) was terminated by signal 6: Aborted ``` ``` corrupted double-linked list 2024-10-29 13:14:28.322 GMT [894747] LOG: background worker "parallel worker" (PID 1019071) was terminated by signal 6: Aborted ``` ``` *** stack smashing detected ***: terminated 2024-10-28 15:24:30.331 GMT [1880289] LOG: background worker "parallel worker" (PID 528630) was terminated by signal 6: A\ borted ``` ``` 2024-10-28 15:40:26.617 GMT [1880289] LOG: background worker "parallel worker" (PID 533515) was terminated by signal 11: \ Segmentation fault 2024-10-28 15:40:26.617 GMT [1880289] DETAIL: Failed process was running: SELECT "formula_line".id FROM "formul\ ``` I rebooted the server at 18:30 and have had no further issues so far, although work has yet to start. When rebooting the server, postgres seemed to take a long time to terminate. Now there is one odd thing that has been happening recently. Due to a bug in my code I've had more deadlocks than would normally be expected. ``` 2024-10-29 19:26:51.680 GMT [71152] xxx@xxx ERROR: could not serialize access due to concurrent update ``` I believe I have fixed that bug in my code this morning and the errors above did not seem to coincide with the errors appearing but I'm raising it in case related. Comments and insights are warmly welcomed. Best regards Ian Cottee --0000000000008e47d70625acba49 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hello everyone= , I=E2=80=99ve been using postgres for over 25 years now and never had any major issues which were not caused by my own stupidit= y. In the last 24 hours however I=E2=80=99ve had a number of issues on one cli= ent's server which I assume are a bug in postgres or a possible hardware issue (they are running on a L= inode) but I need some clarification and would welcome advice on how to proceed. I will also = forward this mail to Linode support to ask them to check for any memory iss= ues they can detect.=C2=A0

=C2=A0

This particular Postgres is running on Ubuntu LTS 22.04 and has the following version information:

=C2=A0

```

PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit

```


The quick summary is = that over a 24 hour period I had the following errors appear in the postgres logs at different times causing the system processes to restart:

  • stuck spinlock detected=C2=A0
  • free(): corrupted unsorted chunks=C2= =A0
  • double free or corruption (!prev)
  • corrupted size vs. pr= ev_size=C2=A0
  • corrupted double-linked list=C2=A0
  • *** stack = smashing detected ***: terminated=C2=A0
  • Segmentation fault=C2=A0

=C2=A0

Here=E2=80=99s the more detailed breakdown.=C2=A0

=C2=A0

On Monday evening this week, the following event occurred on t= he server

=C2=A0

```

2024-10-28 18:12:47.145 GMT [575437] xxx@xxx PANIC: stuck spinlock detected at LWLockWaitListLock, ./build/../src/backend/storage/lmgr/lwlock.c:913

```

=C2=A0

Followed by:

=C2=A0

```

2024-10-28 18:12:47.249 GMT [1880289] LOG: terminating any other active server processes

2024-10-28 18:12:47.284 GMT [1880289] LOG: all server processes terminated; reinitializing

```

And eventually


```

2024-10-28 18:12:48.474 GMT [575566] xxx@xxx FATAL: the database system is in recovery mode

2024-10-28 18:12:48.476 GMT [575550] LOG: database system was not properly shut down; automatic recovery in progress

2024-10-28 18:12:48.487 GMT [575550] LOG: redo starts at DD/405E83A8

2024-10-28 18:12:48.487 GMT [575550] LOG: invalid record length at DD/405EF818: wanted 24, got 0

2024-10-28 18:12:48.487 GMT [575550] LOG: redo done at DD/405EF7E0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 = s

2024-10-28 18:12:48.515 GMT [1880289] LOG: database system is ready to accept connections

```

This wasn=E2=80=99t noticed by myself or any users as they ten= d to all be finished by 17:30.=C2=A0 However later,

=C2=A0

```

2024-10-28 20:27:15.258 GMT [611459] xxx@xxx LOG: unexpected EOF on client connection with an open transaction

2024-10-28 21:01:05.934 GMT [620373] xxx@xxxx LOG: unexpected EOF on client connection with an open transaction

free(): corrupted unsorted chunks

2024-10-28 21:15:02.203 GMT [1880289] LOG: server process (PID 623803) was terminated by signal 6: Aborted

2024-10-28 21:15:02.204 GMT [1880289] LOG: terminating any other active server processes

```

=C2=A0

This time it could not recover and I didn=E2=80=99t notice unt= il early the next morning whilst doing some routine checks.

=C2=A0

```

2024-10-28 21:15:03.643 GMT [623807] LOG: database system was not properly shut down; automatic recovery in progress

2024-10-28 21:15:03.655 GMT [623807] LOG: redo starts at DD/47366740

2024-10-28 21:15:03.663 GMT [623807] LOG: invalid record length at DD/475452A0: wanted 24, got 0

2024-10-28 21:15:03.663 GMT [623807] LOG: redo done at DD/47545268 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 = s

2024-10-28 21:15:03.682 GMT [623829] xxx@xxx FATAL: the database system is in recovery mode

double free or corruption (!prev)

2024-10-28 21:15:03.832 GMT [1880289] LOG: startup process (PID 623807) was terminated by signal 6: Aborted

2024-10-28 21:15:03.832 GMT [1880289] LOG: aborting startup due to startup process failure

2024-10-28 21:15:03.835 GMT [1880289] LOG: database system is shut down

```

=C2=A0

When I noticed in the morning it was able to start without an issue. From googling it appeared to be a memory issue and I wondered if = the problem was sorted now the server process had stopped completely and restar= ted. The problem was not sorted although all the above errors were recovered fro= m automatically without any input from myself or the client=E2=80=99s noticin= g.

=C2=A0

```

corrupted size vs. prev_size

2024-10-29 09:55:24.417 GMT [894747] LOG: background worker "parallel worker" (PID 947642) was terminated by signal 6: Aborte= d

```

=C2=A0

```

corrupted double-linked list

2024-10-29 13:14:28.322 GMT [894747] LOG: background worker "parallel worker" (PID 1019071) was terminated by signal 6: Abort= ed

```

=C2=A0

```

*** stack smashing detected ***: terminated

2024-10-28 15:24:30.331 GMT [1880289] LOG: background worker "parallel worker" (PID 528630) was terminated by signal 6: A\ bor= ted

```

=C2=A0

```

2024-10-28 15:40:26.617 GMT [1880289] LOG: background worker "parallel worker" (PID 533515) was terminated by signal 11: \

Segmentation fault

2024-10-28 15:40:26.617 GMT [1880289] DETAIL: Failed process was running: SELECT "formula_line".id FROM "formul\

```

=C2=A0

I rebooted the server at 18:30 and have had no further issues so far, although work has yet to start. When rebooting the server, postgres seemed to take a long time to terminate.


=

Now there is one odd thing that has been happening recently. D= ue to a bug in my code I've had more deadlocks than would normally be e= xpected.=C2=A0


```

202= 4-10-29 19:26:51.680 GMT [71152] xxx@xxx ERROR: =C2=A0could not serialize a= ccess due to concurrent update

```


=

I believe I have fixed that bug in my code this morning and th= e errors above did not seem to coincide with the errors appearing but I'= ;m raising it in case related.=C2=A0


C= omments and insights are warmly welcomed.=C2=A0


Best regards


Ian Cottee

<= p class=3D"MsoNormal" style=3D"margin:0cm;font-size:12pt;font-family:Aptos,= sans-serif">


=C2=A0

=C2=A0

=C2=A0

=C2=A0

=C2=A0

=C2=A0

--0000000000008e47d70625acba49--