Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tXbnC-00AkJd-K0 for pgsql-general@arkaria.postgresql.org; Tue, 14 Jan 2025 07:56:55 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tXbnA-00G1hJ-Al for pgsql-general@arkaria.postgresql.org; Tue, 14 Jan 2025 07:56:52 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tXbn9-00G1hB-US for pgsql-general@lists.postgresql.org; Tue, 14 Jan 2025 07:56:52 +0000 Received: from zproxy4.mail3d.it ([212.78.3.137]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1tXbn7-000JpW-2i for pgsql-general@lists.postgresql.org; Tue, 14 Jan 2025 07:56:51 +0000 Received: from zproxy4.mail3d.it (localhost.localdomain [127.0.0.1]) by zproxy4.mail3d.it (Postfix) with ESMTPS id 6EF5560897; Tue, 14 Jan 2025 08:56:48 +0100 (CET) Received: from localhost (localhost.localdomain [127.0.0.1]) by zproxy4.mail3d.it (Postfix) with ESMTP id 5EBA86088B; Tue, 14 Jan 2025 08:56:48 +0100 (CET) DKIM-Filter: OpenDKIM Filter v2.10.3 zproxy4.mail3d.it 5EBA86088B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cleistech.it; s=1D99BE42-E8ED-11ED-9278-A8004466375A; t=1736841408; bh=IqwMhCPPEf0TOfNb+/eqNX53D7+POdTc1qW75uM1PbQ=; h=Message-ID:Date:MIME-Version:To:From; b=cTaaxIQmBCHeLLDtLH3jbwefbkl+2XFLvzsHbF5VmwN6YkZf+PY6qAk+6Htv1omvD T1ITEY4xIaZqZa1xNxwrLo1ye+ZV8ObOSvuZ0oCD01rKK/eAXml0DNs4qqT6FstsKT O1WJfdLfw89gdOwowzUm69Gh5G+k1ZG+bOWf/3fnsB53kaw6pZlcJv048pIQEUB+XI eLUfJ1qmGMjLFm88ePoMSf6ft3rMdXw1chEqXXzVPGzkA8LE9PAkW5x5043hiwEk4U WUKcFrbb4BmMbMl2HTh48YPnEEl+5bRCs8CQO++XHaj4W1427nboOVq/8LM+LYU8ym 8WXagVU3bKhEA== Received: from zproxy4.mail3d.it ([127.0.0.1]) by localhost (zproxy4.mail3d.it [127.0.0.1]) (amavis, port 10026) with ESMTP id LnB8OH-uSYTD; Tue, 14 Jan 2025 08:56:48 +0100 (CET) Received: from [192.168.0.20] (host-87-1-250-25.retail.telecomitalia.it [87.1.250.25]) by zproxy4.mail3d.it (Postfix) with ESMTPSA id 1B2BD60824; Tue, 14 Jan 2025 08:56:48 +0100 (CET) Message-ID: <54b91cf3-1d0c-4134-89c8-8e76ed46415c@cleistech.it> Date: Tue, 14 Jan 2025 08:56:47 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Intermittent errors when fetching cursor rows on PostgreSQL 16 To: Adrian Klaver , pgsql-general@lists.postgresql.org Cc: Massimo Catti , Livio Pizzolo References: <446423eb-4a4e-4135-bbb8-4d0e5c7aac3b@cleistech.it> <25d5fb16-9bb2-4ad1-969c-eaca03ccbaaf@aklaver.com> <6ce80aaa-970b-4432-938a-39a07f811599@cleistech.it> <9f60eb26-7d34-4228-bd78-74c6deb90e54@aklaver.com> <282c2a48-bb12-4486-b03d-563523cac81b@cleistech.it> <2645a89e-d661-4f2b-92b3-01154a78d535@aklaver.com> <54689a6a-839c-44c4-90b5-b9692e8e7cb0@aklaver.com> <4efe42a2-789c-4957-a564-25199869f6ec@cleistech.it> <382a1eec-2069-4010-bbdb-37260a1a53a7@cleistech.it> <8d65f84d-ddb3-4d0f-be05-44f443500e41@aklaver.com> Content-Language: it From: Enrico Schenone In-Reply-To: <8d65f84d-ddb3-4d0f-be05-44f443500e41@aklaver.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Il 13/01/25 18:26, Adrian Klaver ha scritto: > On 1/13/25 08:59, Enrico Schenone wrote: >> >> Il 13/01/25 17:19, Adrian Klaver ha scritto: >>> On 1/13/25 00:45, Enrico Schenone wrote: >>>> Hello, Adrian. >>>> As I said days ago, I have arranged a kind of stress test in=20 >>>> production environment. >>>> I wrote a program that loads a temporary table, loads 2049 rows=20 >>>> into them from a baseline_table and finally declare two nested=20 >>>> cursors. >>>> The first cursor is on the temp table as parent while the second is=20 >>>> on a lookup table as child. >>>> >>>> The program logic is the transposition of one fragment of several=20 >>>> production programs that was failing on cursors, and has to be=20 >>>> intended as a POC only. >>>> >>> >>> >>>> And Well, I'm quite confused: no error at all has been detected,=20 >>>> not only on the test programs but in the whole production system.=20 >>>> The error was completely disappeared. >>>> >>>> Then I have stopped the four tasks of the stress test leaving all=20 >>>> other services running for a week, and again no error at all. >>>> >>>> No setup was changed nor servers was rebooted, nor infrastructure=20 >>>> has been upgraded during the test period. >>> >>> You are absolutely sure about the above? >> I can say Yes. All test operations has been logged and verified=20 >> against the Postgresql log. >> The only component not under my control is the Provider's=20 >> Infrastructure, but=C2=A0 the infrastructure admin ensured me that no=20 >> operation at all has been made. I beleave him because it is a=20 >> reliable tecnician end a well known person. > > In your OP you stated: > > "Production environments can be: > > =C2=A0* Distinct application server and DB server on distinct subnets (= no > =C2=A0=C2=A0=C2=A0 dropped packet detected on firewall, no memory/disk/= network failure > =C2=A0=C2=A0=C2=A0 detected by "nmon" tool) > =C2=A0 * Distinct application server and DB server on same subnet (no=20 > firewall) > =C2=A0 * Same server for PostgreSQL and applications > " > > In all those cases are the various servers all running completely=20 > within the providers infrastructure? > No, the second production environment is On Premises at the customer=20 Datacenter under the same vmware hypervisor. I'll make more investigations on second and third environments by=20 increasing the verbosity of both DB and Application logs. > >>> Errors that 'fix' themselves are the most frustrating kind, as you=20 >>> know in the back of your mind they will likely pop up again. >> True, knocking again to my door ... I still can't beleave. > > Going forward one of three things are likely to happen: > > 1) The error never shows again. > > 2) It does show up again but in a manner that allows it to be traced. > > 3) The worst case, it plays hide and seek as previously. > > In our standard applications it normally plays hide because normally our=20 programs doesn't apply TRY/CATCH paradigm on FETCH. In 4Js Genero BDL a fetch can return only "success" or "not found"=20 status, so normally we don't catch for any exception unless we want to=20 detect locks. Furthermore for standard cursors we don't use the OPEN/FETCH/CLOSE=20 statements but simply a FOREACH ... END FOREACH statement that=20 masquerades in a more easy way the OPEN/FETCH/CLOSE process. We detected the problem looking at Postgresql log and investigating at=20 client side for any program exception or abort in FOREACH/END FOREACH=20 block by increasing log verbosity and applying the OPEN/FETCH/CLOSE and=20 TRY/CATCH paradigm on suspect cursors. > >> Thanks a lot for your interest in sharing my strange experience. >> Best regards. >> Enrico >> >> *Enrico Schenone* >> Software Architect > *Enrico Schenone* Software Architect