Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uuBkI-004GAR-Ga for pgsql-hackers@arkaria.postgresql.org; Thu, 04 Sep 2025 15:19:31 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uuBkH-00H8yz-Ki for pgsql-hackers@arkaria.postgresql.org; Thu, 04 Sep 2025 15:19:30 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uuBjS-00H3xb-NM for pgsql-hackers@lists.postgresql.org; Thu, 04 Sep 2025 15:18:39 +0000 Received: from mail.postgrespro.ru ([93.174.132.70]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1uuBjP-000WhD-0N for pgsql-hackers@lists.postgresql.org; Thu, 04 Sep 2025 15:18:37 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=postgrespro.ru; s=mx2023; t=1756999110; bh=fdhxlJPizP3YuTIEkvbUqyEDK9KVOuDzXVMGpy/K0sU=; h=Message-ID:Date:User-Agent:To:From:Subject:From; b=hYGzWSCDahCdzgo5xBNo3pIVk5qBf9GvttKvDLx0GNhyRMxtMOuPPPXxD+8vBec9b 7Wto/Bi8kMTk1YEemfWLN93NbbNwWePtCWyErprprSegxzeDfPvpn2XYDw3dRXQ3NT APhxbjQfF+MeBYL6RtE5RJp0Ezpgknfa/v1hPcUvLLoFGCMZoqmPrzasAvR1ane461 v9Fh2GFF1BeTdhUAGq2eQQJzQXXIzMvuZ6jmss/60qbJEraXGQq6PHHDK3vfxnRHF0 s7WAH7x4dyXlzKgTU+SBAjMLRmkgNDFScT7GwIRZZevV8DmCXo24O4+a/e3i2Obeb+ qYDe12lnbp2/g== Received: from [192.168.0.102] (unknown [176.99.84.183]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: m.melnikov@postgrespro.ru) by mail.postgrespro.ru (Postfix/465) with ESMTPSA id B99F460676 for ; Thu, 4 Sep 2025 18:18:30 +0300 (MSK) Content-Type: multipart/mixed; boundary="------------oBwWm0M1Fb0cgbE0J0fbXR0o" Message-ID: Date: Thu, 4 Sep 2025 18:18:30 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: pgsql-hackers@lists.postgresql.org From: "Maksim.Melnikov" Subject: Incorrect checksum in control file with pg_rewind test X-KSMG-AntiPhishing: NotDetected, bases: 2025/09/04 14:47:00 X-KSMG-AntiSpam-Interceptor-Info: not scanned X-KSMG-AntiSpam-Status: not scanned, disabled by settings X-KSMG-AntiVirus: Kaspersky Secure Mail Gateway, version 2.1.0.7854, bases: 2025/09/04 12:10:00 #27762282 X-KSMG-AntiVirus-Status: NotDetected, skipped X-KSMG-LinksScanning: not scanned, disabled by settings X-KSMG-Message-Action: skipped X-KSMG-Rule-ID: 1 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------oBwWm0M1Fb0cgbE0J0fbXR0o Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Hi, hackers! I've got test failure for pg_rewind tests and it seems we have=20 read/write races for pg_control file. The test error is incorrect checksum in control file= . Build was compiled with -DEXEC_BACKEND flag. # +++ tap check in src/bin/pg_rewind +++ Bailout called.=C2=A0 Further testing stopped:=C2=A0 pg_ctl start failed t/001_basic.pl ............... Dubious, test returned 255 (wstat 65280, 0xff00) All 20 subtests passed 2025-05-07 15:00:39.353 MSK [2002308] LOG:=C2=A0 starting backup recovery= =20 with redo LSN 0/2000028, checkpoint LSN 0/2000070, on timeline ID 1 2025-05-07 15:00:39.354 MSK [2002307] FATAL:=C2=A0 incorrect checksum in=20 control file 2025-05-07 15:00:39.354 MSK [2002308] LOG:=C2=A0 redo starts at 0/2000028 2025-05-07 15:00:39.354 MSK [2002308] LOG:=C2=A0 completed backup recover= y=20 with redo LSN 0/2000028 and end LSN 0/2000138 2025-05-07 15:00:39.354 MSK [2002301] LOG:=C2=A0 background writer proces= s=20 (PID 2002307) exited with exit code 1 2025-05-07 15:00:39.354 MSK [2002301] LOG:=C2=A0 terminating any other ac= tive=20 server processes 2025-05-07 15:00:39.355 MSK [2002301] LOG:=C2=A0 shutting down because=20 restart_after_crash is off 2025-05-07 15:00:39.356 MSK [2002301] LOG:=C2=A0 database system is shut = down # No postmaster PID for node "primary_remote" [15:00:39.438](0.238s) Bail out!=C2=A0 pg_ctl start failed Failure occurred during restart the primary node to check that rewind=20 went correctly. Error is very rare and difficult to reproduce. It seems we have race between process that replays WAL on start and=20 update control file and other sub-processes that read control file and were started=20 with exec. As the result sub-processes can read partially updated file with=20 incorrect crc. The reason is that LocalProcessControlFile don't acquire ControlFileLock=20 and it can't do it. I found thread=20 https://www.postgresql.org/message-id/flat/20221123014224.xisi44byq3cf5ps= i%40awork3.anarazel.de, where the similiar issue was discussed for frontend programs. The=20 decision was to retry control file read in case of crc failures. Details can be found=20 in commit 5725e4ebe7a936f724f21e7ee1e84e54a70bfd83. My suggestion is to use this=20 approach here. Patch is attached. Best regards, Maksim Melnikov --------------oBwWm0M1Fb0cgbE0J0fbXR0o Content-Type: text/x-patch; charset=UTF-8; name="v1-0001-Try-to-handle-torn-reads-of-pg_control-in-sub-pos.patch" Content-Disposition: attachment; filename*0="v1-0001-Try-to-handle-torn-reads-of-pg_control-in-sub-pos.pa"; filename*1="tch" Content-Transfer-Encoding: base64 RnJvbSBjN2U1NWMyOGJjZWNhN2FjM2E2NTk4NjBlMWYxOWQ1MjQzYzE0OTlhIE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBNYWtzaW0gTWVsbmlrb3YgPG0ubWVsbmlrb3ZAcG9z dGdyZXNwcm8ucnU+CkRhdGU6IFRodSwgNCBTZXAgMjAyNSAxNzozNzo0NyArMDMwMApTdWJq ZWN0OiBbUEFUQ0ggdjFdIFRyeSB0byBoYW5kbGUgdG9ybiByZWFkcyBvZiBwZ19jb250cm9s IGluIHN1YiBwb3N0bWFzdGVyCiBwcm9jZXNzZXMuCgpUaGUgc2FtZSBwcm9ibGVtIHdhcyBm aXhlZCBpbiA2M2E1ODIyMjJjNmIzZGIyYjExMDNkZGY2N2EwNGIzMWE4ZjhlOWJiLApidXQg Zm9yIGZyb250ZW5kcy4gQ3VycmVudCBjb21taXQgaXMgZml4aW5nIHRoaXMgcHJvYmxlbSBm b3IgY2FzZXMKd2hlbiBwZ19jb250cm9sIGZpbGUgaXMgcmVhZCBieSBmb3JrL2V4ZWMnZCBw cm9jZXNzZXMuCgpUaGVyZSBjYW4gYmUgcmFjZSBiZXR3ZWVuIHByb2Nlc3MsIHRoYXQgcmVw bGF5cyBXQUwgb24gc3RhcnQgYW5kCnVwZGF0ZSBjb250cm9sIGZpbGUgYW5kIG90aGVyIHN1 Yi1wcm9jZXNzZXMgdGhhdCByZWFkIGNvbnRyb2wgZmlsZQphbmQgd2VyZSBzdGFydGVkIHdp dGggZXhlYy4gQXMgdGhlIHJlc3VsdCBzdWItcHJvY2Vzc2VzIGNhbiByZWFkCnBhcnRpYWxs eSB1cGRhdGVkIGZpbGUgd2l0aCBpbmNvcnJlY3QgY3JjLiBUaGUgcmVhc29uIGlzIHRoYXQK TG9jYWxQcm9jZXNzQ29udHJvbEZpbGUgZG9uJ3QgYWNxdWlyZSBDb250cm9sRmlsZUxvY2sg YW5kIGl0IGNhbid0CmRvIGl0LgoKQ3VycmVudCBwYXRjaCBpcyBqdXN0IGNvcHktcGFzdGUg b2YgY2hhbmdlcywgYXBwbGllZCBmb3IgZnJvbnRlbmRzLAp3aXRoIGxpdHRsZSBhZGFwdGF0 aW9uLgotLS0KIHNyYy9iYWNrZW5kL2FjY2Vzcy90cmFuc2FtL3hsb2cuYyB8IDMzICsrKysr KysrKysrKysrKysrKysrKysrKysrKysrKy0KIDEgZmlsZSBjaGFuZ2VkLCAzMiBpbnNlcnRp b25zKCspLCAxIGRlbGV0aW9uKC0pCgpkaWZmIC0tZ2l0IGEvc3JjL2JhY2tlbmQvYWNjZXNz L3RyYW5zYW0veGxvZy5jIGIvc3JjL2JhY2tlbmQvYWNjZXNzL3RyYW5zYW0veGxvZy5jCmlu ZGV4IDdmZmIyMTc5MTUxLi45OGY5OTJhYTgxMiAxMDA2NDQKLS0tIGEvc3JjL2JhY2tlbmQv YWNjZXNzL3RyYW5zYW0veGxvZy5jCisrKyBiL3NyYy9iYWNrZW5kL2FjY2Vzcy90cmFuc2Ft L3hsb2cuYwpAQCAtNDM0Nyw2ICs0MzQ3LDE1IEBAIFJlYWRDb250cm9sRmlsZSh2b2lkKQog CWludAkJCWZkOwogCWNoYXIJCXdhbF9zZWdzel9zdHJbMjBdOwogCWludAkJCXI7CisJYm9v bAkJY3JjX29rOworI2lmZGVmIEVYRUNfQkFDS0VORAorCXBnX2NyYzMyYwlsYXN0X2NyYzsK KwlpbnQJCQlyZXRyaWVzID0gMDsKKworCUlOSVRfQ1JDMzJDKGxhc3RfY3JjKTsKKworcmV0 cnk6CisjZW5kaWYKIAogCS8qCiAJICogUmVhZCBkYXRhLi4uCkBAIC00NDExLDcgKzQ0MjAs MjkgQEAgUmVhZENvbnRyb2xGaWxlKHZvaWQpCiAJCQkJb2Zmc2V0b2YoQ29udHJvbEZpbGVE YXRhLCBjcmMpKTsKIAlGSU5fQ1JDMzJDKGNyYyk7CiAKLQlpZiAoIUVRX0NSQzMyQyhjcmMs IENvbnRyb2xGaWxlLT5jcmMpKQorCWNyY19vayA9IEVRX0NSQzMyQyhjcmMsIENvbnRyb2xG aWxlLT5jcmMpOworCisjaWZkZWYgRVhFQ19CQUNLRU5ECisKKwkvKgorCSAqIElmIHRoZSBz ZXJ2ZXIgd2FzIHdyaXRpbmcgYXQgdGhlIHNhbWUgdGltZSwgaXQgaXMgcG9zc2libGUgdGhh dCB3ZSByZWFkCisJICogcGFydGlhbGx5IHVwZGF0ZWQgY29udGVudHMgb24gc29tZSBzeXN0 ZW1zLiAgSWYgdGhlIENSQyBkb2Vzbid0IG1hdGNoLAorCSAqIHJldHJ5IGEgbGltaXRlZCBu dW1iZXIgb2YgdGltZXMgdW50aWwgd2UgY29tcHV0ZSB0aGUgc2FtZSBiYWQgQ1JDIHR3aWNl CisJICogaW4gYSByb3cgd2l0aCBhIHNob3J0IHNsZWVwIGluIGJldHdlZW4uICBUaGVuIHRo ZSBmYWlsdXJlIGlzIHVubGlrZWx5CisJICogdG8gYmUgZHVlIHRvIGEgY29uY3VycmVudCB3 cml0ZS4KKwkgKi8KKwlpZiAoIWNyY19vayAmJgorCQkocmV0cmllcyA9PSAwIHx8ICFFUV9D UkMzMkMoY3JjLCBsYXN0X2NyYykpICYmCisJCXJldHJpZXMgPCAxMCkKKwl7CisJCXJldHJp ZXMrKzsKKwkJbGFzdF9jcmMgPSBjcmM7CisJCXBnX3VzbGVlcCgxMDAwMCk7CisJCWdvdG8g cmV0cnk7CisJfQorI2VuZGlmCisKKwlpZiAoIWNyY19vaykKIAkJZXJlcG9ydChGQVRBTCwK IAkJCQkoZXJyY29kZShFUlJDT0RFX09CSkVDVF9OT1RfSU5fUFJFUkVRVUlTSVRFX1NUQVRF KSwKIAkJCQkgZXJybXNnKCJpbmNvcnJlY3QgY2hlY2tzdW0gaW4gY29udHJvbCBmaWxlIikp KTsKLS0gCjIuNDMuMAoK --------------oBwWm0M1Fb0cgbE0J0fbXR0o--