Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w54yu-002s3j-2P for pgsql-hackers@arkaria.postgresql.org; Tue, 24 Mar 2026 16:51:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w54yt-0082D0-0r for pgsql-hackers@arkaria.postgresql.org; Tue, 24 Mar 2026 16:51:51 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w54ys-0082Cr-32 for pgsql-hackers@lists.postgresql.org; Tue, 24 Mar 2026 16:51:51 +0000 Received: from mail-lf1-x12a.google.com ([2a00:1450:4864:20::12a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w54yq-00000000uSL-3g1Z for pgsql-hackers@lists.postgresql.org; Tue, 24 Mar 2026 16:51:51 +0000 Received: by mail-lf1-x12a.google.com with SMTP id 2adb3069b0e04-5a27b5ad832so5440920e87.2 for ; Tue, 24 Mar 2026 09:51:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774371108; cv=none; d=google.com; s=arc-20240605; b=hyQhifzudSNpslJqZy/v3qOaWkkhXQCCBUKfPrt9gyPhjp0V/ZtdoG9EzYFu+enOKK qylFdW1H5U1sqA+qXMW3QPKkFM68PyRqyBHib1iYtfXunaHgViYFgrEoJQNoXuqCVO9G WoEvVvr7+LMh+zr5OJZqLhMqezXOHRqJIEPOtYEKC3yj84iNdZ6OqbF+vPfPP5WGDUr9 bev3jJmOCgfPQKsx7b0GtT92VltTEYOVSww8j3OMGVTIsyqQyJDB23HUaAGWKVnsQc3H gAwzbmLWCazLrVtebTgaLBJug+tTD38sVtI7fHh4YeDzwVexPbE3dgtPeEvV9YstOLIm Nn8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=BZuv5AvCYIAhFlwkrIA76VS8LaSpoGRSpIrcvtLMmc4=; fh=r2QK5VYNDlXA+4g0XSQWXXpSyf9cJpjx4Qxgl+YKJn0=; b=ZDrEZurTmofvniMf9U4zzfNu1Xot0DL+APhtsUhYrUbbeXoEBv7Nn1rxrGAlHKiJMC daErzNaM9n+I0gOSkQ0y4nb27Cgf7BKxeMnEMoet/n5V8sEdm1Ia5BoW4xWJLGHUQ9tA f+Mtp+zIfBUXavQdKcQz6xRD+fZEXZ4ksrA2jkYEK3OBavPWzRZzVaFTgScImFXy0KD+ +KVAaqZck7qXdnPoGSeavz883ZLsWg8x8qrV952NesGLasYlFRq4pIAkQzhmsaiIsW0T w7ppqTc3rOFPM86t0gsWlDKcv9RkOywMfhxXUHxRGzYjCTw3j3+5/RC3wGgHkI30++pW 4rQQ==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1774371108; x=1774975908; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BZuv5AvCYIAhFlwkrIA76VS8LaSpoGRSpIrcvtLMmc4=; b=TTaxlqaLuW3CXXQ2ZayuLWbXAFnjf0dtrjN+HPh7kJEPiFhFobuk8HrZkp0TifKiYl D8opPcIs7EOY17mq0GwasXV49dzfN5QFNVByCWaGEUKdfUk04Pp7ZLhMVZirT9glT5U7 OFWwmzEzBLTR+yWJvf3bbYQV4DWSO68mEym4J0eVPhlCjVVMQN7GB2Cxz2nTaDQQZOw+ P3Gg8x64PC1p/GccDSvhvd6bzn7g2SMNajF1alJ31GabqoJm0EdmbTFe2RflcV9eAcYI 8JlnKKuPx9NRvIjLm9u41RyipD0Wcoz11Yir867WI+GxpZJZsr3wJa2Oq0b8FsNfQu1X CAag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774371108; x=1774975908; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=BZuv5AvCYIAhFlwkrIA76VS8LaSpoGRSpIrcvtLMmc4=; b=EAfVIruqeE8DniOhlfo3lJ6Cae/IxwcclP7GAQW0g6Y+NOG3vn+eVZoiSMP9Kpwsrg FfuJrM7n1+kN+xrqzDvV9Nx8FTcuSClEqNrF0Z4MdHvG7ssrLDCrAZps8awKWFDyZ6wd EJT49TWuN5kHhRRIajvXYgjaXeQLr+hJ2E+e9xqzWGavbE9UHjTpTRC1I1s5FJFilklF rpaMmk0kchB5at/He/3cXK0IC9b3T7Ux6fkWAohlGxJYOgorMY+mYVoCStZG487bjEGL A6o4/ISnkpu+2rKGGpjRSHD/fsn7us7C2xWIKtQfnzT+mKzvo1f+vUE8J3OEMgCAVlNX O1OA== X-Forwarded-Encrypted: i=1; AJvYcCVciOcznWl1+c9EBB06UokwgsGsxaT9gGo2xydIqSuBiRd4hNopLFc3yfZtYRQ5705qoQLXaH25Q8r8jVe/@lists.postgresql.org X-Gm-Message-State: AOJu0YwyzQwbPob3pScL865Mmp8FrdpXLni8lJQtlKIcryE/lI2TSHpI QVnBODs7ZcGgxUiDB4JydseTvPDF7lewp893QNR2Be788hV7uNbQfehi+PV+4KuveEsMdFKCLzX /pdN/mz8LUPwpTOuRCv/+qZkPbKmikeKPhm4= X-Gm-Gg: ATEYQzxL73/PHeEB7xobCJmoqehzzYNC4VYUJfze08ydOVg2QthlTBojt3+hNIJQRNk YhgIHHdm0JOn9MDZdGl+ULvEOpQZiSHQKCfKi0hf0clwqoNEejDcQugPs+RFU3OBZqGlXtpznkU CS4FSfMZ7cnr3Ap4Wadmlzg53oK6PxRad3KCFUnesrIrOyai0PVxfFj/OWyOz2eWXpOSaQ6uMBO wxLi1WLOftGvF9N9IqiCm+NyXu3HLdvzX6uUGRxaMud9u+P2O6EabXZMqrd3H61mCkwz9vYxite mcjxtSc8Qd/Fvcw/8n8i1vg+fRiF6Q/Lz2hr X-Received: by 2002:a05:6512:1289:b0:5a2:963f:9e67 with SMTP id 2adb3069b0e04-5a29b979be9mr95964e87.2.1774371107410; Tue, 24 Mar 2026 09:51:47 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Nisha Moond Date: Tue, 24 Mar 2026 22:21:35 +0530 X-Gm-Features: AaiRm53TPS-OYcc_V5sJNZHtDuhlSPwFqBn__4ZTKekixsJ8hqLGDeoMC_4wjzs Message-ID: Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? To: Fujii Masao Cc: Amit Kapila , PostgreSQL Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Mar 24, 2026 at 2:45=E2=80=AFPM Fujii Masao = wrote: > > On Tue, Mar 24, 2026 at 3:00=E2=80=AFPM Fujii Masao wrote: > > > > On Tue, Mar 24, 2026 at 1:01=E2=80=AFPM Nisha Moond wrote: > > > Hi Fujii-san, > > > > > > I tried reproducing the wait scenario as you mentioned, but could not > > > reproduce it. > > > Steps I followed: > > > 1) Place a debugger in the slotsync worker and hold it at > > > fetch_remote_slots() ... -> libpqsrv_get_result() > > > 2) Kill the primary. > > > 3) Triggered promotion of the standby and release debugger from slots= ync worker. > > > > > > The slot sync worker stops when the promotion is triggered and then > > > restarts, but fails to connect to the primary. The promotion happens > > > immediately. > > > ``` > > > LOG: received promote request > > > LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system= : > > > 0.02 s, elapsed: 4574.89 s > > > LOG: last completed transaction was at log time 2026-03-23 > > > 17:13:15.782313+05:30 > > > LOG: replication slot synchronization worker will stop because > > > promotion is triggered > > > LOG: slot sync worker started > > > ERROR: synchronization worker "slotsync worker" could not connect to > > > the primary server: connection to server at "127.0.0.1", port 9933 > > > failed: Connection refused > > > Is the server running on that host and accepting TCP/IP connections? > > > ``` > > > > > > I=E2=80=99ll debug this further to understand it better. > > > In the meantime, please let me know if I=E2=80=99m missing any step, = or if you > > > followed a specific setup/script to reproduce this scenario. > > > > Thanks for testing! > > > > If you killed the primary with a signal like SIGTERM, an RST packet mig= ht have > > been sent to the slotsync worker at that moment. That allowed the worke= r to > > detect the connection loss and exited the wait state, so promotion coul= d > > complete as expected. > > > > To reproduce the issue, you'll need a scenario where the worker cannot = detect > > the connection loss. For example, you could block network traffic (e.g.= , with > > iptables) between the primary and the slotsync worker. The key is to cr= eate > > a situation where the worker remains stuck waiting for input for a long= time. > > Here's one way to reproduce the issue using iptables: > Thank you, Fujii-san, for sharing the steps. I am now able to reproduce the behavior where promotion gets stuck because the slot sync worker remains in a wait loop. As an experiment, I tried setting tcp_user_timeout to 7000 / 15000 (using slightly higher values for debugging). With this setting, the TCP stack terminates the connection if data sent to the primary remains unacknowledged beyond the configured timeout (e.g., due to a network drop). In such cases the slot sync worker exits instead of waiting indefinitely. With an appropriately tuned timeout, this could help avoid the promotion issue by ensuring the worker does not remain stuck when the connection to the primary is lost. Thanks, Nisha