Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tBXf5-004UjP-F7 for pgsql-hackers@arkaria.postgresql.org; Thu, 14 Nov 2024 11:05:18 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tBXf2-002qjv-Fh for pgsql-hackers@arkaria.postgresql.org; Thu, 14 Nov 2024 11:05:17 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tBXf2-002qjm-4k for pgsql-hackers@lists.postgresql.org; Thu, 14 Nov 2024 11:05:16 +0000 Received: from mail-pj1-x1036.google.com ([2607:f8b0:4864:20::1036]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1tBXev-001oa6-W8 for pgsql-hackers@lists.postgresql.org; Thu, 14 Nov 2024 11:05:15 +0000 Received: by mail-pj1-x1036.google.com with SMTP id 98e67ed59e1d1-2e2ed2230d8so421337a91.0 for ; Thu, 14 Nov 2024 03:05:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=timescale.com; s=google; t=1731582309; x=1732187109; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=zhWSBeDUTq/oEUP8dIw2AL4dV7Cl6JJgg5yr1zN82yI=; b=MSupe2rsbLsKA2eH/dHWiHhTgMnSaWeQTzYyymtdvsDwzRkVYRbIHncduQBMa4Z5BP z/0+NyhY7+qOtcEuokR1cBhvlq4DslwsBclDFdTmvAJPWX3KZhmBBpuQLReUwHYPi5iq E2RTtQYSMWNQFZpKl0meCHvbPDOY7nbof5JAlXlxUYF8fKeJQh784Hq7pSUJ5IdbE8EU KyLZbhCeuAjzIDFHEOdZYbr41F8KWbD9EK1JxNM9Gule2X/IETN4QT2x21TEVGtoA2X1 uV+T/gGsUNcph3qle96C05Q/PWeTp5v3GP+gupmpIzw/kxGFHnw3Pk0dULYWcg6tbHGp VKNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731582309; x=1732187109; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=zhWSBeDUTq/oEUP8dIw2AL4dV7Cl6JJgg5yr1zN82yI=; b=Qp2uxyBai3XpO9F05J4uQYwZ1tfgW0IkSzCgSOUKA0OiV3bPHNzevy8qIfRg76fkAI kOaRlswlSLv4iJ20gkBsdufU21FPwmkHe3IQlkh6gqNIbY/WiGFDeCQ5ZouXbGNonUNq Ji0P0JWRKybB7XdVjYml7lgbQru9/ulCcT6Xyf1db7JR426Zk+oOlE0L82LAjpOT3yE9 +eg4hlTPHlAfrNJ4HNWaWHj2UnJLfmm6rxdOPfla+hy0i1pcFbUbK5ZMp0Q7zOmKhglg Ul3kFtcP6wL78tev+tfH5n1En2A3720mhz+llzugqmzboNu57SeD9y1NtfDvfWIvexqo 55lQ== X-Gm-Message-State: AOJu0YwWsXNsRpNcl/jQC6ux49gGIpxLnUA6yYoSrD+zaEgDVjAwgs1h YDzIV5p97vt0dD43PFiDNXwhfM79hSSskh2LvGFzb+mkjbYImzKCydmni9XgUGjSyUm46+GO9dN vUBFMXvLFbfmCOOt4iIIOyHxGrblePE31vwu0kVXhZcHaLESM0U0rzg== X-Google-Smtp-Source: AGHT+IGeZQnimCPaMmtSquroR4AWSzXnODDDWFHsP22qJF5z+janf5zuVPYxv4GpSDNBBYifUslybhlBGkA4ySNpnCk= X-Received: by 2002:a17:90b:4c84:b0:2e2:d17e:1ef7 with SMTP id 98e67ed59e1d1-2ea0634651bmr1756401a91.3.1731582308508; Thu, 14 Nov 2024 03:05:08 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Aleksander Alekseev Date: Thu, 14 Nov 2024 14:04:57 +0300 Message-ID: Subject: Re: [PATCH] Refactor SLRU to always use long file names To: PostgreSQL Hackers Cc: Michael Paquier Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi Michael, > The scans may be quite long as well, actually, which could be a > bottleneck. Did you measure the runtime with a maximized (still > realistic) pool of files for these SLRUs in the upgrade time? For > upgrades, data would be the neck. Good question. In theory SLRUs are not supposed to grow large and their size is a small fraction of the rest of the database. As an example CLOG ( pg_xact/ ) stores 2 bits per transaction. Since every SLRU has a dedicated directory and we scan just it, non-SLRU files don't affect the scan time. To make sure I asked several people to check how many SLRUs they have in the prod environment. The typical response looked like this: ``` $PGDATA/pg_xact: 191 segments $PGDATA/pg_commit_ts: 3 $PGDATA/pg_multixact/offsets: 148 $PGDATA/pg_multixact/members: 400 $PGDATA/pg_subtrans: 4 $PGDATA/pg_serial: 3 ``` This is a 800 Gb database. Interestingly larger databases (4.2Tb) may have much less SLRU segments (220 in total, most of them are pg_xact). And here is the *worst* case that was reported to me: ``` $PGDATA/pg_xact: 171 segments $PGDATA/pg_commit_ts: 3 $PGDATA/pg_multixact/offsets: 4864 $PGDATA/pg_multixact/members: 40996 $PGDATA/pg_subtrans: 5 $PGDATA/pg_serial: 3 ``` I was told this is a "1Tb+" database. For this user pg_upgrade will rename 45 000 files. I wrote a little script to check how much time it will take: ``` #!/usr/bin/env perl use strict; my $from = "test_0001.tmp"; my $to = "test_0002.tmp"; system("touch $from"); for my $i (1..45000) { rename($from, $to); ($from, $to) = ($to, $from); } ``` On my laptop I get 0.5 seconds. Note that I don't do scanning, only renaming, assuming that the recent should take most of the time. I think this should be multiplied by 10 to take into account the role of the filesystem cache and other factors. All in all in the absolutely worst case scenario this shouldn't take more than 5 seconds, in reality it will probably be orders of magnitude less. > Note that this also depends on the system endianness, see 039_end_of_wal.pl. Sure, I think I took it into account when using pack("L!"). My understanding is that "L" takes care of the endiness since I see special flags to force little- or big-endiness independently from the platform [1]. This of course should be tested in practice on different machines. Using an exclamation mark in "L!" was a mistake since cat_ver is not an int, but rather an uint32. > You don't really need the lookup part, actually? For lookup we already have the pg_controldata tool, that's not a problem. > Control file manipulation may be useful as a routine in Cluster.pm, > based on an offset in the file and a format to pack as argument? > [...] > It's one of these things I could see myself reuse to force a state in > the cluster and make a test cheaper, for example. > You would just need the part where > the control file is rewritten, which should be OK as long as the > cluster is freshly initdb'd meaning that there should be nothing that > interacts with the new value set. Agree. Still I don't see a good way of figuring out sizeof(ControlFileData) from Perl. The structure has int's in it (e.g. wal_level, MaxConnections, etc) thus the size is platform-dependent. The CRC should be placed at the end of the structure. If we want to manipulate MaxConnections etc their offsets are going to be platform-dependent as well. And my understanding is that the alignment is platform/compiler dependent too. I guess we are going to need either a `pg_writecontoldata` tool or `pg_controldata -w` flag. I wonder which option you find more attractive, or maybe you have better ideas? [1]: https://perldoc.perl.org/functions/pack -- Best regards, Aleksander Alekseev