public inbox for [email protected]help / color / mirror / Atom feed
Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup 4+ messages / 3 participants [nested] [flat]
* Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup @ 2025-04-04 17:34 Heikki Linnakangas <[email protected]> 0 siblings, 1 reply; 4+ messages in thread From: Heikki Linnakangas @ 2025-04-04 17:34 UTC (permalink / raw) To: Jelte Fennema-Nio <[email protected]>; Andres Freund <[email protected]>; +Cc: Tom Lane <[email protected]>; Tomas Vondra <[email protected]>; pgsql-hackers On 18/03/2025 01:08, Jelte Fennema-Nio wrote: > On Mon Feb 24, 2025 at 12:01 PM CET, Jelte Fennema-Nio wrote: >> Right after pressing send I realized I could remove two more useless >> lines... > > Rebased patchset attached (trivial conflict against pg_noreturn > changes). v7-0001-Adds-a-helper-for-places-that-shell-out-to-system.patch: > /* > * A custom wrapper around the system() function that calls the necessary > * functions pre/post-fork. > */ > int > System(const char *command, uint32 wait_event_info) > { > int rc; > > fflush(NULL); > pgstat_report_wait_start(wait_event_info); > > if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND) > { > /* > * Set in_restore_command to tell the signal handler that we should > * exit right away on SIGTERM. This is done for the duration of the > * system() call because there isn't a good way to break out while it > * is executing. Since we might call proc_exit() in a signal handler, > * it is best to put any additional logic outside of this section > * where in_restore_command is set to true. > */ > in_restore_command = true; > > /* > * Also check if we had already received the signal, so that we don't > * miss a shutdown request received just before this. > */ > if (shutdown_requested) > proc_exit(1); > } > > rc = system(command); > > if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND) > in_restore_command = false; > > pgstat_report_wait_end(); > return rc; > } Let's move that 'in_restore_command' business to the caller. It's weird modularity for the function to implicitly behave differently for some callers. And 'wait_event_info' should only affect pgstat reporting, not actual behavior. I don't feel good about the function name. How about pg_system() or something? postmaster/startup.c also seems like a weird place for it; not sure where it belongs but probably not there. Maybe next to OpenPipeStream() in fd.c, or move both to a new file. v7-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch: > @@ -274,6 +275,7 @@ System(const char *command, uint32 wait_event_info) > { > int rc; > > + RestoreOriginalOpenFileLimit(); > fflush(NULL); > pgstat_report_wait_start(wait_event_info); > > @@ -303,6 +305,7 @@ System(const char *command, uint32 wait_event_info) > in_restore_command = false; > > pgstat_report_wait_end(); > + RestoreCustomOpenFileLimit(); > return rc; > } > Looks a bit funny that both functions are called Restore<something>(). Not sure what to suggest instead though. Maybe RaiseOpenFileLimit() and LowerOpenFileLimit(). > @@ -2724,6 +2873,19 @@ OpenPipeStream(const char *command, const char *mode) > ReleaseLruFiles(); > > TryAgain: > + > + /* > + * It would be great if we could call RestoreOriginalOpenFileLimit here, > + * but since popen() also opens a file in the current process (this side > + * of the pipe) we cannot do so safely. Because we might already have many > + * more files open than the original limit. > + * > + * The only way to address this would be implementing a custom popen() > + * that calls RestoreOriginalOpenFileLimit only around the actual fork > + * call, but that seems too much effort to handle the corner case where > + * this external command uses both select() and tries to open more files > + * than select() allows for. > + */ > fflush(NULL); > pqsignal(SIGPIPE, SIG_DFL); > errno = 0; What would it take to re-implement popen() with fork+exec? Genuine question; I have a feeling it might be complicated to do correctly on all platforms, but I don't know. -- Heikki Linnakangas Neon (https://neon.tech) ^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup @ 2025-04-13 19:30 Jelte Fennema-Nio <[email protected]> parent: Heikki Linnakangas <[email protected]> 0 siblings, 1 reply; 4+ messages in thread From: Jelte Fennema-Nio @ 2025-04-13 19:30 UTC (permalink / raw) To: Heikki Linnakangas <[email protected]>; Andres Freund <[email protected]>; +Cc: Tom Lane <[email protected]>; Tomas Vondra <[email protected]>; pgsql-hackers On Fri Apr 4, 2025 at 7:34 PM CEST, Heikki Linnakangas wrote: > Let's move that 'in_restore_command' business to the caller. It's weird > modularity for the function to implicitly behave differently for some > callers. I definitely agree with the sentiment, and it was what I originally planned to do. But then I went for this approach anyway because commit 8fb13dd6ab5b explicitely moved all code except for the actual call to system() out of the PreRestoreCommand()/PostRestoreCommand() section (which is also described in the code comment). So I didn't move the the in_restore_command stuff to the caller, and improved the function comment to call out this unfortunate coupling. > And 'wait_event_info' should only affect pgstat reporting, not > actual behavior. Given that we need to keep the restore command stuff in this function, I think the only other option is to add a dedicated argument for the restore command stuff, like "bool is_restore_command". But that would require every caller, except for the restore command, to pass an additional "false" as an argument. To me the additionaly noise that that adds seems like a worse issue than the non-purity we get by piggy-backing on the wait_event_info argument. > I don't feel good about the function name. How about pg_system() or > something? Done > postmaster/startup.c also seems like a weird place for it; > not sure where it belongs but probably not there. Maybe next to > OpenPipeStream() in fd.c, or move both to a new file. Moved it to fd.c > Looks a bit funny that both functions are called Restore<something>(). > Not sure what to suggest instead though. Maybe RaiseOpenFileLimit() and > LowerOpenFileLimit(). Changed them to UseOriginalOpenFileLimit() and UseOriginalOpenFileLimit() > What would it take to re-implement popen() with fork+exec? Genuine > question; I have a feeling it might be complicated to do correctly on > all platforms, but I don't know. I initially attempted to re-implement it, but after looking at the fairly complex FreeBSD implementation of popen[1] that suddenly seemed more trouble than it's worth. [1]: https://github.com/freebsd/freebsd-src/blob/c98367641991019bac0e8cd55b70682171820534/lib/libc/gen/po... Attachments: [text/x-patch] v8-0001-Adds-a-helper-for-places-that-shell-out-to-system.patch (5.8K, 2-v8-0001-Adds-a-helper-for-places-that-shell-out-to-system.patch) download | inline diff: From 5dbab2ccf0d74313dbc2cbaeb418346de8cc2f48 Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio <[email protected]> Date: Sun, 23 Feb 2025 16:52:29 +0100 Subject: [PATCH v8 1/3] Adds a helper for places that shell out to system() We need to call system() in a few places, and to do so safely we need some pre/post-fork logic. This encapsulates that logic into a single System helper function. The main reason to do this is that in a follow up commit we want to add even more logic pre and post-fork. --- src/backend/access/transam/xlogarchive.c | 29 ++-------------- src/backend/archive/shell_archive.c | 5 +-- src/backend/postmaster/startup.c | 1 + src/backend/storage/file/fd.c | 42 ++++++++++++++++++++++++ src/include/storage/fd.h | 1 + 5 files changed, 48 insertions(+), 30 deletions(-) diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 1ef1713c91a..c7640ec5025 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -158,27 +158,8 @@ RestoreArchivedFile(char *path, const char *xlogfname, (errmsg_internal("executing restore command \"%s\"", xlogRestoreCmd))); - fflush(NULL); - pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND); - - /* - * PreRestoreCommand() informs the SIGTERM handler for the startup process - * that it should proc_exit() right away. This is done for the duration - * of the system() call because there isn't a good way to break out while - * it is executing. Since we might call proc_exit() in a signal handler, - * it is best to put any additional logic before or after the - * PreRestoreCommand()/PostRestoreCommand() section. - */ - PreRestoreCommand(); - - /* - * Copy xlog from archival storage to XLOGDIR - */ - rc = system(xlogRestoreCmd); - - PostRestoreCommand(); - - pgstat_report_wait_end(); + /* Copy xlog from archival storage to XLOGDIR */ + rc = pg_system(xlogRestoreCmd, WAIT_EVENT_RESTORE_COMMAND); pfree(xlogRestoreCmd); if (rc == 0) @@ -325,11 +306,7 @@ ExecuteRecoveryCommand(const char *command, const char *commandName, /* * execute the constructed command */ - fflush(NULL); - pgstat_report_wait_start(wait_event_info); - rc = system(xlogRecoveryCmd); - pgstat_report_wait_end(); - + rc = pg_system(xlogRecoveryCmd, wait_event_info); pfree(xlogRecoveryCmd); if (rc != 0) diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c index 828723afe47..64c2c6aa760 100644 --- a/src/backend/archive/shell_archive.c +++ b/src/backend/archive/shell_archive.c @@ -75,10 +75,7 @@ shell_archive_file(ArchiveModuleState *state, const char *file, (errmsg_internal("executing archive command \"%s\"", xlogarchcmd))); - fflush(NULL); - pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND); - rc = system(xlogarchcmd); - pgstat_report_wait_end(); + rc = pg_system(xlogarchcmd, WAIT_EVENT_ARCHIVE_COMMAND); if (rc != 0) { diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c index 7149a67fcbc..eb79fda1fd9 100644 --- a/src/backend/postmaster/startup.c +++ b/src/backend/postmaster/startup.c @@ -33,6 +33,7 @@ #include "utils/guc.h" #include "utils/memutils.h" #include "utils/timeout.h" +#include "utils/wait_event_types.h" #ifndef USE_POSTMASTER_DEATH_SIGNAL diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 0e8299dd556..718d8079ad7 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2734,6 +2734,48 @@ OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode) return -1; /* failure */ } +/* + * A custom wrapper around the system() function that calls the necessary + * functions pre/post-fork. + * + * If WAIT_EVENT_RESTORE_COMMAND is passed as the wait_event_info, it will also + * call the necessary PreRestoreCommand/PostRerstoreCommand functions. It's + * unfortunate that we have to do couple the behaviour of this function so + * tighlty to the restore command logic, but it's the only way to make sure + * that we don't have additional logic inbetween the PreRestoreCommand and + * PostRestoreCommand calls. + */ +int +pg_system(const char *command, uint32 wait_event_info) +{ + int rc; + + fflush(NULL); + pgstat_report_wait_start(wait_event_info); + + if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND) + { + /* + * PreRestoreCommand() informs the SIGTERM handler for the startup + * process that it should proc_exit() right away. This is done for + * the duration of the system() call because there isn't a good way to + * break out while it is executing. Since we might call proc_exit() + * in a signal handler, it is best to put any additional logic before + * or after the PreRestoreCommand()/PostRestoreCommand() section. + */ + PreRestoreCommand(); + } + + rc = system(command); + + if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND) + PostRestoreCommand(); + + pgstat_report_wait_end(); + return rc; +} + + /* * Routines that want to initiate a pipe stream should use OpenPipeStream * rather than plain popen(). This lets fd.c deal with freeing FDs if diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index b77d8e5e30e..2d445674a1a 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -187,6 +187,7 @@ extern int pg_fsync_writethrough(int fd); extern int pg_fdatasync(int fd); extern bool pg_file_exists(const char *name); extern void pg_flush_data(int fd, off_t offset, off_t nbytes); +extern int pg_system(const char *command, uint32 wait_event_info); extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); base-commit: f09088a01d3372fdfe5da12dd0b2e24989f0caa6 -- 2.43.0 [text/x-patch] v8-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch (9.7K, 3-v8-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch) download | inline diff: From c43950afa51b22892211770a41d4be4609b7b9ac Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio <[email protected]> Date: Tue, 11 Feb 2025 19:15:36 +0100 Subject: [PATCH v8 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE) when necessary The default open file limit of 1024 on Linux is extremely low. The reason that this hasn't changed change is because doing so would break legacy programs that use the select(2) system call in hard to debug ways. So instead programs that want to opt-in to a higher open file limit are expected to bump their soft limit to their hard limit on startup. Details on this are very well explained in a blogpost by the systemd author[1]. There's also a similar change done by the Go language[2]. This starts bumping postmaster its soft open file limit when we realize that we'll run into the soft limit with the requested max_files_per_process GUC. We do so by slightly changing the meaning of the max_files_per_process GUC. The actual (not publicly exposed) limit is max_safe_fds, previously this would be set to: max_files_per_process - already_open_files - NUM_RESERVED_FDS After this change we now try to set max_safe_fds to max_files_per_process if the system allows that. This is deemed more natural to understand for users, because now the limit of files that they can open is actually what they configured in max_files_per_process. Adding this infrastructure to change RLIMIT_NOFILE when needed is especially useful for the AIO work that Andres is doing, because io_uring consumes a lot of file descriptors. Even without looking at AIO there is a large number of reports from people that require changing their soft file limit before starting Postgres, sometimes falling back to lowering max_files_per_process when they fail to do so[3-8]. It's also not all that strange to fail at setting the soft open file limit because there are multiple places where one can configure such limits and usually only one of them is effective (which one depends on how Postgres is started). In cloud environments its also often not possible for user to change the soft limit, because they don't control the way that Postgres is started. One thing to note is that we temporarily restore the original soft limit when shell-ing out to other executables. This is done as a precaution in case those executables are using select(2). [1]: https://0pointer.net/blog/file-descriptor-limits.html [2]: https://github.com/golang/go/issues/46279 [3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres [4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie [5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com [6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com [7]: https://github.com/abiosoft/colima/discussions/836 [8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9 --- src/backend/storage/file/fd.c | 199 +++++++++++++++++++++++++++++++--- 1 file changed, 184 insertions(+), 15 deletions(-) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 718d8079ad7..b3a58deef43 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -158,6 +158,13 @@ int max_files_per_process = 1000; */ int max_safe_fds = FD_MINFREE; /* default if not changed */ +#ifdef HAVE_GETRLIMIT +static bool saved_original_max_open_files; +static struct rlimit original_max_open_files; +static struct rlimit custom_max_open_files; +#endif + + /* Whether it is safe to continue running after fsync() fails. */ bool data_sync_retry = false; @@ -946,6 +953,152 @@ InitTemporaryFileAccess(void) #endif } +/* + * Returns true if the passed in highestfd is the last one that we're allowed + * to open based on our. This should only be called if + */ +static bool +IsOpenFileLimit(int highestfd) +{ +#ifdef HAVE_GETRLIMIT + if (!saved_original_max_open_files) + { + return false; + } + + return highestfd >= custom_max_open_files.rlim_cur - 1; +#else + return false; +#endif +} + +/* + * Increases the open file limit (RLIMIT_NOFILE) by the requested amount. + * Returns true if successful, false otherwise. + */ +static bool +IncreaseOpenFileLimit(int extra_files) +{ +#ifdef HAVE_GETRLIMIT + struct rlimit rlim; + + if (!saved_original_max_open_files) + { + return false; + } + + rlim = custom_max_open_files; + + /* If we're already at the max we reached our limit */ + if (rlim.rlim_cur == original_max_open_files.rlim_max) + return false; + + /* Otherwise try to increase the soft limit to what we need */ + rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max); + + if (setrlimit(RLIMIT_NOFILE, &rlim) != 0) + { + /* We made sure not to exceed the hard limit, so this shouldn't fail */ + ereport(WARNING, (errmsg("setrlimit failed: %m"))); + return false; + } + + custom_max_open_files = rlim; + + elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur); + + return true; +#else + return false; +#endif +} + +/* + * Saves the original open file limit (RLIMIT_NOFILE) the first time when this + * is called. If called again it's a no-op. + * + * Returns true if successful, false otherwise. + */ +static void +SaveOriginalOpenFileLimit(void) +{ +#ifdef HAVE_GETRLIMIT + int status; + + if (saved_original_max_open_files) + { + /* Already saved, no need to do it again */ + return; + } + + status = getrlimit(RLIMIT_NOFILE, &original_max_open_files); + if (status != 0) + { + ereport(WARNING, (errmsg("getrlimit failed: %m"))); + return; + } + + custom_max_open_files = original_max_open_files; + saved_original_max_open_files = true; + return; +#endif +} + +/* + * UseOriginalOpenFileLimit --- Makes the process use the original open file + * limit that was present at postmaster start. + * + * This should be called before spawning subprocesses that might use select(2) + * which can only handle file descriptors up to 1024. + */ +static void +UseOriginalOpenFileLimit(void) +{ +#ifdef HAVE_GETRLIMIT + if (!saved_original_max_open_files) + { + return; + } + + if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur) + { + /* Not changed, so no need to call setrlimit at all */ + return; + } + + if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0) + { + ereport(WARNING, (errmsg("setrlimit failed: %m"))); + } +#endif +} + +/* + * UseCustomOpenFileLimit --- Makes the process use our custom open file limit + * after that we configured based on the max_files_per_process GUC. + */ +static void +UseCustomOpenFileLimit(void) +{ +#ifdef HAVE_GETRLIMIT + if (!saved_original_max_open_files) + { + return; + } + + if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur) + { + /* Not changed, so no need to call setrlimit at all */ + return; + } + + if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0) + { + ereport(WARNING, (errmsg("setrlimit failed: %m"))); + } +#endif +} + /* * count_usable_fds --- count how many FDs the system will let us open, * and estimate how many are already open. @@ -969,38 +1122,39 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open) int highestfd = 0; int j; -#ifdef HAVE_GETRLIMIT - struct rlimit rlim; - int getrlimit_status; -#endif - size = 1024; fd = (int *) palloc(size * sizeof(int)); -#ifdef HAVE_GETRLIMIT - getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim); - if (getrlimit_status != 0) - ereport(WARNING, (errmsg("getrlimit failed: %m"))); -#endif /* HAVE_GETRLIMIT */ + SaveOriginalOpenFileLimit(); /* dup until failure or probe limit reached */ for (;;) { int thisfd; -#ifdef HAVE_GETRLIMIT - /* * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on * some platforms */ - if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1) - break; -#endif + if (IsOpenFileLimit(highestfd)) + { + if (!IncreaseOpenFileLimit(max_to_probe - used)) + break; + } thisfd = dup(2); if (thisfd < 0) { + /* + * Eventhough we do the pre-check above, it's still possible that + * the call to dup fails with EMFILE. This can happen if the last + * file descriptor was already assigned to an "already open" file. + * One example of this happening, is if we're already at the soft + * limit when we call count_usable_fds. + */ + if (errno == EMFILE && IncreaseOpenFileLimit(max_to_probe - used)) + continue; + /* Expect EMFILE or ENFILE, else it's fishy */ if (errno != EMFILE && errno != ENFILE) elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used); @@ -2750,6 +2904,7 @@ pg_system(const char *command, uint32 wait_event_info) { int rc; + UseOriginalOpenFileLimit(); fflush(NULL); pgstat_report_wait_start(wait_event_info); @@ -2772,6 +2927,7 @@ pg_system(const char *command, uint32 wait_event_info) PostRestoreCommand(); pgstat_report_wait_end(); + UseCustomOpenFileLimit(); return rc; } @@ -2805,6 +2961,19 @@ OpenPipeStream(const char *command, const char *mode) ReleaseLruFiles(); TryAgain: + + /* + * It would be great if we could call UseOriginalOpenFileLimit here, but + * since popen() also opens a file in the current process (this side of the + * pipe) we cannot do so safely. Because we might already have many more + * files open than the original limit. + * + * The only way to address this would be implementing a custom popen() that + * calls UseOriginalOpenFileLimit only around the actual fork call, but + * that seems too much effort to handle the corner case where this external + * command uses both select() and tries to open more files than select() + * allows for. + */ fflush(NULL); pqsignal(SIGPIPE, SIG_DFL); errno = 0; -- 2.43.0 [text/x-patch] v8-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch (1.5K, 4-v8-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch) download | inline diff: From 298c3c436eb4535df95d7efb0b17105cc6e6c770 Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio <[email protected]> Date: Wed, 12 Feb 2025 01:08:07 +0100 Subject: [PATCH v8 3/3] Reflect the value of max_safe_fds in max_files_per_process It is currently hard to figure out if max_safe_fds is significantly lower than max_files_per_process. This starts reflecting the value of max_safe_fds in max_files_per_process after our limit detection. We still want to have two separate variables because for the bootstrap or standalone-backend cases their values differ on purpose. --- src/backend/storage/file/fd.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index b3a58deef43..8cee38e6c17 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -1199,6 +1199,7 @@ set_max_safe_fds(void) { int usable_fds; int already_open; + char *max_safe_fds_string; /*---------- * We want to set max_safe_fds to @@ -1214,6 +1215,16 @@ set_max_safe_fds(void) max_safe_fds = Min(usable_fds, max_files_per_process); + /* + * Update GUC variable to allow users to see if the result is different + * than what the used value turns out to be different than what they had + * configured. + */ + max_safe_fds_string = psprintf("%d", max_safe_fds); + SetConfigOption("max_files_per_process", max_safe_fds_string, + PGC_POSTMASTER, PGC_S_OVERRIDE); + pfree(max_safe_fds_string); + /* * Take off the FDs reserved for system() etc. */ -- 2.43.0 ^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup @ 2025-10-29 10:11 Peter Eisentraut <[email protected]> parent: Jelte Fennema-Nio <[email protected]> 0 siblings, 1 reply; 4+ messages in thread From: Peter Eisentraut @ 2025-10-29 10:11 UTC (permalink / raw) To: Jelte Fennema-Nio <[email protected]>; Heikki Linnakangas <[email protected]>; Andres Freund <[email protected]>; +Cc: Tom Lane <[email protected]>; Tomas Vondra <[email protected]>; pgsql-hackers On 13.04.25 21:30, Jelte Fennema-Nio wrote: > On Fri Apr 4, 2025 at 7:34 PM CEST, Heikki Linnakangas wrote: >> Let's move that 'in_restore_command' business to the caller. It's >> weird modularity for the function to implicitly behave differently for >> some callers. > > I definitely agree with the sentiment, and it was what I originally > planned to do. But then I went for this approach anyway because commit > 8fb13dd6ab5b explicitely moved all code except for the actual call to > system() out of the PreRestoreCommand()/PostRestoreCommand() section > (which is also described in the code comment). > So I didn't move the the in_restore_command stuff to the caller, and > improved the function comment to call out this unfortunate coupling. >> And 'wait_event_info' should only affect pgstat reporting, not actual >> behavior. > > Given that we need to keep the restore command stuff in this function, I > think the only other option is to add a dedicated argument for the > restore command stuff, like "bool is_restore_command". But that would > require every caller, except for the restore command, to pass an > additional "false" as an argument. To me the additionaly noise that that > adds seems like a worse issue than the non-purity we get by > piggy-backing on the wait_event_info argument. > >> I don't feel good about the function name. How about pg_system() or >> something? This patch set is showing compiler warnings because pg_system() wasn't properly declared where needed. Please provide an update that builds cleanly. Also, it appears the patch for pgbench disappeared from the series. Was that intentional? ^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup @ 2025-11-03 17:04 Jelte Fennema-Nio <[email protected]> parent: Peter Eisentraut <[email protected]> 0 siblings, 0 replies; 4+ messages in thread From: Jelte Fennema-Nio @ 2025-11-03 17:04 UTC (permalink / raw) To: Peter Eisentraut <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; Andres Freund <[email protected]>; Tom Lane <[email protected]>; Tomas Vondra <[email protected]>; pgsql-hackers On Wed, 29 Oct 2025 at 11:11, Peter Eisentraut <[email protected]> wrote: > This patch set is showing compiler warnings because pg_system() wasn't > properly declared where needed. Please provide an update that builds > cleanly. It still compiled fine on my local branch, but indeed after a rebase not anymore. I guess some include got removed in some header in the meantime. Attached an updated version (which added an storage/fd.h include). > Also, it appears the patch for pgbench disappeared from the series. Was > that intentional? Yes, the pgbench change got committed for PG18 already (see Andres' last message in the thread). Attachments: [text/x-patch] v8-0001-Adds-a-helper-for-places-that-shell-out-to-system.patch (6.0K, 2-v8-0001-Adds-a-helper-for-places-that-shell-out-to-system.patch) download | inline diff: From 9e478de009b5c1a77ca47fc268cb51adbb657dd8 Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio <[email protected]> Date: Sun, 23 Feb 2025 16:52:29 +0100 Subject: [PATCH v8 1/3] Adds a helper for places that shell out to system() We need to call system() in a few places, and to do so safely we need some pre/post-fork logic. This encapsulates that logic into a single System helper function. The main reason to do this is that in a follow up commit we want to add even more logic pre and post-fork. --- src/backend/access/transam/xlogarchive.c | 29 ++-------------- src/backend/archive/shell_archive.c | 6 ++-- src/backend/postmaster/startup.c | 1 + src/backend/storage/file/fd.c | 42 ++++++++++++++++++++++++ src/include/storage/fd.h | 1 + 5 files changed, 49 insertions(+), 30 deletions(-) diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 1ef1713c91a..c7640ec5025 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -158,27 +158,8 @@ RestoreArchivedFile(char *path, const char *xlogfname, (errmsg_internal("executing restore command \"%s\"", xlogRestoreCmd))); - fflush(NULL); - pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND); - - /* - * PreRestoreCommand() informs the SIGTERM handler for the startup process - * that it should proc_exit() right away. This is done for the duration - * of the system() call because there isn't a good way to break out while - * it is executing. Since we might call proc_exit() in a signal handler, - * it is best to put any additional logic before or after the - * PreRestoreCommand()/PostRestoreCommand() section. - */ - PreRestoreCommand(); - - /* - * Copy xlog from archival storage to XLOGDIR - */ - rc = system(xlogRestoreCmd); - - PostRestoreCommand(); - - pgstat_report_wait_end(); + /* Copy xlog from archival storage to XLOGDIR */ + rc = pg_system(xlogRestoreCmd, WAIT_EVENT_RESTORE_COMMAND); pfree(xlogRestoreCmd); if (rc == 0) @@ -325,11 +306,7 @@ ExecuteRecoveryCommand(const char *command, const char *commandName, /* * execute the constructed command */ - fflush(NULL); - pgstat_report_wait_start(wait_event_info); - rc = system(xlogRecoveryCmd); - pgstat_report_wait_end(); - + rc = pg_system(xlogRecoveryCmd, wait_event_info); pfree(xlogRecoveryCmd); if (rc != 0) diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c index 828723afe47..5dd8e2c7247 100644 --- a/src/backend/archive/shell_archive.c +++ b/src/backend/archive/shell_archive.c @@ -22,6 +22,7 @@ #include "archive/shell_archive.h" #include "common/percentrepl.h" #include "pgstat.h" +#include "storage/fd.h" static bool shell_archive_configured(ArchiveModuleState *state); static bool shell_archive_file(ArchiveModuleState *state, @@ -75,10 +76,7 @@ shell_archive_file(ArchiveModuleState *state, const char *file, (errmsg_internal("executing archive command \"%s\"", xlogarchcmd))); - fflush(NULL); - pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND); - rc = system(xlogarchcmd); - pgstat_report_wait_end(); + rc = pg_system(xlogarchcmd, WAIT_EVENT_ARCHIVE_COMMAND); if (rc != 0) { diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c index 27e86cf393f..783eb88e59d 100644 --- a/src/backend/postmaster/startup.c +++ b/src/backend/postmaster/startup.c @@ -33,6 +33,7 @@ #include "utils/guc.h" #include "utils/memutils.h" #include "utils/timeout.h" +#include "utils/wait_event_types.h" #ifndef USE_POSTMASTER_DEATH_SIGNAL diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index a4ec7959f31..9f3b58c3767 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2731,6 +2731,48 @@ OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode) return -1; /* failure */ } +/* + * A custom wrapper around the system() function that calls the necessary + * functions pre/post-fork. + * + * If WAIT_EVENT_RESTORE_COMMAND is passed as the wait_event_info, it will also + * call the necessary PreRestoreCommand/PostRerstoreCommand functions. It's + * unfortunate that we have to do couple the behaviour of this function so + * tighlty to the restore command logic, but it's the only way to make sure + * that we don't have additional logic inbetween the PreRestoreCommand and + * PostRestoreCommand calls. + */ +int +pg_system(const char *command, uint32 wait_event_info) +{ + int rc; + + fflush(NULL); + pgstat_report_wait_start(wait_event_info); + + if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND) + { + /* + * PreRestoreCommand() informs the SIGTERM handler for the startup + * process that it should proc_exit() right away. This is done for + * the duration of the system() call because there isn't a good way to + * break out while it is executing. Since we might call proc_exit() + * in a signal handler, it is best to put any additional logic before + * or after the PreRestoreCommand()/PostRestoreCommand() section. + */ + PreRestoreCommand(); + } + + rc = system(command); + + if (wait_event_info == WAIT_EVENT_RESTORE_COMMAND) + PostRestoreCommand(); + + pgstat_report_wait_end(); + return rc; +} + + /* * Routines that want to initiate a pipe stream should use OpenPipeStream * rather than plain popen(). This lets fd.c deal with freeing FDs if diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index b77d8e5e30e..2d445674a1a 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -187,6 +187,7 @@ extern int pg_fsync_writethrough(int fd); extern int pg_fdatasync(int fd); extern bool pg_file_exists(const char *name); extern void pg_flush_data(int fd, off_t offset, off_t nbytes); +extern int pg_system(const char *command, uint32 wait_event_info); extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); base-commit: ad25744f436ed7809fc754e1a44630b087812fbc -- 2.51.1 [text/x-patch] v8-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch (9.7K, 3-v8-0002-Bump-postmaster-soft-open-file-limit-RLIMIT_NOFIL.patch) download | inline diff: From 557a63fb212e89f0448b08025fb5c88c46ea198e Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio <[email protected]> Date: Tue, 11 Feb 2025 19:15:36 +0100 Subject: [PATCH v8 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE) when necessary The default open file limit of 1024 on Linux is extremely low. The reason that this hasn't changed change is because doing so would break legacy programs that use the select(2) system call in hard to debug ways. So instead programs that want to opt-in to a higher open file limit are expected to bump their soft limit to their hard limit on startup. Details on this are very well explained in a blogpost by the systemd author[1]. There's also a similar change done by the Go language[2]. This starts bumping postmaster its soft open file limit when we realize that we'll run into the soft limit with the requested max_files_per_process GUC. We do so by slightly changing the meaning of the max_files_per_process GUC. The actual (not publicly exposed) limit is max_safe_fds, previously this would be set to: max_files_per_process - already_open_files - NUM_RESERVED_FDS After this change we now try to set max_safe_fds to max_files_per_process if the system allows that. This is deemed more natural to understand for users, because now the limit of files that they can open is actually what they configured in max_files_per_process. Adding this infrastructure to change RLIMIT_NOFILE when needed is especially useful for the AIO work that Andres is doing, because io_uring consumes a lot of file descriptors. Even without looking at AIO there is a large number of reports from people that require changing their soft file limit before starting Postgres, sometimes falling back to lowering max_files_per_process when they fail to do so[3-8]. It's also not all that strange to fail at setting the soft open file limit because there are multiple places where one can configure such limits and usually only one of them is effective (which one depends on how Postgres is started). In cloud environments its also often not possible for user to change the soft limit, because they don't control the way that Postgres is started. One thing to note is that we temporarily restore the original soft limit when shell-ing out to other executables. This is done as a precaution in case those executables are using select(2). [1]: https://0pointer.net/blog/file-descriptor-limits.html [2]: https://github.com/golang/go/issues/46279 [3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres [4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie [5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com [6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com [7]: https://github.com/abiosoft/colima/discussions/836 [8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9 --- src/backend/storage/file/fd.c | 199 +++++++++++++++++++++++++++++++--- 1 file changed, 184 insertions(+), 15 deletions(-) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 9f3b58c3767..37de12fbb7e 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -158,6 +158,13 @@ int max_files_per_process = 1000; */ int max_safe_fds = FD_MINFREE; /* default if not changed */ +#ifdef HAVE_GETRLIMIT +static bool saved_original_max_open_files; +static struct rlimit original_max_open_files; +static struct rlimit custom_max_open_files; +#endif + + /* Whether it is safe to continue running after fsync() fails. */ bool data_sync_retry = false; @@ -943,6 +950,152 @@ InitTemporaryFileAccess(void) #endif } +/* + * Returns true if the passed in highestfd is the last one that we're allowed + * to open based on our. This should only be called if + */ +static bool +IsOpenFileLimit(int highestfd) +{ +#ifdef HAVE_GETRLIMIT + if (!saved_original_max_open_files) + { + return false; + } + + return highestfd >= custom_max_open_files.rlim_cur - 1; +#else + return false; +#endif +} + +/* + * Increases the open file limit (RLIMIT_NOFILE) by the requested amount. + * Returns true if successful, false otherwise. + */ +static bool +IncreaseOpenFileLimit(int extra_files) +{ +#ifdef HAVE_GETRLIMIT + struct rlimit rlim; + + if (!saved_original_max_open_files) + { + return false; + } + + rlim = custom_max_open_files; + + /* If we're already at the max we reached our limit */ + if (rlim.rlim_cur == original_max_open_files.rlim_max) + return false; + + /* Otherwise try to increase the soft limit to what we need */ + rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max); + + if (setrlimit(RLIMIT_NOFILE, &rlim) != 0) + { + /* We made sure not to exceed the hard limit, so this shouldn't fail */ + ereport(WARNING, (errmsg("setrlimit failed: %m"))); + return false; + } + + custom_max_open_files = rlim; + + elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur); + + return true; +#else + return false; +#endif +} + +/* + * Saves the original open file limit (RLIMIT_NOFILE) the first time when this + * is called. If called again it's a no-op. + * + * Returns true if successful, false otherwise. + */ +static void +SaveOriginalOpenFileLimit(void) +{ +#ifdef HAVE_GETRLIMIT + int status; + + if (saved_original_max_open_files) + { + /* Already saved, no need to do it again */ + return; + } + + status = getrlimit(RLIMIT_NOFILE, &original_max_open_files); + if (status != 0) + { + ereport(WARNING, (errmsg("getrlimit failed: %m"))); + return; + } + + custom_max_open_files = original_max_open_files; + saved_original_max_open_files = true; + return; +#endif +} + +/* + * UseOriginalOpenFileLimit --- Makes the process use the original open file + * limit that was present at postmaster start. + * + * This should be called before spawning subprocesses that might use select(2) + * which can only handle file descriptors up to 1024. + */ +static void +UseOriginalOpenFileLimit(void) +{ +#ifdef HAVE_GETRLIMIT + if (!saved_original_max_open_files) + { + return; + } + + if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur) + { + /* Not changed, so no need to call setrlimit at all */ + return; + } + + if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0) + { + ereport(WARNING, (errmsg("setrlimit failed: %m"))); + } +#endif +} + +/* + * UseCustomOpenFileLimit --- Makes the process use our custom open file limit + * after that we configured based on the max_files_per_process GUC. + */ +static void +UseCustomOpenFileLimit(void) +{ +#ifdef HAVE_GETRLIMIT + if (!saved_original_max_open_files) + { + return; + } + + if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur) + { + /* Not changed, so no need to call setrlimit at all */ + return; + } + + if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0) + { + ereport(WARNING, (errmsg("setrlimit failed: %m"))); + } +#endif +} + /* * count_usable_fds --- count how many FDs the system will let us open, * and estimate how many are already open. @@ -966,38 +1119,39 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open) int highestfd = 0; int j; -#ifdef HAVE_GETRLIMIT - struct rlimit rlim; - int getrlimit_status; -#endif - size = 1024; fd = (int *) palloc(size * sizeof(int)); -#ifdef HAVE_GETRLIMIT - getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim); - if (getrlimit_status != 0) - ereport(WARNING, (errmsg("getrlimit failed: %m"))); -#endif /* HAVE_GETRLIMIT */ + SaveOriginalOpenFileLimit(); /* dup until failure or probe limit reached */ for (;;) { int thisfd; -#ifdef HAVE_GETRLIMIT - /* * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on * some platforms */ - if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1) - break; -#endif + if (IsOpenFileLimit(highestfd)) + { + if (!IncreaseOpenFileLimit(max_to_probe - used)) + break; + } thisfd = dup(2); if (thisfd < 0) { + /* + * Eventhough we do the pre-check above, it's still possible that + * the call to dup fails with EMFILE. This can happen if the last + * file descriptor was already assigned to an "already open" file. + * One example of this happening, is if we're already at the soft + * limit when we call count_usable_fds. + */ + if (errno == EMFILE && IncreaseOpenFileLimit(max_to_probe - used)) + continue; + /* Expect EMFILE or ENFILE, else it's fishy */ if (errno != EMFILE && errno != ENFILE) elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used); @@ -2747,6 +2901,7 @@ pg_system(const char *command, uint32 wait_event_info) { int rc; + UseOriginalOpenFileLimit(); fflush(NULL); pgstat_report_wait_start(wait_event_info); @@ -2769,6 +2924,7 @@ pg_system(const char *command, uint32 wait_event_info) PostRestoreCommand(); pgstat_report_wait_end(); + UseCustomOpenFileLimit(); return rc; } @@ -2802,6 +2958,19 @@ OpenPipeStream(const char *command, const char *mode) ReleaseLruFiles(); TryAgain: + + /* + * It would be great if we could call UseOriginalOpenFileLimit here, but + * since popen() also opens a file in the current process (this side of the + * pipe) we cannot do so safely. Because we might already have many more + * files open than the original limit. + * + * The only way to address this would be implementing a custom popen() that + * calls UseOriginalOpenFileLimit only around the actual fork call, but + * that seems too much effort to handle the corner case where this external + * command uses both select() and tries to open more files than select() + * allows for. + */ fflush(NULL); pqsignal(SIGPIPE, SIG_DFL); errno = 0; -- 2.51.1 [text/x-patch] v8-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch (1.5K, 4-v8-0003-Reflect-the-value-of-max_safe_fds-in-max_files_pe.patch) download | inline diff: From 6bf857b0ddb43b2a5abb62bcdc5669debc96edab Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio <[email protected]> Date: Wed, 12 Feb 2025 01:08:07 +0100 Subject: [PATCH v8 3/3] Reflect the value of max_safe_fds in max_files_per_process It is currently hard to figure out if max_safe_fds is significantly lower than max_files_per_process. This starts reflecting the value of max_safe_fds in max_files_per_process after our limit detection. We still want to have two separate variables because for the bootstrap or standalone-backend cases their values differ on purpose. --- src/backend/storage/file/fd.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 37de12fbb7e..3cd99054c9f 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -1196,6 +1196,7 @@ set_max_safe_fds(void) { int usable_fds; int already_open; + char *max_safe_fds_string; /*---------- * We want to set max_safe_fds to @@ -1211,6 +1212,16 @@ set_max_safe_fds(void) max_safe_fds = Min(usable_fds, max_files_per_process); + /* + * Update GUC variable to allow users to see if the result is different + * than what the used value turns out to be different than what they had + * configured. + */ + max_safe_fds_string = psprintf("%d", max_safe_fds); + SetConfigOption("max_files_per_process", max_safe_fds_string, + PGC_POSTMASTER, PGC_S_OVERRIDE); + pfree(max_safe_fds_string); + /* * Take off the FDs reserved for system() etc. */ -- 2.51.1 ^ permalink raw reply [nested|flat] 4+ messages in thread
end of thread, other threads:[~2025-11-03 17:04 UTC | newest] Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2025-04-04 17:34 Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup Heikki Linnakangas <[email protected]> 2025-04-13 19:30 ` Jelte Fennema-Nio <[email protected]> 2025-10-29 10:11 ` Peter Eisentraut <[email protected]> 2025-11-03 17:04 ` Jelte Fennema-Nio <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox