Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vSRMl-00Ego2-32 for pgsql-hackers@arkaria.postgresql.org; Mon, 08 Dec 2025 02:52:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vSRMk-00EXLs-0I for pgsql-hackers@arkaria.postgresql.org; Mon, 08 Dec 2025 02:52:46 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vSRMj-00EXLk-24 for pgsql-hackers@lists.postgresql.org; Mon, 08 Dec 2025 02:52:46 +0000 Received: from mail-pl1-x62f.google.com ([2607:f8b0:4864:20::62f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vSRMg-003gNN-2V for pgsql-hackers@lists.postgresql.org; Mon, 08 Dec 2025 02:52:44 +0000 Received: by mail-pl1-x62f.google.com with SMTP id d9443c01a7336-2980d9b7df5so45953375ad.3 for ; Sun, 07 Dec 2025 18:52:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1765162362; x=1765767162; darn=lists.postgresql.org; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=NaCkWyHWCt69NZRa+Ocni0gv+XDwinjL7/lG0/NoBrE=; b=RgeiwIS/nRvWVYoJRqcxek+IwlBY6YumiufJP6zXK7MBUuIAHUlCZbzMS1cZg5biMM yXqMZPpjKZeffn8fTIFIszAnPASTilM9TAf0MIJQBGOjZLHyx1q0sFHcQKu6QnNcG67f GagDCHiiIwwpHV1kYfNdcR/UREnszhKZRrbQ56fpQW8h7QyMX/qB5DuURo7GgmVMIYV1 /XnLnwLvnlG2LXsN/1azV7QShyNgQwclJKoREgEYqkH6xj7/xtvVlDLpcAHCwZ23dECx Z+qRay4EaXzFtUkgvemO6NXu4BLwsJ89qp/z9td4aR37BVmZk89tZ8lImnprNvYV6T60 SAjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765162362; x=1765767162; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=NaCkWyHWCt69NZRa+Ocni0gv+XDwinjL7/lG0/NoBrE=; b=CANjCWEdo5shqCeXV0++rtsPqN2GCcyMuyqGdMqkC97MqX1y5XKTNW1s4pxPJKVifF LLwySGr5NoMugCmFAXP2YZUdhIhp/9biOVuTIV5YgDokAnOw8q4ED+PO0GtkAiLIhqNZ tXBfdka297YLGvj7/5WKEHeLLhRFwVbPx+hSCPOV6ggmEpwMMkXQhCmJ2OLpmmArKHcJ edoSSomgyhfV2CsyWGk8+WDEs1n3nH+YjKMrlC+Icv0BONG2qtHwhSbT5j7hfCy84jKQ hFVLla2P4+zN1OwCxpDPZe5fZUcc9XO/wcQRPurp3W/mn33sXKRby18xXg86MdbPfZ/k EUqQ== X-Forwarded-Encrypted: i=1; AJvYcCWaVHeVWejNMURFscKhulQSOnRjs8Jdy272Djbe04JzG4PkzfyQUvXVlz275dki/YWUQg4T+ym03zAAUQj9@lists.postgresql.org X-Gm-Message-State: AOJu0YzjtXyhs0aQfJsG9/HW+7LymIRiMJr5IfS5aJAX5QPvRFNWLbsE MEKD0RZDVmKjeyO1LRcn9TCx6d4MKCkcsrLewWNbCTiUjb7+/SiKBly7 X-Gm-Gg: ASbGnctu04QbO+BlWNYVsfeVRTwm5tMWW43ebGK8+A8dzQBg9nVIgp4U6zGF4Aeqnbt mmo/dZvzB4VjaLL1pKei7X85kRyKA7SK8ySc6bBsrtlj2MWdn04BS/4eRBuz98yZxHn3ifPHRzb Eg5QpBDBDgExEESOaIlN83TPJlaBvcK3ooUlgxwe4i732Lp+1OZvbSAQRxmkNtKDCikaH3w3h7O /oNKR3t8U76KvmZQW2u4nWTJGlZysZmygPyXDGyJkX4T45zbM5XTEy4cbhMeN1EaW0dNWNiinQ6 TDJlIG99agop0lGsZyR/dEDliR3GJVd9nqIThPFL5IM48tK4+KEW5m/l6r79Kdm02ZlaNMc6Ce9 p4mQ0g0QxF7l2vTYpwpMyPvZlcVtbiXkxmpfm2aYV5BQDwf5nJqPdrTlyOIU6RUP/pCywsnTB+C bvmcipRm0hozIYGeBdlLQ= X-Google-Smtp-Source: AGHT+IFCLGanmxkj9uvbDH90ZwsN198fZcHnbkBzRxyz8L20rN4BjjbbGVCXfWqonEviHJfrwTqW9A== X-Received: by 2002:a17:903:240f:b0:297:d45b:6d97 with SMTP id d9443c01a7336-29df59a8867mr57472685ad.14.1765162361884; Sun, 07 Dec 2025 18:52:41 -0800 (PST) Received: from smtpclient.apple ([45.32.121.103]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29daeae6b47sm108711315ad.94.2025.12.07.18.52.39 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 07 Dec 2025 18:52:41 -0800 (PST) From: Chao Li Message-Id: <9D3D4647-868B-4562-B382-D201478AD67B@gmail.com> Content-Type: multipart/mixed; boundary="Apple-Mail=_189EDA60-ABDE-4377-A6EB-4BAC38D353DB" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.700.81\)) Subject: Re: tuple radix sort Date: Mon, 8 Dec 2025 10:52:02 +0800 In-Reply-To: Cc: Chengpeng Yan , "pgsql-hackers@lists.postgresql.org" , Peter Geoghegan To: John Naylor References: <269A8FB9-6D43-43CF-A6FE-52D28CBDB8A9@Outlook.com> <606C775A-4C1A-482B-BE7D-2E7A46AE14B9@gmail.com> X-Mailer: Apple Mail (2.3826.700.81) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_189EDA60-ABDE-4377-A6EB-4BAC38D353DB Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi John, I played the radix sort again during the weekend. First, I changed my direction and implemented the in-place switching in = the other way, where I did a way like chained-switching. Say starting = from item0, for example, switching item0 to item5, then check where = item5 should be switched to, and makes the switch, till an item is = switch to position 0. See my implementation in other-implemation.diff if = you are interested in it. This time, I eyeball checked the sort result = and confirmed the correction. But my implementation is slightly slower = than your implementation, based on the same test procedure I described = in my previous email, my implementation is roughly ~3% slower your = implementation. So I think that at least proves your current = implementation in v5 has been perfectly fine tuned. Then I went back to read your implementation again, I found a tiny = optimization, where we can move =E2=80=9Ccount sorted partitions=E2=80=9D = to before the =E2=80=9Cfor=E2=80=9D loop, which avoid sorted partition = to go through the =E2=80=9Cfor=E2=80=9D loop, and my test shows that the = movement may lead to ~1% improvement. See the change in = radixsort_tiny_optimizeation.diff. I also noticed that, there could be cases where target element is = already in the right partition, so that switching could be unnecessary. = However if we want to avoid such unnecessary switches, then we will need = to add a =E2=80=9Cif=E2=80=9D check. Given the total number of such = cases is tiny, the =E2=80=9Cif=E2=80=9D check would be more expensive = than performing blindly switching. I tried to add such a check like: ``` if (offset =3D=3D (size_t) (st - begin)) continue; /* already in correct position */ ``` With my test, it just makes the query ~3% slower. So, now I think I can wrap up this round of playing. My only suggestion = is radixsort_tiny_optimizeation.diff. I may revisit this patch again = once you make the entire patch set ready for review. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ --Apple-Mail=_189EDA60-ABDE-4377-A6EB-4BAC38D353DB Content-Disposition: attachment; filename=other-implemation.diff Content-Type: application/octet-stream; x-unix-mode=0644; name="other-implemation.diff" Content-Transfer-Encoding: 7bit commit ee29bccc4540f61dca283e4404ff140705628552 Author: Chao Li (Evan) Date: Sat Dec 6 07:03:14 2025 +0800 the other implmentation diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 028c5b71c27..5e3e49ba22e 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -632,6 +632,7 @@ typedef struct RadixPartitionInfo size_t count; size_t offset; }; + size_t begin_offset; size_t next_offset; } RadixPartitionInfo; @@ -2763,13 +2764,14 @@ static void radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *state) { RadixPartitionInfo partitions[256] = {0}; - uint8_t remaining_partitions[256] = {0}; + uint8_t remaining_partitions[256]; size_t total = 0; int num_partitions = 0; - int num_remaining; + //int num_remaining; SortSupport ssup = &state->base.sortKeys[0]; - size_t start_offset = 0; - SortTuple *partition_begin = begin; + SortTuple pending, tmp; + //size_t start_offset = 0; + //SortTuple *partition_begin = begin; /* count number of occurrences of each byte */ for (SortTuple *tup = begin; tup < begin + n_elems; tup++) @@ -2796,6 +2798,7 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st if (count != 0) { partitions[i].offset = total; + partitions[i].begin_offset = total; total += count; remaining_partitions[num_partitions] = i; num_partitions++; @@ -2803,7 +2806,7 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st partitions[i].next_offset = total; } - num_remaining = num_partitions; + //num_remaining = num_partitions; /* * Swap tuples to correct partition. @@ -2825,45 +2828,75 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st * previous loop iteration results in only one partition that hasn't been * counted as sorted, we know it's actually sorted and can exit the loop. */ - while (num_remaining > 1) + // while (num_remaining > 1) + // { + // /* start the count over */ + // num_remaining = num_partitions; + + // for (int i = 0; i < num_partitions; i++) + // { + // uint8 idx = remaining_partitions[i]; + + // for (SortTuple *st = begin + partitions[idx].offset; + // st < begin + partitions[idx].next_offset; + // st++) + // { + // size_t offset = partitions[st->current_byte].offset++; + // SortTuple tmp; + + // /* swap current tuple with destination position */ + // Assert(offset < n_elems); + // tmp = *st; + // *st = begin[offset]; + // begin[offset] = tmp; + + // CHECK_FOR_INTERRUPTS(); + // }; + + // /* count sorted partitions */ + // if (partitions[idx].offset == partitions[idx].next_offset) + // num_remaining--; + // } + // } + if (num_partitions > 1) { - /* start the count over */ - num_remaining = num_partitions; - - for (int i = 0; i < num_partitions; i++) + for (int i = 0; i < n_elems; i++) { - uint8 idx = remaining_partitions[i]; + SortTuple *st = begin + i; + + if (i >= partitions[st->current_byte].begin_offset && + i < partitions[st->current_byte].next_offset) + continue; - for (SortTuple *st = begin + partitions[idx].offset; - st < begin + partitions[idx].next_offset; - st++) + while (true) { - size_t offset = partitions[st->current_byte].offset++; - SortTuple tmp; + size_t offset = partitions[st->current_byte].offset; + + CHECK_FOR_INTERRUPTS(); + + /* target is the empty position, we are done */ + if (offset == i) + { + begin[i] = pending; + break; + } - /* swap current tuple with destination position */ Assert(offset < n_elems); tmp = *st; - *st = begin[offset]; + pending = begin[offset]; begin[offset] = tmp; - - CHECK_FOR_INTERRUPTS(); + st = &pending; + partitions[begin[offset].current_byte].offset ++; }; - - /* count sorted partitions */ - if (partitions[idx].offset == partitions[idx].next_offset) - num_remaining--; } } /* recurse */ - for (uint8_t *rp = remaining_partitions; - rp < remaining_partitions + num_partitions; - rp++) + for (int i = 0; i < num_partitions; i++) { - size_t end_offset = partitions[*rp].next_offset; - SortTuple *partition_end = begin + end_offset; - ptrdiff_t num_elements = end_offset - start_offset; + uint8_t rp = remaining_partitions[i]; + SortTuple *partition_begin = begin + partitions[rp].begin_offset; + size_t num_elements = partitions[rp].next_offset - partitions[rp].begin_offset; if (num_elements > 1) { @@ -2897,9 +2930,6 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st state); } } - - start_offset = end_offset; - partition_begin = partition_end; } } --Apple-Mail=_189EDA60-ABDE-4377-A6EB-4BAC38D353DB Content-Disposition: attachment; filename=radixsort_tiny_optimizeation.diff Content-Type: application/octet-stream; x-unix-mode=0644; name="radixsort_tiny_optimizeation.diff" Content-Transfer-Encoding: 7bit diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 028c5b71c27..5185d8403d5 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -2834,6 +2834,13 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st { uint8 idx = remaining_partitions[i]; + /* count sorted partitions */ + if (partitions[idx].offset == partitions[idx].next_offset) + { + num_remaining--; + continue; + } + for (SortTuple *st = begin + partitions[idx].offset; st < begin + partitions[idx].next_offset; st++) @@ -2841,6 +2848,9 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st size_t offset = partitions[st->current_byte].offset++; SortTuple tmp; + //if (offset == (size_t) (st - begin)) + // continue; /* already in correct position */ + /* swap current tuple with destination position */ Assert(offset < n_elems); tmp = *st; @@ -2849,10 +2859,6 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st CHECK_FOR_INTERRUPTS(); }; - - /* count sorted partitions */ - if (partitions[idx].offset == partitions[idx].next_offset) - num_remaining--; } } --Apple-Mail=_189EDA60-ABDE-4377-A6EB-4BAC38D353DB--