public inbox for [email protected]  
help / color / mirror / Atom feed
From: Chao Li <[email protected]>
To: John Naylor <[email protected]>
Cc: Chengpeng Yan <[email protected]>
Cc: [email protected] <[email protected]>
Cc: Peter Geoghegan <[email protected]>
Subject: Re: tuple radix sort
Date: Mon, 8 Dec 2025 10:52:02 +0800
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <CANWCAZYzx7a7E9AY16Jt_U3+GVKDADfgApZ-42SYNiig8dTnFA@mail.gmail.com>
	<CANWCAZbsx5VY-F5RVvZ5H1abBWcinMY90gNra6EbdGS1CTx=7g@mail.gmail.com>
	<CANWCAZaN8kW6o7Ymc_jzPO_Z5SqekQTTNra3xTGYTH+cjrVp8g@mail.gmail.com>
	<[email protected]>
	<CANWCAZZiMGj6QuHfyBPv9at7xn23NPAz4nit=G6fr26V+h8MKg@mail.gmail.com>
	<[email protected]>
	<CANWCAZaKRzsYPn9YWRR8g5dkX-ep_jgyUb4r3KFekK5ZTNo7ew@mail.gmail.com>
	<[email protected]>
	<CANWCAZaBeDPfF0sCdSssRXv6nBZeRP8Z=OO1731zsoUC58U5_Q@mail.gmail.com>
	<[email protected]>

Hi John,

I played the radix sort again during the weekend.

First, I changed my direction and implemented the in-place switching in the other way, where I did a way like chained-switching. Say starting from item0, for example, switching item0 to item5, then check where item5 should be switched to, and makes the switch, till an item is switch to position 0. See my implementation in other-implemation.diff if you are interested in it. This time, I eyeball checked the sort result and confirmed the correction. But my implementation is slightly slower than your implementation, based on the same test procedure I described in my previous email, my implementation is roughly ~3% slower your implementation. So I think that at least proves your current implementation in v5 has been perfectly fine tuned.

Then I went back to read your implementation again, I found a tiny optimization, where we can move “count sorted partitions” to before the “for” loop, which avoid sorted partition to go through the “for” loop, and my test shows that the movement may lead to ~1% improvement. See the change in radixsort_tiny_optimizeation.diff.

I also noticed that, there could be cases where target element is already in the right partition, so that switching could be unnecessary. However if we want to avoid such unnecessary switches, then we will need to add a “if” check. Given the total number of such cases is tiny, the “if” check would be more expensive than performing blindly switching. I tried to add such a check like:
```
if (offset == (size_t) (st - begin))
 continue; /* already in correct position */
```
With my test, it just makes the query ~3% slower.

So, now I think I can wrap up this round of playing. My only suggestion is radixsort_tiny_optimizeation.diff. I may revisit this patch again once you make the entire patch set ready for review.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






Attachments:

  [application/octet-stream] other-implemation.diff (4.7K, 2-other-implemation.diff)
  download | inline diff:
commit ee29bccc4540f61dca283e4404ff140705628552
Author: Chao Li (Evan) <[email protected]>
Date:   Sat Dec 6 07:03:14 2025 +0800

    the other implmentation

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 028c5b71c27..5e3e49ba22e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -632,6 +632,7 @@ typedef struct RadixPartitionInfo
 		size_t		count;
 		size_t		offset;
 	};
+	size_t		begin_offset;
 	size_t		next_offset;
 }			RadixPartitionInfo;
 
@@ -2763,13 +2764,14 @@ static void
 radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *state)
 {
 	RadixPartitionInfo partitions[256] = {0};
-	uint8_t		remaining_partitions[256] = {0};
+	uint8_t		remaining_partitions[256];
 	size_t		total = 0;
 	int			num_partitions = 0;
-	int			num_remaining;
+	//int			num_remaining;
 	SortSupport ssup = &state->base.sortKeys[0];
-	size_t		start_offset = 0;
-	SortTuple  *partition_begin = begin;
+	SortTuple pending, tmp;
+	//size_t		start_offset = 0;
+	//SortTuple  *partition_begin = begin;
 
 	/* count number of occurrences of each byte */
 	for (SortTuple *tup = begin; tup < begin + n_elems; tup++)
@@ -2796,6 +2798,7 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st
 		if (count != 0)
 		{
 			partitions[i].offset = total;
+			partitions[i].begin_offset = total;
 			total += count;
 			remaining_partitions[num_partitions] = i;
 			num_partitions++;
@@ -2803,7 +2806,7 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st
 		partitions[i].next_offset = total;
 	}
 
-	num_remaining = num_partitions;
+	//num_remaining = num_partitions;
 
 	/*
 	 * Swap tuples to correct partition.
@@ -2825,45 +2828,75 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st
 	 * previous loop iteration results in only one partition that hasn't been
 	 * counted as sorted, we know it's actually sorted and can exit the loop.
 	 */
-	while (num_remaining > 1)
+	// while (num_remaining > 1)
+	// {
+	// 	/* start the count over */
+	// 	num_remaining = num_partitions;
+
+	// 	for (int i = 0; i < num_partitions; i++)
+	// 	{
+	// 		uint8		idx = remaining_partitions[i];
+
+	// 		for (SortTuple *st = begin + partitions[idx].offset;
+	// 			 st < begin + partitions[idx].next_offset;
+	// 			 st++)
+	// 		{
+	// 			size_t		offset = partitions[st->current_byte].offset++;
+	// 			SortTuple	tmp;
+
+	// 			/* swap current tuple with destination position */
+	// 			Assert(offset < n_elems);
+	// 			tmp = *st;
+	// 			*st = begin[offset];
+	// 			begin[offset] = tmp;
+
+	// 			CHECK_FOR_INTERRUPTS();
+	// 		};
+
+	// 		/* count sorted partitions */
+	// 		if (partitions[idx].offset == partitions[idx].next_offset)
+	// 			num_remaining--;
+	// 	}
+	// }
+	if (num_partitions > 1)
 	{
-		/* start the count over */
-		num_remaining = num_partitions;
-
-		for (int i = 0; i < num_partitions; i++)
+		for (int i = 0; i < n_elems; i++)
 		{
-			uint8		idx = remaining_partitions[i];
+			SortTuple *st = begin + i;
+
+			if (i >= partitions[st->current_byte].begin_offset &&
+				i < partitions[st->current_byte].next_offset)
+				continue;
 
-			for (SortTuple *st = begin + partitions[idx].offset;
-				 st < begin + partitions[idx].next_offset;
-				 st++)
+			while (true)
 			{
-				size_t		offset = partitions[st->current_byte].offset++;
-				SortTuple	tmp;
+				size_t		offset = partitions[st->current_byte].offset;
+
+				CHECK_FOR_INTERRUPTS();
+
+				/* target is the empty position, we are done */
+				if (offset == i)
+				{
+					begin[i] = pending;
+					break;
+				}
 
-				/* swap current tuple with destination position */
 				Assert(offset < n_elems);
 				tmp = *st;
-				*st = begin[offset];
+				pending = begin[offset];
 				begin[offset] = tmp;
-
-				CHECK_FOR_INTERRUPTS();
+				st = &pending;
+				partitions[begin[offset].current_byte].offset ++;
 			};
-
-			/* count sorted partitions */
-			if (partitions[idx].offset == partitions[idx].next_offset)
-				num_remaining--;
 		}
 	}
 
 	/* recurse */
-	for (uint8_t *rp = remaining_partitions;
-		 rp < remaining_partitions + num_partitions;
-		 rp++)
+	for (int i = 0; i < num_partitions; i++)
 	{
-		size_t		end_offset = partitions[*rp].next_offset;
-		SortTuple  *partition_end = begin + end_offset;
-		ptrdiff_t	num_elements = end_offset - start_offset;
+		uint8_t		rp = remaining_partitions[i];
+		SortTuple *partition_begin = begin + partitions[rp].begin_offset;
+		size_t	num_elements = partitions[rp].next_offset - partitions[rp].begin_offset;
 
 		if (num_elements > 1)
 		{
@@ -2897,9 +2930,6 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st
 							state);
 			}
 		}
-
-		start_offset = end_offset;
-		partition_begin = partition_end;
 	}
 }
 


  [application/octet-stream] radixsort_tiny_optimizeation.diff (1.3K, 3-radixsort_tiny_optimizeation.diff)
  download | inline diff:
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 028c5b71c27..5185d8403d5 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2834,6 +2834,13 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st
 		{
 			uint8		idx = remaining_partitions[i];
 
+			/* count sorted partitions */
+			if (partitions[idx].offset == partitions[idx].next_offset)
+			{
+				num_remaining--;
+				continue;
+			}
+
 			for (SortTuple *st = begin + partitions[idx].offset;
 				 st < begin + partitions[idx].next_offset;
 				 st++)
@@ -2841,6 +2848,9 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st
 				size_t		offset = partitions[st->current_byte].offset++;
 				SortTuple	tmp;
 
+				//if (offset == (size_t) (st - begin))
+				//	continue;	/* already in correct position */
+
 				/* swap current tuple with destination position */
 				Assert(offset < n_elems);
 				tmp = *st;
@@ -2849,10 +2859,6 @@ radix_sort_tuple(SortTuple *begin, size_t n_elems, int level, Tuplesortstate *st
 
 				CHECK_FOR_INTERRUPTS();
 			};
-
-			/* count sorted partitions */
-			if (partitions[idx].offset == partitions[idx].next_offset)
-				num_remaining--;
 		}
 	}
 


view thread (39+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: tuple radix sort
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox