Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vJyAU-00ADW4-33 for pgsql-hackers@arkaria.postgresql.org; Fri, 14 Nov 2025 18:05:05 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vJyAS-007SLE-0V for pgsql-hackers@arkaria.postgresql.org; Fri, 14 Nov 2025 18:05:04 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vJyAR-007SL6-2T for pgsql-hackers@lists.postgresql.org; Fri, 14 Nov 2025 18:05:03 +0000 Received: from mail-wm1-x336.google.com ([2a00:1450:4864:20::336]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vJyAP-0078kC-2x for pgsql-hackers@lists.postgresql.org; Fri, 14 Nov 2025 18:05:02 +0000 Received: by mail-wm1-x336.google.com with SMTP id 5b1f17b1804b1-471191ac79dso24465975e9.3 for ; Fri, 14 Nov 2025 10:05:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1763143500; x=1763748300; darn=lists.postgresql.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=ZJGGCl8ONZ/YL5ThtP9JSxe97FOOYT2ErNnS1T1xcjg=; b=VSldASpe41lqcWp/YLE3xU+LAMe8YFdUEF/61d//ofVIqrf7/uzhnQxJYt5aVnDXGl uvc7evSPvaC5kGgONnUKsC26ZZUx2FTLhEYCIXXO5tsBLuoM+DbqYsCwrbaH7RPH4QJg aleAPJ63ejbtGB3dXXlqSqFBXuf5BvcO4Etpr4LckO3QH23LdbtyUH5vLVf9KL0sX+dS fufjt7sZDSsVn0kdH2gUgVamh0+NZ5ipTKZlDjKk+2bSkgD7v3EmknWtz6g50AUB5HZG BsRit3C3mNjxsgg+X0ISad4GIMRBKEpCi/KqlzNXYcfLsk8kIOkhQbWFJ+3vPQkqHhce fQ7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1763143500; x=1763748300; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ZJGGCl8ONZ/YL5ThtP9JSxe97FOOYT2ErNnS1T1xcjg=; b=LanKCDXqjIPVtUUe7OhXKbUwDAK3+zg7AYaqSoKzvxrMSXTPmh+dr9i5qTGC9hNXOt yoj0rI8l+6k1a5nYQiC4l7YH5pX/1ruMtXXUQ+Fa4lXv7sP0DbAAalEm6nPaoCMSGPwf LqTidK3kTSJm2p6ajOKCi/TXVgkXP6wZXVkIgyCceYGI4kRLe9xSLo7o5BuXVfIP+sty 9Q5ATX3YhGcHnsWpjDbx7RINpmB8SrixBLoMxxJf21kv1mjSFwyBd6uT6mvRp5/6brUz T+CyoKkJMKLQ7UHlax+UCIB9v9bvGXqt6AL90I8l4owTGF0zIetdra//NnslfjV86rNw jwcA== X-Gm-Message-State: AOJu0YyxtYuB7Jjpk4gWAOtkXt5JC4rP1dl/P6MC/zzL7DNEbh0qh+GP qAKzry5hZBTGvh2xBTCq1kInbzSelTvzFF2Kz5/OJJntOg5/qBFSdjAk X-Gm-Gg: ASbGncuhOJZh9tnjKq4Or5VPgVyn3dpG8s1TOthah4HJUndkYnUDsTyARtKbf0pIvSz M5deGFrJxlawpNTx9vKgWUjTbLvfGnuHpWag3vd7ht0I5SpMdD6ihpzQVW85erOjECHpXPWxtdM ecbUwpYU5EoB2RmR8NGyLYpFR3gnm1TJ9dvBb3tIiGi/YosFBlfo/7iXFiu2OhV65hDHiN+gi7A IFET+TxJM/CBbdjCSbZ91BnZieg8T/M+rzXhJiDLom1TmqpSZ2xZ6Qx1Yy7GjDupcB+wk4UN1tN 3tFi2xPCzvKHhlTMnCNTyvzwuTqfA7W4qIKSmTFtlBR/7XCw0HLURy5YlfFfbUGpTbMu/jVKrqL 2TaJn2Bj3OtBsDNYvl5MS1WC2qmkSkv1VFwssBwVB8/L5W0vUIvaR82X9c1GsWQky/93Lx5qaGN M+ X-Google-Smtp-Source: AGHT+IH8iSfy7pYN6kBh7agGI06kBmgFQZTBiWgwpBjjXuP9Z8FWT9t2qXpZDq3OmJaVmQ3ooypRiw== X-Received: by 2002:a05:600c:a05:b0:475:da1a:5418 with SMTP id 5b1f17b1804b1-4778fe55465mr37423515e9.1.1763143499545; Fri, 14 Nov 2025 10:04:59 -0800 (PST) Received: from [192.168.2.32] ([165.225.27.7]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4779528bc65sm22001685e9.15.2025.11.14.10.04.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 14 Nov 2025 10:04:59 -0800 (PST) Message-ID: <97492f51-51cc-45cd-aeea-8ed6f48eccc2@gmail.com> Date: Fri, 14 Nov 2025 19:04:57 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: tuple radix sort To: John Naylor Cc: PostgreSQL Hackers , Peter Geoghegan References: Content-Language: en-US From: David Geier In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi John! On 13.11.2025 05:01, John Naylor wrote: > If that's the case then I suggest first seeing if dfd8e6c73ee made > things any worse. A simpler possible improvement is to use a similar > normalization step for the chars, if needed, then do the sort and > quinique with a specialization for unsigned chars. (We don't yet > specialize qunique, but that can be remedied). If you're interested, > please start a separate thread for that. It did but only a bit. I worked around it by having two sort specializations, one for signed and one for unsigned. I also wanted to try to use a hash table to filter out duplicates and then only sort the remaining unique trigams, which are, most of the times, a lot less. Generally speaking, the GIN code is death by a thousand cuts. I've got a patch coming up that cuts CREATE INDEX runtime in half for columns with relatively short strings and yields even better results for columns with longer strings. But that's not only changing the sort but requires a few changes in a couple of places. More details in the upcoming thread. I thought qunique() is already pretty optimal because it's defined in a header file. I believe that even the comparator gets inlined. What would be useful though is if qunique() used an equality comparator which only returns true/false instead of a sort comparator. In the GIN code this also shaved off a few percent. I'll take a closer look at qunique() at open a thread with the findings / ideas for changes. Anyways. In this context GIN was just one example for where a generic radix sort would be useful and there are certainly more. > > That's moving the goalposts too far IMO. I want to get to a place > where I feel comfortable with the decisions made, and that already > requires a lot of testing. Also, I don't want to risk introducing > abstractions that make future improvements to tuplesort more > cumbersome. On a quick glance it looks like you didn't specialize much. So the testing seems related to if the new algo introduces regressions, not if the abstraction would cause problems. So it should be possible to extract out the code fairly easily without invalidating your existing benchmark results. I understand that you want to make progress with the use case at hand but I feel like we're missing out on a lot of opportunity where the introduced code would also be very beneficial. Beyond that we could nicely test the new sort code in the spirit of test_rbtree.c and friends. Maybe you want to give it a 2nd thought. -- David Geier