Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0lZv-002BrV-1s for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Mar 2026 19:20:15 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w0lZu-00Gr0c-0F for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Mar 2026 19:20:14 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0lZt-00Gr0T-1o for pgsql-hackers@lists.postgresql.org; Thu, 12 Mar 2026 19:20:14 +0000 Received: from mail-dy1-x132d.google.com ([2607:f8b0:4864:20::132d]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w0lZq-00000001ofw-2nIi for pgsql-hackers@postgresql.org; Thu, 12 Mar 2026 19:20:12 +0000 Received: by mail-dy1-x132d.google.com with SMTP id 5a478bee46e88-2be1b5fe11cso2244670eec.0 for ; Thu, 12 Mar 2026 12:20:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=j-davis-com.20230601.gappssmtp.com; s=20230601; t=1773343210; x=1773948010; darn=postgresql.org; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:from:to:cc:subject :date:message-id:reply-to; bh=KCc0FZqH5F0N4VDzlEYaRpGWV5YIg4WlXczKZHWAFJc=; b=1FBYqvJUmAmItZecQ0rE3w04hggcN1S9PefVGvydB3qFisIfrf+9IzOUc3Q5E5nwE5 Xj7dlKK+lY44SF+YdiUXK3uTd2GxDFKzRlwyuuGwZRhjlAGeQWyGJJktUunHf0z5dPm7 s0vSlkc1ujcCc62Xy83xBCZ5xy6GNGLSNPQPAuemvIGflZyYvNsk9n4ks+b6gEq6pW/a +r/lM9ChvrJCNwB02GWWAyBwfJQZf7VrkUWqeWPwkEL9C0RPq2H0HFN45GkUvRVPbLKi l8YzLVMQayOMrFubUqP1++QAYLOf8zjhNRG33/1Eny1wlrWrdNWepd2mF0xUTRmLWlWf F4DA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773343210; x=1773948010; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KCc0FZqH5F0N4VDzlEYaRpGWV5YIg4WlXczKZHWAFJc=; b=Hym3+ibCZk1P/WDP34vbRU7ipzaCrgpBE1Rkkkk/wicLTOA4BV1abFEyKd3hb20nIP Bb7sm182IAZ1FDzXdRESRJzc4WTb3kQd/zgIRWa8sGXPb3YpzTBPJo8Dot7tXdV7zVnm RtS+kAN4aWfwlWMqJfzZNqy4we4aIhMtBnI7zOQFKCYBV2x7RknsF6mA1YRjw0ICKYZz b625uxRaIpZxoMHetGheeGbpl86bsu/AfQATfZf+bRUclJIjnf0O54QRjTM9Eu2s94ru 2D/2QagIdqKnjcJoAU/cw2o7yNqr5H6Ywm3Euj2plkIUC5tJbpGSCUYbr0u4CzQExstp ++UA== X-Gm-Message-State: AOJu0YyTrgG7t1zD1EsudGiVEuA5W2wMByfjPys77m8iJJcyW1eQsOsJ iAji3gD3eBT4jsKLmgYO4tC2WVN10I7EViG42JZwPtkCxw03+ZRbb+629UdBMw5dxQ== X-Gm-Gg: ATEYQzxxj09fRLluOzNqiCYWf5CWqUUZ1T4FSEskb/+irLQ1QRfPIyt2cu7evYQuZvQ xvbgb64WYtxMnuduEmMFbLlv2CMFU/VgsouHgiBFlMydfqi5GYYw2SWx+TVcu8r7AtGZX3EOmVR Joeakk5cZESQTy0cS7FRd8mRZ0nKAzR6k2wkyY+Q+Y7rIOvFk1bq/ee52zSRCjD571mbf3v3rFT Y7+YcuVC7D99KytVQXmOQCKZYhpr0YmRPNhILS7VYNWn238LzuwS/ELDMKzAHkpuDcGrCtFWLql JBcRYGj26iINqKpr9h/5+pS/3w+F4OWOssnpVomsXUEGQkxPUcc09+YR2RzInxLFnmSlj7+pfeN RgciI/u02hjVLYU5X9av5Cp4vBDgSyKIQQ0XOfyNOZgozlE1iyaWXiMk2j0mS939aPUzIX4Y2AB YwrD6ZULLuMu2GpSLIlrYvVsg79xTadrwU8ggqHJ6wcn472UqZn+o= X-Received: by 2002:a05:7301:9bcb:b0:2be:2f62:8bb6 with SMTP id 5a478bee46e88-2bea553a875mr359200eec.30.1773343209716; Thu, 12 Mar 2026 12:20:09 -0700 (PDT) Received: from jeff-laptop.lan (c-24-7-19-3.hsd1.ca.comcast.net. [24.7.19.3]) by smtp.gmail.com with ESMTPSA id 5a478bee46e88-2be9df326f6sm2584365eec.1.2026.03.12.12.20.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Mar 2026 12:20:09 -0700 (PDT) Message-ID: Subject: Re: Change initdb default to the builtin collation provider From: Jeff Davis To: Robert Haas Cc: pgsql-hackers@postgresql.org Date: Thu, 12 Mar 2026 12:20:07 -0700 In-Reply-To: References: <47e1b4f72fe732c5ae85c6cf2c085b4e99a10120.camel@j-davis.com> <4309879ac305b1cf6b4d7b5fb85bc7b62c6ab768.camel@j-davis.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.52.3-0ubuntu1.1 MIME-Version: 1.0 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, 2026-03-12 at 10:04 -0400, Robert Haas wrote: > Yes. I think actually one of the big challenges right now is making > sure that when you initdb to do a pg_upgrade, you get the right > settings to make the upgrade work. pg_upgrade should copy the locale settings to the new cluster as of 9637badd9f. If there are still some rough edges here, let me know. >=20 > I don't have total information, but I think they mostly use a single > locale. If they have extremely specific needs, they are likely to end > up with ICU, else they pick a glibc locale. I have no idea how likely > that glibc locale is to match their environment. I wouldn't bet on it > being the norm, but I wouldn't bet against whatever they have in the > environment being more usable than "C". That's interesting. In other words, (in your sample) users aren't worried about the precise sort order in their native language; it's just that ASCII is particularly bad, and almost any "real" locale is more appealing. If the concern is mostly that ASCII is particularly bad, how much of that is because case is a high-order bit (i.e. 'A' < 'Z' < 'a' < 'z')? > >=20 > It's tough if people have range scans. Range scans using a natural language collation are dubious. It can't be for a prefix search; LIKE 'myprefix%' needs the index to be defined with text_pattern_ops (which is code point order), so the default isn't going to work for them anyway. (A prefix search can't be implemented with a range scan in natural language collation because, e.g. in the cs_CZ locale, 'cha' does not fall between 'ch' and 'ci'.) So how often is a range scan using a natural language collation actually useful? I'm sure there are some real cases, but I'd say it's usually a mistake and they are quite possibly getting wrong results. > Not everybody does, but they > also don't know whether or not they will want them when they're > making > setup choices. Picking a locale that matches their desired sort order > *in case* they end up using range scans in some queries feels like > the > "safe" coice. I have trouble understanding this perspective: slow all indexes down (and sorts, too), and risk index inconsistencies just in case someone ends up doing a range scan on one of the indexes? How is that safer? > >=20 > What I'm > most worried about is the population of users -- which I guess to be > large -- who do not have a strong preference but won't be happy with > something as dumb as "C". If even a small fraction of users create a > database using "C" unintentionally and load a terabyte of data into > it > before realizing that all their text indexes are sorting "wrong", I > suspect that's not going to be much fun. This is where we differ: even in that case, I believe all (or nearly all) of that user's indexes would be better. When you look at the conditions that must be true for an index with a natural language collation to be useful, it's certainly not the normal case, and I'd bet it's closer to "rare": * the use case must be real (not relying on faulty assumptions about lexicographical ordering) * the input data must be large enough to benefit from an index scan * one of the following must be true: - the index needs to be correlated with the heap order (seems unlikely; correlation usually happens with sequences, timestamps, etc., not natural language text values); or - needs to be eligible for an index only scan (plausible); or - the amount of data read must be small enough that correlation with the heap doesn't matter * the result data needs to be small enough for a human to consume it (otherwise why bother with natural language?) * the performance improvement must be enough to offset the penalty for equality searches and index maintenance While each of those is plausible, when combined, I think it's far from the typical case. It's perfectly reasonable to say the user may be upset about the way the final result order looks, but making all the index orderings worse is not a good way to fix that. Regards, Jeff Davis