Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vAx8J-008C1K-HA for pgsql-hackers@arkaria.postgresql.org; Mon, 20 Oct 2025 21:09:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1vAx8I-004IsL-GY for pgsql-hackers@arkaria.postgresql.org; Mon, 20 Oct 2025 21:09:33 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vAx8I-004Iqe-6l for pgsql-hackers@lists.postgresql.org; Mon, 20 Oct 2025 21:09:33 +0000 Received: from mail-pl1-x62a.google.com ([2607:f8b0:4864:20::62a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vAx8E-003Ib9-38 for pgsql-hackers@postgresql.org; Mon, 20 Oct 2025 21:09:32 +0000 Received: by mail-pl1-x62a.google.com with SMTP id d9443c01a7336-26e68904f0eso47844475ad.0 for ; Mon, 20 Oct 2025 14:09:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760994569; x=1761599369; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mDfBrpz6cwj8pQTM9TsNkrkxuIrk+9N7tgV8ClpS+ZE=; b=Przx3g1csFG0C1XthZqU1rE+Z8fA/tX0/UBbMBp1/AdSAg5Bv46AoKZaRN1FWQhCdC 37VY8gB/aSOYrViOQ4ucSwEhekTIP2nnI4cmHZObt96IhxVcbIY+okKLdzY4j/r6tr6B nvLrsuMBQ+/wMUwa0CWXDZHf3xUe3xcpAjdkPCj06EgCxdPg6t09pORXi829mdd8mQfW C1oD5Ch92NDgxkX/btZQHTJcW5wHXBzdKdZUG5b7eEtTLFPz7xSEPBHW8sA7XQbVRpn4 qGc+veWl2neTID2RrUWt4LFZ6lpWNOElju4EFnQSPJ2qvb8/7WgMXHqDA8nc0Nz2VIkI l9OQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760994569; x=1761599369; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mDfBrpz6cwj8pQTM9TsNkrkxuIrk+9N7tgV8ClpS+ZE=; b=C9I5M5FsWt2AaNDSNkGmkeipPOPwg3Kv1JaHYUTA4WLNwUdWSG1gUZ6YlWrFrloSVE wsCIB5iUV01OvLPbKNiNPpQ5KnszWMtkqJ+1Pq5xZX/8StLjv4Q7kAebeXI3m4d4rvGh 311PQp9lr+H3QcCczUHSiO72Sdbyl4ekxMPEgX4v5+Jpg45+YoHvuMDmw0bMq8NxYVTk yTzwYYpranHS8Byy5ymNMv5GpURNM27MKejuv3S0GGaE9dslK2uLkY8WMHtEAKw38r/w RLwl51FCrodlIOXLGUYkRzY2QjJA70jlRaZPzpzjqjlsoGvoUfyPIEf6zB2e5iY7DCh8 Lf0w== X-Forwarded-Encrypted: i=1; AJvYcCXVMeyVwvtCvFYIn3FvtL9k3UITzmOS8sWuDtzrFy0Qm+NDDmREZHYutijiQT/Flg/VB+l9v+wDJdYwMPHZ@postgresql.org X-Gm-Message-State: AOJu0Yw4KXXVg71TJERXLiYjCs4YDCfDNjgSg81BWvzneU2Pya7KzeCU OXDoXoKQKLUH0FrcE/74nlRUt9lfra42mDLFGmKbnzdfTDPXyMHLP2k7lW7woYDFR885kBhobqC dXJoX57ttpmAlwXUH9iSGPwsMtsZAxe0= X-Gm-Gg: ASbGnctu/o5U67iaYoQBKIEyB23AeKK/gb65c1z/6/x0kweECuJmrzR5m+IqdqxqVYW 3/98p/4+OT7Jr3jDwfzPXU5ItwYPQrH0Ngiu/Ydb2FSpSDjoFW/zwu7kW4ObWf+kdYiD/PRri7X t2xz2unWhFO6MZ6wG6rn9GSfVqaV+w/OD092F28NhwrBlldn+odumJQ2ma8uJ6Dg4qw9ciFacSt S94A6XogNOuZK9xDPXGzY7delvpTaVcpdEziFZmUvRTBR2OD4SUr1v4INtPBg== X-Google-Smtp-Source: AGHT+IGlHUdPfuFH+x16OmWIeYpgda7CgMX3z81GHFA8YwBcQpbVucNfF8Ue+b3JUAJMEsLRaggnO8UzxOTxAwlInfA= X-Received: by 2002:a17:903:950:b0:268:cc5:5e4e with SMTP id d9443c01a7336-290c9c89320mr182771745ad.1.1760994568700; Mon, 20 Oct 2025 14:09:28 -0700 (PDT) MIME-Version: 1.0 References: <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net> <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net> In-Reply-To: <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net> From: Nazir Bilal Yavuz Date: Tue, 21 Oct 2025 00:09:27 +0300 X-Gm-Features: AS18NWDi1SuYbcjjwhZgmMmbkaEL-Sea-QGFQXD7U-YTHUJ5jV_DP5E6HlNDLOw Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Andrew Dunstan Cc: Nathan Bossart , KAZAR Ayoub , Shinya Kato , pgsql-hackers@postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On Mon, 20 Oct 2025 at 23:32, Andrew Dunstan wrote: > > > On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote: > > On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote: > > On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote: > > With this heuristic the regression is limited by %2 in the worst case. > > My worry is that the worst case is actually quite common. Sparse data set= s > dominated by a lot of null values (and hence lots of special characters) = are > very common. Are people prepared to accept a 2% regression on load times = for > such data sets? > > Without knowing how common it is, I think it's difficult to judge whether > 2% is a reasonable trade-off. If <5% of workloads might see a small > regression while the other >95% see double-digit percentage improvements, > then I might argue that it's fine. But I'm not sure we have any way to > know those sorts of details at the moment. > > > I guess what I don't understand is why we actually need to do the test co= ntinuously, even using an adaptive algorithm. Data files in my experience u= sually have lines with fairly similar shapes. It's highly unlikely that you= will get the the first 1000 (say) lines of a file that are rich in special= characters and then some later significant section that isn't, or vice ver= sa. Therefore, doing the test once should yield the correct answer that can= be applied to the rest of the file. That should reduce the worst case regr= ession to ~0% without sacrificing any of the performance gains. I appreciat= e the elegance of what Bilal has done here, but it does seem like overkill. I think the problem is deciding how many lines to process before deciding for the rest. 1000 lines could work for the small sized data but it might not work for the big sized data. Also, it might cause a worse regressions for the small sized data. Because of this reason, I tried to implement a heuristic that will work regardless of the size of the data. The last heuristic I suggested will run SIMD for approximately (#number_of_lines / 1024 [1024 is the max number of lines to sleep before running SIMD again]) lines if all characters in the data are special characters. --=20 Regards, Nazir Bilal Yavuz Microsoft