Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vBY2Q-000Pq1-Gb for pgsql-hackers@arkaria.postgresql.org; Wed, 22 Oct 2025 12:33:57 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1vBY2P-00G8JB-6I for pgsql-hackers@arkaria.postgresql.org; Wed, 22 Oct 2025 12:33:56 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vBY2O-00G8J2-SL for pgsql-hackers@lists.postgresql.org; Wed, 22 Oct 2025 12:33:55 +0000 Received: from mail-pg1-x52c.google.com ([2607:f8b0:4864:20::52c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vBY2L-003c84-2E for pgsql-hackers@postgresql.org; Wed, 22 Oct 2025 12:33:55 +0000 Received: by mail-pg1-x52c.google.com with SMTP id 41be03b00d2f7-b5579235200so4400377a12.3 for ; Wed, 22 Oct 2025 05:33:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761136431; x=1761741231; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=WJMfBJXHUF7Ha03RImtXCjo8WiF9nB8+2sJSfQc5ZCo=; b=VE/T+Uue02I0xUX9kAtXSRqgQhZm1pTt+60Drgd+QAo8Dj/44+5sBkUYX3gR58RZxi H984re5EQS7fbO89BFIUdNZy5JDCypXjjEUxPUa50VWNjjo8M0U4PI+ENFZMCRVQUhqH fKV6axkndY+S+PV5ewbOmsuJQsYgYflyqYT37ALa9J1mebaiPWhgiZLBtYAdZmsR8gSI nNRmVe242AFF4NJA9b7zAf2oO5X3uisP1FVfcthrQ1r4OlQOSYU5TT62/krCs83nGzM2 iVJPOvgwt/jDmgof1ZWmm2lJPJnIhUvLg8QuBClksftVDqcWaS38XS00DhyiMCEkNmaH ouSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761136431; x=1761741231; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WJMfBJXHUF7Ha03RImtXCjo8WiF9nB8+2sJSfQc5ZCo=; b=KiMMJW+l9ZHfG8MYu3Es7wP/yVz6ao+zOrs+0mpM54BEIqBaSyBscMgDonx80au1e0 ev3wAKE2/fcXKUZQ5M0gHWzq0DjxeJj/yt1lZKeGqAoYvDQPoOJn26Dsv3icj6JFsyc3 p9SYDHtOlVJhhG8zyRo057+VrdNOUPV2c9vGkfmkbIXZHwNbD75Gl4iD4Tt8pwpFLOJ/ TkfoITrsihBJivjqCuFYhJua71L2N91JG1dK+eCmyrVsMi87yXUy8d68pUWHG7Bv+sqh sljxg5Ad5YNWbS3vvp2HT4r3SHHzVCY8V3shI68Q58OgTZh3Eji+rBBF9Qf125HOJBDw XvCg== X-Forwarded-Encrypted: i=1; AJvYcCUC1/dSPkSNEqsW/ug8QZzoTz/KQ+LiFYppKab+UymL8CJ58Ct+OQaCheNrJprrSzVIewk2rPqpZWsGvtce@postgresql.org X-Gm-Message-State: AOJu0YzzjLjXuMVlsDMG3RRoMTCKI7Sq5MqhtEsqIGJB2DN5/Oo+3S3t 6j/X+/WggARgNsoQ1DJzrDa1gDAzmei+LjAKVdEaSMszVH/sFyE6+6ZxUCWt96qx0ugV2xosCzR jY7T9UMLkzMJMoKgrHgU6zKlQLLqtT18= X-Gm-Gg: ASbGncs5EN0g5smNFEMZur1hhaTy81AiHl1hZBPSmCPEkzrTWuAiMdFvGtY4ZzPi6Ng MbbemIDhnbyuZ1DuiOjsMGLQuvL3lpTYJhEwvjcMbRHoGHcRvjrReNMjVJUj5wc7g5LipoBJE62 hQREGarKzde5jXeoso+Bd+sAX7yaDeOs1zCvGqHB6kYAA10+HhMmLLxEz4NFs+Zh+AlHCVDfYlz T9VCqZK4apuhE3272dIA7k0M3rJnh9vPwXMY9KCGlexlPMqqquDcBwf2ki4z8FYTmjV4zM= X-Google-Smtp-Source: AGHT+IGoaoyhJN/sBzloBHAIanZEFpF+r8rlzRnaJzyPSLG0vBN1ivrf1GRyeb8hMiHAVxRiEh3CzDX8+vd5zfDcogY= X-Received: by 2002:a17:903:181:b0:271:479d:3dcb with SMTP id d9443c01a7336-290c9c8a75amr247666955ad.6.1761136431125; Wed, 22 Oct 2025 05:33:51 -0700 (PDT) MIME-Version: 1.0 References: <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net> <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net> In-Reply-To: From: Nazir Bilal Yavuz Date: Wed, 22 Oct 2025 15:33:37 +0300 X-Gm-Features: AS18NWBIpDzeSz2NVK2WGUKycMaFfReOl5PXTv_vZYS7-sFQeVfifhqahGNOiJk Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Nathan Bossart Cc: Andrew Dunstan , KAZAR Ayoub , Shinya Kato , pgsql-hackers@postgresql.org Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On Tue, 21 Oct 2025 at 21:40, Nathan Bossart wrote: > > On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote: > > I think the problem is deciding how many lines to process before > > deciding for the rest. 1000 lines could work for the small sized data > > but it might not work for the big sized data. Also, it might cause a > > worse regressions for the small sized data. > > IMHO we have some leeway with smaller amounts of data. If COPY FROM for > 1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems > unlikely users would be inconvenienced all that much. (Those numbers are > completely made up in order to illustrate my point.) > > > Because of this reason, I > > tried to implement a heuristic that will work regardless of the size > > of the data. The last heuristic I suggested will run SIMD for > > approximately (#number_of_lines / 1024 [1024 is the max number of > > lines to sleep before running SIMD again]) lines if all characters in > > the data are special characters. > > I wonder if we could mitigate the regression further by spacing out the > checks a bit more. It could be worth comparing a variety of values to > identify what works best with the test data. Do you mean that instead of doubling the SIMD sleep, we should multiply it by 3 (or another factor)? Or are you referring to increasing the maximum sleep from 1024? Or possibly both? -- Regards, Nazir Bilal Yavuz Microsoft