Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vBHHR-00EMEB-21 for pgsql-hackers@arkaria.postgresql.org; Tue, 21 Oct 2025 18:40:20 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1vBHHM-00BDXt-0j for pgsql-hackers@arkaria.postgresql.org; Tue, 21 Oct 2025 18:40:15 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vBHHL-00BDXl-G3 for pgsql-hackers@lists.postgresql.org; Tue, 21 Oct 2025 18:40:14 +0000 Received: from mail-il1-x12b.google.com ([2607:f8b0:4864:20::12b]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vBHHI-0031ha-2V for pgsql-hackers@postgresql.org; Tue, 21 Oct 2025 18:40:13 +0000 Received: by mail-il1-x12b.google.com with SMTP id e9e14a558f8ab-430ab5ee3afso53494865ab.2 for ; Tue, 21 Oct 2025 11:40:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761072012; x=1761676812; darn=postgresql.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=Th1U8F5aJGLC2iPcv6ql5TYDj2bSjmc53VZhL6k5uqU=; b=ZWMOO8rtaEPTTFPFVbwBqXqevp8p9iCYTk07JgErGZn+DAA3Dv1ND4flMuedLBjiyO ZWE8daycJTk1gXF08/z9k7gsICVN8qvHsf/4ePkMvKHH8dWzDYmVJTeY6SWiiPTbmx/a MVRXvHtMRg3yzybd2HQxIHu/ODUzKKdNfcr7jNJmB3iF5yx+xFdrUVX6KDFcHYhKKjnT m9f9RIOoFND2wf6G2G6sjGjQ98fNDy9lAmeRJ1aMy7ZdMUGuSDJdyX1Uc9tFsjvkTMP5 n8coi+JvtUAhn3v3xWYp1Uu6axLEzkGOUG/HQ2lJB53dTCCfUhRrlyJ+eHTKeEnL8HeK w3RA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761072012; x=1761676812; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Th1U8F5aJGLC2iPcv6ql5TYDj2bSjmc53VZhL6k5uqU=; b=qocTTNcYtk95iwobf1Yug9Wz6Gfp9bCKqWCHC2FyhdUyGeZ58vlMKLxXXjLV8/k0Du CCdJV71t5ilF9ae0grN4SZZIU54fRF4YuiaiK4tJ56hhSodogy7pXmlWeF4a49qZogme HbhGUraKlohyPd7uMraNZxtmU524pvKt8Oyol0I29JfsWyvjmB+O5RfzEi2tOJZB0nm1 xsg+xV9v2Z0KW28NlKnqhCf+6gckUgf6UG7U9j185zgzbKIUcYy5h2gipv/Y1SiqXgJq l+mMWuyre1nmWJ2jlF8/VaDw2hBzTwN7iW3XPv6okWKMF5aCg5KvvyIxjLQ0xl2tPIGF CNFQ== X-Forwarded-Encrypted: i=1; AJvYcCWK41na+szfpeGHsCt5uHaEaeNe/YfcA0FmcVz/8jRDuv7SMHF3buwnLNwmsfTXoz5IU93CQsrbJ5KeO3ES@postgresql.org X-Gm-Message-State: AOJu0Yx3N+VE8ILw3ZOXJSUK1zEpRi/oPA/DIt3zfhTIhzOrRe+52wsX yt9zjKs4mMDx9WTTTZ3sVxmzErjboQsmX8k1CQi+H6Ma2VfZ+nNDBEHE X-Gm-Gg: ASbGnctBtt4fi4xQ0wBRuTm8SrQInRAyLV/rmKUTfyQOGRG7PGF7nOqEERAawW0MYvm rfACJO+uBbjunZfk2gACqgHTmgYc7d8j1C/nKQo3HNrbVwMR0fckySGAiFr5O6Q0yJXsIsbycTG Wl1rLxqbtKCaGtAoViMVelQ5q04FTgNyjnYQFFmNVdh1SJ6KTPgom0HXDgLweE5iA8hNuGaeyQ6 uQEnpJhLXOnbxeRUBCF8q+1Nx1vyfhwf0rRr+X78O5C42YvTd/pSVVRwFVfk7VrLVTCS2Z+fHvH ltobt8nONpr3Pfz13TzZh4mMbmRWZC2s6xkwjOxQz2LhqMNW6hz1UjsNZx/Vv9K5fiwSQupZQ3c 6O3EBxDMxXHzO1q6Jo7x92LLS5gY6tFkW6QGLDLwNV+RCCVHkqIz7V/jAEtz7qL6bNozepzOTxU s6Ry25d+2sXuJgGyY5RD4yIUH0aSEO1XxrkraAGw/t+edOO4w5uqI4NzsZVX3bWMLwIrczasNKv iXn X-Google-Smtp-Source: AGHT+IELwV1qH4/kjMJ9ooY9KM6tEe827GWH6zqSHnwsgAPee7Lpl2Biy+XAX4b28hjf0ODjDt+JhQ== X-Received: by 2002:a92:c266:0:b0:430:c1ec:41a7 with SMTP id e9e14a558f8ab-430c5234293mr265618375ab.2.1761072011823; Tue, 21 Oct 2025 11:40:11 -0700 (PDT) Received: from nathan (162-195-168-172.lightspeed.stlsmo.sbcglobal.net. [162.195.168.172]) by smtp.gmail.com with ESMTPSA id e9e14a558f8ab-430d07cb539sm45810435ab.36.2025.10.21.11.40.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 21 Oct 2025 11:40:11 -0700 (PDT) Date: Tue, 21 Oct 2025 13:40:09 -0500 From: Nathan Bossart To: Nazir Bilal Yavuz Cc: Andrew Dunstan , KAZAR Ayoub , Shinya Kato , pgsql-hackers@postgresql.org Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD Message-ID: References: <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net> <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote: > I think the problem is deciding how many lines to process before > deciding for the rest. 1000 lines could work for the small sized data > but it might not work for the big sized data. Also, it might cause a > worse regressions for the small sized data. IMHO we have some leeway with smaller amounts of data. If COPY FROM for 1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems unlikely users would be inconvenienced all that much. (Those numbers are completely made up in order to illustrate my point.) > Because of this reason, I > tried to implement a heuristic that will work regardless of the size > of the data. The last heuristic I suggested will run SIMD for > approximately (#number_of_lines / 1024 [1024 is the max number of > lines to sleep before running SIMD again]) lines if all characters in > the data are special characters. I wonder if we could mitigate the regression further by spacing out the checks a bit more. It could be worth comparing a variety of values to identify what works best with the test data. -- nathan