Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1um4rt-009aJV-67 for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Aug 2025 06:21:49 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1um4rr-00Cd7e-EM for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Aug 2025 06:21:47 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1um4rr-00Cd7V-47 for pgsql-hackers@lists.postgresql.org; Wed, 13 Aug 2025 06:21:47 +0000 Received: from mail-qk1-x72b.google.com ([2607:f8b0:4864:20::72b]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1um4ro-000U9L-2i for pgsql-hackers@postgresql.org; Wed, 13 Aug 2025 06:21:46 +0000 Received: by mail-qk1-x72b.google.com with SMTP id af79cd13be357-7e864c4615aso66321085a.1 for ; Tue, 12 Aug 2025 23:21:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755066103; x=1755670903; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=6r+x15lunF3yOLe0mcbJwcQ8p9f+zd0dZeT2YMXwZ+Q=; b=eSQKRLFjI3OFZUGjPZIXINyE03Erad5PRyV7KPb+iNPWKP2Ca+PY+V5h+JTOC1TzAF SCm/hxL6CoAIRXSSrfoBeMOOLeG74c4Gb/bxakdaOwcJnJOlsQNSXenpuHQsP9NbhXfM 1VciXiPKhKtaDV0aSzpcpBJs16NuoJ/Ev+mKCZzRuQMat9lniDuMbyKwSzTecq7ODEW8 sJqVqb68QJvyzbvvUX9bFJ3hOb6fwm83FreqkJlnwWZs1pD1g0/Xq7LcUaA9vwn7Jy2H PAJh6/OwNY0ARFI3Ab6mmcTmHLglU3Ils4pjyFbfiXBhC4Elu3KQmqYDr+LX8AE6CfcA xe9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755066103; x=1755670903; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=6r+x15lunF3yOLe0mcbJwcQ8p9f+zd0dZeT2YMXwZ+Q=; b=I/dU4lKTrEtP0DK+3x4r3TfzUWDSXGR70crn9LxUtYUF9Oaziknobf9B+KSob/huv4 jRA4QubJSKxAy7QyVplLqyo3rIqJJ6VG0IR4L99efrXL7wpLGInjwfR2oN1264sX/RFf 1X7hogoqRMaRkGI5jP+jDyLjCgNtX2tiS1BZvENcekmpnP0KjRIdkR7C9O0lrMmkYfMi UhQRxcfcsVw1FNXFuFqRYFS02IJnpz7qh21349n4uqbQNoDBu7ziT+KY2tKP1QzM7I3U t35T+QD2oJIrr5FtyRVjwXfORKFaYaf/pJ12QP3kdk4YDb7SHuD6D0NND38nvEVtDEjp Z7zg== X-Gm-Message-State: AOJu0YwsVuBqWcldwDJxB2UiKp50vKA7ib0Bgo8RUpXeoCfhGN5QSRL9 6fsE02M3t2htrfV5Lc7I3MYijwdi/26DzsqA44RGh+UTJTv+3O28FUCL5rGmvnWkzwa9ixVfngl 9+YPyAOyKSVcxNqmEKVQpXAY0CyTfZQ== X-Gm-Gg: ASbGncv92gd2Lj/S7WMgeHaWmtFwPunmVnuyVfXHZ4NiO3zZmdtIHyrBftBUBmb1Z29 aJVKl+oQkmJbmcmZBAZYmaLyuDaj8Sjkea9YOFeWLyhkgXqAbmTmvgTp/Tt82l/DMtr1gjG6PO8 8kqibhKFFKBAWmomw4Z/htaPhPQCzibAkAkVf2g2CxZc2CA0bZJldxkLKk0QNwMDjq29Mia3vvw 6iT35YNxFPXBLwjOg== X-Google-Smtp-Source: AGHT+IF9iNAHOPBtwD5xQk4+ym8l8BJJzdpPMIxIQGwzjev63toLwg2WMq5mT7uD/EeMWBAozkIc1OQlYlVmGljcGTg= X-Received: by 2002:a05:620a:4712:b0:7e6:28ba:20f9 with SMTP id af79cd13be357-7e866bb6d84mr143982585a.16.1755066103068; Tue, 12 Aug 2025 23:21:43 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Shinya Kato Date: Wed, 13 Aug 2025 15:21:06 +0900 X-Gm-Features: Ac12FXxgHBuizLjUJgUwxz2596bU7Kue4kVjedo8eKYTx8gDX8eAOUZJ62JVAUE Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Nazir Bilal Yavuz Cc: pgsql-hackers@postgresql.org Content-Type: multipart/mixed; boundary="000000000000a217a8063c392bba" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000a217a8063c392bba Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Aug 12, 2025 at 4:25=E2=80=AFPM Shinya Kato wrote: > > + * However, SIMD optimization cannot be applied in the followi= ng cases: > > + * - Inside quoted fields, where escape sequences and closing = quotes > > + * require sequential processing to handle correctly. > > > > I think you can continue SIMD inside quoted fields. Only important > > thing is you need to set last_was_esc to false when SIMD skipped the > > chunk. > > That's a clever point that last_was_esc should be reset to false when > a SIMD chunk is skipped. You're right about that specific case. > > However, the core challenge is not what happens when we skip a chunk, > but what happens when a chunk contains special characters like quotes > or escapes. The main reason we avoid SIMD inside quoted fields is that > the parsing logic becomes fundamentally sequential and > context-dependent. > > To correctly parse a "" as a single literal quote, we must perform a > lookahead to check the next character. This is an inherently > sequential operation that doesn't map well to SIMD's parallel nature. > > Trying to handle this stateful logic with SIMD would lead to > significant implementation complexity, especially with edge cases like > an escape character falling on the last byte of a chunk. Ah, you're right. My apologies, I misunderstood the implementation. It appears that SIMD can be used even within quoted strings. I think it would be better not to use the SIMD path when last_was_esc is true. The next character is likely to be a special character, and handling this case outside the SIMD loop would also improve readability by consolidating the last_was_esc toggle logic in one place. Furthermore, when inside a quote (in_quote) in CSV mode, the detection of \n and \r can be disabled. + last_was_esc =3D false; Regarding the implementation, I believe we must set last_was_esc to false when advancing input_buf_ptr, as shown in the code below. For this reason, I think it=E2=80=99s best to keep the current logic for toggli= ng last_was_esc. + int advance =3D pg_rightmost_one_pos32(mask); + input_buf_ptr +=3D advance; I've attached a new patch that includes these changes. Further modifications are still in progress. --=20 Best regards, Shinya Kato NTT OSS Center --000000000000a217a8063c392bba Content-Type: application/octet-stream; name="v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch" Content-Disposition: attachment; filename="v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_me9kzs9m0 RnJvbSA2OWUxNmY4YzdhNTJkOTY3Mzg1YTFkYzliMTYwMmJiZDQ0NzJkZjYwIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBTaGlueWEgS2F0byA8c2hpbnlhMTEua2F0b0BnbWFpbC5jb20+ CkRhdGU6IE1vbiwgMjggSnVsIDIwMjUgMjI6MDg6MjAgKzA5MDAKU3ViamVjdDogW1BBVENIIHYy XSBTcGVlZCB1cCBDT1BZIEZST00gdGV4dC9DU1YgcGFyc2luZyB1c2luZyBTSU1ECgotLS0KIHNy Yy9iYWNrZW5kL2NvbW1hbmRzL2NvcHlmcm9tcGFyc2UuYyB8IDcxICsrKysrKysrKysrKysrKysr KysrKysrKysrKysKIDEgZmlsZSBjaGFuZ2VkLCA3MSBpbnNlcnRpb25zKCspCgpkaWZmIC0tZ2l0 IGEvc3JjL2JhY2tlbmQvY29tbWFuZHMvY29weWZyb21wYXJzZS5jIGIvc3JjL2JhY2tlbmQvY29t bWFuZHMvY29weWZyb21wYXJzZS5jCmluZGV4IGIxYWU5N2I4MzNkLi5mMWE2ZWE4MWRkMSAxMDA2 NDQKLS0tIGEvc3JjL2JhY2tlbmQvY29tbWFuZHMvY29weWZyb21wYXJzZS5jCisrKyBiL3NyYy9i YWNrZW5kL2NvbW1hbmRzL2NvcHlmcm9tcGFyc2UuYwpAQCAtNzEsNyArNzEsOSBAQAogI2luY2x1 ZGUgIm1iL3BnX3djaGFyLmgiCiAjaW5jbHVkZSAibWlzY2FkbWluLmgiCiAjaW5jbHVkZSAicGdz dGF0LmgiCisjaW5jbHVkZSAicG9ydC9wZ19iaXR1dGlscy5oIgogI2luY2x1ZGUgInBvcnQvcGdf YnN3YXAuaCIKKyNpbmNsdWRlICJwb3J0L3NpbWQuaCIKICNpbmNsdWRlICJ1dGlscy9idWlsdGlu cy5oIgogI2luY2x1ZGUgInV0aWxzL3JlbC5oIgogCkBAIC0xMjU1LDYgKzEyNTcsMTQgQEAgQ29w eVJlYWRMaW5lVGV4dChDb3B5RnJvbVN0YXRlIGNzdGF0ZSwgYm9vbCBpc19jc3YpCiAJY2hhcgkJ cXVvdGVjID0gJ1wwJzsKIAljaGFyCQllc2NhcGVjID0gJ1wwJzsKIAorI2lmbmRlZiBVU0VfTk9f U0lNRAorCVZlY3RvcjgJCW5sID0gdmVjdG9yOF9icm9hZGNhc3QoJ1xuJyk7CisJVmVjdG9yOAkJ Y3IgPSB2ZWN0b3I4X2Jyb2FkY2FzdCgnXHInKTsKKwlWZWN0b3I4CQlicyA9IHZlY3RvcjhfYnJv YWRjYXN0KCdcXCcpOworCVZlY3RvcjgJCXF1b3RlOworCVZlY3RvcjgJCWVzY2FwZTsKKyNlbmRp ZgorCiAJaWYgKGlzX2NzdikKIAl7CiAJCXF1b3RlYyA9IGNzdGF0ZS0+b3B0cy5xdW90ZVswXTsK QEAgLTEyNjIsNiArMTI3MiwxMiBAQCBDb3B5UmVhZExpbmVUZXh0KENvcHlGcm9tU3RhdGUgY3N0 YXRlLCBib29sIGlzX2NzdikKIAkJLyogaWdub3JlIHNwZWNpYWwgZXNjYXBlIHByb2Nlc3Npbmcg aWYgaXQncyB0aGUgc2FtZSBhcyBxdW90ZWMgKi8KIAkJaWYgKHF1b3RlYyA9PSBlc2NhcGVjKQog CQkJZXNjYXBlYyA9ICdcMCc7CisKKyNpZm5kZWYgVVNFX05PX1NJTUQKKwkJcXVvdGUgPSB2ZWN0 b3I4X2Jyb2FkY2FzdChxdW90ZWMpOworCQlpZiAocXVvdGVjICE9IGVzY2FwZWMpCisJCQllc2Nh cGUgPSB2ZWN0b3I4X2Jyb2FkY2FzdChlc2NhcGVjKTsKKyNlbmRpZgogCX0KIAogCS8qCkBAIC0x MzI4LDYgKzEzNDQsNjEgQEAgQ29weVJlYWRMaW5lVGV4dChDb3B5RnJvbVN0YXRlIGNzdGF0ZSwg Ym9vbCBpc19jc3YpCiAJCQluZWVkX2RhdGEgPSBmYWxzZTsKIAkJfQogCisjaWZuZGVmIFVTRV9O T19TSU1ECisJCS8qCisJCSAqIFVzZSBTSU1EIGluc3RydWN0aW9ucyB0byBlZmZpY2llbnRseSBz Y2FuIHRoZSBpbnB1dCBidWZmZXIgZm9yCisJCSAqIHNwZWNpYWwgY2hhcmFjdGVycyAoZS5nLiwg bmV3bGluZSwgY2FycmlhZ2UgcmV0dXJuLCBxdW90ZSwgYW5kCisJCSAqIGVzY2FwZSkuIFRoaXMg aXMgZmFzdGVyIHRoYW4gYnl0ZS1ieS1ieXRlIGl0ZXJhdGlvbiwgZXNwZWNpYWxseSBvbgorCQkg KiBsYXJnZSBidWZmZXJzLgorCQkgKgorCQkgKiBXZSBkbyBub3QgYXBwbHkgdGhlIFNJTUQgZmFz dCBwYXRoIGluIGVpdGhlciBvZiB0aGUgZm9sbG93aW5nIGNhc2VzOgorCQkgKiAtIFdoZW4gdGhl IHByZXZpb3VzbHkgcHJvY2Vzc2VkIGNoYXJhY3RlciB3YXMgYW4gZXNjYXBlIGNoYXJhY3Rlcgor CQkgKiAgIChsYXN0X3dhc19lc2MpLCBzaW5jZSB0aGUgbmV4dCBieXRlIG11c3QgYmUgZXhhbWlu ZWQgc2VxdWVudGlhbGx5LgorCQkgKiAtIFRoZSByZW1haW5pbmcgYnVmZmVyIGlzIHNtYWxsZXIg dGhhbiBvbmUgdmVjdG9yIHdpZHRoCisJCSAqICAgKHNpemVvZihWZWN0b3I4KSk7IFNJTUQgb3Bl cmF0ZXMgb24gZml4ZWQtc2l6ZSBjaHVua3MuCisJCSAqLworCQlpZiAoIWxhc3Rfd2FzX2VzYyAm JiBjb3B5X2J1Zl9sZW4gLSBpbnB1dF9idWZfcHRyID49IHNpemVvZihWZWN0b3I4KSkKKwkJewor CQkJVmVjdG9yOAkJY2h1bms7CisJCQlWZWN0b3I4CQltYXRjaDsKKwkJCXVpbnQzMgkJbWFzazsK KworCQkJLyogTG9hZCBhIGNodW5rIG9mIGRhdGEgaW50byBhIHZlY3RvciByZWdpc3RlciAqLwor CQkJdmVjdG9yOF9sb2FkKCZjaHVuaywgKGNvbnN0IHVpbnQ4ICopICZjb3B5X2lucHV0X2J1Zltp bnB1dF9idWZfcHRyXSk7CisKKwkJCS8qIFxuIGFuZCBcciBhcmUgbm90IHNwZWNpYWwgaW5zaWRl IHF1b3RlcyAqLworCQkJaWYgKCFpbl9xdW90ZSkKKwkJCQltYXRjaCA9IHZlY3Rvcjhfb3IodmVj dG9yOF9lcShjaHVuaywgbmwpLCB2ZWN0b3I4X2VxKGNodW5rLCBjcikpOworCisJCQlpZiAoaXNf Y3N2KQorCQkJeworCQkJCW1hdGNoID0gdmVjdG9yOF9vcihtYXRjaCwgdmVjdG9yOF9lcShjaHVu aywgcXVvdGUpKTsKKwkJCQlpZiAoZXNjYXBlYyAhPSAnXDAnKQorCQkJCQltYXRjaCA9IHZlY3Rv cjhfb3IobWF0Y2gsIHZlY3RvcjhfZXEoY2h1bmssIGVzY2FwZSkpOworCQkJfQorCQkJZWxzZQor CQkJCW1hdGNoID0gdmVjdG9yOF9vcihtYXRjaCwgdmVjdG9yOF9lcShjaHVuaywgYnMpKTsKKwor CQkJLyogQ2hlY2sgaWYgd2UgZm91bmQgYW55IHNwZWNpYWwgY2hhcmFjdGVycyAqLworCQkJbWFz ayA9IHZlY3RvcjhfaGlnaGJpdF9tYXNrKG1hdGNoKTsKKwkJCWlmIChtYXNrICE9IDApCisJCQl7 CisJCQkJLyoKKwkJCQkgKiBGb3VuZCBhIHNwZWNpYWwgY2hhcmFjdGVyLiBBZHZhbmNlIHVwIHRv IHRoYXQgcG9pbnQgYW5kIGxldAorCQkJCSAqIHRoZSBzY2FsYXIgY29kZSBoYW5kbGUgaXQuCisJ CQkJICovCisJCQkJaW50IGFkdmFuY2UgPSBwZ19yaWdodG1vc3Rfb25lX3BvczMyKG1hc2spOwor CQkJCWlucHV0X2J1Zl9wdHIgKz0gYWR2YW5jZTsKKwkJCX0KKwkJCWVsc2UKKwkJCXsKKwkJCQkv KiBObyBzcGVjaWFsIGNoYXJhY3RlcnMgZm91bmQsIHNvIHNraXAgdGhlIGVudGlyZSBjaHVuayAq LworCQkJCWlucHV0X2J1Zl9wdHIgKz0gc2l6ZW9mKFZlY3RvcjgpOworCQkJCWNvbnRpbnVlOwor CQkJfQorCQl9CisjZW5kaWYKKwogCQkvKiBPSyB0byBmZXRjaCBhIGNoYXJhY3RlciAqLwogCQlw cmV2X3Jhd19wdHIgPSBpbnB1dF9idWZfcHRyOwogCQljID0gY29weV9pbnB1dF9idWZbaW5wdXRf YnVmX3B0cisrXTsKLS0gCjIuNDcuMwoK --000000000000a217a8063c392bba--