Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vyYX6-000D6w-0U for pgsql-hackers@arkaria.postgresql.org; Fri, 06 Mar 2026 17:00:12 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vyYX2-006EiZ-2U for pgsql-hackers@arkaria.postgresql.org; Fri, 06 Mar 2026 17:00:09 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vyYX2-006EiQ-0a for pgsql-hackers@lists.postgresql.org; Fri, 06 Mar 2026 17:00:08 +0000 Received: from mail-dy1-x1334.google.com ([2607:f8b0:4864:20::1334]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vyYWy-00000000niv-3iLh for pgsql-hackers@postgresql.org; Fri, 06 Mar 2026 17:00:06 +0000 Received: by mail-dy1-x1334.google.com with SMTP id 5a478bee46e88-2be1ab1fa7dso609961eec.0 for ; Fri, 06 Mar 2026 09:00:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1772816404; cv=none; d=google.com; s=arc-20240605; b=ejD2EkhfniZqJzT33FiYF7vForDwFniWtAh6hHt85R4R1Qu1O2YoQbuIyY7yunn/aK qshpHhX667PIwvxIX7uaF4EjCfkrCCn/hrrvT0IBMZ1SxquyRS7vRqbqvMxUaGimj4nT xV9kXEhowa99QYHbkbmBdo2xCFEdsD1NZoN9zpuse8wNdsvUHRi/gxZ8bDfPG+XXf8ng LZSbk3GEogDWlrs1T1Qm9pmZbQhlxJEW3fb7EaAVYPK00cYObztTFf9gT7j7SzjAMYzl wVculOgEYH0TWaGr2xAJeMp1mKpD/ahw7b/mv0pe+5cE1+VyjDOau5Me0cYtAqTiY0We VI5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=hRWM8hVwIZ5DEVDJLk4F3rfU0rr9IMYoF7Q8pRe6Gq4=; fh=egSZx6Vj/lFHwTcNjYw8Kczvd1h+Wg1lemsiDLRubIw=; b=QHG+qp/z6hZ6H4MJLpxf4+thzSHFUYIl4FJgDvcM8HLnqJe6LmNHJrLm40uuxarqqg +LLTESFtbCBdNSyZT9GATdJ21vA+4RYfU06rO1KqUg/r5R3MnMRJORYHF/mk9pvhtFAY 4h58o23vLxlEpgtIDn/DUEvNDg8kFAym6Ts9BTMLEslVToYFwVPjW9Cyp1lKZEZykloF Z1RkNEyo0StXqa7KpS/nbnmxdJ6LAFAebaMcyHeZq+rOhi9vgjSBOOAOnv6s6Iq/3zx5 TJyg8z4vnakZEeZ3TQZkBK01JGEGoZaizZhYcJFQuguxZtcUbF27RdEmErXDPHnjH6vW VMcQ==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1772816404; x=1773421204; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=hRWM8hVwIZ5DEVDJLk4F3rfU0rr9IMYoF7Q8pRe6Gq4=; b=iSzooKTtutXnMsJo/PwEEiu6RK0U7St3cZ59HAazc6IQhyTVc4dxO7FueP9uRnO2gV TUrACgI8IistERyS5P0JdGSSmqgapNjdF1dkmAfZmZzTKvj31m2gLY0pl6s3/90MLC8f SI7WIhTpGUR/HZJrM+YDi4WUEpplc5MiDhoegF3QVutkB3/eBgUIO+vXJqRbJqtp0V2v XeuLzERoAhHFD/srrjO379owuO2ka2iamKKg6+bZsFjwDSROt8Qv1kf8j67oXC4I2Ow7 MWzIBeoF2+HerpCulhhCYGI+LS3Y5PXyRIzkFLb+hZEF9s1SYWj2b52OtJnPXzqfw7Uc FiwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772816404; x=1773421204; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=hRWM8hVwIZ5DEVDJLk4F3rfU0rr9IMYoF7Q8pRe6Gq4=; b=NyEc/gPItRZZB90p3Oh2oSMR2+bBzaQBrfZD7YMJMVF96fcc01tMYwpv6mj78gDyjd eeuvVqKfYRRlw8YHsLmCGpc344aqi9YUfjtmyOpe2Um27oISocChUTAZ7+CXpdaQoaY1 434d2S7SANAcAnOeGyaqh//pa+tGBq9Puxl2k76MNTUevJSBqE2jknZMnx5fig1Q1cu+ 2KDxcZiYUxQp9+XPGPRhGwNbhba9HFJY0nlkQaitZpOUm/G1UOlCkKZpp6rWVT1494VD NN34ZES40VUWdFKQUPxwu6PIXr4yQjeVkdY+qXFbmijmK3vJrTJ9oS6ltePaNYGpWR63 rEIw== X-Forwarded-Encrypted: i=1; AJvYcCWZVRj7laKl6rBKUOXOXRsybGy9+ipE35qfUbvj/wBn5dXJnOdZeY54vhFi8jxggH3qszt31mahE8FxaRMt@postgresql.org X-Gm-Message-State: AOJu0YyKW+dRIA1do6NLBvhxHQ8icwYIEPWnGTX8zwIAcqsY3mO4tgoo GCvt7kbUBYj50lFOwnl8tUDquh8avQkUqZ9Kce7accgIvF/KRqhSg1Wpm2XqUKHi9NHVAS9bx8T KPb8pfPB1X9X2k5HUiI+MzE19oQuX4wA2oKKfkpgV X-Gm-Gg: ATEYQzylR9PwW8sXJoiv0+TdV614NmgN8FjIQIxj1+h7GyK0dF3FOR8FLbkqPusBpdD xJz/9Tn8ruJXjLCfmT39ZQXy2an/XQWHKSIGK1VHBsbWLSJn6j5HltP06E+co+yY8e0ufrqQUzY ngbRefbp0NV9YbWvFZfej6cwPGdKK/n2IFSaMChvQHjJRqsx9b/uCy7u70O3miwTbl9YZdLMid9 RlHEGBw4IdKGswXcbaN+YBu3MN/KctCXR8HJeOc6KFWhvjkUtBok2xjoGi0RGJ9qKRtxMoe5+t7 POUMDLAz X-Received: by 2002:a05:7300:fd0e:b0:2b7:f145:a94 with SMTP id 5a478bee46e88-2be4e04c366mr1190445eec.30.1772816403322; Fri, 06 Mar 2026 09:00:03 -0800 (PST) MIME-Version: 1.0 References: <91acb778-42c4-44ef-8888-f18ad9b12a5b@dunslane.net> In-Reply-To: <91acb778-42c4-44ef-8888-f18ad9b12a5b@dunslane.net> From: Manni Wood Date: Fri, 6 Mar 2026 10:59:52 -0600 X-Gm-Features: AaiRm53qxmC_e8asndBflCdK8q7thGwdxAQ4ACGm2kjvK9a6Nj_SDadiTzzE9KY Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Andrew Dunstan Cc: Nazir Bilal Yavuz , Nathan Bossart , KAZAR Ayoub , Neil Conway , Shinya Kato , PostgreSQL-development Content-Type: multipart/alternative; boundary="000000000000f8f05a064c5dfbf4" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000f8f05a064c5dfbf4 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello. I ran Nazir's v11 patch on my x86 tower PC and my arm raspberry pi using the same build I've been using: meson with "debugoptimized", which translates to "-g -O2" gcc flags. x86 NARROW old master (18bcdb75) TXT : 25909.060500 ms CSV : 28137.591250 ms TXT with 1/3 escapes: 27794.177000 ms CSV with 1/3 quotes: 34541.704750 ms x86 NARROW v10 TXT : 26416.331500 ms -1.957890% regression CSV : 25318.727500 ms 10.018142% improvement TXT with 1/3 escapes: 28608.007500 ms -2.928061% regression CSV with 1/3 quotes: 32805.627750 ms 5.026032% improvement x86 NARROW v11 TXT : 27212.945750 ms -5.032545% regression CSV : 26985.971250 ms 4.092817% improvement TXT with 1/3 escapes: 27216.510000 ms 2.078374% improvement CSV with 1/3 quotes: 32817.267500 ms 4.992334% improvement x86 WIDE old master (18bcdb75) TXT : 28778.426500 ms CSV : 35671.908000 ms TXT with 1/3 escapes: 32441.549750 ms CSV with 1/3 quotes: 47024.416000 ms x86 WIDE v10 TXT : 23067.046750 ms 19.846046% improvement CSV : 23259.092250 ms 34.797174% improvement TXT with 1/3 escapes: 31796.098250 ms 1.989583% improvement CSV with 1/3 quotes: 42925.792250 ms 8.715948% improvement x86 WIDE v11 TXT : 22571.305750 ms 21.568659% improvement CSV : 22711.524750 ms 36.332184% improvement TXT with 1/3 escapes: 29236.453000 ms 9.879604% improvement CSV with 1/3 quotes: 40022.110750 ms 14.890786% improvement arm NARROW old master (18bcdb75) TXT : 10997.568250 ms CSV : 10797.549000 ms TXT with 1/3 escapes: 10299.047000 ms CSV with 1/3 quotes: 12559.385750 ms arm NARROW v10 TXT : 10467.816750 ms 4.816988% improvement CSV : 9986.288000 ms 7.513381% improvement TXT with 1/3 escapes: 10323.173750 ms -0.234262% regression CSV with 1/3 quotes: 11843.611750 ms 5.699116% improvement arm NARROW v11 TXT : 10340.966250 ms 5.970429% improvement CSV : 10224.399500 ms 5.308144% improvement TXT with 1/3 escapes: 10438.216750 ms -1.351288% regression CSV with 1/3 quotes: 11865.934000 ms 5.521383% improvement arm WIDE old master (18bcdb75) TXT : 11825.771250 ms CSV : 13907.074000 ms TXT with 1/3 escapes: 13430.691250 ms CSV with 1/3 quotes: 17557.954500 ms arm WIDE v10 TXT : 9064.959000 ms 23.345727% improvement CSV : 9019.553250 ms 35.144134% improvement TXT with 1/3 escapes: 12344.497250 ms 8.087402% improvement CSV with 1/3 quotes: 15495.863750 ms 11.744482% improvement arm WIDE v11 TXT : 9001.442250 ms 23.882831% improvement CSV : 8940.928750 ms 35.709490% improvement TXT with 1/3 escapes: 12049.668500 ms 10.282589% improvement CSV with 1/3 quotes: 15277.843250 ms 12.986201% improvement Best, -Manni On Thu, Mar 5, 2026 at 3:25=E2=80=AFPM Andrew Dunstan = wrote: > > On 2026-03-04 We 10:15 AM, Nazir Bilal Yavuz wrote: > > Hi, > > > > On Mon, 2 Mar 2026 at 22:55, Nathan Bossart > wrote: > >> On Wed, Feb 25, 2026 at 05:24:27PM +0300, Nazir Bilal Yavuz wrote: > >>> If anyone has any suggestions/ideas, please let me know! > > I am able to fix the problem. My first assumption was that the > > branching of SIMD code caused that problem, so I moved SIMD code to > > the CopyReadLineTextSIMDHelper() function. Then I moved this > > CopyReadLineTextSIMDHelper() to top of CopyReadLineText(), by doing > > that we won't have any branching in the non-SIMD (scalar) code path. > > This didn't solve the problem and then I realized that even though I > > disable SIMD code path with 'if (false)', there is still regression > > but if I comment all of the 'if (cstate->simd_enabled)' branch, then > > there is no regression at all. > > > > To find out more, I compared assembly outputs of both and found out > > the possible reason. What I understood is that the compiler can't > > promote a variable to register, instead these variables live in the > > stack; which is slower. Please see the two different assembly outputs: > > > > Slow code: > > > > c =3D copy_input_buf[input_buf_ptr++]; > > db0: 48 8b 55 b8 mov -0x48(%rbp),%rdx > > db4: 48 63 c6 movslq %esi,%rax > > db7: 44 8d 66 01 lea 0x1(%rsi),%r12d > > dbb: 44 89 65 cc mov %r12d,-0x34(%rbp) > > dbf: 0f be 14 02 movsbl (%rdx,%rax,1),%edx > > > > Fast code: > > > > c =3D copy_input_buf[input_buf_ptr++]; > > d80: 49 63 c4 movslq %r12d,%rax > > d83: 45 8d 5c 24 01 lea 0x1(%r12),%r11d > > d88: 41 0f be 04 06 movsbl (%r14,%rax,1),%eax > > > > And the reason for that is sending the address of input_buf_ptr to a > > CopyReadLineTextSIMDHelper(..., &input_buf_ptr). If I change it to > > this: > > > > int temp_input_buf_ptr =3D input_buf_ptr; > > CopyReadLineTextSIMDHelper(..., &temp_input_buf_ptr); > > > > Then there is no regression. However, I am still not completely sure > > if that is the same problem in the v10, I am planning to spend more > > time debugging this. > > > >> A couple of random ideas: > >> > >> * Additional inlining for callers. I looked around a little bit and > didn't > >> see any great candidates, so I don't have much faith in this, but mayb= e > >> you'll see something I don't. > > I agree with you. CopyReadLineText() is already quite a big function. > > > >> * Disable SIMD if we are consistently getting small rows. That won't > help > >> your "wide & CSV 1/3" case in all likelihood, but perhaps it'll help > with > >> the regression for narrow rows described elsewhere. > > I implemented this, two consecutive small rows disables SIMD. > > > >> * Surround the variable initializations with "if (simd_enabled)". > >> Presumably compilers are smart enough to remove those in the non-SIMD > paths > >> already, but it could be worth a try. > > Done. > > > >> * Add simd_enabled function parameter to CopyReadLine(), > >> NextCopyFromRawFieldsInternal(), and CopyFromTextLikeOneRow(), and do > the > >> bool literal trick in CopyFrom{Text,CSV}OneRow(). That could encourag= e > the > >> compiler to do some additional optimizations to reduce branching. > > I think we don't need this. At least the implementation with > > CopyReadLineTextSIMDHelper() doesn't need this since branching will be > > at the top and it will be once per line. > > > > I think v11 looks better compared to v10. I liked the > > CopyReadLineTextSIMDHelper() helper function. I also liked it being at > > the top of CopyReadLineText(), not being in the scalar path. This > > gives us more optimization options without affecting the scalar path. > > > > Here are the new benchmark results, I benchmarked the changes with > > both -O2 and -O3 and also both with and without 'changing > > default_toast_compression to lz4' commit (65def42b1d5). Benchmark > > results show that there is no regression and the performance > > improvement is much bigger with 65def42b1d5, it is close to 2x for > > text format and more than 2x for the csv format. > > > I spent some time exploring different ideas for improving this, but > found none that didn't cause regression in some cases, so good to go > from my POV. > > > cheers > > > andrew > > > > -- > Andrew Dunstan > EDB: https://www.enterprisedb.com > > --=20 -- Manni Wood EDB: https://www.enterprisedb.com --000000000000f8f05a064c5dfbf4 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello.

I ran Nazir's v11 patch on m= y x86 tower PC and my arm raspberry pi using the same build I've been u= sing: meson with "debugoptimized", which translates to "-g -= O2" gcc flags.

x86 NARROW old master (18bcdb7= 5)
TXT : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 25909.0= 60500 ms
CSV : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 2= 8137.591250 ms
TXT with 1/3 escapes: 27794.177000 ms
CSV with 1/3 quo= tes: =C2=A034541.704750 ms

x86 NARROW v10
TXT : =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 26416.331500 ms =C2=A0-1.957890% reg= ression
CSV : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 25= 318.727500 ms =C2=A010.018142% improvement
TXT with 1/3 escapes: 28608.0= 07500 ms =C2=A0-2.928061% regression
CSV with 1/3 quotes: =C2=A032805.62= 7750 ms =C2=A05.026032% improvement

x86 NARROW v11
TXT : =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 27212.945750 ms =C2=A0-5.0= 32545% regression
CSV : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 26985.971250 ms =C2=A04.092817% improvement
TXT with 1/3 escapes= : 27216.510000 ms =C2=A02.078374% improvement
CSV with 1/3 quotes: =C2= =A032817.267500 ms =C2=A04.992334% improvement


x86 WIDE old mast= er (18bcdb75)
TXT : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 28778.426500 ms
CSV : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 35671.908000 ms
TXT with 1/3 escapes: 32441.549750 ms
CSV = with 1/3 quotes: =C2=A047024.416000 ms

x86 WIDE v10
TXT : =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 23067.046750 ms =C2=A019.8= 46046% improvement
CSV : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 23259.092250 ms =C2=A034.797174% improvement
TXT with 1/3 esc= apes: 31796.098250 ms =C2=A01.989583% improvement
CSV with 1/3 quotes: = =C2=A042925.792250 ms =C2=A08.715948% improvement

x86 WIDE v11
TX= T : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 22571.305750 ms= =C2=A021.568659% improvement
CSV : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 22711.524750 ms =C2=A036.332184% improvement
TXT wi= th 1/3 escapes: 29236.453000 ms =C2=A09.879604% improvement
CSV with 1/3= quotes: =C2=A040022.110750 ms =C2=A014.890786% improvement



= arm NARROW old master (18bcdb75)
TXT : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 10997.568250 ms
CSV : =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 10797.549000 ms
TXT with 1/3 escapes: 10= 299.047000 ms
CSV with 1/3 quotes: =C2=A012559.385750 ms

arm NARR= OW v10
TXT : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 104= 67.816750 ms =C2=A04.816988% improvement
CSV : =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 9986.288000 ms =C2=A07.513381% improvement<= br>TXT with 1/3 escapes: 10323.173750 ms =C2=A0-0.234262% regression
CSV= with 1/3 quotes: =C2=A011843.611750 ms =C2=A05.699116% improvement

= arm NARROW v11
TXT : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 10340.966250 ms =C2=A05.970429% improvement
CSV : =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 10224.399500 ms =C2=A05.308144% i= mprovement
TXT with 1/3 escapes: 10438.216750 ms =C2=A0-1.351288% regres= sion
CSV with 1/3 quotes: =C2=A011865.934000 ms =C2=A05.521383% improvem= ent


arm WIDE old master (18bcdb75)
TXT : =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11825.771250 ms
CSV : =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 13907.074000 ms
TXT with 1/3 e= scapes: 13430.691250 ms
CSV with 1/3 quotes: =C2=A017557.954500 ms
arm WIDE v10
TXT : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 9064.959000 ms =C2=A023.345727% improvement
CSV : =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 9019.553250 ms =C2=A035.144134% i= mprovement
TXT with 1/3 escapes: 12344.497250 ms =C2=A08.087402% improve= ment
CSV with 1/3 quotes: =C2=A015495.863750 ms =C2=A011.744482% improve= ment

arm WIDE v11
TXT : =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 9001.442250 ms =C2=A023.882831% improvement
CSV : =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8940.928750 ms =C2=A035.70= 9490% improvement
TXT with 1/3 escapes: 12049.668500 ms =C2=A010.282589%= improvement
CSV with 1/3 quotes: =C2=A015277.843250 ms =C2=A012.986201%= improvement

Best,

-Manni=

On Thu, Mar 5, 2026 at 3:25=E2=80=AFPM Andrew= Dunstan <andrew@dunslane.net= > wrote:

On 2026-03-04 We 10:15 AM, Nazir Bilal Yavuz wrote:
> Hi,
>
> On Mon, 2 Mar 2026 at 22:55, Nathan Bossart <nathandbossart@gmail.com> wr= ote:
>> On Wed, Feb 25, 2026 at 05:24:27PM +0300, Nazir Bilal Yavuz wrote:=
>>> If anyone has any suggestions/ideas, please let me know!
> I am able to fix the problem. My first assumption was that the
> branching of SIMD code caused that problem, so I moved SIMD code to > the CopyReadLineTextSIMDHelper() function. Then I moved this
> CopyReadLineTextSIMDHelper() to top of CopyReadLineText(), by doing > that we won't have any branching in the non-SIMD (scalar) code pat= h.
> This didn't solve the problem and then I realized that even though= I
> disable SIMD code path with 'if (false)', there is still regre= ssion
> but if I comment all of the 'if (cstate->simd_enabled)' bra= nch, then
> there is no regression at all.
>
> To find out more, I compared assembly outputs of both and found out > the possible reason. What I understood is that the compiler can't<= br> > promote a variable to register, instead these variables live in the > stack; which is slower. Please see the two different assembly outputs:=
>
> Slow code:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 c =3D copy_input_buf[input_buf_ptr++= ];
>=C2=A0 =C2=A0 =C2=A0 =C2=A0db0:=C2=A0 =C2=A0 48 8b 55 b8=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mov=C2=A0 =C2=A0 -0x48(%rbp),%rdx
>=C2=A0 =C2=A0 =C2=A0 =C2=A0db4:=C2=A0 =C2=A0 48 63 c6=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0movslq %esi,%rax
>=C2=A0 =C2=A0 =C2=A0 =C2=A0db7:=C2=A0 =C2=A0 44 8d 66 01=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 lea=C2=A0 =C2=A0 0x1(%rsi),%r12d
>=C2=A0 =C2=A0 =C2=A0 =C2=A0dbb:=C2=A0 =C2=A0 44 89 65 cc=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mov=C2=A0 =C2=A0 %r12d,-0x34(%rbp)
>=C2=A0 =C2=A0 =C2=A0 =C2=A0dbf:=C2=A0 =C2=A0 0f be 14 02=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 movsbl (%rdx,%rax,1),%edx
>
> Fast code:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 c =3D copy_input_buf[input_buf_ptr++= ];
>=C2=A0 =C2=A0 =C2=A0 =C2=A0d80:=C2=A0 =C2=A0 49 63 c4=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0movslq %r12d,%rax
>=C2=A0 =C2=A0 =C2=A0 =C2=A0d83:=C2=A0 =C2=A0 45 8d 5c 24 01=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0lea=C2=A0 =C2=A0 0x1(%r12),%r11d
>=C2=A0 =C2=A0 =C2=A0 =C2=A0d88:=C2=A0 =C2=A0 41 0f be 04 06=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0movsbl (%r14,%rax,1),%eax
>
> And the reason for that is sending the address of input_buf_ptr to a > CopyReadLineTextSIMDHelper(..., &input_buf_ptr). If I change it to=
> this:
>
> int=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 temp_input_buf_ptr =3D in= put_buf_ptr;
> CopyReadLineTextSIMDHelper(..., &temp_input_buf_ptr);
>
> Then there is no regression. However, I am still not completely sure > if that is the same problem in the v10, I am planning to spend more > time debugging this.
>
>> A couple of random ideas:
>>
>> * Additional inlining for callers.=C2=A0 I looked around a little = bit and didn't
>> see any great candidates, so I don't have much faith in this, = but maybe
>> you'll see something I don't.
> I agree with you. CopyReadLineText() is already quite a big function.<= br> >
>> * Disable SIMD if we are consistently getting small rows.=C2=A0 Th= at won't help
>> your "wide & CSV 1/3" case in all likelihood, but pe= rhaps it'll help with
>> the regression for narrow rows described elsewhere.
> I implemented this, two consecutive small rows disables SIMD.
>
>> * Surround the variable initializations with "if (simd_enable= d)".
>> Presumably compilers are smart enough to remove those in the non-S= IMD paths
>> already, but it could be worth a try.
> Done.
>
>> * Add simd_enabled function parameter to CopyReadLine(),
>> NextCopyFromRawFieldsInternal(), and CopyFromTextLikeOneRow(), and= do the
>> bool literal trick in CopyFrom{Text,CSV}OneRow().=C2=A0 That could= encourage the
>> compiler to do some additional optimizations to reduce branching.<= br> > I think we don't need this. At least the implementation with
> CopyReadLineTextSIMDHelper() doesn't need this since branching wil= l be
> at the top and it will be once per line.
>
> I think v11 looks better compared to v10. I liked the
> CopyReadLineTextSIMDHelper() helper function. I also liked it being at=
> the top of CopyReadLineText(), not being in the scalar path. This
> gives us more optimization options without affecting the scalar path.<= br> >
> Here are the new benchmark results, I benchmarked the changes with
> both -O2 and -O3 and also both with and without 'changing
> default_toast_compression to lz4' commit (65def42b1d5). Benchmark<= br> > results show that there is no regression and the performance
> improvement is much bigger with 65def42b1d5, it is close to 2x for
> text format and more than 2x for the csv format.


I spent some time exploring different ideas for improving this, but
found none that didn't cause regression in some cases, so good to go from my POV.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com



--
-- Manni Wood EDB: https://www.enterprisedb.com
--000000000000f8f05a064c5dfbf4--