Message-ID: <gh-pgjdbc-pgjdbc-2963-c1819927745@github.com>
From: "zkorhone (@zkorhone)" <noreply+zkorhone@github.com>
To: "pgjdbc/pgjdbc" <noreply+pgjdbc-pgjdbc@github.com>
Date: Mon, 20 Nov 2023 22:45:58 +0000
Subject: Re: [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
In-Reply-To: <gh-pgjdbc-pgjdbc-2963@github.com>
References: <gh-pgjdbc-pgjdbc-2963@github.com>
List-Id: <gh-pgjdbc-pgjdbc.github.com>
X-GitHub-Author-Login: zkorhone
X-GitHub-Comment-Id: 1819927745
X-GitHub-Comment-Type: issue_comment
X-GitHub-Edited-At: 2023-11-20T23:13:41Z
X-GitHub-Issue: 2963
X-GitHub-Repo: pgjdbc/pgjdbc
X-GitHub-Type: comment
X-GitHub-Url: https://github.com/pgjdbc/pgjdbc/issues/2963#issuecomment-1819927745
Content-Type: text/plain; charset=utf-8

The combination of weird output from flamegraph and presence of unused codepath with `FilterOutputStream` in the same class that derailed me. As you mentioned, it is indeed BufferedOutputStream that is being invoked. There was also flaw in my assumption on how much data is being written per write to BufferedOutputStream.

It was fairly easy to modify the benchmark, so I just did that. I implemented what would be worse than worst-case. And results are here:

[Threads+Pool] FilteringOutputStream(NoLockBufferedOutputStream(OutputStream)): 90.575ms [ -56 ]
[Threads+Pool] FilteringOutputStream(BufferedOutputStream(OutputStream)): 621.572ms [ -56 ]
[No threads] FilteringOutputStream(NoLockBufferedOutputStream(OutputStream)): 445.047ms [ -56 ]
[No threads] FilteringOutputStream(BufferedOutputStream(OutputStream)): 3916.484ms [ -56 ]

Applying FilteringOutputStream forces writes to be done 1byte at a time (I did try also use of for loop instead of FilteringOutputStream to copy 1byte at a time from random input and results were identical).

So with very small writes unsynchronized version of BufferedOutputStream does make sense. If you're batching a lot of data without long character data, this could very well happen.

NoLockBufferedOutputStream that I provided is quite naive implementation. Looking at PGStream more closely, it might make sense to move logic from sendInteger2 etc. to OutputStream. This would allow to write directly to the buffer. Such implementation could _theoretically_ perform better than copying between two buffers. On the other hand HotSpot might be able to optimize such a trivial case, if buffer was created with new for every invocation. Reusing buffer that is declared as a PGStream is actually likely to perform worse. Field indicates that state should be kept. It would be very difficult for optimizer to be sure, that it's safe to use registers / stack allocations to copy data.

I'll add the modified benchmark.

Update: Be sure to run on Java17+. Performance on older Java versions differ from what can be observed on modern Java. Biased-locking was disabled by default in Java 15 and it was removed completely in Java 21.

[microbenchmark.java.txt](https://github.com/pgjdbc/pgjdbc/files/13420011/microbenchmark.java.txt)