[pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient

pgjdbc/pgjdbc GitHub issues and pull requests (mirror)  
help / color / mirror / Atom feed

[pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
9+ messages / 5 participants
[nested] [flat]

* [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
@ 2023-10-10 14:39 "plokhotnyuk (@plokhotnyuk)" <[email protected]>
  0 siblings, 0 replies; 9+ messages in thread

From: plokhotnyuk (@plokhotnyuk) @ 2023-10-10 14:39 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

**Describe the issue**

`BufferedOutputStream` in `PGStream` should be replaced to something more efficient

Currently it is hardcoded to be 8K size and to use an internal lock that adds too much overhead during execution of batched queries, especially with rewritten inserts.

Below is a flame-graph of CPU cycles reported by the async-profiler that shows more than 50% of CPU used for the buffer resizing and unlocking during the batch execution:

![image](https://github.com/pgjdbc/pgjdbc/assets/890289/891dfb43-d9cf-47d1-9eab-dcfff682d669)

**Driver Version** 

42.6.0

**Java Version**

OpenJDK 21

**OS Version**

Ubuntu 23.04 (Linux 6.2)

**PostgreSQL Version**

15

**To Reproduce**
1. Create a connection with the rewritten inserts option turned on
2. Prepare an insert statement with a dozen of placeholders for numbers
3. Add 10K batches
4. Commit
5. Repeat from 3rd item

**Expected behaviour**
No CPU cycles burned on buffer resizing and locks

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
@ 2023-10-10 15:01 ` "davecramer (@davecramer)" <[email protected]>
  7 siblings, 0 replies; 9+ messages in thread

From: davecramer (@davecramer) @ 2023-10-10 15:01 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

Thanks for the analysis. Care to provide a better implementation ?

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
@ 2023-11-17 00:09 ` "zkorhone (@zkorhone)" <[email protected]>
  7 siblings, 0 replies; 9+ messages in thread

From: zkorhone (@zkorhone) @ 2023-11-17 00:09 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

I looked at the implementation in Java 21. There should be no buffer resizing, if you use size parameter. Resizing only applies when size is not defined.

I think biggest optimization would be to replace FilterOutputStream with custom delegate. FilterOutputStream writes to given OutputStream one byte at a time. With BufferedOutputStream that means writing 1000byte array causes 1000lock and unlock calls. While removing locks by implementing custom BufferedOutputStream would fix this, implementation with FilterOutputStream would still write to memory in a very inefficient way (JVM might be able to optimize this, but why rely on that)

I would first try, if replacing FilterOutputStream does enough. If not, then consider unsynchronized version. Just be sure not to extend BufferedOutputStream when getting rid of FilterOutputStream. There seems to be virtual thread optimizations that are only present when getClass equals to BufferedOutputStream.class

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
@ 2023-11-20 11:52 ` "zkorhone (@zkorhone)" <[email protected]>
  7 siblings, 0 replies; 9+ messages in thread

From: zkorhone (@zkorhone) @ 2023-11-20 11:52 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

I did run some microbenchmarks for you:

[No threads] OutputStream: 150.25ms [ -56 ]
[No threads] BufferedOutputStream(OutputStream): 155.564ms [ -56 ]
[No threads] FilteringOutputStream(BufferedOutputStream(OutputStream)): 3919.915ms [ -56 ]
[Threads] OutputStream: 22.315ms [ -56 ]
[Threads] BufferedOutputStream(OutputStream): 21.27ms [ -56 ]
[Threads] FilteringOutputStream(BufferedOutputStream(OutputStream)): 626.437ms [ -56 ]
[Threads+Pool] OutputStream: 21.532ms [ -56 ]
[Threads+Pool] BufferedOutputStream(OutputStream): 27.802ms [ -56 ]
[Threads+Pool] FilteringOutputStream(BufferedOutputStream(OutputStream)): 613.602ms [ -56 ]

In above results:
* No threads - no threads were used
* Threads - a thread pool equal to half of size of available cores was used
* Threads+Pool - a thread pool equal to half of size of available cores was used and resource pooling was used
* OutputStream - data is written directly to target OutputStream
* BufferedOutputStream(OutputStream) - data is written via BufferedOutputStream to target OutputStream
* FilteringOutputStream(BufferedOutputStream(OutputStream)) - data is written via FilteringOutputStream to BufferedOutputStream and finally to target OutputStream

Note: Resource pooling is my guess on how postgres driver to perform when using connection pooling. I did this because in theory resource pooling could have impact on how HotSpot optimizes locks (lock elision). There's no way really to guarantee that my simulation is correct.

In results execution time is total execution time for test. I have 16 cpu cores, which in test results to 8 threads being used for running a threaded test. This explains why for threaded tests execution time is ~1/8 of single threaded tests.

Based on these results I'd suggest replacing FilteringOutputStream with custom OutputStream.

I also tried with custom version of BufferedOutputStream that doesn't implement locks. There are some gains (< 1%), but I wouldn't say they are significant enough to warrant custom implementation.

[microbenchmark.java.txt](https://github.com/pgjdbc/pgjdbc/files/13413623/microbenchmark.java.txt)

```
    static class NoLockBufferedOutputStream extends OutputStream {
        private final OutputStream dst;
        private final byte buffer[];
        private int length;

        public NoLockBufferedOutputStream(OutputStream dst) {
            this.dst = dst;
            this.buffer = new byte[8192];
            this.length = 0;
        }

        @Override
        public void write(int b) throws IOException {
            if (length == buffer.length) {
                flushBuffer();
            }
            buffer[length++] = (byte)b;
            if (length >= buffer.length) {
                flushBuffer();
            }
        }

        @Override
        public void write(byte[] b, int off, int len) throws IOException {
            int capacityAfter = buffer.length - length - len;
            if (capacityAfter < 0) {
                int toCopy = buffer.length - length;
                appendToBuffer(b, off, toCopy);
                flushBuffer();

                off += toCopy;
                len -= toCopy;

                if (len >= buffer.length) {
                    // more than our buffer
                    dst.write(b, off, len);
                } else {
                    appendToBuffer(b, off, len);
                }
            } else {
                appendToBuffer(b, off, len);
            }
        }

        private void appendToBuffer(byte[] src, int off, int toCopy) {
            System.arraycopy(src, off, buffer, length, toCopy);
            length += toCopy;
        }

        private void flushBuffer() throws IOException {
            try {
                dst.write(buffer, 0, length);
            } finally {
                length = 0;
            }
        }

        @Override
        public void flush() throws IOException {
            flushBuffer();
            dst.flush();
        }

        @Override
        public void close() throws IOException {
            dst.close();
        }

    }
```

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
@ 2023-11-20 19:01 ` "vlsi (@vlsi)" <[email protected]>
  7 siblings, 0 replies; 9+ messages in thread

From: vlsi (@vlsi) @ 2023-11-20 19:01 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

> I think biggest optimization would be to replace FilterOutputStream with custom delegate. FilterOutputStream writes to given OutputStream one byte at a time

I think `BuferredOutputStrem` overrides `write(byte[], int, int)` method, so your analysis does not sound quite right: https://github.com/openjdk/jdk/blob/30d895383eda37ba1475825a258feb0a15da1526/src/java.base/share/cla...

In other words the flame graph looks like `PGStream.sendInteger2` calling `pgOutput.write(byte[])`.
`pgOutput.write` happens to be a subclass of `FilterOutputSteram`, however, `FilterOutputStream.write(byte[])` calls `write(b, 0, b.length)` which turns out to be `BufferedOutputStream`.

In other words, the flame graph suggests that in your case `PGStream.out` was `BufferedOutputStream`.

I suggest we reproduce the issue with JMH first and then attempt to fix it.

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
@ 2023-11-20 22:45 ` "zkorhone (@zkorhone)" <[email protected]>
  7 siblings, 0 replies; 9+ messages in thread

From: zkorhone (@zkorhone) @ 2023-11-20 22:45 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

The combination of weird output from flamegraph and presence of unused codepath with `FilterOutputStream` in the same class that derailed me. As you mentioned, it is indeed BufferedOutputStream that is being invoked. There was also flaw in my assumption on how much data is being written per write to BufferedOutputStream.

It was fairly easy to modify the benchmark, so I just did that. I implemented what would be worse than worst-case. And results are here:

[Threads+Pool] FilteringOutputStream(NoLockBufferedOutputStream(OutputStream)): 90.575ms [ -56 ]
[Threads+Pool] FilteringOutputStream(BufferedOutputStream(OutputStream)): 621.572ms [ -56 ]
[No threads] FilteringOutputStream(NoLockBufferedOutputStream(OutputStream)): 445.047ms [ -56 ]
[No threads] FilteringOutputStream(BufferedOutputStream(OutputStream)): 3916.484ms [ -56 ]

Applying FilteringOutputStream forces writes to be done 1byte at a time (I did try also use of for loop instead of FilteringOutputStream to copy 1byte at a time from random input and results were identical).

So with very small writes unsynchronized version of BufferedOutputStream does make sense. If you're batching a lot of data without long character data, this could very well happen.

NoLockBufferedOutputStream that I provided is quite naive implementation. Looking at PGStream more closely, it might make sense to move logic from sendInteger2 etc. to OutputStream. This would allow to write directly to the buffer. Such implementation could _theoretically_ perform better than copying between two buffers. On the other hand HotSpot might be able to optimize such a trivial case, if buffer was created with new for every invocation. Reusing buffer that is declared as a PGStream is actually likely to perform worse. Field indicates that state should be kept. It would be very difficult for optimizer to be sure, that it's safe to use registers / stack allocations to copy data.

I'll add the modified benchmark.

Update: Be sure to run on Java17+. Performance on older Java versions differ from what can be observed on modern Java. Biased-locking was disabled by default in Java 15 and it was removed completely in Java 21.

[microbenchmark.java.txt](https://github.com/pgjdbc/pgjdbc/files/13420011/microbenchmark.java.txt)

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
@ 2023-11-21 07:50 ` "vlsi (@vlsi)" <[email protected]>
  7 siblings, 0 replies; 9+ messages in thread

From: vlsi (@vlsi) @ 2023-11-21 07:50 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

> Looking at PGStream more closely, it might make sense to move logic from sendInteger2 etc. to OutputStream

That sounds plausible.

> I'll add the modified benchmark.

Frankly, I suggest using JMH for the benchmark as it helps avoiding many pitfalls, and it has integrated support for profilers.
I might try doing it sometime later, however,  I would focus on reviewing and merging the outstanding PRs first.

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
@ 2023-11-21 14:34 ` "BrentDouglasB1 (@BrentDouglasB1)" <[email protected]>
  7 siblings, 0 replies; 9+ messages in thread

From: BrentDouglasB1 (@BrentDouglasB1) @ 2023-11-21 14:34 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

I was quite interested to find this active ticket today, as I opened this repo with the intent to create one to discuss essentially the same patch as above (an unsynchonized clone of BufferedOutputStream for the usage in PGStream). I'm only bringing some anecdotal profiler images to the table at the moment, but I would be happy to assist creating some benchmarks/etc for the patch if desired. The pictures below are Zulu 21, PG 12, PGJDBC 42.5.4

This is a screenshot I have taken from a datadog profile of a benchmark for a real workload run using the unpatched driver.

![Screenshot 2023-11-21 at 6 17 08 PM](https://github.com/pgjdbc/pgjdbc/assets/83204206/0ac77dd0-ac89-4faa-86af-06eee31002cc)

And this is what the profile looks like over the same run after monkey patching PGStream with a copy of BufferedOutputStream with the `synchronized` keywords removed.

![Screenshot 2023-11-21 at 10 07 15 PM](https://github.com/pgjdbc/pgjdbc/assets/83204206/bc062f1b-8c96-4319-a4dc-9d49a7d4e21b)

Removing these locks seems a great start to reduce CPU usage in batch queries.

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient
@ 2024-12-12 04:20 ` "BrentDouglasB1 (@BrentDouglasB1)" <[email protected]>
  7 siblings, 0 replies; 9+ messages in thread

From: BrentDouglasB1 (@BrentDouglasB1) @ 2024-12-12 04:20 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

This is fixed in eeb37dda71a76f70db220e26ac9c51573d4b539e

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

end of thread, other threads:[~2024-12-12 04:20 UTC | newest]

Thread overview: 9+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2023-10-10 14:39 [pgjdbc/pgjdbc] issue #2963: BufferedOutputStream in PGStream should be replaced to something more efficient "plokhotnyuk (@plokhotnyuk)" <[email protected]>
2023-10-10 15:01 ` "davecramer (@davecramer)" <[email protected]>
2023-11-17 00:09 ` "zkorhone (@zkorhone)" <[email protected]>
2023-11-20 11:52 ` "zkorhone (@zkorhone)" <[email protected]>
2023-11-20 19:01 ` "vlsi (@vlsi)" <[email protected]>
2023-11-20 22:45 ` "zkorhone (@zkorhone)" <[email protected]>
2023-11-21 07:50 ` "vlsi (@vlsi)" <[email protected]>
2023-11-21 14:34 ` "BrentDouglasB1 (@BrentDouglasB1)" <[email protected]>
2024-12-12 04:20 ` "BrentDouglasB1 (@BrentDouglasB1)" <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox