[pgjdbc/pgjdbc] issue #3840: Periodic latency spikes to Postgres in large scale deployment

pgjdbc/pgjdbc GitHub issues and pull requests (mirror)  
help / color / mirror / Atom feed

[pgjdbc/pgjdbc] issue #3840: Periodic latency spikes to Postgres in large scale deployment
6+ messages / 3 participants
[nested] [flat]

* [pgjdbc/pgjdbc] issue #3840: Periodic latency spikes to Postgres in large scale deployment
@ 2025-10-17 11:39  "wgfm (@wgfm)" <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: wgfm (@wgfm) @ 2025-10-17 11:39 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

**Describe the issue**

We observe periodic latency spikes during regular operations, and it significantly impacts tail latencies of our database queries.

We’re running a monolithic application with a sharded database setup. We’re on AWS and use RDS and RDS Proxies extensively. Our application nodes are connected to 120+ database instances at a time. We see periodic latency spikes every minute for ~3 seconds. While these latency spikes are happening, we see that pgjdbc’s SharedTimer thread is `Object.wait()`ing on a random thread serving traffic:

<img width="1948" height="728" alt="Image" src="https://github.com/user-attachments/assets/aab1ca1d-31c4-4d70-9ebe-128457cbb500"; />

In the above image, threads calling the database show long SocketRead events while the SharedTimer thread shows long Java Monitor Read events. We observe long SocketRead events in all threads querying the database, and the vast majority of these SocketRead events are queries to Postgres. Other network calls, such as queries to Redis or to microservices don't show the same periodic latency.

We can't say for sure this is a driver issue yet, and we could use your help in either confirming that it could be, or rule it out completely. The driver is one of our primary suspects, because of the single SharedTimer thread shared by 120+ connection pools to our database instances. This could make it a resource contention issue.

One thing of note is that no matter which server node we profile, the latency spikes always happen at 1-2 seconds past the minute mark.

We have ruled out the following suspects:
- Nagle's algorithm. This is disabled by default in the driver version we use, and we haven't enabled it explicitly.
- Garbage collection events. These do not exhibit the same periodicity as our latency spikes.

We realise this is not a lot to go on. Any pointers at all would be greatly appreciated.

**Driver Version?** 

42.6.2

**Java Version?**

Corretto 17.0.16

**OS Version?**

Ubuntu 22.04.5

**PostgreSQL Version?**

15.12

**To Reproduce**

We have not been able to reproduce this locally.

**Expected behaviour**

We don't expect periodic latency spikes.

**Logs**

We don't have any relevant logs.

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #3840: Periodic latency spikes to Postgres in large scale deployment
@ 2025-10-17 11:48  "davecramer (@davecramer)" <[email protected]>
  4 siblings, 0 replies; 6+ messages in thread

From: davecramer (@davecramer) @ 2025-10-17 11:48 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

`One thing of note is that no matter which server node we profile, the latency spikes always happen at 1-2 seconds past the minute mark.` 
Is a pretty big red flag for me.
I don't think anyone else has reported this issue. Doesn't mean it doesn't exist, just I'd expect it to show up elsewhere if it was in the driver.

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #3840: Periodic latency spikes to Postgres in large scale deployment
@ 2025-10-17 12:01  "wgfm (@wgfm)" <[email protected]>
  4 siblings, 0 replies; 6+ messages in thread

From: wgfm (@wgfm) @ 2025-10-17 12:01 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

Thanks for your prompt reply. I expected as much. It looks like it's caused by some task on a repeated schedule. So far, we haven't found anything with this kind of periodicity.

Under what conditions would the SharedTimer thread wait on another thread for multiple seconds?

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #3840: Periodic latency spikes to Postgres in large scale deployment
@ 2025-10-17 12:14  "davecramer (@davecramer)" <[email protected]>
  4 siblings, 0 replies; 6+ messages in thread

From: davecramer (@davecramer) @ 2025-10-17 12:14 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

From what I can see, the shared time is being used to close an otherwise not closed connection. It could be waiting to send the close message? 

@vlsi may have other ideas

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #3840: Periodic latency spikes to Postgres in large scale deployment
@ 2025-10-18 11:35  "vlsi (@vlsi)" <[email protected]>
  4 siblings, 0 replies; 6+ messages in thread

From: vlsi (@vlsi) @ 2025-10-18 11:35 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

The driver uses a shared `SharedTimer` to implement statement cancellation via query timeout. OpenJDK uses `Object,wait()` in `java.util.TimerThread`.

In other words, it is expected that `SharedTimer` would wait in `Object.wait` in case the application uses something like `java.sql.Statement#setQueryTimeout` and executes a query.

I do not see how `TimerThread` code could result in "at 1-2 seconds past the minute mark"

Frankly, so far it looks like a scheduled activity in the application code that executes every minute.

The current `SharedTimer` does bother me a bit. For instance, it is suboptimal for executing a lot of queries with long timeouts and we currently use `purgeTimerTasks()` to remove expired tasks. However, I haven't observed workloads that would run into `purgeTimerTasks()` issue though.

It would be interesting to get some stacktraces/threaddumps/jfs/async-profiler results regarding the issue.

---

Technically speaking, `SharedTimer` indeed uses a single thread to fire its tasks. However, an unexpected slowness of a single task should not impact the latency for the rest:
* The tasks execute without holding the timer lock
* Individual app threads do not wait each task execution

---

I would suggest capturing the stack traces for "at 1-2 seconds past the minute mark".
For instance, async-profiler's heatmaps might help you: https://github.com/async-profiler/async-profiler/blob/master/docs/Heatmap.md 

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [pgjdbc/pgjdbc] issue #3840: Periodic latency spikes to Postgres in large scale deployment
@ 2025-10-20 07:21  "wgfm (@wgfm)" <[email protected]>
  4 siblings, 0 replies; 6+ messages in thread

From: wgfm (@wgfm) @ 2025-10-20 07:21 UTC (permalink / raw)
  To: pgjdbc/pgjdbc <[email protected]>

Unfortunately, I can't provide our JFRs, as those are classified. I may be able to walk you through one in a video call if that would be useful, but I'll have to get clearance first.

Here's a redacted stack trace for a socket read wait event, though nothing looks out of the ordinary here to me:

```
Thread.run()
ThreadPoolExecutor$Worker.run()
ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
MyShadowReader.read(long, long, boolean, boolean)
OurOpenTracingUtilKtKt.traceSync(String, Function1)
OurOpenTracingUtil.applyWithTracerSpan(String, Function)
OurOpenTracingUtil.doWithTracerSpan(String, Function)
OurOpenTracingUtil.doWithTracerSpan(String, Span, Function)
OurOpenTracingUtilKtKt$$Lambda$7552+0x000000080410e890.292814703.apply(Object)
OurOpenTracingUtilKtKt.traceSync$lambda$0(Function1, Span)
MyShadowReader$$Lambda$7573+0x000000080413ebe0.845997003.invoke(Object)
MyShadowReader.read$lambda$10(MyShadowReader, long, long, boolean, boolean, Span)
MyPostgresRepository.findOne(long, long, boolean, boolean)
MyQuery.get(long, long, boolean, boolean)
DefaultDSLContext.fetchOne(String, Object[])
ResultQueryTrait.fetchOne()
AbstractResultQuery.fetchLazyNonAutoClosing()
AbstractResultQuery.fetchLazy()
AbstractQuery.execute()
AbstractResultQuery.execute(ExecuteContext, ExecuteListener)
Tools.executeStatementAndGetFirstResultSet(ExecuteContext, int)
DefaultPreparedStatement.execute()
$Proxy579.execute()
OurConnectionMetricsDataSourceProxy$ConnectionMetricsInvocationHandler$StatementMetricsInvocationHandler.invoke(Object, Method, Object[])
OurConnectionMetricsDataSourceProxy$ConnectionMetricsInvocationHandler.executeMethod(Callable)
OurConnectionMetricsDataSourceProxy$ConnectionMetricsInvocationHandler$StatementMetricsInvocationHandler$$Lambda$6626+0x0000000803d43c48.764783219.call()
OurConnectionMetricsDataSourceProxy$ConnectionMetricsInvocationHandler$StatementMetricsInvocationHandler.lambda$invoke$0(Method, Object[])
Method.invoke(Object, Object[])
DelegatingMethodAccessorImpl.invoke(Object, Object[])
GeneratedMethodAccessor272.invoke(Object, Object[])
$Proxy579.execute()
OurQueryMetricsDataSourceProxy$ConnectionMetricsInvocationHandler$StatementMetricsInvocationHandler.invoke(Object, Method, Object[])
OurQueryMetricsDataSourceProxy$ConnectionMetricsInvocationHandler.executeMethod(Callable)
OurQueryMetricsDataSourceProxy$ConnectionMetricsInvocationHandler$StatementMetricsInvocationHandler$$Lambda$6627+0x0000000803d50000.1175461386.call()
OurQueryMetricsDataSourceProxy$ConnectionMetricsInvocationHandler$StatementMetricsInvocationHandler.lambda$invoke$0(Method, Object[])
Method.invoke(Object, Object[])
DelegatingMethodAccessorImpl.invoke(Object, Object[])
GeneratedMethodAccessor272.invoke(Object, Object[])
HikariProxyPreparedStatement.execute()
ProxyPreparedStatement.execute()
PgPreparedStatement.execute()
PgPreparedStatement.executeWithFlags(int)
PgStatement.execute(CachedQuery, ParameterList, int)
PgStatement.executeInternal(CachedQuery, ParameterList, int)
QueryExecutorImpl.execute(Query, ParameterList, ResultHandler, int, int, int, boolean)
QueryExecutorImpl.processResults(ResultHandler, int, boolean)
PGStream.receiveChar()
VisibleBufferedInputStream.read()
VisibleBufferedInputStream.ensureBytes(int)
VisibleBufferedInputStream.ensureBytes(int, boolean)
VisibleBufferedInputStream.readMore(int, boolean)
SSLSocketImpl$AppInputStream.read(byte[], int, int)
SSLSocketImpl.readApplicationRecord(ByteBuffer)
SSLSocketInputRecord.bytesInCompletePacket()
SSLSocketInputRecord.readHeader()
SSLSocketInputRecord.read(InputStream, byte[], int, int)
```

And for the Java Monitor Wait event:

```
TimerThread.run()
TimerThread.mainLoop()
Object.wait(long)
```

> I do not see how TimerThread code could result in "at 1-2 seconds past the minute mark"

That's what I expected from looking through the driver code. Thanks for the confirmation.

> SharedTimer would wait in Object.wait in case the application uses something like java.sql.Statement#setQueryTimeout

From what I've seen, our code only very rarely uses timeouts on queries, and mostly for queries that are run during manual administrative tasks. It could be that middlewares, frameworks, or libraries set timeouts, or expose timeouts with a different API, though. I'll have a closer look.

I'm okay closing the issue, as this looks like a dead end in my investigation. Thanks again for your support.

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2025-10-20 07:21 UTC | newest]

Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-10-17 11:39 [pgjdbc/pgjdbc] issue #3840: Periodic latency spikes to Postgres in large scale deployment "wgfm (@wgfm)" <[email protected]>
2025-10-17 11:48 ` "davecramer (@davecramer)" <[email protected]>
2025-10-17 12:01 ` "wgfm (@wgfm)" <[email protected]>
2025-10-17 12:14 ` "davecramer (@davecramer)" <[email protected]>
2025-10-18 11:35 ` "vlsi (@vlsi)" <[email protected]>
2025-10-20 07:21 ` "wgfm (@wgfm)" <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox