Hi Robert,
Thanks for the pg_plan_advice patches! I've tried hard to attack them, to get
them to fail in some catastrophic way. You seem to have hardened the code
well. I found only one concern for you to consider, a kind of memory leak
ratchet:
Once the system reaches memory pressure where:
- The 8192-byte DSA size class is exhausted (needs a new DSM segment, OS refuses)
- Smaller size classes still have free space in existing superblocks
Then every single query that triggers advice collection will:
1. Successfully allocate an advice entry from existing free space
2. Enter store_shared_advice, hit the same chunk boundary
3. Fail to allocate the chunk
4. Leak the advice entry
5. Reduce remaining free space in the small size classes
This continues until the small size classes are also exhausted, at which point
make_collected_advice itself fails (the DSA area has been consumed by leaked
entries). The ratchet is self-reinforcing: each failure guarantees the next
failure while consuming more resources, assuming nobody else is freeing
memory simultaneously.
I looked for situations where something inside postgres would keep retrying
after the OOM, but the most likely I think is just an application that treats
OOM as a transient error and keeps retrying.
See the make_collected_advice() call in pg_collect_advice_save(); the point
where DSA memory is allocated but not yet linked into any data structure.
Everything downstream from here (the four dsa_allocate0 calls inside
store_shared_advice) can fail and leak it.