Hi Michael (and Peter),
Thanks for the detailed feedback — this is really helpful.
I want to make sure I understand your main point: you're OK with a new `vartag_external`, but prefer we avoid increasing the heap TOAST pointer from 16 -> 20 bytes since every zstd-toasted value would pay +4 bytes in the main heap tuple.
I also realize the "compatibility" of the extended header doesn't buy us much — we'll need to support the existing 16-byte varatt_external forever for backward compatibility. Adding a 20-byte structure just means two formats to maintain indefinitely.
A couple clarifying questions if we go with new vartag (e.g., `VARTAG_ONDISK_ZSTD`), same 16-byte `varatt_external` payload, vartag as discriminator
1. How should we handle future methods beyond zstd? One tag per method, or store a method id elsewhere (e.g., in TOAST chunk header)?
2. And re: "as long as the TOAST value is 32 bits" — are you referring to the 30-bit extsize field in va_extinfo (i.e., avoid stealing bits from extsize for method encoding)?
Test | Rows | Uncompressed | PGLZ | LZ4 | ZSTD | PGLZ/ZSTD | LZ4/ZSTD |
T1: Large JSON (~18KB/row) | 500 | ~9,000 KB | 1496 KB | 1528 KB | 976 KB | 1.53x | 1.57x |
T2: Repetitive Text (~246KB/row) | 500 | ~123,000 KB | 1672 KB | 648 KB | 248 KB | 6.74x | 2.61x |
T3: MD5 Hash Data (~16KB/row) | 500 | ~8,000 KB | 8288 KB | 8232 KB | 4256 KB | 1.95x | 1.93x |
T4: Server Logs (~3.5KB/row) | 1000 | ~3,500 KB | 400 KB | 352 KB | 456 KB | 0.88x | 0.77x
|
Key findings (i guess well known at this point):
- ZSTD excels for repetitive/pattern-heavy data (6.7x better than PGLZ)
- For low-redundancy data (MD5 hashes), ZSTD still achieves ~2x better
- The T4 result showing zstd as "worse" is not about compression quality - it's about missing inline storage support. ZSTD actually compresses better, but pays unnecessary TOAST overhead.
I'll share the detailed benchmark script with the next patch revision. But also a potential path forward could be that we could just fully replace pglz (can bring it up later in different thread)
On Testing and Patch Structure
Agreed on both points:
- I'll use `compression_zstd.sql` following the `compression_lz4.sql` pattern (removing the test_toast_ext module)
- I'll split the GUC refactoring into a separate preparatory patch
Once you confirm which representation you're advocating, I'll respin accordingly.
Thanks,
Dharin