postgresql-interfaces/psqlodbc GitHub issues and pull requests (mirror)
help / color / mirror / Atom feed[postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
4+ messages / 2 participants
[nested] [flat]
* [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
@ 2026-05-12 03:42 "jarvis24young (@jarvis24young)" <[email protected]>
0 siblings, 0 replies; 4+ messages in thread
From: jarvis24young (@jarvis24young) @ 2026-05-12 03:42 UTC (permalink / raw)
To: postgresql-interfaces/psqlodbc <[email protected]>
SQLWCHAR-to-UTF-8 conversion currently treats any UTF-16 high surrogate as the start of a surrogate pair. It then advances to the next code unit and reads it unconditionally.
That can read past the caller-supplied length when a wide-character ODBC API receives a dangling high surrogate at the end of its input. The new regression test exercises this through the public `SQLPrepareW()` path with a guarded one-code-unit SQLWCHAR buffer, so the old implementation faults deterministically if it reads `wstr[1]`.
Fix this by only taking the surrogate-pair path when:
- the current code unit is a high surrogate,
- there is another code unit within `ilen`, and
- the next code unit is a low surrogate.
Otherwise the existing non-pair path is used, avoiding the out-of-bounds read.
Reproduction on the old implementation, using the same black-box test with ASan and a guarded buffer:
```text
ERROR: AddressSanitizer: SEGV on unknown address
The signal is caused by a READ memory access.
#0 ucs2_to_utf8 win_unicode.c:191
#1 SQLPrepareW odbcapiw.c:439
#2 SQLPrepareW libodbc.so.2
#3 main test/src/surrogate-pair-test.c:109
```
Tested after the fix:
```text
cd ~/psqlodbc-surrogate-oob-build/test
ODBCSYSINI=. ODBCINSTINI=./odbcinst.ini ODBCINI=./odbc.ini ./runsuite surrogate-pair --inputdir=.
TAP version 13
1..1
ok 1 - surrogate-pair
```
Also tested the target binary directly under ASan/UBSan with `detect_leaks=0`; it returns normally.
^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
@ 2026-05-14 11:48 ` "davecramer (@davecramer)" <[email protected]>
2 siblings, 0 replies; 4+ messages in thread
From: davecramer (@davecramer) @ 2026-05-14 11:48 UTC (permalink / raw)
To: postgresql-interfaces/psqlodbc <[email protected]>
In win_unicode.c, the ucs2_to_utf8() function converts SQLWCHAR (UTF-16) to UTF-8. When it encounters a high
surrogate (0xD800–0xDBFF), it unconditionally advances to wstr[1] and reads the next code unit — without checking:
1. Whether there is a next code unit within the caller-specified length (ilen)
2. Whether that next code unit is actually a low surrogate (0xDC00–0xDFFF)
This is an out-of-bounds read. If the input ends with a dangling high surrogate (which is invalid UTF-16 but must be
handled gracefully), the driver reads one SQLWCHAR past the buffer.
The Fix
// Before:
else if (surrog1_bits == (*wstr & surrog_check))
// After:
else if (surrog1_bits == (*wstr & surrog_check) &&
i + 1 < ilen &&
surrog2_bits == (wstr[1] & surrog_check))
Three conditions must all be true before taking the surrogate-pair path:
1. Current code unit is a high surrogate
2. There's at least one more code unit within bounds (i + 1 < ilen)
3. The next code unit is a low surrogate
If any condition fails, the code falls through to the non-pair path (which handles the code unit as a standalone
value — producing replacement or garbage UTF-8, but not crashing).
Assessment
Correctness: ✅ The fix is minimal and correct. The bounds check (i + 1 < ilen) prevents the OOB read. The
low-surrogate validation (surrog2_bits == (wstr[1] & surrog_check)) ensures we only combine when we actually have a
valid pair — a high surrogate followed by something that isn't a low surrogate is malformed input and shouldn't be
combined.
Test: ✅ Excellent test design. It uses mmap/VirtualAlloc to place the single SQLWCHAR at the end of a page with the
next page marked PROT_NONE/PAGE_NOACCESS. If the driver reads past the supplied length, it gets a SIGSEGV/access
violation — deterministic crash rather than silent corruption. Cross-platform (Unix + Win32).
Security relevance: This is a memory safety bug reachable from any wide-character ODBC API (SQLPrepareW,
SQLExecDirectW, etc.) when the application passes a length that doesn't include a trailing low surrogate. In
practice this is unlikely to be exploitable (it's a 2-byte over-read, not a write), but it's still a crash bug that
ASan catches.
One observation: The fallthrough path for a dangling high surrogate will encode it as a 3-byte UTF-8 sequence (the
raw surrogate code point). This is technically "WTF-8" rather than valid UTF-8. PostgreSQL will reject it anyway, so
it's fine — the driver doesn't crash, and the server returns an error. A future improvement could substitute U+FFFD,
but that's out of scope for this fix.
Verdict
Approve. Clean, minimal fix for a real OOB read with an excellent guarded-page regression test. Ship it.
^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
@ 2026-05-14 11:48 ` "davecramer (@davecramer)" <[email protected]>
2 siblings, 0 replies; 4+ messages in thread
From: davecramer (@davecramer) @ 2026-05-14 11:48 UTC (permalink / raw)
To: postgresql-interfaces/psqlodbc <[email protected]>
Can you please credit whatever AI you are using as well?
^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
@ 2026-05-14 12:12 ` "jarvis24young (@jarvis24young)" <[email protected]>
2 siblings, 0 replies; 4+ messages in thread
From: jarvis24young (@jarvis24young) @ 2026-05-14 12:12 UTC (permalink / raw)
To: postgresql-interfaces/psqlodbc <[email protected]>
> Can you please credit whatever AI you are using as well?
I used OpenAI ChatGPT/Codex to help draft parts of the analysis and review the patch. I should have credited that explicitly.
I have reviewed the final changes myself and I am responsible for the submitted code. I will add an AI assistance note to the PR description and do the same for
future AI-assisted contributions.
^ permalink raw reply [nested|flat] 4+ messages in thread
end of thread, other threads:[~2026-05-14 12:12 UTC | newest]
Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-05-12 03:42 [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining "jarvis24young (@jarvis24young)" <[email protected]>
2026-05-14 11:48 ` "davecramer (@davecramer)" <[email protected]>
2026-05-14 11:48 ` "davecramer (@davecramer)" <[email protected]>
2026-05-14 12:12 ` "jarvis24young (@jarvis24young)" <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox