[postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining

postgresql-interfaces/psqlodbc GitHub issues and pull requests (mirror)  
help / color / mirror / Atom feed

[postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
4+ messages / 2 participants
[nested] [flat]

* [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
@ 2026-05-12 03:42 "jarvis24young (@jarvis24young)" <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: jarvis24young (@jarvis24young) @ 2026-05-12 03:42 UTC (permalink / raw)
  To: postgresql-interfaces/psqlodbc <[email protected]>

SQLWCHAR-to-UTF-8 conversion currently treats any UTF-16 high surrogate as the start of a surrogate pair. It then advances to the next code unit and reads it unconditionally.

That can read past the caller-supplied length when a wide-character ODBC API receives a dangling high surrogate at the end of its input. The new regression test exercises this through the public `SQLPrepareW()` path with a guarded one-code-unit SQLWCHAR buffer, so the old implementation faults deterministically if it reads `wstr[1]`.

Fix this by only taking the surrogate-pair path when:

- the current code unit is a high surrogate,
- there is another code unit within `ilen`, and
- the next code unit is a low surrogate.

Otherwise the existing non-pair path is used, avoiding the out-of-bounds read.

Reproduction on the old implementation, using the same black-box test with ASan and a guarded buffer:

```text
ERROR: AddressSanitizer: SEGV on unknown address
The signal is caused by a READ memory access.
#0 ucs2_to_utf8 win_unicode.c:191
#1 SQLPrepareW odbcapiw.c:439
#2 SQLPrepareW libodbc.so.2
#3 main test/src/surrogate-pair-test.c:109
```

Tested after the fix:

```text
cd ~/psqlodbc-surrogate-oob-build/test
ODBCSYSINI=. ODBCINSTINI=./odbcinst.ini ODBCINI=./odbc.ini ./runsuite surrogate-pair --inputdir=.
TAP version 13
1..1
ok 1 - surrogate-pair
```

Also tested the target binary directly under ASan/UBSan with `detect_leaks=0`; it returns normally.

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
@ 2026-05-14 11:48 ` "davecramer (@davecramer)" <[email protected]>
  2 siblings, 0 replies; 4+ messages in thread

From: davecramer (@davecramer) @ 2026-05-14 11:48 UTC (permalink / raw)
  To: postgresql-interfaces/psqlodbc <[email protected]>

In win_unicode.c, the ucs2_to_utf8() function converts SQLWCHAR (UTF-16) to UTF-8. When it encounters a high
  surrogate (0xD800–0xDBFF), it unconditionally advances to wstr[1] and reads the next code unit — without checking:

  1. Whether there is a next code unit within the caller-specified length (ilen)
  2. Whether that next code unit is actually a low surrogate (0xDC00–0xDFFF)

  This is an out-of-bounds read. If the input ends with a dangling high surrogate (which is invalid UTF-16 but must be
  handled gracefully), the driver reads one SQLWCHAR past the buffer.

  The Fix

  // Before:
  else if (surrog1_bits == (*wstr & surrog_check))

  // After:
  else if (surrog1_bits == (*wstr & surrog_check) &&
           i + 1 < ilen &&
           surrog2_bits == (wstr[1] & surrog_check))

  Three conditions must all be true before taking the surrogate-pair path:

  1. Current code unit is a high surrogate
  2. There's at least one more code unit within bounds (i + 1 < ilen)
  3. The next code unit is a low surrogate

  If any condition fails, the code falls through to the non-pair path (which handles the code unit as a standalone
  value — producing replacement or garbage UTF-8, but not crashing).

  Assessment

  Correctness: ✅ The fix is minimal and correct. The bounds check (i + 1 < ilen) prevents the OOB read. The
  low-surrogate validation (surrog2_bits == (wstr[1] & surrog_check)) ensures we only combine when we actually have a
  valid pair — a high surrogate followed by something that isn't a low surrogate is malformed input and shouldn't be
  combined.

  Test: ✅ Excellent test design. It uses mmap/VirtualAlloc to place the single SQLWCHAR at the end of a page with the
  next page marked PROT_NONE/PAGE_NOACCESS. If the driver reads past the supplied length, it gets a SIGSEGV/access
  violation — deterministic crash rather than silent corruption. Cross-platform (Unix + Win32).

  Security relevance: This is a memory safety bug reachable from any wide-character ODBC API (SQLPrepareW,
  SQLExecDirectW, etc.) when the application passes a length that doesn't include a trailing low surrogate. In
  practice this is unlikely to be exploitable (it's a 2-byte over-read, not a write), but it's still a crash bug that
  ASan catches.

  One observation: The fallthrough path for a dangling high surrogate will encode it as a 3-byte UTF-8 sequence (the
  raw surrogate code point). This is technically "WTF-8" rather than valid UTF-8. PostgreSQL will reject it anyway, so
  it's fine — the driver doesn't crash, and the server returns an error. A future improvement could substitute U+FFFD,
  but that's out of scope for this fix.

  Verdict

  Approve. Clean, minimal fix for a real OOB read with an excellent guarded-page regression test. Ship it.

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
@ 2026-05-14 11:48 ` "davecramer (@davecramer)" <[email protected]>
  2 siblings, 0 replies; 4+ messages in thread

From: davecramer (@davecramer) @ 2026-05-14 11:48 UTC (permalink / raw)
  To: postgresql-interfaces/psqlodbc <[email protected]>

Can you please credit whatever AI you are using as well?

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
@ 2026-05-14 12:12 ` "jarvis24young (@jarvis24young)" <[email protected]>
  2 siblings, 0 replies; 4+ messages in thread

From: jarvis24young (@jarvis24young) @ 2026-05-14 12:12 UTC (permalink / raw)
  To: postgresql-interfaces/psqlodbc <[email protected]>

> Can you please credit whatever AI you are using as well?

I used OpenAI ChatGPT/Codex to help draft parts of the analysis and review the patch. I should have credited that explicitly.
I have reviewed the final changes myself and I am responsible for the submitted code. I will add an AI assistance note to the PR description and do the same for
future AI-assisted contributions.

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2026-05-14 12:12 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-05-12 03:42 [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining "jarvis24young (@jarvis24young)" <[email protected]>
2026-05-14 11:48 ` "davecramer (@davecramer)" <[email protected]>
2026-05-14 11:48 ` "davecramer (@davecramer)" <[email protected]>
2026-05-14 12:12 ` "jarvis24young (@jarvis24young)" <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox