Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining

postgresql-interfaces/psqlodbc GitHub issues and pull requests (mirror)  
help / color / mirror / Atom feed

From: davecramer (@davecramer) <[email protected]>
To: postgresql-interfaces/psqlodbc <[email protected]>
Subject: Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
Date: Thu, 14 May 2026 11:48:08 +0000
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>

In win_unicode.c, the ucs2_to_utf8() function converts SQLWCHAR (UTF-16) to UTF-8. When it encounters a high
  surrogate (0xD800–0xDBFF), it unconditionally advances to wstr[1] and reads the next code unit — without checking:

  1. Whether there is a next code unit within the caller-specified length (ilen)
  2. Whether that next code unit is actually a low surrogate (0xDC00–0xDFFF)

  This is an out-of-bounds read. If the input ends with a dangling high surrogate (which is invalid UTF-16 but must be
  handled gracefully), the driver reads one SQLWCHAR past the buffer.

  The Fix

  // Before:
  else if (surrog1_bits == (*wstr & surrog_check))

  // After:
  else if (surrog1_bits == (*wstr & surrog_check) &&
           i + 1 < ilen &&
           surrog2_bits == (wstr[1] & surrog_check))

  Three conditions must all be true before taking the surrogate-pair path:

  1. Current code unit is a high surrogate
  2. There's at least one more code unit within bounds (i + 1 < ilen)
  3. The next code unit is a low surrogate

  If any condition fails, the code falls through to the non-pair path (which handles the code unit as a standalone
  value — producing replacement or garbage UTF-8, but not crashing).

  Assessment

  Correctness: ✅ The fix is minimal and correct. The bounds check (i + 1 < ilen) prevents the OOB read. The
  low-surrogate validation (surrog2_bits == (wstr[1] & surrog_check)) ensures we only combine when we actually have a
  valid pair — a high surrogate followed by something that isn't a low surrogate is malformed input and shouldn't be
  combined.

  Test: ✅ Excellent test design. It uses mmap/VirtualAlloc to place the single SQLWCHAR at the end of a page with the
  next page marked PROT_NONE/PAGE_NOACCESS. If the driver reads past the supplied length, it gets a SIGSEGV/access
  violation — deterministic crash rather than silent corruption. Cross-platform (Unix + Win32).

  Security relevance: This is a memory safety bug reachable from any wide-character ODBC API (SQLPrepareW,
  SQLExecDirectW, etc.) when the application passes a length that doesn't include a trailing low surrogate. In
  practice this is unlikely to be exploitable (it's a 2-byte over-read, not a write), but it's still a crash bug that
  ASan catches.

  One observation: The fallthrough path for a dangling high surrogate will encode it as a 3-byte UTF-8 sequence (the
  raw surrogate code point). This is technically "WTF-8" rather than valid UTF-8. PostgreSQL will reject it anyway, so
  it's fine — the driver doesn't crash, and the server returns an error. A future improvement could substitute U+FFFD,
  but that's out of scope for this fix.

  Verdict

  Approve. Clean, minimal fix for a real OOB read with an excellent guarded-page regression test. Ship it.

view thread (4+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: github://postgresql-interfaces/psqlodbc
  Cc: [email protected], [email protected]
  Subject: Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
  In-Reply-To: <<[email protected]>>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox