Message-ID: <gh-postgresql-interfaces-psqlodbc-187-c4450395724@github.com>
From: "davecramer (@davecramer)" <noreply+davecramer@github.com>
To: "postgresql-interfaces/psqlodbc" <noreply+postgresql-interfaces-psqlodbc@github.com>
Date: Thu, 14 May 2026 11:48:08 +0000
Subject: Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining
In-Reply-To: <gh-postgresql-interfaces-psqlodbc-187@github.com>
References: <gh-postgresql-interfaces-psqlodbc-187@github.com>
List-Id: <gh-postgresql-interfaces-psqlodbc.github.com>
X-GitHub-Author-Login: davecramer
X-GitHub-Comment-Id: 4450395724
X-GitHub-Comment-Type: issue_comment
X-GitHub-Issue: 187
X-GitHub-Repo: postgresql-interfaces/psqlodbc
X-GitHub-Type: comment
X-GitHub-Url: https://github.com/postgresql-interfaces/psqlodbc/pull/187#issuecomment-4450395724
Content-Type: text/plain; charset=utf-8

In win_unicode.c, the ucs2_to_utf8() function converts SQLWCHAR (UTF-16) to UTF-8. When it encounters a high
  surrogate (0xD800–0xDBFF), it unconditionally advances to wstr[1] and reads the next code unit — without checking:
  
  1. Whether there is a next code unit within the caller-specified length (ilen)
  2. Whether that next code unit is actually a low surrogate (0xDC00–0xDFFF)
  
  This is an out-of-bounds read. If the input ends with a dangling high surrogate (which is invalid UTF-16 but must be
  handled gracefully), the driver reads one SQLWCHAR past the buffer.
  
  The Fix
  
  // Before:
  else if (surrog1_bits == (*wstr & surrog_check))
  
  // After:
  else if (surrog1_bits == (*wstr & surrog_check) &&
           i + 1 < ilen &&
           surrog2_bits == (wstr[1] & surrog_check))
  
  Three conditions must all be true before taking the surrogate-pair path:
  
  1. Current code unit is a high surrogate
  2. There's at least one more code unit within bounds (i + 1 < ilen)
  3. The next code unit is a low surrogate
  
  If any condition fails, the code falls through to the non-pair path (which handles the code unit as a standalone
  value — producing replacement or garbage UTF-8, but not crashing).
  
  Assessment
  
  Correctness: ✅ The fix is minimal and correct. The bounds check (i + 1 < ilen) prevents the OOB read. The
  low-surrogate validation (surrog2_bits == (wstr[1] & surrog_check)) ensures we only combine when we actually have a
  valid pair — a high surrogate followed by something that isn't a low surrogate is malformed input and shouldn't be
  combined.
  
  Test: ✅ Excellent test design. It uses mmap/VirtualAlloc to place the single SQLWCHAR at the end of a page with the
  next page marked PROT_NONE/PAGE_NOACCESS. If the driver reads past the supplied length, it gets a SIGSEGV/access
  violation — deterministic crash rather than silent corruption. Cross-platform (Unix + Win32).
  
  Security relevance: This is a memory safety bug reachable from any wide-character ODBC API (SQLPrepareW,
  SQLExecDirectW, etc.) when the application passes a length that doesn't include a trailing low surrogate. In
  practice this is unlikely to be exploitable (it's a 2-byte over-read, not a write), but it's still a crash bug that
  ASan catches.
  
  One observation: The fallthrough path for a dangling high surrogate will encode it as a 3-byte UTF-8 sequence (the
  raw surrogate code point). This is technically "WTF-8" rather than valid UTF-8. PostgreSQL will reject it anyway, so
  it's fine — the driver doesn't crash, and the server returns an error. A future improvement could substitute U+FFFD,
  but that's out of scope for this fix.
  
  Verdict
  
  Approve. Clean, minimal fix for a real OOB read with an excellent guarded-page regression test. Ship it.