Message-ID: From: "davecramer (@davecramer)" To: "postgresql-interfaces/psqlodbc" Date: Thu, 14 May 2026 11:48:08 +0000 Subject: Re: [postgresql-interfaces/psqlodbc] PR #187: Validate UTF-16 surrogate pairs before combining In-Reply-To: References: List-Id: X-GitHub-Author-Login: davecramer X-GitHub-Comment-Id: 4450395724 X-GitHub-Comment-Type: issue_comment X-GitHub-Issue: 187 X-GitHub-Repo: postgresql-interfaces/psqlodbc X-GitHub-Type: comment X-GitHub-Url: https://github.com/postgresql-interfaces/psqlodbc/pull/187#issuecomment-4450395724 Content-Type: text/plain; charset=utf-8 In win_unicode.c, the ucs2_to_utf8() function converts SQLWCHAR (UTF-16) to UTF-8. When it encounters a high surrogate (0xD800–0xDBFF), it unconditionally advances to wstr[1] and reads the next code unit — without checking: 1. Whether there is a next code unit within the caller-specified length (ilen) 2. Whether that next code unit is actually a low surrogate (0xDC00–0xDFFF) This is an out-of-bounds read. If the input ends with a dangling high surrogate (which is invalid UTF-16 but must be handled gracefully), the driver reads one SQLWCHAR past the buffer. The Fix // Before: else if (surrog1_bits == (*wstr & surrog_check)) // After: else if (surrog1_bits == (*wstr & surrog_check) && i + 1 < ilen && surrog2_bits == (wstr[1] & surrog_check)) Three conditions must all be true before taking the surrogate-pair path: 1. Current code unit is a high surrogate 2. There's at least one more code unit within bounds (i + 1 < ilen) 3. The next code unit is a low surrogate If any condition fails, the code falls through to the non-pair path (which handles the code unit as a standalone value — producing replacement or garbage UTF-8, but not crashing). Assessment Correctness: ✅ The fix is minimal and correct. The bounds check (i + 1 < ilen) prevents the OOB read. The low-surrogate validation (surrog2_bits == (wstr[1] & surrog_check)) ensures we only combine when we actually have a valid pair — a high surrogate followed by something that isn't a low surrogate is malformed input and shouldn't be combined. Test: ✅ Excellent test design. It uses mmap/VirtualAlloc to place the single SQLWCHAR at the end of a page with the next page marked PROT_NONE/PAGE_NOACCESS. If the driver reads past the supplied length, it gets a SIGSEGV/access violation — deterministic crash rather than silent corruption. Cross-platform (Unix + Win32). Security relevance: This is a memory safety bug reachable from any wide-character ODBC API (SQLPrepareW, SQLExecDirectW, etc.) when the application passes a length that doesn't include a trailing low surrogate. In practice this is unlikely to be exploitable (it's a 2-byte over-read, not a write), but it's still a crash bug that ASan catches. One observation: The fallthrough path for a dangling high surrogate will encode it as a 3-byte UTF-8 sequence (the raw surrogate code point). This is technically "WTF-8" rather than valid UTF-8. PostgreSQL will reject it anyway, so it's fine — the driver doesn't crash, and the server returns an error. A future improvement could substitute U+FFFD, but that's out of scope for this fix. Verdict Approve. Clean, minimal fix for a real OOB read with an excellent guarded-page regression test. Ship it.