X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org Received: from localhost (av.hub.org [200.46.204.144]) by postgresql.org (Postfix) with ESMTP id CD1D89DCB62 for ; Fri, 9 Dec 2005 12:17:57 -0400 (AST) Received: from postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 18662-09 for ; Fri, 9 Dec 2005 12:17:56 -0400 (AST) X-Greylist: from auto-whitelisted by SQLgrey- Received: from svana.org (svana.org [203.20.62.76]) by postgresql.org (Postfix) with ESMTP id 011699DD725 for ; Fri, 9 Dec 2005 12:17:54 -0400 (AST) Received: from kleptog by svana.org with local (Exim 3.35 #1 (Debian)) id 1Ekkw9-00066t-00; Sat, 10 Dec 2005 03:17:33 +1100 Date: Fri, 9 Dec 2005 17:17:33 +0100 From: Martijn van Oosterhout To: Gregory Maxwell Cc: Bruce Momjian , Gavin Sherry , Peter Eisentraut , pgsql-hackers@postgresql.org Subject: Re: Upcoming PG re-releases Message-ID: <20051209161733.GC20352@svana.org> Reply-To: Martijn van Oosterhout References: <200512082244.jB8MiYT02161@candle.pha.pa.us> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="gE7i1rD7pdK0Ng3j" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.28i X-PGP-Key-ID: Length=1024; ID=0x0DC67BE6 X-PGP-Key-Fingerprint: 295F A899 A81A 156D B522 48A7 6394 F08A 0DC6 7BE6 X-PGP-Key-URL: X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Status: No, score=0.111 required=5 tests=[AWL=0.111] X-Spam-Score: 0.111 X-Spam-Level: X-Archive-Number: 200512/483 X-Sequence-Number: 77331 --gE7i1rD7pdK0Ng3j Content-Type: multipart/mixed; boundary="ABTtc+pdwF7KHXCz" Content-Disposition: inline --ABTtc+pdwF7KHXCz Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Dec 08, 2005 at 05:54:35PM -0500, Gregory Maxwell wrote: > No, what is needed for people who care about fixing their data is a > loadable strip_invalid_utf8() that works in older versions.. then just > select * from bar where foo !=3D strip_invalid_utf8(foo); The function > would be useful in general, for example, if you have an application > which doesn't already have much utf8 logic, you want to use a text > field, and stripping is the behaviour you want. For example, lots of > simple web applications. Would something like the following work? It's written in pl/pgsql and does (AFAICS) the same checking as the backend in recent releases. Except the backend only supports up to 4-byte UTF-8 whereas this function checks upto six byte. For a six byte UTF-8 character, who is wrong? In any case, people should be able to do something like: SELECT field FROM table WHERE NOT utf8_verify(field,4); To check conformance with PostgreSQL 8.1. Note, I don't have large chunks of UTF-8 to test with but it works for the characters I tried with. Tested with 7.4. Have a nice day, --=20 Martijn van Oosterhout http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. --ABTtc+pdwF7KHXCz Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="utf8_verify.sql" create or replace function utf8_verify(bytea,integer) returns bool as ' DECLARE str ALIAS FOR $1; maxlen ALIAS FOR $2; strlen INTEGER; i integer; j INTEGER; len integer; chr integer; wchr integer; BEGIN i := 0; strlen := length(str); WHILE i < strlen LOOP -- Check leading byte chr := get_byte(str,i); IF chr < 128 THEN -- 0x00 - 0x80 - single byte len := 1; wchr := chr; ELSIF chr < 192 THEN -- 0x80 - 0xC0 - illegal RETURN false; ELSIF chr < 224 THEN -- 0xC0 - 0xE0 - two bytes len := 2; wchr := chr - 192; ELSIF chr < 240 THEN -- 0xE0 - 0xF0 - three bytes len := 3; wchr := chr - 224; ELSIF chr < 248 THEN -- 0xF0 - 0xF8 - four bytes len := 4; wchr := chr - 240; ELSIF chr < 252 THEN -- 0xF8 - 0xFC - five bytes len := 5; wchr := chr - 248; ELSIF chr < 254 THEN -- 0xFC - 0xFE - six bytes len := 6; wchr := chr - 252; ELSE RETURN false; -- FE and FF not currently defined END IF; IF i + len > strlen THEN RETURN false; END IF; IF len > maxlen THEN RETURN false; END IF; -- Check remaining characters j := 1; WHILE len > j LOOP chr := get_byte(str, i+j); IF chr < 128 OR chr >= 192 THEN RETURN false; END IF; wchr := (wchr << 6) + (chr - 192); j := j+1; END LOOP; -- Verify shortest possible string IF len = 1 AND wchr >= 128 THEN RETURN false; ELSIF len = 2 AND (wchr < 128 OR wchr >= 2048) THEN RETURN false; ELSIF len = 3 AND (wchr < 2048 OR wchr >= 65536) THEN RETURN false; ELSIF len = 4 AND (wchr < 65536 OR wchr >= 2097152) THEN RETURN false; ELSIF len = 5 AND (wchr < 2097152 OR wchr >= 67108864) THEN RETURN false; ELSIF len = 6 AND (wchr < 67108864 OR wchr >= 2147483648) THEN RETURN false; END IF; -- RAISE NOTICE ''Checked char offset %, OK (wchr=%,len=%)'', i, wchr, len; i := i+len; END LOOP; RETURN true; END; ' language plpgsql; --ABTtc+pdwF7KHXCz-- --gE7i1rD7pdK0Ng3j Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDma4cIB7bNG8LQkwRArOdAJ0dOOxMCY6hU/GcUyOyLnruNvHOqACgh0Ms K2RyErQji5qf4HgI1qpWfvE= =bA/A -----END PGP SIGNATURE----- --gE7i1rD7pdK0Ng3j--