X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org Received: from localhost (av.hub.org [200.46.204.144]) by postgresql.org (Postfix) with ESMTP id 939AA9DD784 for ; Fri, 9 Dec 2005 12:34:49 -0400 (AST) Received: from postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 23605-01 for ; Fri, 9 Dec 2005 12:34:48 -0400 (AST) X-Greylist: from auto-whitelisted by SQLgrey- Received: from candle.pha.pa.us (candle.pha.pa.us [64.139.89.126]) by postgresql.org (Postfix) with ESMTP id E6B849DD732 for ; Fri, 9 Dec 2005 12:34:46 -0400 (AST) Received: (from pgman@localhost) by candle.pha.pa.us (8.11.6/8.11.6) id jB9GYMb04814; Fri, 9 Dec 2005 11:34:22 -0500 (EST) From: Bruce Momjian Message-Id: <200512091634.jB9GYMb04814@candle.pha.pa.us> Subject: Re: Upcoming PG re-releases In-Reply-To: <20051209161733.GC20352@svana.org> To: Martijn van Oosterhout Date: Fri, 9 Dec 2005 11:34:22 -0500 (EST) CC: Gregory Maxwell , Gavin Sherry , Peter Eisentraut , pgsql-hackers@postgresql.org X-Mailer: ELM [version 2.4ME+ PL121 (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Status: No, score=0.008 required=5 tests=[AWL=0.008] X-Spam-Score: 0.008 X-Spam-Level: X-Archive-Number: 200512/494 X-Sequence-Number: 77342 Martijn van Oosterhout wrote: -- Start of PGP signed section. > On Thu, Dec 08, 2005 at 05:54:35PM -0500, Gregory Maxwell wrote: > > No, what is needed for people who care about fixing their data is a > > loadable strip_invalid_utf8() that works in older versions.. then just > > select * from bar where foo != strip_invalid_utf8(foo); The function > > would be useful in general, for example, if you have an application > > which doesn't already have much utf8 logic, you want to use a text > > field, and stripping is the behaviour you want. For example, lots of > > simple web applications. > > Would something like the following work? It's written in pl/pgsql and > does (AFAICS) the same checking as the backend in recent releases. > Except the backend only supports up to 4-byte UTF-8 whereas this > function checks upto six byte. For a six byte UTF-8 character, who is > wrong? > > In any case, people should be able to do something like: > > SELECT field FROM table WHERE NOT utf8_verify(field,4); > > To check conformance with PostgreSQL 8.1. Note, I don't have large > chunks of UTF-8 to test with but it works for the characters I tried > with. Tested with 7.4. I think the problem with any kind of function-call detection is that the data has to get into the database first, and it isn't clear how someone loading a failed dump would do that aside from modifying the column to bytea in the dump, loading it in, then fixing it. The iconv idea has the advantage that it can be fixed before loading into the database. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073