X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org Received: from localhost (av.hub.org [200.46.204.144]) by postgresql.org (Postfix) with ESMTP id AF0069DCB8C for ; Sun, 4 Dec 2005 14:56:44 -0400 (AST) Received: from postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 62740-03 for ; Sun, 4 Dec 2005 14:56:41 -0400 (AST) X-Greylist: from auto-whitelisted by SQLgrey- Received: from svana.org (unknown [203.20.62.76]) by postgresql.org (Postfix) with ESMTP id D694F9DCAB3 for ; Sun, 4 Dec 2005 14:56:41 -0400 (AST) Received: from kleptog by svana.org with local (Exim 3.35 #1 (Debian)) id 1Eiz0r-00014x-00; Mon, 05 Dec 2005 05:55:05 +1100 Date: Sun, 4 Dec 2005 19:55:05 +0100 From: Martijn van Oosterhout To: Gregory Maxwell Cc: pgsql-hackers@postgresql.org Subject: Re: Upcoming PG re-releases Message-ID: <20051204185505.GB698@svana.org> Reply-To: Martijn van Oosterhout References: <1133625371.9297.3.camel@localhost.localdomain> <200512031554.jB3Fs8h10927@candle.pha.pa.us> <20051204162520.GD10317@inuus.com> <8284.1133714056@sss.pgh.pa.us> <20051204164054.GE10317@inuus.com> <8437.1133715165@sss.pgh.pa.us> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Yylu36WmvOXNoKYn" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.28i X-PGP-Key-ID: Length=1024; ID=0x0DC67BE6 X-PGP-Key-Fingerprint: 295F A899 A81A 156D B522 48A7 6394 F08A 0DC6 7BE6 X-PGP-Key-URL: X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Status: No, score=0 required=5 tests=[none] X-Spam-Score: 0 X-Spam-Level: X-Archive-Number: 200512/190 X-Sequence-Number: 77038 --Yylu36WmvOXNoKYn Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Dec 04, 2005 at 12:19:32PM -0500, Gregory Maxwell wrote: > > That's exactly what's bothering me about it. If we recommend that > > we had better put a large THIS WILL DESTROY YOUR DATA warning first. > > The problem is that the data is not "invalid" from the user's point > > of view --- more likely, it's in some non-UTF8 encoding --- and so > > just throwing away some of the characters is unlikely to make people > > happy. >=20 > Nor is it even guarenteed to make the data load: If the column is > unique constrained and the removal of the non-UTF characters makes two > rows have the same data where they didn't before... >=20 > The way to preserve the data is to switch the column to be a bytea. Additionally, it's hard to suggest anything better without specific knowledge of the characters that are incorrect and how they got there. The ideal solution would be a way for people to identify problem data *before* they dump so they have an opportunity to fix it. Something like a module they can load and say: select val from table where not utf8_validate(val); This would allow people to examine the data while the system is still running and fix it. Maybe we can code something up in plpgsql? Slow as molasses but you'll be able to run it anywhere. Have a nice day, --=20 Martijn van Oosterhout http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. --Yylu36WmvOXNoKYn Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDkzuIIB7bNG8LQkwRAomhAJ0de5U+3wiv97zXnAmtmsbymIa+lgCfSRjM MQfmu14HY53JrTIrUjKCybo= =urQW -----END PGP SIGNATURE----- --Yylu36WmvOXNoKYn--