X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org Received: from localhost (av.hub.org [200.46.204.144]) by postgresql.org (Postfix) with ESMTP id 424A39DCC57 for ; Mon, 5 Dec 2005 00:44:22 -0400 (AST) Received: from postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 84950-06 for ; Mon, 5 Dec 2005 00:44:21 -0400 (AST) X-Greylist: from auto-whitelisted by SQLgrey- Received: from linuxworld.com.au (unknown [203.34.46.50]) by postgresql.org (Postfix) with ESMTP id A64DB9DCC3B for ; Mon, 5 Dec 2005 00:44:19 -0400 (AST) Received: from linuxworld.com.au (IDENT:swm@localhost.localdomain [127.0.0.1]) by linuxworld.com.au (8.13.2/8.13.2) with ESMTP id jB54i0IR007132; Mon, 5 Dec 2005 15:44:00 +1100 Received: from localhost (swm@localhost) by linuxworld.com.au (8.13.2/8.13.2/Submit) with ESMTP id jB54hx81007129; Mon, 5 Dec 2005 15:43:59 +1100 Date: Mon, 5 Dec 2005 15:43:59 +1100 (EST) From: Gavin Sherry To: Tom Lane cc: Paul Lindner , Bruce Momjian , Neil Conway , pgsql-hackers@postgresql.org Subject: Re: Upcoming PG re-releases In-Reply-To: <8284.1133714056@sss.pgh.pa.us> Message-ID: References: <1133625371.9297.3.camel@localhost.localdomain> <200512031554.jB3Fs8h10927@candle.pha.pa.us> <20051204162520.GD10317@inuus.com> <8284.1133714056@sss.pgh.pa.us> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Status: No, score=0 required=5 tests=[none] X-Spam-Score: 0 X-Spam-Level: X-Archive-Number: 200512/216 X-Sequence-Number: 77064 Hi all, On Sun, 4 Dec 2005, Tom Lane wrote: > Paul Lindner writes: > > To convert your pre-8.1 database to 8.1 you may have to remove and/or > > fix the offending characters. One simple way to fix the problem is to > > run your pg_dump output through the iconv command like this: > > > iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql > > Is that really a one-size-fits-all solution? Especially with -c? > It's definately not a one size fits all. The reassuring thing is that others have tried to deal with this problem before. Omar Kilani and I have spent a few hours looking at the problem. For situations where there is a lot of invalid encoding, manual fixing is just not viable. The vim project has a kind of fuzzy encoding conversion which accounts for a lot of the non-UTF8 sequences in UTF8 data. You can use vim to modify your text dump as follows: vim -c ":wq! ++enc=utf8 fixed.dump" original.dump Now, our testing of this is far from exhaustive but it's a lot better than just cutting the data from the original dump. Those suffering the problem should definately check this out, particularly if you have a non-trivial amount of data. Thanks, Gavin