Date: Mon, 5 Dec 2005 15:43:59 +1100 (EST)
From: Gavin Sherry <swm@linuxworld.com.au>
To: Tom Lane <tgl@sss.pgh.pa.us>
cc: Paul Lindner <lindner@inuus.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
	Neil Conway <neilc@samurai.com>, pgsql-hackers@postgresql.org
Subject: Re: Upcoming PG re-releases 
In-Reply-To: <8284.1133714056@sss.pgh.pa.us>
Message-ID: <Pine.LNX.4.58.0512051539390.7093@linuxworld.com.au>
References: <1133625371.9297.3.camel@localhost.localdomain>
	<200512031554.jB3Fs8h10927@candle.pha.pa.us>
	<20051204162520.GD10317@inuus.com> <8284.1133714056@sss.pgh.pa.us>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

Hi all,

On Sun, 4 Dec 2005, Tom Lane wrote:

> Paul Lindner <lindner@inuus.com> writes:
> > To convert your pre-8.1 database to 8.1 you may have to remove and/or
> > fix the offending characters.  One simple way to fix the problem is to
> > run your pg_dump output through the iconv command like this:
>
> >   iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql
>
> Is that really a one-size-fits-all solution?  Especially with -c?
>

It's definately not a one size fits all. The reassuring thing is that
others have tried to deal with this problem before.

Omar Kilani and I have spent a few hours looking at the problem. For
situations where there is a lot of invalid encoding, manual fixing is just
not viable. The vim project has a kind of fuzzy encoding conversion which
accounts for a lot of the non-UTF8 sequences in UTF8 data. You can use vim
to modify your text dump as follows:

vim -c ":wq! ++enc=utf8 fixed.dump" original.dump

Now, our testing of this is far from exhaustive but it's a lot better than
just cutting the data from the original dump. Those suffering the problem
should definately check this out, particularly if you have a non-trivial
amount of data.

Thanks,

Gavin