X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org Received: from localhost (av.hub.org [200.46.204.144]) by postgresql.org (Postfix) with ESMTP id 3F07E9DCAA8 for ; Tue, 6 Dec 2005 16:25:33 -0400 (AST) Received: from postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 78489-09 for ; Tue, 6 Dec 2005 16:25:33 -0400 (AST) X-Greylist: from auto-whitelisted by SQLgrey- Received: from candle.pha.pa.us (candle.pha.pa.us [64.139.89.126]) by postgresql.org (Postfix) with ESMTP id 7E1E69DD619 for ; Tue, 6 Dec 2005 16:25:30 -0400 (AST) Received: (from pgman@localhost) by candle.pha.pa.us (8.11.6/8.11.6) id jB6KPDK02212; Tue, 6 Dec 2005 15:25:13 -0500 (EST) From: Bruce Momjian Message-Id: <200512062025.jB6KPDK02212@candle.pha.pa.us> Subject: Re: Upcoming PG re-releases In-Reply-To: <200512061932.jB6JWFn24607@candle.pha.pa.us> To: Bruce Momjian Date: Tue, 6 Dec 2005 15:25:13 -0500 (EST) CC: Tom Lane , Paul Lindner , Neil Conway , pgsql-hackers@postgresql.org X-Mailer: ELM [version 2.4ME+ PL121 (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Status: No, score=0.008 required=5 tests=[AWL=0.008] X-Spam-Score: 0.008 X-Spam-Level: X-Archive-Number: 200512/325 X-Sequence-Number: 77173 Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian writes: > > > I have added your suggestions to the 8.1.X release notes. > > > > Did you read the followup discussion? Recommending -c without a large > > warning seems a very bad idea. > > Well, I said it would remove invalid sequences. What else should we > say? > > This will remove invalid character sequences. > > I saw no clear solution that allowed sequences to be corrected. The release note text is: Some users are having problems loading UTF8 data into 8.1.X. This is because previous versions allowed invalid UTF8 sequences to be entered into the database, and this release properly accepts only valid UTF8 sequences. One way to correct a dumpfile is to use iconv -c -f UTF-8 -t UTF-8. This will remove invalid character sequences. iconv reads the entire input file into memory so it might be necessary to split the dump into multiple smaller files for processing. One nice solution would be if iconv would report the lines with errors and you could correct them, but I see no way to do that. The only thing you could do is to diff the old and new files to see the problems. Is that helpful? Here is new text I have used: Some users are having problems loading UTF8 data into 8.1.X. This is because previous versions allowed invalid UTF8 sequences to be entered into the database, and this release properly accepts only valid UTF8 sequences. One way to correct a dumpfile is to use iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql. The -c option removes invalid character sequences. A diff of the two files will show the sequences that are invalid. iconv reads the entire input file into memory so it might be necessary to split the dump into multiple smaller files for processing. It highlights the 'diff' idea. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073