public inbox for [email protected]
help / color / mirror / Atom feedFrom: Martijn van Oosterhout <[email protected]>
To: Gregory Maxwell <[email protected]>
Cc: [email protected]
Subject: Re: Upcoming PG re-releases
Date: Sun, 4 Dec 2005 19:55:05 +0100
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
On Sun, Dec 04, 2005 at 12:19:32PM -0500, Gregory Maxwell wrote:
> > That's exactly what's bothering me about it. If we recommend that
> > we had better put a large THIS WILL DESTROY YOUR DATA warning first.
> > The problem is that the data is not "invalid" from the user's point
> > of view --- more likely, it's in some non-UTF8 encoding --- and so
> > just throwing away some of the characters is unlikely to make people
> > happy.
>
> Nor is it even guarenteed to make the data load: If the column is
> unique constrained and the removal of the non-UTF characters makes two
> rows have the same data where they didn't before...
>
> The way to preserve the data is to switch the column to be a bytea.
Additionally, it's hard to suggest anything better without specific
knowledge of the characters that are incorrect and how they got there.
The ideal solution would be a way for people to identify problem data
*before* they dump so they have an opportunity to fix it. Something
like a module they can load and say:
select val from table where not utf8_validate(val);
This would allow people to examine the data while the system is still
running and fix it. Maybe we can code something up in plpgsql? Slow as
molasses but you'll be able to run it anywhere.
Have a nice day,
--
Martijn van Oosterhout <[email protected]> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.
view thread (55+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: Upcoming PG re-releases
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox