Re: Logical replication, need to reclaim big disk space

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Achilleas Mantzios <[email protected]>
To: Moreno Andreo <[email protected]>
To: [email protected]
Subject: Re: Logical replication, need to reclaim big disk space
Date: Mon, 19 May 2025 21:49:53 +0300
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>

On 19/5/25 17:38, Moreno Andreo wrote:

>
>
> On 19/05/25 14:41, Achilleas Mantzios wrote:
>>
>>
>> On 5/19/25 09:14, Moreno Andreo wrote:
>>>
>>> On 16/05/25 21:33, Achilleas Mantzios wrote:
>>>> On 16/5/25 18:45, Moreno Andreo wrote:
>>>>
>>>>> Hi,
>>>>>     we are moving our old binary data approach, moving them from 
>>>>> bytea fields in a table to external storage (making database 
>>>>> smaller and related operations faster and smarter).
>>>>> In short, we have a job that runs in background and copies data 
>>>>> from the table to an external file and then sets the bytea field 
>>>>> to NULL.
>>>>> (UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)
>>>>>
>>>>> This results, at the end of the operations, to a table that's less 
>>>>> than one tenth in size.
>>>>> We have a multi-tenant architecture (100s of schemas with 
>>>>> identical architecture, all inheriting from public) and we are 
>>>>> performing the task on one table per schema.
>>>>>
>>>> So? toasted data are kept on separate TOAST tables, unless those 
>>>> bytea cols are selected, you won't even touch them. I cannot 
>>>> understand what you are trying to achieve here.
>>>>
>>>> Years ago, when I made the mistake to go for a coffee and let my 
>>>> developers "improvise" , the result was a design similar to what 
>>>> you are trying to achieve. Years after, I am seriously considering 
>>>> moving those data back to PostgreSQL.
>>> The "related operations" I was talking about are backups and 
>>> database maintenance when needed, cluster/replica management, etc. 
>>> With a smaller database size they would be easier in timing and 
>>> effort, right?
>> Ok, but you'll lose replica functionality for those blobs, which 
>> means you don't care about them, correct me if I am wrong.
> I'm not saying I don't care about them, the opposite, they are 
> protected with Object Versioning and soft deletion, this should assure 
> a good protection against e.g. ransomware, if someone manages to get 
> in there (and if this happens, we'll have bigger troubles than this).
PostgreSQL has become very popular because of ppl who care about their data.
>>> We are mostly talking about costs, here. To give things their names, 
>>> I'm moving bytea contents (85% of total data) to files into Google 
>>> Cloud Storage buckets, that has a fraction of the cost of the disks 
>>> holding my database (on GCE, to be clear ).
>> May I ask the size of the bytea data (uncompressed) ?.
> single records vary from 150k to 80 MB, the grand total is more than 
> 8,5 TB in a circa 10 TB data footprint
>>> This data is not accessed frequently (just by the owner when he 
>>> needs to do it), so no need to keep it on expensive hardware.
>>> I've already read in these years that keeping many big bytea fields 
>>> in databases is not recommended, but might have misunderstood this.
>>
>> Ok, I assume those are unimportant data, but let me ask, what is the 
>> longevity or expected legitimacy of those ? I haven't worked with 
>> those just reading  :
>>
>> https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VK...
>>
>> would you choose e.g. "*Anywhere Cache storage" ?
>> *
>>
> Absolutely not, this is *not* unimportant data, and we are using 
> Standard Storage, for 0,02$/GB/month + operations, that compared to a 
> 0.17$/GB/month of an SSD or even more for the Hyperdisks we are using, 
> is a good price drop.
How about hosting your data in your own storage and spend 0$/GB/month ?
>>
>> **
>>
>>> Another way would have been to move these tables to a different 
>>> tablespace, in cheaper storage, but it still would have been 3 times 
>>> the buckets cost.
>>>
>> can you actually mount those Cloud Storage Buckets under a supported 
>> FS in linux and just move them to tablespaces backed by this storage ?
>>
> Never tried, I mounted this via FUSE and had some simple operations in 
> the past, but not sure it can handle database operations in terms of 
> I/O bandwidth
>>
>>> Why are you considering to get data back to database tables?
>> Because now if we need to migrate from cloud to on-premise, or just 
>> upgrade or move the specific server which holds those data I will 
>> have an extra headache. Also this is a single point of failure, or 
>> best case a cause for fragmented technology introduced just for the 
>> sake of keeping things out of the DB.
> This is managed as an hierarchical disk structure, so the calling 
> server may be literally everywhere, it just needs an account (or a 
> service account) to get in there ,
and you are locked in a proprietary solution. and at their mercy of any 
future increases in cost.
>>>>
>>>>
>>>>> The problem is: this is generating BIG table bloat, as you may 
>>>>> imagine.
>>>>> Running a VACUUM FULL on an ex-22GB table on a standalone test 
>>>>> server is almost immediate.
>>>>> If I had only one server, I'll process a table a time, with a 
>>>>> nightly script, and issue a VACUUM FULL to tables that have 
>>>>> already been processed.
>>>>>
>>>>> But I'm in a logical replication architecture (we are using a 
>>>>> multimaster system called pgEdge, but I don't think it will make 
>>>>> big difference, since it's based on logical replication), and I'm 
>>>>> building a test cluster.
>>>>>
>>>> So you use PgEdge , but you wanna lose all the benefits of 
>>>> multi-master , since your binary data won't be replicated ...
>>> I don't think I need it to be replicated, since this data cannot be 
>>> "edited", so either it's there or it's been deleted. Buckets have 
>>> protections for data deletions or events like ransomware attacks and 
>>> such.
>>> Also multi-master was an absolute requirement one year ago because 
>>> of a project we were building, but it has been abandoned and now a 
>>> simple logical replication would be enough, but let's do one thing a 
>>> time.
>> Multi-master is cool, you can configure your pooler / clients to take 
>> advantage of this for full load balanced architecture, but if not a 
>> strict requirement , you can live without it, as so many of us, and 
>> employ other means of load balancing the reads.
> That's what we are doing, it's a really cool feature, but I 
> experienced (maybe because it uses old pglogical extension) that the 
> replication is a bit fragile, especially when dealing with those bytea 
> fields (when I ingest big loads, say 25-30 GB or more), it happened to 
> break replication, and recreating a replica from scratch with "normal 
> size" tables is not a big deal, since it can be achieved 
> automatically, because they normally fit in shared memory and can be 
> transferred by the replicator, but you can imagine what would be the 
> effort and the downtime necessary to create a base backup, transfer it 
> to the replica, build the DB and restart a 10-TB database (ATM we are 
> running with a 2-node cluster).
Break this in batches, use modern techniques for robust data loading, in 
smaller transactions, if you have to.
>>>>> I've been instructed to issue VACUUM FULL on both nodes, nightly, 
>>>>> but before proceeding I read on docs that VACUUM FULL can disrupt 
>>>>> logical replication, so I'm a bit concerned on how to proceed. 
>>>>> Rows are cleared one a time (one transaction, one row, to keep 
>>>>> errors to the record that issued them)
>> Mind if you shared the specific doc ?
> Obviously I can't find it from a quick search, I'll search deeper, I 
> don't think it went off a dream :-).
>>>>>
>>>> PgEdge is based on the old pg_logical, the old 2ndQuadrant 
>>>> extension, not the native logical replication we have since pgsql 
>>>> 10. But I might be mistaken.
>>> Don't know about this, it keeps running on latest pg versions (we 
>>> are about to upgrade to 17.4, if I'm not wrong), but I'll ask
>>>>> I read about extensions like pg_squeeze, but I wonder if they are 
>>>>> still not dangerous for replication.
>>>>>
>>>> What's pgEdge take on that, I mean the bytea thing you are trying 
>>>> to achieve here.
>>> They are positive, it's they that suggested to do VACUUM FULL on 
>>> both nodes... I'm quite new to replication, so I'm searching some 
>>> advise here.
>>
>> As I told you, pgEdge logical replication (old 2ndquadrant BDR) != 
>> native logical replication. You may look here :
>>
>> https://github.com/pgEdge/spock
>>
>> If multi-master is not a must you could convert  to vanilla 
>> postgresql and focus on standard physical and logical replication.
>>
> No, multimaster is cool, but as I said, the project has been 
> discontinued and it's not a must anymore. This is the first step, 
> actually. We are planning to return to plain PostgreSQL, or CloudSQL 
> for PostgreSQL, using logical replication (that seems the most 
> reliable of the two). We created a test case for both the options, and 
> they seem to be OK for now, even if I have still to do adequate stress 
> tests. And when I'll do the migration, I'd like to be migrating plain 
> data only and leave blobs where they are.

as you wish. But this design has inherent data infra fragmentation as 
you understand.

Personally I like to let the DB take care of the data, and I take care 
of the DB, not a plethora of extra systems that we need to keep 
connected and consistent.

>>>>> Thanks for your help.
>>>>> Moreno.-
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>

view thread (7+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Logical replication, need to reclaim big disk space
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox