Re: Logical replication, need to reclaim big disk space

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Moreno Andreo <[email protected]>
To: [email protected]
Subject: Re: Logical replication, need to reclaim big disk space
Date: Tue, 20 May 2025 13:12:42 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>



On 20/05/25 12:58, Achilleas Mantzios wrote:
>
> Στις 20/5/25 12:17, ο/η Moreno Andreo έγραψε:
>
>>
>>
>> On 19/05/25 20:49, Achilleas Mantzios wrote:
>>>
>>> On 19/5/25 17:38, Moreno Andreo wrote:
>>>
>>>>
>>>>
>>>> On 19/05/25 14:41, Achilleas Mantzios wrote:
>>>>>
>>>>>
>>>>> On 5/19/25 09:14, Moreno Andreo wrote:
>>>>>>
>>>>>> On 16/05/25 21:33, Achilleas Mantzios wrote:
>>>>>>> On 16/5/25 18:45, Moreno Andreo wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>     we are moving our old binary data approach, moving them 
>>>>>>>> from bytea fields in a table to external storage (making 
>>>>>>>> database smaller and related operations faster and smarter).
>>>>>>>> In short, we have a job that runs in background and copies data 
>>>>>>>> from the table to an external file and then sets the bytea 
>>>>>>>> field to NULL.
>>>>>>>> (UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = 
>>>>>>>> <uuid>)
>>>>>>>>
>>>>>>>> This results, at the end of the operations, to a table that's 
>>>>>>>> less than one tenth in size.
>>>>>>>> We have a multi-tenant architecture (100s of schemas with 
>>>>>>>> identical architecture, all inheriting from public) and we are 
>>>>>>>> performing the task on one table per schema.
>>>>>>>>
>>>>>>> So? toasted data are kept on separate TOAST tables, unless those 
>>>>>>> bytea cols are selected, you won't even touch them. I cannot 
>>>>>>> understand what you are trying to achieve here.
>>>>>>>
>>>>>>> Years ago, when I made the mistake to go for a coffee and let my 
>>>>>>> developers "improvise" , the result was a design similar to what 
>>>>>>> you are trying to achieve. Years after, I am seriously 
>>>>>>> considering moving those data back to PostgreSQL.
>>>>>> The "related operations" I was talking about are backups and 
>>>>>> database maintenance when needed, cluster/replica management, 
>>>>>> etc. With a smaller database size they would be easier in timing 
>>>>>> and effort, right?
>>>>> Ok, but you'll lose replica functionality for those blobs, which 
>>>>> means you don't care about them, correct me if I am wrong.
>>>> I'm not saying I don't care about them, the opposite, they are 
>>>> protected with Object Versioning and soft deletion, this should 
>>>> assure a good protection against e.g. ransomware, if someone 
>>>> manages to get in there (and if this happens, we'll have bigger 
>>>> troubles than this).
>>> PostgreSQL has become very popular because of ppl who care about 
>>> their data.
>> Yeah, it's always been famous for its robustness, and that's why I 
>> chose PostgreSQL more than 10 years ago, and, in spite of how a 
>> "normal" user treats his PC, we never had corruption (only where 
>> FS/disk were failing, but that's not PG fault)
>>>>>> We are mostly talking about costs, here. To give things their 
>>>>>> names, I'm moving bytea contents (85% of total data) to files 
>>>>>> into Google Cloud Storage buckets, that has a fraction of the 
>>>>>> cost of the disks holding my database (on GCE, to be clear ).
>>>>> May I ask the size of the bytea data (uncompressed) ?.
>>>> single records vary from 150k to 80 MB, the grand total is more 
>>>> than 8,5 TB in a circa 10 TB data footprint
>>>>>> This data is not accessed frequently (just by the owner when he 
>>>>>> needs to do it), so no need to keep it on expensive hardware.
>>>>>> I've already read in these years that keeping many big bytea 
>>>>>> fields in databases is not recommended, but might have 
>>>>>> misunderstood this.
>>>>>
>>>>> Ok, I assume those are unimportant data, but let me ask, what is 
>>>>> the longevity or expected legitimacy of those ? I haven't worked 
>>>>> with those just reading  :
>>>>>
>>>>> https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VK...
>>>>>
>>>>> would you choose e.g. "*Anywhere Cache storage" ?
>>>>> *
>>>>>
>>>> Absolutely not, this is *not* unimportant data, and we are using 
>>>> Standard Storage, for 0,02$/GB/month + operations, that compared to 
>>>> a 0.17$/GB/month of an SSD or even more for the Hyperdisks we are 
>>>> using, is a good price drop.
>>> How about hosting your data in your own storage and spend 0$/GB/month ?
>> If we could host on our own hardware I'd not be here talking. Maybe 
>> we would have a 10-node full-mesh multimaster architecture with 
>> barman backup on 2 separate SANs.
>> But we are a small company that has to balance performance, 
>> consistency, security and, last but not latter, costs. And margins 
>> are tightening.
>>>
>>>>> **
>>>>>
>>>>>> Another way would have been to move these tables to a different 
>>>>>> tablespace, in cheaper storage, but it still would have been 3 
>>>>>> times the buckets cost.
>>>>>>
>>>>> can you actually mount those Cloud Storage Buckets under a 
>>>>> supported FS in linux and just move them to tablespaces backed by 
>>>>> this storage ?
>>>>>
>>>> Never tried, I mounted this via FUSE and had some simple operations 
>>>> in the past, but not sure it can handle database operations in 
>>>> terms of I/O bandwidth
>>>>>
>>>>>> Why are you considering to get data back to database tables?
>>>>> Because now if we need to migrate from cloud to on-premise, or 
>>>>> just upgrade or move the specific server which holds those data I 
>>>>> will have an extra headache. Also this is a single point of 
>>>>> failure, or best case a cause for fragmented technology introduced 
>>>>> just for the sake of keeping things out of the DB.
>>>> This is managed as an hierarchical disk structure, so the calling 
>>>> server may be literally everywhere, it just needs an account (or a 
>>>> service account) to get in there ,
>>> and you are locked in a proprietary solution. and at their mercy of 
>>> any future increases in cost.
>> Since we cannot host on our hardware, the only thing is to keep an 
>> eye on costs and migrate (yeah, more work) when it's becoming 
>> expensive. Every solution is proprietary, if you want to run on 
>> cloud. Even the VMs where PostgreSQL is running.
>>>>>>>
>>>>>>>
>>>>>>>> The problem is: this is generating BIG table bloat, as you may 
>>>>>>>> imagine.
>>>>>>>> Running a VACUUM FULL on an ex-22GB table on a standalone test 
>>>>>>>> server is almost immediate.
>>>>>>>> If I had only one server, I'll process a table a time, with a 
>>>>>>>> nightly script, and issue a VACUUM FULL to tables that have 
>>>>>>>> already been processed.
>>>>>>>>
>>>>>>>> But I'm in a logical replication architecture (we are using a 
>>>>>>>> multimaster system called pgEdge, but I don't think it will 
>>>>>>>> make big difference, since it's based on logical replication), 
>>>>>>>> and I'm building a test cluster.
>>>>>>>>
>>>>>>> So you use PgEdge , but you wanna lose all the benefits of 
>>>>>>> multi-master , since your binary data won't be replicated ...
>>>>>> I don't think I need it to be replicated, since this data cannot 
>>>>>> be "edited", so either it's there or it's been deleted. Buckets 
>>>>>> have protections for data deletions or events like ransomware 
>>>>>> attacks and such.
>>>>>> Also multi-master was an absolute requirement one year ago 
>>>>>> because of a project we were building, but it has been abandoned 
>>>>>> and now a simple logical replication would be enough, but let's 
>>>>>> do one thing a time.
>>>>> Multi-master is cool, you can configure your pooler / clients to 
>>>>> take advantage of this for full load balanced architecture, but if 
>>>>> not a strict requirement , you can live without it, as so many of 
>>>>> us, and employ other means of load balancing the reads.
>>>> That's what we are doing, it's a really cool feature, but I 
>>>> experienced (maybe because it uses old pglogical extension) that 
>>>> the replication is a bit fragile, especially when dealing with 
>>>> those bytea fields (when I ingest big loads, say 25-30 GB or more), 
>>>> it happened to break replication, and recreating a replica from 
>>>> scratch with "normal size" tables is not a big deal, since it can 
>>>> be achieved automatically, because they normally fit in shared 
>>>> memory and can be transferred by the replicator, but you can 
>>>> imagine what would be the effort and the downtime necessary to 
>>>> create a base backup, transfer it to the replica, build the DB and 
>>>> restart a 10-TB database (ATM we are running with a 2-node cluster).
>>> Break this in batches, use modern techniques for robust data 
>>> loading, in smaller transactions, if you have to.
>> Normally it's run via COPY commands, I can throttle COPY or break it 
>> in batches. At the moment, while the schema is offline, we disconnect 
>> replication from the bytea tables, feed them, wait for checkpoints to 
>> return normal and then resume replication between tables before 
>> putting schema online. This is safe, even if far from being 
>> optimized. It's a migration tool, it won't be used forever, just to 
>> move customers from their current architecture to new cloud one.
>>>>>>>> I've been instructed to issue VACUUM FULL on both nodes, 
>>>>>>>> nightly, but before proceeding I read on docs that VACUUM FULL 
>>>>>>>> can disrupt logical replication, so I'm a bit concerned on how 
>>>>>>>> to proceed. Rows are cleared one a time (one transaction, one 
>>>>>>>> row, to keep errors to the record that issued them)
>>>>> Mind if you shared the specific doc ?
>>>> Obviously I can't find it from a quick search, I'll search deeper, 
>>>> I don't think it went off a dream :-).
>>>>>>>>
>>>>>>> PgEdge is based on the old pg_logical, the old 2ndQuadrant 
>>>>>>> extension, not the native logical replication we have since 
>>>>>>> pgsql 10. But I might be mistaken.
>>>>>> Don't know about this, it keeps running on latest pg versions (we 
>>>>>> are about to upgrade to 17.4, if I'm not wrong), but I'll ask
>>>>>>>> I read about extensions like pg_squeeze, but I wonder if they 
>>>>>>>> are still not dangerous for replication.
>>>>>>>>
>>>>>>> What's pgEdge take on that, I mean the bytea thing you are 
>>>>>>> trying to achieve here.
>>>>>> They are positive, it's they that suggested to do VACUUM FULL on 
>>>>>> both nodes... I'm quite new to replication, so I'm searching some 
>>>>>> advise here.
>>>>>
>>>>> As I told you, pgEdge logical replication (old 2ndquadrant BDR) != 
>>>>> native logical replication. You may look here :
>>>>>
>>>>> https://github.com/pgEdge/spock
>>>>>
>>>>> If multi-master is not a must you could convert  to vanilla 
>>>>> postgresql and focus on standard physical and logical replication.
>>>>>
>>>> No, multimaster is cool, but as I said, the project has been 
>>>> discontinued and it's not a must anymore. This is the first step, 
>>>> actually. We are planning to return to plain PostgreSQL, or 
>>>> CloudSQL for PostgreSQL, using logical replication (that seems the 
>>>> most reliable of the two). We created a test case for both the 
>>>> options, and they seem to be OK for now, even if I have still to do 
>>>> adequate stress tests. And when I'll do the migration, I'd like to 
>>>> be migrating plain data only and leave blobs where they are.
>>>
>>> as you wish. But this design has inherent data infra fragmentation 
>>> as you understand.
>>>
>>> Personally I like to let the DB take care of the data, and I take 
>>> care of the DB, not a plethora of extra systems that we need to keep 
>>> connected and consistent.
>>>
>> We followed this idea when the application (old version) was on 
>> customer premises, so backups and operations were simple and getting 
>> in trouble (e.g. customer deleting a directory from their PC) has 
>> happened a very few times, just when they launched disk cleanup on 
>> windows :-)
>>
>> Now we host a full cloud solution, so we got rid of many potential 
>> problems generated by the end user, but bumped into other, as you 
>> certainly imagine. We have to keep it consistent, fast, reliable, 
>> keeping an eye on costs.
>> You are right, but the more I was working with this solution, the 
>> more I'm having the impression of dealing with something heavy, hard 
>> to mantain because of these rarely-accessed files that sum up most of 
>> my data. Maybe it's just my impression, maybe I need some expertise 
>> in an area that's still quite new for me.
>> At the moment that seems a good compromise between stability and 
>> costs. Maybe in one year I'll be in your position (considering 
>> getting everything back), but for now we are thinking forward in that 
>> way.
>>
> Makes perfect sense.
>> This been said, the original question :-)
>> Would be VACUUM FULL a risky operation? Has it to be done on all 
>> nodes, obviously in a low-traffic and low-access timing (night)?
> VACUUM affects the physical blocks. In a physical streaming 
> replication scenario that might (or not) potentiallyt affect read-only 
> queries on the hot standby (depending on usage and settings). Normally 
> I cannot see how a VACUUM (plain or FULL) would interact with logical 
> replication in any way. But again, since you run PgEdge specific, you 
> have to ask them.
Thanks. This makes me think I misread or misinterpreted something. They 
already suggested me that to use VACUUM FULL on both nodes, but that 
"thing" I read (or I'm convinced to have) made me think twice before 
crashing everything. Two experts' according words is quite enough for me.
I will start this evening and see what happens.

Thanks for the help and the very interesting discussion.
>>>
>>>>>>>> Thanks for your help.
>>>>>>>> Moreno.-
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>

view thread (7+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Logical replication, need to reclaim big disk space
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox