Re: update faster way

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: update faster way
2+ messages / 2 participants
[nested] [flat]

* Re: update faster way
@ 2024-09-14 10:40 yudhi s <[email protected]>
  2024-09-14 17:23 ` Re: update faster way Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: yudhi s @ 2024-09-14 10:40 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: pgsql-general <[email protected]>

On Sat, 14 Sept, 2024, 1:09 pm Laurenz Albe, <[email protected]>
wrote:

> On Sat, 2024-09-14 at 08:43 +0530, yudhi s wrote:
> > We have to update a column value(from numbers like '123' to codes like
> 'abc'
> > by looking into a reference table data) in a partitioned table with
> billions
> > of rows in it, with each partition having 100's millions rows. As we
> tested
>
> > for ~30million rows it's taking ~20minutes to update. So if we go by this
> > calculation, it's going to take days for updating all the values. So my
> > question is
> >
> > 1) If there is any inbuilt way of running the update query in parallel
> >    (e.g. using parallel hints etc) to make it run faster?
> > 2) should we run each individual partition in a separate session (e.g.
> five
> >    partitions will have the updates done at same time from 5 different
> >    sessions)? And will it have any locking effect or we can just start
> the
> >    sessions and let them run without impacting our live transactions?
>
> Option 1 doesn't exist.
> Option 2 is possible, and you can even have more than one session workingr
> on a single partition.
>
> However, the strain on your system's resources and particularly the row
> locks will impair normal database work.
>
> Essentially, you can either take an extended down time or perform the
> updates
> in very small chunks with a very low "lock_timeout" over a very long period
> of time.  If any of the batches fails because of locking conflicts, it has
> to be retried.
>
> Investigate with EXPLAIN (ANALYZE) why the updates take that long.  It
> could
> be a lame disk, tons of (unnecessary?) indexes or triggers, but it might as
> well be the join with the lookup table, so perhaps there is room for
> improvement (more "work_mem" for a hash join?)
>

Thank you so much Laurenz.

We have mostly insert/update happen on current day/live partition. So
considering that, if we will run batch updates(with batch size of 1000)
from five different sessions in parallel on different historical partition,
at any time they will lock 5000 rows and then commit. And also those rows
will not collide with each other. So do you think that approach can anyway
cause locking issues? We will ensure the update of live partition occurs
when we have least activity. So in that way we will not need extended down
time. Please correct me if wrong.

Never used lock_timeout though, but in above case do we need lock_timeout?

Regarding batch update with batch size of 1000, do we have any method
exists in postgres (say like forall statement in Oracle) which will do the
batch dml. Can you please guide me here, how we can do it in postgres.

And yes will need to see what happens in the update using explain analyze.
And I was trying to see, if we can run explain analyze without doing actual
update , but seems that is not possible.

>

^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: update faster way
  2024-09-14 10:40 Re: update faster way yudhi s <[email protected]>
@ 2024-09-14 17:23 ` Laurenz Albe <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Laurenz Albe @ 2024-09-14 17:23 UTC (permalink / raw)
  To: yudhi s <[email protected]>; +Cc: pgsql-general <[email protected]>

On Sat, 2024-09-14 at 16:10 +0530, yudhi s wrote:
> > However, the strain on your system's resources and particularly the row
> > locks will impair normal database work.
> > 
> > Essentially, you can either take an extended down time or perform the updates
> > in very small chunks with a very low "lock_timeout" over a very long period
> > of time.  If any of the batches fails because of locking conflicts, it has
> > to be retried.
> > 
> > Investigate with EXPLAIN (ANALYZE) why the updates take that long.  It could
> > be a lame disk, tons of (unnecessary?) indexes or triggers, but it might as
> > well be the join with the lookup table, so perhaps there is room for
> > improvement (more "work_mem" for a hash join?)
> 
> We have mostly insert/update happen on current day/live partition. So
> considering that, if we will run batch updates(with batch size of 1000) from
> five different sessions in parallel on different historical partition, at any
> time they will lock 5000 rows and then commit. And also those rows will not
> collide with each other. So do you think that approach can anyway cause locking
> issues?

The updates won't lock with each other.  I thought that other database activity
might modify rows in these partitions.  If that is not the case, you don't need
to worry about locks.

In that case I would also choose a much higher batch size.

You should make sure to back off every now and then and VACUUM the partition,
so that you avoid excessive table bloat.

> We will ensure the update of live partition occurs when we have least activity.
> So in that way we will not need extended down time. Please correct me if wrong.

That sounds right.

> Never used lock_timeout though, but in above case do we need lock_timeout?

It can be useful if your updating process is blocked by a lock from the
application.  Setting the parameter to a low value will keep your update
from hanging for a long time and will throw an error instead.
Erroring out early reduces the danger of a deadlock.

> Regarding batch update with batch size of 1000, do we have any method exists
> in postgres (say like forall statement in Oracle) which will do the batch dml.
> Can you please guide me here, how we can do it in postgres.

I would certainly not perform the update row for row in PL/pgSQL code.
Perhaps something like this:

  DO
  $$DECLARE
     i bigint;
  BEGIN
     FOR i IN 1..1000000 by 100000 LOOP
        UPDATE tab SET ...
        WHERE id >= i AND id < i + 100000;

        COMMIT;
     END LOOP;
  END;$$;

  VACUUM tab;

Then repeat for the next million rows, and so on.

> And yes will need to see what happens in the update using explain analyze.
> And I was trying to see, if we can run explain analyze without doing
> actual update , but seems that is not possible. 

You can do it in a transaction and roll the transaction back.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 2+ messages in thread

end of thread, other threads:[~2024-09-14 17:23 UTC | newest]

Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-09-14 10:40 Re: update faster way yudhi s <[email protected]>
2024-09-14 17:23 ` Laurenz Albe <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox