Re: update faster way

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: update faster way
3+ messages / 2 participants
[nested] [flat]

* Re: update faster way
@ 2024-09-14 11:24 Peter J. Holzer <[email protected]>
  2024-09-14 14:56 ` Re: update faster way yudhi s <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Peter J. Holzer @ 2024-09-14 11:24 UTC (permalink / raw)
  To: [email protected]

On 2024-09-14 16:10:15 +0530, yudhi s wrote:
> On Sat, 14 Sept, 2024, 1:09 pm Laurenz Albe, <[email protected]> wrote:
> 
>     On Sat, 2024-09-14 at 08:43 +0530, yudhi s wrote:
>     > We have to update a column value(from numbers like '123' to codes like
>     'abc'
>     > by looking into a reference table data) in a partitioned table with
>     billions
>     > of rows in it, with each partition having 100's millions rows. As we
>     tested
> 
>     > for ~30million rows it's taking ~20minutes to update.
[...]
>     > 2) should we run each individual partition in a separate session (e.g.
>     five
>     >    partitions will have the updates done at same time from 5 different
>     >    sessions)? And will it have any locking effect or we can just start
>     the
>     >    sessions and let them run without impacting our live transactions?
> 
>     Option 2 is possible, and you can even have more than one session workingr
>     on a single partition.
> 
>     However, the strain on your system's resources and particularly the row
>     locks will impair normal database work.
> 
>     Essentially, you can either take an extended down time or perform the
>     updates
>     in very small chunks with a very low "lock_timeout" over a very long period
>     of time.  If any of the batches fails because of locking conflicts, it has
>     to be retried.
> 
>     Investigate with EXPLAIN (ANALYZE) why the updates take that long.  It
>     could
>     be a lame disk, tons of (unnecessary?) indexes or triggers, but it might as
>     well be the join with the lookup table, so perhaps there is room for
>     improvement (more "work_mem" for a hash join?)
> 
> 
> 
> Thank you so much Laurenz.
> 
> We have mostly insert/update happen on current day/live partition. So
> considering that, if we will run batch updates(with batch size of 1000) from
> five different sessions in parallel on different historical partition, at any
> time they will lock 5000 rows and then commit.

If you are updating billions of rows in batches of 5000, that means you
are executing hundreds of thousands or millions of update statements.

Which in turn means that you want as little overhead as possible per
batch which means finding those 5000 rows should be quick. Which brings
us back to Igor's question: Do you have any indexes in place which speed
up finding those 5000 rows (the primary key almost certainly won't help
with that). EXPLAIN (ANALYZE) (as suggested by Laurenz) will certainly
help answering that question.

> And also those rows will not collide with each other. So do you think
> that approach can anyway cause locking issues?

No, I don't think so. With a batch size that small I wouldn't expect
problems even on the live partition. But of course many busy parallel
sessions will put additional load on the system which may or may not be
noticeable by users (you might saturate the disks writing WAL entries
for example, which would slow down other sessions trying to commit).


> Regarding batch update with batch size of 1000, do we have any method exists in
> postgres (say like forall statement in Oracle) which will do the batch dml. Can
> you please guide me here, how we can do it in postgres.

Postgres offers several server side languages. As an Oracle admin you
will probably find PL/pgSQL most familiar. But you could also use Perl
or Python or several others. And of course you could use any
programming/scripting language you like on the client side.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [email protected]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"


Attachments:

  [application/pgp-signature] signature.asc (833B, 2-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: update faster way
  2024-09-14 11:24 Re: update faster way Peter J. Holzer <[email protected]>
@ 2024-09-14 14:56 ` yudhi s <[email protected]>
  2024-09-15 22:18   ` Re: update faster way Peter J. Holzer <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: yudhi s @ 2024-09-14 14:56 UTC (permalink / raw)
  To: [email protected]

On Sat, Sep 14, 2024 at 4:55 PM Peter J. Holzer <[email protected]> wrote:

>
> Which in turn means that you want as little overhead as possible per
> batch which means finding those 5000 rows should be quick. Which brings
> us back to Igor's question: Do you have any indexes in place which speed
> up finding those 5000 rows (the primary key almost certainly won't help
> with that). EXPLAIN (ANALYZE) (as suggested by Laurenz) will certainly
> help answering that question.
>
> > And also those rows will not collide with each other. So do you think
> > that approach can anyway cause locking issues?
>
> No, I don't think so. With a batch size that small I wouldn't expect
> problems even on the live partition. But of course many busy parallel
> sessions will put additional load on the system which may or may not be
> noticeable by users (you might saturate the disks writing WAL entries
> for example, which would slow down other sessions trying to commit).
>
>
> > Regarding batch update with batch size of 1000, do we have any method
> exists in
> > postgres (say like forall statement in Oracle) which will do the batch
> dml. Can
> > you please guide me here, how we can do it in postgres.
>
> Postgres offers several server side languages. As an Oracle admin you
> will probably find PL/pgSQL most familiar. But you could also use Perl
> or Python or several others. And of course you could use any
> programming/scripting language you like on the client side.
>
>
 When you said *"(the primary key almost certainly won't help with that)", *I
am trying to understand why it is so ?
I was thinking of using that column as an incrementing filter and driving
the eligible rows based on that filter. And if it would have been a
sequence. I think it would have helped but in this case it's UUID , so I
may not be able to do the batch DML using that as filter criteria. but in
that case will it be fine to drive the update based on ctid something as
below? Each session will have the range of 5 days of data or five partition
data and will execute a query something as below which will update in the
batches of 10K and then commit. Is this fine? Or is there some better way
of doing the batch DML in postgres plpgsql?

DO $$
DECLARE
    l_rowid_array ctid[];
    l_ctid ctid;
    l_array_size INT := 10000;
    l_processed INT := 0;
BEGIN

    FOR l_cnt IN 0..(SELECT COUNT(*) FROM part_tab WHERE   part_date >
'1-sep-2024' and part_date < '5-sep-2024'
) / l_array_size LOOP
        l_rowid_array := ARRAY(
            SELECT ctid
            FROM part_tab
            WHERE part_date   > '1-sep-2024' and part_date < '5-sep-2024'
            LIMIT l_array_size OFFSET l_cnt * l_array_size
        );

        FOREACH l_ctid IN ARRAY l_rowid_array LOOP
            update  part_tab
            SET column1 = reftab.code
           FROM reference_tab reftab
            WHERE tab_part1.column1 = reftab.column1
            and ctid = l_ctid;
            l_processed := l_processed + 1;
        END LOOP;

        COMMIT;
    END LOOP;

END $$;


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: update faster way
  2024-09-14 11:24 Re: update faster way Peter J. Holzer <[email protected]>
  2024-09-14 14:56 ` Re: update faster way yudhi s <[email protected]>
@ 2024-09-15 22:18   ` Peter J. Holzer <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Peter J. Holzer @ 2024-09-15 22:18 UTC (permalink / raw)
  To: [email protected]

On 2024-09-14 20:26:32 +0530, yudhi s wrote:
> 
> 
> On Sat, Sep 14, 2024 at 4:55 PM Peter J. Holzer <[email protected]> wrote:
> 
> 
>     Which in turn means that you want as little overhead as possible per
>     batch which means finding those 5000 rows should be quick. Which brings
>     us back to Igor's question: Do you have any indexes in place which speed
>     up finding those 5000 rows (the primary key almost certainly won't help
>     with that). EXPLAIN (ANALYZE) (as suggested by Laurenz) will certainly
>     help answering that question.
> 
>     > And also those rows will not collide with each other. So do you think
>     > that approach can anyway cause locking issues?
> 
>     No, I don't think so. With a batch size that small I wouldn't expect
>     problems even on the live partition. But of course many busy parallel
>     sessions will put additional load on the system which may or may not be
>     noticeable by users (you might saturate the disks writing WAL entries
>     for example, which would slow down other sessions trying to commit).
> 
> 
>     > Regarding batch update with batch size of 1000, do we have any method
>     exists in
>     > postgres (say like forall statement in Oracle) which will do the batch
>     dml. Can
>     > you please guide me here, how we can do it in postgres.
> 
>     Postgres offers several server side languages. As an Oracle admin you
>     will probably find PL/pgSQL most familiar. But you could also use Perl
>     or Python or several others. And of course you could use any
>     programming/scripting language you like on the client side.
> 
> 
> 
>  When you said "(the primary key almost certainly won't help with that)", I am
> trying to understand why it is so ? 

I was thinking that you would do something like 

    begin;

    update with a as (
        select id from the_table
        where :part_lower <= id and id < :part_upper and col_x = :old
        limit 5000
    )
    update the_table set col_x = :new
    from a where the_table.id = a.id;

    commit; 

in a loop until you you update 0 rows and then switch to the next
partition. That pretty much requires an index on col_x or you will need
a sequential scan to find the next 5000 rows to update.

Even if you return the ids and leed the last updated id back into the
loop like this:

    update with a as (
        select id from the_table
        where id > :n and col_x = :old
        order by id
        limit 5000
    )
    update the_table set col_x = :new
    from a where the_table.id = a.id;

that may lead to a lot of extra reads from the heap or the optimizer
might even decide it's better to go for a sequential scan.

The latter is pretty unlikely if you restrict the range of ids:

    update the_table set col_x = :new
    where :n <= id and id < :n + 5000 and col_x = :old;

but that will possible result in a lot of queries which don't update
anything at all but still need to read 5000 rows each.


> I was thinking of using that column as an incrementing filter and driving the
> eligible rows based on that filter. And if it would have been a sequence. I
> think it would have helped but in this case it's UUID , so I may not be able to
> do the batch DML using that as filter criteria.

You can order by uuid or compare them to other uuids. So my first two
approaches above would still work.

> but in that case will it be fine to drive the update based on ctid
> something as below? Each session will have the range of 5 days of data
> or five partition data and will execute a query something as below
> which will update in the batches of 10K and then commit. Is this fine?
> Or is there some better way of doing the batch DML in postgres
> plpgsql?
> 
> DO $$
> DECLARE
>     l_rowid_array ctid[];
>     l_ctid ctid;
>     l_array_size INT := 10000;
>     l_processed INT := 0;
> BEGIN
>  
>     FOR l_cnt IN 0..(SELECT COUNT(*) FROM part_tab WHERE   part_date >
> '1-sep-2024' and part_date < '5-sep-2024'
> ) / l_array_size LOOP
>         l_rowid_array := ARRAY(
>             SELECT ctid
>             FROM part_tab
>             WHERE part_date   > '1-sep-2024' and part_date < '5-sep-2024'
>             LIMIT l_array_size OFFSET l_cnt * l_array_size

Never use LIMIT and OFFSET without an ORDER BY, especially not when you
are updating the table. You may get some rows twice and some never.
ALso OFFSET means you are reading all those rows and then ignoring
them. I expect this to be O(n²).

>         );
>  
>         FOREACH l_ctid IN ARRAY l_rowid_array LOOP
>             update  part_tab
>             SET column1 = reftab.code
>            FROM reference_tab reftab

I see you are using a reference table and I think you mentioned that you
will be updating most rows. So that alleviates my concerns that you may
read lots of rows without updating them. But you still need an efficient
way to get at the next rows to update.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [email protected]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"


Attachments:

  [application/pgp-signature] signature.asc (833B, 2-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2024-09-15 22:18 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-09-14 11:24 Re: update faster way Peter J. Holzer <[email protected]>
2024-09-14 14:56 ` yudhi s <[email protected]>
2024-09-15 22:18   ` Peter J. Holzer <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox