Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1spQJu-001Not-VJ for pgsql-general@arkaria.postgresql.org; Sat, 14 Sep 2024 10:48:04 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1spQJu-00A8cM-Nc for pgsql-general@arkaria.postgresql.org; Sat, 14 Sep 2024 10:48:02 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1spQJu-00A8cE-6n for pgsql-general@lists.postgresql.org; Sat, 14 Sep 2024 10:48:02 +0000 Received: from mail-qt1-x835.google.com ([2607:f8b0:4864:20::835]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1spQJr-0016ry-8F for pgsql-general@lists.postgresql.org; Sat, 14 Sep 2024 10:48:01 +0000 Received: by mail-qt1-x835.google.com with SMTP id d75a77b69052e-457ce5fda1aso24853371cf.1 for ; Sat, 14 Sep 2024 03:47:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726310879; x=1726915679; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=I5Qd7AAYy7MXmQg5D5H31cf2YN02cmAA1VL9GzvGALM=; b=Op2A9enlP+W3rPq1XeXJfTS0iqQHQHQ30tisWQ5RdLrIhoxOyFzBOoipedtuDE+lHm bDS54or4Xornkh/7YJq64Dak9guj6iBM3qwz+RplS6vI49tXI/ZDk6zkFBsHGlm5uGbi md3hnsKG/wC53GG2CALNfMBXHAyhBaZZkO+A9Mi0oO758EbkLEdnR1GEDdtGnpv+ltya Nl98SGBjy35ElQpeJEBIjX5BQSBpMAtoxuVzOi7wTbPFrqsQMUmmLAY/4fvt4btrziEZ eHvnT8gboE8mA9mts4RkzytFsuv54OiG0m3onVfn/fjIOXBxabURoOtM5qgI2TGtIc2y Jnhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726310879; x=1726915679; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=I5Qd7AAYy7MXmQg5D5H31cf2YN02cmAA1VL9GzvGALM=; b=cOZEtalyDdR8/9yS4Dkp4vAAM3KsqgpWTMppwv3XSEqi3h8QfIUuL1lSwdFKzb/GYx ObUOprNqCk9IEq1/XEgHTubxS7WpD3VjSG7062p97Kkc5psfgCMR1zC+qA8eG4cc5uun Vke0XOcUx+E2nqQuaunIMVvFZoRP99V4tfj+o+BhvViDkXPFQhpAzpyAyPRdGqOkUxxm YVimWhGBDWgXS+i3TYkstk3ztYwFmmJxLQeoQ0zwUp10P5ytZ6AuwI2v3z58HABGdQ7l zRBZYdsKxnn8lmC3pVNFsEkuOD9VkeGHxYnpHtJ7AW3gc+s/Nux/8F800VXwB+R+2+k1 NaGA== X-Gm-Message-State: AOJu0YxnQ2SaxZQZs5RK1nkT35RF2GbK7K2jejdTB4++yf3O+xQydvMw KxghGQhU7UO6Vkeoro6kzpRKhnE6CjCHzPRPdMsUmSBmJMOemswtxLiT6VEyDkRxA1syex30EtR FrQUGW4AxF9D0czdtKpX1BLDBREufrQ== X-Google-Smtp-Source: AGHT+IHD8yFmyZO7JS5Vu9AK01xyU74SnNxBIAguTf+jP03hnf/aCx6gxMoUrB9LxA92fFX1oLQRhyQpgy8i76j70ZA= X-Received: by 2002:a05:6870:b4a8:b0:261:360:746c with SMTP id 586e51a60fabf-27c3f24faa8mr6355192fac.19.1726310427727; Sat, 14 Sep 2024 03:40:27 -0700 (PDT) MIME-Version: 1.0 References: <1955d0e6cafd643520d282a74d9956340983074e.camel@cybertec.at> In-Reply-To: <1955d0e6cafd643520d282a74d9956340983074e.camel@cybertec.at> From: yudhi s Date: Sat, 14 Sep 2024 16:10:15 +0530 Message-ID: Subject: Re: update faster way To: Laurenz Albe Cc: pgsql-general Content-Type: multipart/alternative; boundary="000000000000d1214c062211f7da" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000d1214c062211f7da Content-Type: text/plain; charset="UTF-8" On Sat, 14 Sept, 2024, 1:09 pm Laurenz Albe, wrote: > On Sat, 2024-09-14 at 08:43 +0530, yudhi s wrote: > > We have to update a column value(from numbers like '123' to codes like > 'abc' > > by looking into a reference table data) in a partitioned table with > billions > > of rows in it, with each partition having 100's millions rows. As we > tested > > > for ~30million rows it's taking ~20minutes to update. So if we go by this > > calculation, it's going to take days for updating all the values. So my > > question is > > > > 1) If there is any inbuilt way of running the update query in parallel > > (e.g. using parallel hints etc) to make it run faster? > > 2) should we run each individual partition in a separate session (e.g. > five > > partitions will have the updates done at same time from 5 different > > sessions)? And will it have any locking effect or we can just start > the > > sessions and let them run without impacting our live transactions? > > Option 1 doesn't exist. > Option 2 is possible, and you can even have more than one session workingr > on a single partition. > > However, the strain on your system's resources and particularly the row > locks will impair normal database work. > > Essentially, you can either take an extended down time or perform the > updates > in very small chunks with a very low "lock_timeout" over a very long period > of time. If any of the batches fails because of locking conflicts, it has > to be retried. > > Investigate with EXPLAIN (ANALYZE) why the updates take that long. It > could > be a lame disk, tons of (unnecessary?) indexes or triggers, but it might as > well be the join with the lookup table, so perhaps there is room for > improvement (more "work_mem" for a hash join?) > Thank you so much Laurenz. We have mostly insert/update happen on current day/live partition. So considering that, if we will run batch updates(with batch size of 1000) from five different sessions in parallel on different historical partition, at any time they will lock 5000 rows and then commit. And also those rows will not collide with each other. So do you think that approach can anyway cause locking issues? We will ensure the update of live partition occurs when we have least activity. So in that way we will not need extended down time. Please correct me if wrong. Never used lock_timeout though, but in above case do we need lock_timeout? Regarding batch update with batch size of 1000, do we have any method exists in postgres (say like forall statement in Oracle) which will do the batch dml. Can you please guide me here, how we can do it in postgres. And yes will need to see what happens in the update using explain analyze. And I was trying to see, if we can run explain analyze without doing actual update , but seems that is not possible. > --000000000000d1214c062211f7da Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Sat, 14 Sept, 2024, 1:09 pm Laurenz Albe, <laurenz.albe@cybertec.at> wrote:
On Sat, 2024-09-14 at 08:43 +0530, yudhi s wrote:
> We have to update a column value(from numbers like '123' to co= des like 'abc'
> by looking into a reference table data) in a partitioned table with bi= llions
> of rows in it, with each partition having 100's millions rows. As = we tested

> for ~30million rows it's taking ~20minutes to update. So if we go = by this
> calculation, it's going to take days for updating all the values. = So my
> question is
>
> 1) If there is any inbuilt way of running the update query in parallel=
>=C2=A0 =C2=A0 (e.g. using parallel hints etc) to make it run faster? > 2) should we run each individual partition in a separate session (e.g.= five
>=C2=A0 =C2=A0 partitions will have the updates done at same time from 5= different
>=C2=A0 =C2=A0 sessions)? And will it have any locking effect or we can = just=C2=A0start the
>=C2=A0 =C2=A0 sessions and let them run without impacting our live tran= sactions?

Option 1 doesn't exist.
Option 2 is possible, and you can even have more than one session workingr<= br> on a single partition.

However, the strain on your system's resources and particularly the row=
locks will impair normal database work.

Essentially, you can either take an extended down time or perform the updat= es
in very small chunks with a very low "lock_timeout" over a very l= ong period
of time.=C2=A0 If any of the batches fails because of locking conflicts, it= has
to be retried.

Investigate with EXPLAIN (ANALYZE) why the updates take that long.=C2=A0 It= could
be a lame disk, tons of (unnecessary?) indexes or triggers, but it might as=
well be the join with the lookup table, so perhaps there is room for
improvement (more "work_mem" for a hash join?)


Thank you so much Laurenz.

We have mostly insert/update happen on current day/live partit= ion. So considering that, if we will run batch updates(with batch size of 1= 000) from five different sessions in parallel on different historical parti= tion, at any time they will lock 5000 rows and then commit. And also those = rows will not collide with each other. So do you think that approach can an= yway cause locking issues? We will ensure the update of live partition occu= rs when we have least activity. So in that way we will not need extended do= wn time. Please correct me if wrong.=C2=A0

Never used lock_timeout though, but in above case do we = need lock_timeout?=C2=A0

Regarding batch update with batch size of 1000, do we have any method exis= ts in postgres (say like forall statement in Oracle) which will do the batc= h dml. Can you please guide me here, how we can do it in postgres.

And yes will need to see what ha= ppens in the update using explain analyze. And I was trying to see, if we c= an run explain analyze without doing actual update , but seems that is not = possible.=C2=A0

=C2=A0


--000000000000d1214c062211f7da--