Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sbRxq-00Gech-Vv for pgsql-general@arkaria.postgresql.org; Tue, 06 Aug 2024 21:43:30 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sbRxp-0036cl-HZ for pgsql-general@arkaria.postgresql.org; Tue, 06 Aug 2024 21:43:29 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sbRxp-0036bW-3e for pgsql-general@lists.postgresql.org; Tue, 06 Aug 2024 21:43:29 +0000 Received: from mail-oo1-xc2c.google.com ([2607:f8b0:4864:20::c2c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sbRxm-003Wjz-PD for pgsql-general@lists.postgresql.org; Tue, 06 Aug 2024 21:43:28 +0000 Received: by mail-oo1-xc2c.google.com with SMTP id 006d021491bc7-5d5f24d9df8so582087eaf.2 for ; Tue, 06 Aug 2024 14:43:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722980605; x=1723585405; darn=lists.postgresql.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=+kzxK8WcwVh/LfMYpnJo4UfMp9w1p2erevFLHl0gp8g=; b=cCNUxgfYSmGaza68BiTIIqGilZiaVkT+QhL8wFYCB3inJwCvBzuSro9uLG0uipQEc6 ztwME/Vky1XZnBgCxjqkCdf1daCgre7UD+Hofwnu9b6uvOjdIEcbJAPzF0wgyOpKYMIL rMJmQfvCBXM15U+IrF46xAvbjrg7e8EpOOaBr0McG07RtIDc3TAxbelbkPS3mfsudXOt dFIrOZKmo9A6nkWr0BWd5h39p83BS9qOmZ+TkmG35P8ayqrANyWIn+SR4NTbd6LmkgsO c9G4qt+MBWBd7dp5Wuv4smbZUmW+EL5DozmZ2oLQJL+Nw9PTrY0TUpssttn3/WUk4gHe Ckvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722980605; x=1723585405; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+kzxK8WcwVh/LfMYpnJo4UfMp9w1p2erevFLHl0gp8g=; b=KrX7KPxJXENyon/5dwACzKUm2QkB1Al66UBkt4wqo+XhAXkoBv9YIRrv+6wtvOlmhL 51vLdyDslpAXIoO2t8AHBuK3abkzzQm7lDNcwylTKDYZGcDb5UY/kG3aaFQETFNZXHKp XJETg1Hz7fz/2IUSJSRjIa9csttyhtv996AJg9T6aM56xXLJG0qQgwPWCG8vKMh/wQ2r /yJNfMRkui93OQu/gW9HiRMfjYupAc6e58wnGDvSrLIt7b0Ta6mCVgSR6pVexTXC8LT5 2Qvxhe7nttL+uP6effTyJqAhCGYw/P6Og77HsaGg5qcDb0UQowOhuJ9ZEIBbI8c2b4Ql 8Fyw== X-Gm-Message-State: AOJu0Yzqc9dCnZJVEyVnvsg7l3/hmy8H+X2y43oNs5zYHaQAc2d0ss7G yZgfj5fKlLbxIUyBZ2JbUy/94B4RjVTIv63m4HGWXCSQs2lL0F53FiAKKmgJHqYRNcJgbHBldB3 j9r6uFo8l2mqnngN8cVVdfwkAIeiPeg== X-Google-Smtp-Source: AGHT+IGevleDLH4fQFbMjPGUB80iOD4fTFf9eBmVpgyp8AgdqIGIszSAtfZfwV9xcTBHecaPdzjL5YxvCXSiklVOuic= X-Received: by 2002:a05:6870:d8ca:b0:260:ffcd:21fe with SMTP id 586e51a60fabf-26891d3dc41mr17335586fac.23.1722980604919; Tue, 06 Aug 2024 14:43:24 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Ron Johnson Date: Tue, 6 Aug 2024 17:43:13 -0400 Message-ID: Subject: Re: Standard of data storage and transformation To: pgsql-general Content-Type: multipart/alternative; boundary="000000000000e945ec061f0aae1b" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000e945ec061f0aae1b Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Aug 6, 2024 at 5:07=E2=80=AFPM yudhi s wrote: > Hi All, > We are having a use case in which we are having transaction data for > multiple customers in one of the Postgres databases(version 15.4) and we > are consuming it from multiple sources(batch file processing, kafka event > processing etc). It's currently stored in normalized form postgres > with constraints, indexes, partitions defined. This postgres database is > holding the transaction data for around a month or so. There are use case= s > of running online transaction search reports which will be mostly real ti= me > reporting and also some daily transaction batch reports based on customer= s > and also month end reports for customers. In target state it will hold > Approx. ~400 million transactions/day which can be billions of rows acros= s > multiple related parent/child tables. > > There is another requirement to send these customer transaction data to a= n > olap system which is in a snowflake database and there it will be persist= ed > for many years. The lag between the data in postgres/oltp and in snowflak= e > will be ~1hr. And any reporting api can query postgres for <1 month worth > of transaction data and if it needs to scan for >1month worth of > transaction data, it will point to the snowflake database. > > Now the question which we are wondering is , should we send the data as i= s > in normalized table form to snowflake and then there we transform/flatten > the data to support the reporting use case or should we first flatten or > transform the data in postgres itself and make it as another structure( f= or > example creating materialized views on top of base table) and only then > move that data to the snowflake? What is the appropriate standard and > downside if we do anything different. > Some thoughts: 0) How big are the records? 1) Materialized views add disk space overhead. 2) Materialized views are for when you query the same static over and over again. 3) IIUC, you'll be moving the data from PG to Snowflake just once. 4) Writing an MV to disk and then reading it only once doubles the IO requirements. 5) Thus, my first thought would be to extract the data from PG using a denormalizing "plain" view. 5a) If you can't make that Fast Enough, then obviously you must pull the normalized data from PG and denorm it elsewhere. You know your situation better than us. 6) Indices will be critical: not too many, but not too few. 7) Obviously consider partitioning, but note that too many partitions can make query planning MUCH slower. 7a) 31 days cut into hours means 744 partitions. That's a LOT. 7b) Partitioning every THREE hours means only 248 child tables. A lot, but much more manageable. 7c) That might well kill reporting performance, though, if it's for example one customer across many partitions. 8) You (hopefully) know what kind of queries will be run. Maybe partition by customer (or whatever) range and THEN by an hour range. 8a) You'd have to simultaneously run multiple extract jobs (on for each "customer" range), but that might not be too onerous, since then each hour partition would be smaller. 9) Testing. Nothing beats full-scale testing. --=20 Death to America, and butter sauce! Iraq lobster... --000000000000e945ec061f0aae1b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Tue, Aug 6, 2024 at 5:07=E2=80=AFPM yu= dhi s <learnerdatabase99@= gmail.com> wrote:
Hi All,
We are hav= ing a use case in which we are having transaction data for multiple custome= rs in one of the Postgres databases(version 15.4) and we are consuming it f= rom multiple sources(batch file processing, kafka event processing etc). It= 's currently stored in normalized form=C2=A0postgres with=C2=A0constrai= nts,=C2=A0indexes, partitions defined. This postgres database is holding th= e transaction data for around a month or so. There are use cases of running= online transaction search reports which will be mostly real time reporting= and also some daily transaction batch reports based on customers and also = month end reports for customers. In target state it will hold Approx. ~400 = million transactions/day which can be billions of rows across multiple rela= ted parent/child tables.

There is another requirement to = send these customer transaction data to an olap system which is in a snowfl= ake database and there it will be persisted for many years. The lag between= the data in postgres/oltp and in snowflake will be ~1hr. And any reporting= api can query postgres for <1 month worth of transaction data and if it= needs to scan for >1month worth of transaction data, it will point to t= he snowflake database.

Now the question which we are wondering is , = should we send the data as is in normalized table form to snowflake and the= n there we transform/flatten the data to support the reporting use case or = should we first flatten or transform the data in postgres itself and make i= t as another structure( for example creating materialized views on top of b= ase table) and only then move that data to the snowflake? What is the appro= priate standard and downside if we do anything different.

Some thoughts:
0) How big are the = records?
1) Materialized views add disk space overhead.
2) Materialized views are for when you query the same static over and over= again.
3) IIUC, you'll be moving the data from PG to Snowfla= ke just once.
4) Writing an MV to disk and then reading it only o= nce doubles the IO requirements.
5) Thus, my first thought would = be to extract the data from PG using a denormalizing "plain"=C2= =A0view.=C2=A0
5a) If you can't make that Fast Enough, then o= bviously you must pull the normalized data from PG and denorm it elsewhere.= =C2=A0 You know your situation better than us.
6) Indices will be= critical: not too many, but not too few.
7) Obviously consider p= artitioning, but note that too many partitions can make query planning MUCH= slower.
7a) 31 days cut into hours means 744 partitions.=C2=A0 T= hat's a LOT.
7b) Partitioning=C2=A0every THREE hours means on= ly 248 child tables.=C2=A0 A lot, but much more manageable.
7c) T= hat might well kill reporting performance, though, if it's for example = one customer across many partitions.
8) You (hopefully) know what= kind of queries will be run.=C2=A0 Maybe partition by customer (or whateve= r) range and THEN by an hour range.
8a) You'd have to simulta= neously run multiple extract jobs (on for each "customer" range),= but that might not be too onerous, since then each hour partition would be= smaller.
9) Testing.=C2=A0 Nothing beats full-scale testing.

--
Death to Ameri= ca, and butter sauce!
Iraq lobster...
--000000000000e945ec061f0aae1b--