Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sbmtV-001Qr1-0r for pgsql-general@arkaria.postgresql.org; Wed, 07 Aug 2024 20:04:25 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sbmtT-009hVG-I9 for pgsql-general@arkaria.postgresql.org; Wed, 07 Aug 2024 20:04:23 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sbmtT-009hV6-5P for pgsql-general@lists.postgresql.org; Wed, 07 Aug 2024 20:04:23 +0000 Received: from mail-ot1-x333.google.com ([2607:f8b0:4864:20::333]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sbmtQ-003gSg-R4 for pgsql-general@lists.postgresql.org; Wed, 07 Aug 2024 20:04:22 +0000 Received: by mail-ot1-x333.google.com with SMTP id 46e09a7af769-704466b19c4so133414a34.0 for ; Wed, 07 Aug 2024 13:04:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723061059; x=1723665859; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=jrjWp4YdvIgyqtYBGber+H5uGHDfxBoZtA4T71jP2WE=; b=AhbkiiI1qSfaKvsOvO3fFo14QYjHCLq8jUwSTfujSjxrjbYWmwuYe/qbsB/Mb2zZQz n9FeOhnrJHZXJMVWS2Bwvkjc6RK1TwUj69kN6YZaTqwZR8DTFVzaw1PwfHQxsQC4UYS3 ySzKenUK+PIsF/AaRMft0faEQTK+/UpUuwFSnNi0uSAMa5sl1o/i9PdfQdOUC9Q5Oqim rGNxtZ0z/c+C9HMZs5UVjQsGdpo01sZD9EJkUdNqzJltslF3Epe/FthfIpxPx4hGZdIO ogiPBqJI54A94rtgx6y549+AY3x62X5QfZzh1JT7c5d5xzzTr6wJAa1IwWz33yW0T44W qdWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723061059; x=1723665859; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jrjWp4YdvIgyqtYBGber+H5uGHDfxBoZtA4T71jP2WE=; b=CpSp7JiRfMwY3RxeYFQF932a+q/OllIWtf8MQA4e0HPY1XoYRFH2wE5NRGQ84A83Rd vvQty3Fgbw9gBnv0hdqJOjc97NJZqBACJ9QKvLrI9BnW0JnFnsymEOCNCw3VR/m673xs Rp8akhIecXj7jI88VhnfnWT/ff7XLHaCFqcFBLA971QTFtsgDjdjRFUuRp8zaSQf156Y XIDMjCV70xuYpCo3TlsIIuuVPn1YvkO4augHw+yt9dqveZvUm3TailxSuAtoLYF5Xhbu qAsyiIa5EY6m7chgw8TDIBlTqZrN49V4xeXB/9DBwSkUK3S4e/HthY7w+xE8YCZCi6ai oyew== X-Gm-Message-State: AOJu0Yx5aO66nugpOsFTAvgOFLiL68xh8V5XSJDlhp8o15jNiy8DZHLN GFRRsCLU67q8cM2RxumWLboHfIQ+pTPneHNy++GLlTKxPJP3ZSr9UEyBTSABgeCR06pKA/biQ1J XS1kl/KPSdRK9IOmucN3bSEYXd0GRVqOc X-Google-Smtp-Source: AGHT+IHaXYNYax3xKYwc1m6BL9r6UZcvF/VmKu8AedbA5fExOJVVAnIrDv5+mUEBMFSQfadAfKI8pLOzzh1gHK3pBTQ= X-Received: by 2002:a05:6830:2586:b0:709:3d9e:f88b with SMTP id 46e09a7af769-709b99785fcmr21324097a34.30.1723061059138; Wed, 07 Aug 2024 13:04:19 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: yudhi s Date: Thu, 8 Aug 2024 01:34:05 +0530 Message-ID: Subject: Re: Standard of data storage and transformation To: Ron Johnson Cc: pgsql-general Content-Type: multipart/alternative; boundary="0000000000005b3a74061f1d6a0e" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000005b3a74061f1d6a0e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Aug 7, 2024 at 3:13=E2=80=AFAM Ron Johnson wrote: > On Tue, Aug 6, 2024 at 5:07=E2=80=AFPM yudhi s > wrote: > >> Hi All, >> We are having a use case in which we are having transaction data for >> multiple customers in one of the Postgres databases(version 15.4) and we >> are consuming it from multiple sources(batch file processing, kafka even= t >> processing etc). It's currently stored in normalized form postgres >> with constraints, indexes, partitions defined. This postgres database is >> holding the transaction data for around a month or so. There are use cas= es >> of running online transaction search reports which will be mostly real t= ime >> reporting and also some daily transaction batch reports based on custome= rs >> and also month end reports for customers. In target state it will hold >> Approx. ~400 million transactions/day which can be billions of rows acro= ss >> multiple related parent/child tables. >> >> There is another requirement to send these customer transaction data to >> an olap system which is in a snowflake database and there it will be >> persisted for many years. The lag between the data in postgres/oltp and = in >> snowflake will be ~1hr. And any reporting api can query postgres for <1 >> month worth of transaction data and if it needs to scan for >1month wort= h >> of transaction data, it will point to the snowflake database. >> >> Now the question which we are wondering is , should we send the data as >> is in normalized table form to snowflake and then there we >> transform/flatten the data to support the reporting use case or should w= e >> first flatten or transform the data in postgres itself and make it as >> another structure( for example creating materialized views on top of bas= e >> table) and only then move that data to the snowflake? What is the >> appropriate standard and downside if we do anything different. >> > > Some thoughts: > 0) How big are the records? > 1) Materialized views add disk space overhead. > 2) Materialized views are for when you query the same static over and ove= r > again. > 3) IIUC, you'll be moving the data from PG to Snowflake just once. > 4) Writing an MV to disk and then reading it only once doubles the IO > requirements. > 5) Thus, my first thought would be to extract the data from PG using a > denormalizing "plain" view. > 5a) If you can't make that Fast Enough, then obviously you must pull the > normalized data from PG and denorm it elsewhere. You know your situation > better than us. > 6) Indices will be critical: not too many, but not too few. > 7) Obviously consider partitioning, but note that too many partitions can > make query planning MUCH slower. > 7a) 31 days cut into hours means 744 partitions. That's a LOT. > 7b) Partitioning every THREE hours means only 248 child tables. A lot, > but much more manageable. > 7c) That might well kill reporting performance, though, if it's for > example one customer across many partitions. > 8) You (hopefully) know what kind of queries will be run. Maybe partitio= n > by customer (or whatever) range and THEN by an hour range. > 8a) You'd have to simultaneously run multiple extract jobs (on for each > "customer" range), but that might not be too onerous, since then each hou= r > partition would be smaller. > 9) Testing. Nothing beats full-scale testing. > > > The table has ~100+ columns but I think the denormalized or the flatten table which we are planning to create will mostly have a large number of columns in it as that will be based on the columns from multiple tables joined together. However, the plan was to have the cooked data ready so as not to do the additional work in target or downstream. So I was thinking of a physical model for persisting the transformed data(using MV) rather than using a simple view which will have performance overhead. Because what is happening is , after the data moves to snowflake , people try to create their own version of the transformed table on top of these normalized tables(which we call as refiners) and then query those from UI or for reporting. And some people say we should avoid those downstream refiners and should do it here in source/postgres.Also the plan was to move the data from postgres once every hour. --0000000000005b3a74061f1d6a0e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Wed, Aug 7, 2024 at 3:13=E2=80=AFAM Ro= n Johnson <ronljohnsonjr@gmai= l.com> wrote:
On Tue, = Aug 6, 2024 at 5:07=E2=80=AFPM yudhi s <learnerdatabase99@gmail.com> wrote:=
Hi All,
We are having a use case in which = we are having transaction data for multiple customers in one of the Postgre= s databases(version 15.4) and we are consuming it from multiple sources(bat= ch file processing, kafka event processing etc). It's currently stored = in normalized form=C2=A0postgres with=C2=A0constraints,=C2=A0indexes, parti= tions defined. This postgres database is holding the transaction data for a= round a month or so. There are use cases of running online transaction sear= ch reports which will be mostly real time reporting and also some daily tra= nsaction batch reports based on customers and also month end reports for cu= stomers. In target state it will hold Approx. ~400 million transactions/day= which can be billions of rows across multiple related parent/child tables.=

There is another requirement to send these customer tran= saction data to an olap system which is in a snowflake database and there i= t will be persisted for many years. The lag between the data in postgres/ol= tp and in snowflake will be ~1hr. And any reporting api can query postgres = for <1 month worth of transaction data and if it needs to scan for >1= month worth of transaction data, it will point to the snowflake database.
Now the question which we are wondering is , should we send the data = as is in normalized table form to snowflake and then there we transform/fla= tten the data to support the reporting use case or should we first flatten = or transform the data in postgres itself and make it as another structure( = for example creating materialized views on top of base table) and only then= move that data to the snowflake? What is the appropriate standard and down= side if we do anything different.

Some thoughts:
0) How big are the records?
1) Ma= terialized views add disk space overhead.
2) Materialized views a= re for when you query the same static over and over again.
3) IIU= C, you'll be moving the data from PG to Snowflake just once.
= 4) Writing an MV to disk and then reading it only once doubles the IO requi= rements.
5) Thus, my first thought would be to extract the data f= rom PG using a denormalizing "plain"=C2=A0view.=C2=A0
5= a) If you can't make that Fast Enough, then obviously you must pull the= normalized data from PG and denorm it elsewhere.=C2=A0 You know your situa= tion better than us.
6) Indices will be critical: not too many, b= ut not too few.
7) Obviously consider partitioning, but note that= too many partitions can make query planning MUCH slower.
7a) 31 = days cut into hours means 744 partitions.=C2=A0 That's a LOT.
7b) Partitioning=C2=A0every THREE hours means only 248 child tables.=C2=A0= A lot, but much more manageable.
7c) That might well kill report= ing performance, though, if it's for example one customer across many p= artitions.
8) You (hopefully) know what kind of queries will be r= un.=C2=A0 Maybe partition by customer (or whatever) range and THEN by an ho= ur range.
8a) You'd have to simultaneously run multiple extra= ct jobs (on for each "customer" range), but that might not be too= onerous, since then each hour partition would be smaller.
9) Tes= ting.=C2=A0 Nothing beats full-scale testing.


The table has ~100+ columns but I t= hink the denormalized or the flatten table which we are planning to create = will mostly have a large number of columns in it as that will be based on t= he columns from multiple tables joined together. However, the plan was to h= ave the cooked data ready so as not to do the additional work in target or = downstream. So I was thinking of a physical model for persisting the transf= ormed data(using=C2=A0MV) rather than using a simple view which will have p= erformance overhead.=C2=A0

Because what is happeni= ng is , after the data moves to snowflake , people try to create their own = version of the transformed table on top of these=C2=A0normalized tables(whi= ch we call as refiners) and then query those from UI or for reporting. And = some people say we should avoid those downstream refiners and should do it = here in source/postgres.Also the plan was to move the data from postgres on= ce every hour.
--0000000000005b3a74061f1d6a0e--