Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\))
Subject: Re: Design strategy for table with many attributes
From: Guyren Howe <guyren@gmail.com>
In-Reply-To: <CAKna9VbQ=o2Yhjo1EitaxzYpWAXe9vw6QtamzCsNNp2-rQOFSA@mail.gmail.com>
Date: Thu, 4 Jul 2024 16:29:13 -0700
Cc: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <2A2982A5-C93C-4A88-8ADE-6F3565D4B42A@gmail.com>
References: <CAKna9VbQ=o2Yhjo1EitaxzYpWAXe9vw6QtamzCsNNp2-rQOFSA@mail.gmail.com>
To: Lok P <loknath.73@gmail.com>
Archived-At: <https://www.postgresql.org/message-id/2A2982A5-C93C-4A88-8ADE-6F3565D4B42A%40gmail.com>
Precedence: bulk

Ultimately, the way you should store the data depends on how you will =
use it. When you retrieve these values, what are the different ways =
you=E2=80=99ll be using them?

Normalised representations are more flexible, and the pragmatic, =
performance-based consideration is that all the values in a row are =
always effectively retrieved together. So if you reasonably often have a =
query that only accesses the creation date and transaction id, then it =
will be pretty slow if you are also always retrieving 500 other columns =
you don=E2=80=99t need.

So: you might often pull all the values *other* than the attributes =
(creation date, creator, job information, whatever) together. This =
argues that those values should be in one table, and the attributes in =
another.

Will you usually be using *all* of the attributes for a particular =
transaction together in the same operation? It might make sense to store =
them in eg an array in that case. But this significantly reduces the =
speed of accessing particular attributes separately.

It is likely that you will want to slice things by particular named =
attribute across many transactions. This argues for the more normalised =
form, as does the general principle of doing things in the way that is =
most general and flexible.

When considering how the data will be used, please consider not only the =
developers of your current application(s), but also eg data analysts, =
managers, future applications etc.

The less specific you can be about how you want to use the data, the =
more you should lean toward fully normalising.

Note also that you can store your data in a normalised and maximally =
flexible form, but also use triggers or batch jobs to gather various =
permutations of the data for specific purposes. If you really do, say, =
both have some major part of your project that uses all the attributes =
on a given transaction together, but you also have other uses, you may =
want to store both the normalised/attribute table and the =E2=80=9Call =
the values together=E2=80=9D version.

Even if you want to store =E2=80=9Call the values together=E2=80=9D, it =
may well be better to use an array, JSON or HStore, rather than having a =
super-wide table. JSON would eg let you enumerate all the column names =
(for example) and employ Postgres=E2=80=99s really nice JSON query =
features.


> On Jul 4, 2024, at 12:37, Lok P <loknath.73@gmail.com> wrote:
>=20
> Hello,
> In one of the applications we are getting transactions in =
messages/events format and also in files and then they are getting =
parsed and stored into the relational database. The number of =
attributes/columns each transaction has is ~900+. Logically they are =
part of one single transaction and should be stored in one table as one =
single row. There will be ~500million such transactions each day coming =
into the system. And there will be approx ~10K peak write TPS and 5K =
read TPS in target state. This system has a postgres database as a =
"source of truth" or OLTP store. And then data moves to snowflakes for =
the olap store.
>=20
> Initially when the system was designed the number of attributes per =
transaction was <100 but slowly the business wants to keep/persist other =
attributes too in the current system and the number of columns keep =
growing.
>=20
> However, as worked with some database systems , we get few suggestions =
from DBA's to not have many columns in a single table. For example in =
oracle they say not to go beyond ~255 columns as then row chaining and =
row migration type of things are going to hunt us. Also we are afraid =
concurrent DMLS on the table may cause this as a contention point. So I =
wanted to understand , in such a situation what would be the best design =
approach we should use irrespective of databases? Or say, what is the =
maximum number of columns per table we should restrict? Should we break =
the single transaction into multiple tables like one main table and =
other addenda tables with the same primary key to join and fetch the =
results wherever necessary?
>=20
> Regards
> Lok