Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sPVtL-0051Rg-SM for pgsql-general@arkaria.postgresql.org; Thu, 04 Jul 2024 23:29:32 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sPVtJ-003t2d-OX for pgsql-general@arkaria.postgresql.org; Thu, 04 Jul 2024 23:29:30 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sPVtJ-003t2V-DT for pgsql-general@lists.postgresql.org; Thu, 04 Jul 2024 23:29:30 +0000 Received: from mail-oo1-xc36.google.com ([2607:f8b0:4864:20::c36]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sPVtI-000Sn7-E3 for pgsql-general@lists.postgresql.org; Thu, 04 Jul 2024 23:29:29 +0000 Received: by mail-oo1-xc36.google.com with SMTP id 006d021491bc7-5b97a071c92so497865eaf.1 for ; Thu, 04 Jul 2024 16:29:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720135766; x=1720740566; darn=lists.postgresql.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FRksWS19/sfgKsF6EQ/zxrrDwgxxv8B73KOV3Bw+PqM=; b=Wf1hqNaPqmSkfPWYASYSyKBGaNSiaS9xQP7JaQFDV9aynzRN3fvKO1oUHq4FG584aa k9xEFPtCcuY7DYZswPzgSqHLyfh13UU2j+E5XkilagISbHXTfM2kY+FZnETFLbzNSmeE izOBEnCgSl+O4hBNMnIe7LlXuo9YdSeu27vx6ZOA8h+GsVNtMHBAaDGdq4101WCjm8mA pW/bsMh5czItcEQOl5oX0qZAA/qf9NSl+BhgeyvIXTH5OzrrrHHjmoZVYv8Ms+h5CY4t KjZJLdsZ7eRWOJpAKKg+W6AYLwZKr+QYrj1srBaCf/717jihr8tOuvJW1+fnfaXHhu39 y+9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720135766; x=1720740566; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FRksWS19/sfgKsF6EQ/zxrrDwgxxv8B73KOV3Bw+PqM=; b=FT9BSbIghhgyouyU7DTIlO3kGg0jt4qp23tFU9pzgIICiRYbbKDwEduafdL1B7yNpl 2w/D2fZt9bpvC7jSlkY4FPuuLouo/vVZXddAP8uIPwog8d0ss5d1hC4aeqKpYjOmRDxs RkVgrTnuNJe0eN8NRSKZSY4Ye9o4LekVvjVG6IShCGT+Qr1g8WeKhoEMZ//t/9hONuPl PU3x2s7NCfyWZOgeX8/bzKvV1hDUyTNcdiCbzjae1hWKBIahDXSJPxLcA8EUDcTKNTpt novcfqadazRXL3GpfFMgvY0D0DvEEMycTvuoIj81UOVeQzFaqPTBn0MD0tHDzb15cYQn NGmA== X-Gm-Message-State: AOJu0YxtWzRMkvZxIjG/T8eDGRh9g/emndgZkjpuxHUTivRivm8Yx7jF nt5iBPB0Xz7teQ8QhHbUGWlFh0b63sQwWh06FBKjo1/fF58UtIvn+bCZxElP X-Google-Smtp-Source: AGHT+IEkwEcFihiv5yxHGXYiwWXKZ0ZuR0vRbGGjJuHfRCceiOQ1LyIKp7nS/Ka4TOvbEol4Q+wgYQ== X-Received: by 2002:a05:6359:5f8c:b0:19c:6472:42d5 with SMTP id e5c5f4694b2df-1aa98c39fccmr274142355d.12.1720135766099; Thu, 04 Jul 2024 16:29:26 -0700 (PDT) Received: from smtpclient.apple ([2600:8801:8600:3d2::12]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-72c6c8ebf0fsm8556446a12.61.2024.07.04.16.29.24 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 04 Jul 2024 16:29:25 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.600.62\)) Subject: Re: Design strategy for table with many attributes From: Guyren Howe In-Reply-To: Date: Thu, 4 Jul 2024 16:29:13 -0700 Cc: "pgsql-general@lists.postgresql.org" Content-Transfer-Encoding: quoted-printable Message-Id: <2A2982A5-C93C-4A88-8ADE-6F3565D4B42A@gmail.com> References: To: Lok P X-Mailer: Apple Mail (2.3774.600.62) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Ultimately, the way you should store the data depends on how you will = use it. When you retrieve these values, what are the different ways = you=E2=80=99ll be using them? Normalised representations are more flexible, and the pragmatic, = performance-based consideration is that all the values in a row are = always effectively retrieved together. So if you reasonably often have a = query that only accesses the creation date and transaction id, then it = will be pretty slow if you are also always retrieving 500 other columns = you don=E2=80=99t need. So: you might often pull all the values *other* than the attributes = (creation date, creator, job information, whatever) together. This = argues that those values should be in one table, and the attributes in = another. Will you usually be using *all* of the attributes for a particular = transaction together in the same operation? It might make sense to store = them in eg an array in that case. But this significantly reduces the = speed of accessing particular attributes separately. It is likely that you will want to slice things by particular named = attribute across many transactions. This argues for the more normalised = form, as does the general principle of doing things in the way that is = most general and flexible. When considering how the data will be used, please consider not only the = developers of your current application(s), but also eg data analysts, = managers, future applications etc. The less specific you can be about how you want to use the data, the = more you should lean toward fully normalising. Note also that you can store your data in a normalised and maximally = flexible form, but also use triggers or batch jobs to gather various = permutations of the data for specific purposes. If you really do, say, = both have some major part of your project that uses all the attributes = on a given transaction together, but you also have other uses, you may = want to store both the normalised/attribute table and the =E2=80=9Call = the values together=E2=80=9D version. Even if you want to store =E2=80=9Call the values together=E2=80=9D, it = may well be better to use an array, JSON or HStore, rather than having a = super-wide table. JSON would eg let you enumerate all the column names = (for example) and employ Postgres=E2=80=99s really nice JSON query = features. > On Jul 4, 2024, at 12:37, Lok P wrote: >=20 > Hello, > In one of the applications we are getting transactions in = messages/events format and also in files and then they are getting = parsed and stored into the relational database. The number of = attributes/columns each transaction has is ~900+. Logically they are = part of one single transaction and should be stored in one table as one = single row. There will be ~500million such transactions each day coming = into the system. And there will be approx ~10K peak write TPS and 5K = read TPS in target state. This system has a postgres database as a = "source of truth" or OLTP store. And then data moves to snowflakes for = the olap store. >=20 > Initially when the system was designed the number of attributes per = transaction was <100 but slowly the business wants to keep/persist other = attributes too in the current system and the number of columns keep = growing. >=20 > However, as worked with some database systems , we get few suggestions = from DBA's to not have many columns in a single table. For example in = oracle they say not to go beyond ~255 columns as then row chaining and = row migration type of things are going to hunt us. Also we are afraid = concurrent DMLS on the table may cause this as a contention point. So I = wanted to understand , in such a situation what would be the best design = approach we should use irrespective of databases? Or say, what is the = maximum number of columns per table we should restrict? Should we break = the single transaction into multiple tables like one main table and = other addenda tables with the same primary key to join and fetch the = results wherever necessary? >=20 > Regards > Lok