MIME-Version: 1.0
References: <CAEHBEOBuoMFWuhHM3L_Zr6o1enELju-Vns6Pknt4TT+6MFQOwQ@mail.gmail.com>
 <fd47b28c-8f6b-4bfb-a393-160c5c3de8c0@aklaver.com> <CAEHBEOD969YrbPH_z9OEmThWx3-w4sMMaHLhZLOQwqCwE8Y58Q@mail.gmail.com>
 <caf4e99941b3c83bb9eab91e33b144b826b68f79.camel@cybertec.at>
 <CAEHBEOBXzkGTqxQSYqmEFN5hbc=zsGWFpU9h8zf7AAPv4VdOWQ@mail.gmail.com>
 <c9699a6fd331d33864fc269060d8d961f784d827.camel@cybertec.at>
 <CAEHBEOBNoG8RkKuCcQQWkbYppMLMzA0MXq+s0kZ6wKWgD7+45Q@mail.gmail.com>
 <099b49ebae94e23f19afdad3f8c9c6e702a3a2d5.camel@cybertec.at>
 <CAEHBEODw8svX557pjB_EL-Os7KWtwi-9Uq=RuCkRKgHVZWw8Bw@mail.gmail.com>
 <6d7e1022-6404-4dab-8467-8d1f6e8b63cb@aklaver.com> <CAEHBEOCpxASoNn=u21kaqOn1A-4YPy_mVfgkEjT3wRT5G4ycbg@mail.gmail.com>
In-Reply-To: <CAEHBEOCpxASoNn=u21kaqOn1A-4YPy_mVfgkEjT3wRT5G4ycbg@mail.gmail.com>
From: Igor Korot <ikorot01@gmail.com>
Date: Wed, 5 Mar 2025 21:03:10 -0600
Message-ID: <CA+FnnTzjSDE9E=TF56F-EAp6u=oPH2vmGNrjN50H53dXrev1MA@mail.gmail.com>
Subject: Re: Quesion about querying distributed databases
To: me nefcanto <sn.1361@gmail.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>, Laurenz Albe <laurenz.albe@cybertec.at>, 
	"pgsql-generallists.postgresql.org" <pgsql-general@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000aa50ba062fa3bfda"
Archived-At: <https://www.postgresql.org/message-id/CA%2BFnnTzjSDE9E%3DTF56F-EAp6u%3DoPH2vmGNrjN50H53dXrev1MA%40mail.gmail.com>
Precedence: bulk

--000000000000aa50ba062fa3bfda
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi,

On Wed, Mar 5, 2025, 8:44=E2=80=AFPM me nefcanto <sn.1361@gmail.com> wrote:

> I once worked with a monolithic SQL Server database with more than 10
> billion records and about 8 Terabytes of data. A single backup took us mo=
re
> than 21 days. It was a nightmare. Almost everybody knows that scaling up
> has a ceiling, but scaling out has no boundaries.
>

But then you did the backup incrementally correct?

That should not take the same amount of time...


> Therefore I will never choose a monolithic database design unless it's a
> small project. But my examples are just examples. We predict 100 million
> records per year. So we have to design accordingly. And it's not just sal=
es
> records. Many applications have requirements that are cheap data but vast
> in multitude. Consider a language-learning app that wants to store the
> known words of any learner. 10 thousand learners each knowing 2 thousand
> words means 20 million records. Convert that to 100 thousand learners eac=
h
> knowing 7 thousand words and now you almost have a billion records. Cheap=
,
> but necessary. Let's not dive into telemetry or time-series data.
>

Can you try and see if 1 server with 3 different databases will do?

Having 1 table per database per server is too ugly.

Also please understand - every databae is different. And so it works and
operates differently. What work good in one may not work good in another...

Thank you.


> We initially chose to break the database into smaller databases, because
> it seemed natural for our modularized monolith architecture. And it worke=
d
> great for SQL Server. If you're small, we host them all on one server. If
> you get bigger, we can put heavy databases on separate machines.
>
> However, I don't have experience working with other types of database
> scaling. I have used table partitioning, but I have never used sharding.
>
> Anyway, that's why I asked you guys. However, encouraging me to go back t=
o
> monolith without giving solutions on how to scale, is not helping. To be
> honest, I'm somehow disappointed by how the most advanced open source
> database does not support cross-database querying just like how SQL Serve=
r
> does. But if it doesn't, it doesn't. Our team should either drop it as a
> choice or find a way (by asking the experts who built it or use it) how t=
o
> design based on its features. That's why I'm asking.
>
> One thing that comes to my mind, is to use custom types. Instead of
> storing data in ItemCategories and ItemAttributes, store them as arrays i=
n
> the relevant tables in the same database. But then it seems to me that in
> this case, Mongo would become a better choice because I lose the relation=
al
> nature and normalization somehow. What drawbacks have you experienced in
> that sense?
>
> Regards
> Saeed
>
> On Wed, Mar 5, 2025 at 7:38=E2=80=AFPM Adrian Klaver <adrian.klaver@aklav=
er.com>
> wrote:
>
>> On 3/5/25 04:15, me nefcanto wrote:
>> > Dear Laurenz, the point is that I think if we put all databases into
>> one
>> > database, then we have blocked our growth in the future.
>>
>> How?
>>
>> > A monolith database can be scaled only vertically. We have had huge
>> > headaches in the past with SQL Server on Windows and a single database=
.
>> > But when you divide bounded contexts into different databases, then yo=
u
>> > have the chance to deploy each database on a separate physical machine=
.
>> > That means a lot in terms of performance. Please correct me if I am
>> wrong.
>>
>> And you add the complexity of talking across machines, as well as
>> maintaining separate machines.
>>
>> >
>> > Let's put this physical restriction on ourselves that we have differen=
t
>> > databases. What options do we have? One option that comes to my mind,
>> is
>> > to store the ID of the categories in the Products table. This means
>> that
>> > I don't need FDW anymore. And databases can be on separate machines. I
>> > first query the categories database first, get the category IDs, and
>> > then add a where clause to limit the product search. That could be an
>> > option. Array data type in Postgres is something that I think other
>> > RDBMSs do not have. Will that work? And how about attributes? Because
>> > attributes are more than a single ID. I should store the attribute key=
,
>> > alongside its value. It's a key-value pair. What can I do for that?
>>
>> You seem to be going out of the way to make your life more complicated.
>>
>> The only way you are going to find an answer is set up test cases and
>> experiment. My bet is a single server with a single database and
>> multiple schemas is where you end up, after all that is where you are
>> starting from.
>>
>>
>> >
>> > Thank you for sharing your time. I really appreciate it.
>> > Saeed
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Mar 5, 2025 at 3:18=E2=80=AFPM Laurenz Albe <laurenz.albe@cybe=
rtec.at
>> > <mailto:laurenz.albe@cybertec.at>> wrote:
>> >
>> >     On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
>> >      > That means a solid monolith database. We lose many goodies with
>> that.
>> >      > As a real-world example, right now we can import a single
>> database
>> >      > from the production to the development to test and troubleshoot
>> data.
>> >
>> >     Well, can't you import a single schema then?
>> >
>> >      > What if we host all databases on the same server and use FDW.
>> What
>> >      > happens in that case? Does it return 100 thousand records and
>> join
>> >      > in the memory?
>> >
>> >     It will do just the same thing.  The performance could be better
>> >     because of the reduced latency.
>> >
>> >      > Because in SQL Server, when you perform a cross-database query
>> >      > (not cross-server) the performance is extremely good, proving
>> that
>> >      > it does not return 100 thousand ItemId from
>> Taxonomy.ItemCategories
>> >      > to join with ProductId.
>> >      >
>> >      > Is that the same case with Postgres too, If databases are locat=
ed
>> >      > on one server?
>> >
>> >     No, you cannot perform cross-database queries without a foreign
>> >     data wrapper.  I don't see a reason why the statement shouldn't
>> >     perform as well as in SQL Server if you use schemas instead of
>> >     databases.
>> >
>> >     Yours,
>> >     Laurenz Albe
>> >
>>
>> --
>> Adrian Klaver
>> adrian.klaver@aklaver.com
>>
>>

--000000000000aa50ba062fa3bfda
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div>Hi,<br><br><div class=3D"gmail_quote gmail_quote_con=
tainer"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Mar 5, 2025, 8:44=E2=
=80=AFPM me nefcanto &lt;<a href=3D"mailto:sn.1361@gmail.com">sn.1361@gmail=
.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr=
"><div class=3D"gmail_default" style=3D"font-family:tahoma,sans-serif">I on=
ce worked with a monolithic SQL Server database with more than 10 billion r=
ecords and about 8 Terabytes=C2=A0of data. A single backup took us more tha=
n 21 days. It was a nightmare. Almost everybody knows that scaling up has a=
 ceiling, but scaling out has no boundaries.</div></div></blockquote></div>=
</div><div dir=3D"auto"><br></div><div dir=3D"auto">But then you did the ba=
ckup incrementally correct?</div><div dir=3D"auto"><br></div><div dir=3D"au=
to">That should not take the same amount of time...</div><div dir=3D"auto">=
<br></div><div dir=3D"auto"><br></div><div dir=3D"auto"><div class=3D"gmail=
_quote gmail_quote_container"><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"lt=
r"><div class=3D"gmail_default" style=3D"font-family:tahoma,sans-serif"><br=
></div><div class=3D"gmail_default" style=3D"font-family:tahoma,sans-serif"=
>Therefore I will never choose a monolithic database design unless it&#39;s=
 a small project. But my examples are just examples. We predict 100 million=
 records per year. So we have to design accordingly. And it&#39;s not just =
sales records. Many applications have requirements that are cheap data but =
vast in multitude. Consider a language-learning app that wants to store the=
 known words of any learner. 10 thousand learners each knowing 2 thousand w=
ords means 20 million records. Convert that to 100 thousand learners each k=
nowing 7 thousand words and now you almost have a billion records. Cheap, b=
ut necessary. Let&#39;s not dive into telemetry or time-series data.</div><=
/div></blockquote></div></div><div dir=3D"auto"><br></div><div dir=3D"auto"=
>Can you try and see if 1 server with 3 different databases will do?</div><=
div dir=3D"auto"><br></div><div dir=3D"auto">Having 1 table per database pe=
r server is too ugly.</div><div dir=3D"auto"><br></div><div dir=3D"auto">Al=
so please understand - every databae is different. And so it works and oper=
ates differently. What work good in one may not work good in another...</di=
v><div dir=3D"auto"><br></div><div dir=3D"auto">Thank you.</div><div dir=3D=
"auto"><br></div><div dir=3D"auto"><div class=3D"gmail_quote gmail_quote_co=
ntainer"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmai=
l_default" style=3D"font-family:tahoma,sans-serif"><br></div><div class=3D"=
gmail_default" style=3D"font-family:tahoma,sans-serif">We initially chose t=
o break the database into smaller databases, because it seemed natural for =
our modularized monolith architecture. And it worked great for SQL Server. =
If you&#39;re small, we host them all on one server. If you get bigger, we =
can put heavy databases on separate machines.</div><div class=3D"gmail_defa=
ult" style=3D"font-family:tahoma,sans-serif"><br></div><div class=3D"gmail_=
default" style=3D"font-family:tahoma,sans-serif">However, I don&#39;t have =
experience working with other types of database scaling. I have used table =
partitioning, but I have never used sharding.</div><div class=3D"gmail_defa=
ult" style=3D"font-family:tahoma,sans-serif"><br></div><div class=3D"gmail_=
default" style=3D"font-family:tahoma,sans-serif">Anyway, that&#39;s why I a=
sked you guys. However, encouraging me to go back to monolith without givin=
g solutions on how to scale, is not helping. To be honest, I&#39;m somehow =
disappointed by how the most advanced open source database does not support=
 cross-database querying just like how SQL Server does. But if it doesn&#39=
;t, it doesn&#39;t. Our team should either drop it as a choice or find a wa=
y (by asking the experts who built it or use it) how to design based on its=
 features. That&#39;s why I&#39;m asking.</div><div class=3D"gmail_default"=
 style=3D"font-family:tahoma,sans-serif"><br></div><div class=3D"gmail_defa=
ult" style=3D"font-family:tahoma,sans-serif">One thing that comes to my min=
d, is to use custom types. Instead of storing data in ItemCategories and It=
emAttributes, store them as arrays in the relevant tables in the same datab=
ase. But then it seems to me that in this case, Mongo would become a better=
 choice because I lose the relational nature and normalization somehow. Wha=
t drawbacks have you experienced in that sense?</div><div class=3D"gmail_de=
fault" style=3D"font-family:tahoma,sans-serif"><br></div><div class=3D"gmai=
l_default" style=3D"font-family:tahoma,sans-serif">Regards</div><div class=
=3D"gmail_default" style=3D"font-family:tahoma,sans-serif">Saeed</div></div=
><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On We=
d, Mar 5, 2025 at 7:38=E2=80=AFPM Adrian Klaver &lt;<a href=3D"mailto:adria=
n.klaver@aklaver.com" target=3D"_blank" rel=3D"noreferrer">adrian.klaver@ak=
laver.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">On 3/5/25 04:15, me nefcanto wrote:<br>
&gt; Dear Laurenz, the point is that I think if we put all databases into o=
ne <br>
&gt; database, then we have blocked our growth in the future.<br>
<br>
How?<br>
<br>
&gt; A monolith database can be scaled only vertically. We have had huge <b=
r>
&gt; headaches in the past with SQL Server on Windows and a single database=
.<br>
&gt; But when you divide bounded contexts into different databases, then yo=
u <br>
&gt; have the chance to deploy each database on a separate physical machine=
. <br>
&gt; That means a lot in terms of performance. Please correct me if I am wr=
ong.<br>
<br>
And you add the complexity of talking across machines, as well as <br>
maintaining separate machines.<br>
<br>
&gt; <br>
&gt; Let&#39;s put this physical restriction on ourselves that we have diff=
erent <br>
&gt; databases. What options do we have? One option that comes to my mind, =
is <br>
&gt; to store the ID of the categories in the Products table. This means th=
at <br>
&gt; I don&#39;t need FDW anymore. And databases can be on separate machine=
s. I <br>
&gt; first query the categories database first, get the category IDs, and <=
br>
&gt; then add a where clause to limit the product search. That could be an =
<br>
&gt; option. Array data type in Postgres is something that I think other <b=
r>
&gt; RDBMSs do not have. Will that work? And how about attributes? Because =
<br>
&gt; attributes are more than a single ID. I should store the attribute key=
, <br>
&gt; alongside its value. It&#39;s a key-value pair. What can I do for that=
?<br>
<br>
You seem to be going out of the way to make your life more complicated.<br>
<br>
The only way you are going to find an answer is set up test cases and <br>
experiment. My bet is a single server with a single database and <br>
multiple schemas is where you end up, after all that is where you are <br>
starting from.<br>
<br>
<br>
&gt; <br>
&gt; Thank you for sharing your time. I really appreciate it.<br>
&gt; Saeed<br>
&gt; <br>
&gt; <br>
&gt; <br>
&gt; <br>
&gt; <br>
&gt; On Wed, Mar 5, 2025 at 3:18=E2=80=AFPM Laurenz Albe &lt;<a href=3D"mai=
lto:laurenz.albe@cybertec.at" target=3D"_blank" rel=3D"noreferrer">laurenz.=
albe@cybertec.at</a> <br>
&gt; &lt;mailto:<a href=3D"mailto:laurenz.albe@cybertec.at" target=3D"_blan=
k" rel=3D"noreferrer">laurenz.albe@cybertec.at</a>&gt;&gt; wrote:<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrot=
e:<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; That means a solid monolith database. We lose=
 many goodies with that.<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; As a real-world example, right now we can imp=
ort a single database<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; from the production to the development to tes=
t and troubleshoot data.<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0Well, can&#39;t you import a single schema then?<br=
>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; What if we host all databases on the same ser=
ver and use FDW. What<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; happens in that case? Does it return 100 thou=
sand records and join<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; in the memory?<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0It will do just the same thing.=C2=A0 The performan=
ce could be better<br>
&gt;=C2=A0 =C2=A0 =C2=A0because of the reduced latency.<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; Because in SQL Server, when you perform a cro=
ss-database query<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; (not cross-server) the performance is extreme=
ly good, proving that<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; it does not return 100 thousand ItemId from T=
axonomy.ItemCategories<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; to join with ProductId.<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; Is that the same case with Postgres too, If d=
atabases are located<br>
&gt;=C2=A0 =C2=A0 =C2=A0 &gt; on one server?<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0No, you cannot perform cross-database queries witho=
ut a foreign<br>
&gt;=C2=A0 =C2=A0 =C2=A0data wrapper.=C2=A0 I don&#39;t see a reason why th=
e statement shouldn&#39;t<br>
&gt;=C2=A0 =C2=A0 =C2=A0perform as well as in SQL Server if you use schemas=
 instead of<br>
&gt;=C2=A0 =C2=A0 =C2=A0databases.<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0Yours,<br>
&gt;=C2=A0 =C2=A0 =C2=A0Laurenz Albe<br>
&gt; <br>
<br>
-- <br>
Adrian Klaver<br>
<a href=3D"mailto:adrian.klaver@aklaver.com" target=3D"_blank" rel=3D"noref=
errer">adrian.klaver@aklaver.com</a><br>
<br>
</blockquote></div>
</blockquote></div></div></div>

--000000000000aa50ba062fa3bfda--