Re: Quesion about querying distributed databases

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Igor Korot <[email protected]>
To: me nefcanto <[email protected]>
Cc: Adrian Klaver <[email protected]>
Cc: Laurenz Albe <[email protected]>
Cc: pgsql-generallists.postgresql.org <[email protected]>
Subject: Re: Quesion about querying distributed databases
Date: Wed, 5 Mar 2025 21:03:10 -0600
Message-ID: <CA+FnnTzjSDE9E=TF56F-EAp6u=oPH2vmGNrjN50H53dXrev1MA@mail.gmail.com> (raw)
In-Reply-To: <CAEHBEOCpxASoNn=u21kaqOn1A-4YPy_mVfgkEjT3wRT5G4ycbg@mail.gmail.com>
References: <CAEHBEOBuoMFWuhHM3L_Zr6o1enELju-Vns6Pknt4TT+6MFQOwQ@mail.gmail.com>
	<[email protected]>
	<CAEHBEOD969YrbPH_z9OEmThWx3-w4sMMaHLhZLOQwqCwE8Y58Q@mail.gmail.com>
	<[email protected]>
	<CAEHBEOBXzkGTqxQSYqmEFN5hbc=zsGWFpU9h8zf7AAPv4VdOWQ@mail.gmail.com>
	<[email protected]>
	<CAEHBEOBNoG8RkKuCcQQWkbYppMLMzA0MXq+s0kZ6wKWgD7+45Q@mail.gmail.com>
	<[email protected]>
	<CAEHBEODw8svX557pjB_EL-Os7KWtwi-9Uq=RuCkRKgHVZWw8Bw@mail.gmail.com>
	<[email protected]>
	<CAEHBEOCpxASoNn=u21kaqOn1A-4YPy_mVfgkEjT3wRT5G4ycbg@mail.gmail.com>

Hi,

On Wed, Mar 5, 2025, 8:44 PM me nefcanto <[email protected]> wrote:

> I once worked with a monolithic SQL Server database with more than 10
> billion records and about 8 Terabytes of data. A single backup took us more
> than 21 days. It was a nightmare. Almost everybody knows that scaling up
> has a ceiling, but scaling out has no boundaries.
>

But then you did the backup incrementally correct?

That should not take the same amount of time...



> Therefore I will never choose a monolithic database design unless it's a
> small project. But my examples are just examples. We predict 100 million
> records per year. So we have to design accordingly. And it's not just sales
> records. Many applications have requirements that are cheap data but vast
> in multitude. Consider a language-learning app that wants to store the
> known words of any learner. 10 thousand learners each knowing 2 thousand
> words means 20 million records. Convert that to 100 thousand learners each
> knowing 7 thousand words and now you almost have a billion records. Cheap,
> but necessary. Let's not dive into telemetry or time-series data.
>

Can you try and see if 1 server with 3 different databases will do?

Having 1 table per database per server is too ugly.

Also please understand - every databae is different. And so it works and
operates differently. What work good in one may not work good in another...

Thank you.


> We initially chose to break the database into smaller databases, because
> it seemed natural for our modularized monolith architecture. And it worked
> great for SQL Server. If you're small, we host them all on one server. If
> you get bigger, we can put heavy databases on separate machines.
>
> However, I don't have experience working with other types of database
> scaling. I have used table partitioning, but I have never used sharding.
>
> Anyway, that's why I asked you guys. However, encouraging me to go back to
> monolith without giving solutions on how to scale, is not helping. To be
> honest, I'm somehow disappointed by how the most advanced open source
> database does not support cross-database querying just like how SQL Server
> does. But if it doesn't, it doesn't. Our team should either drop it as a
> choice or find a way (by asking the experts who built it or use it) how to
> design based on its features. That's why I'm asking.
>
> One thing that comes to my mind, is to use custom types. Instead of
> storing data in ItemCategories and ItemAttributes, store them as arrays in
> the relevant tables in the same database. But then it seems to me that in
> this case, Mongo would become a better choice because I lose the relational
> nature and normalization somehow. What drawbacks have you experienced in
> that sense?
>
> Regards
> Saeed
>
> On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver <[email protected]>
> wrote:
>
>> On 3/5/25 04:15, me nefcanto wrote:
>> > Dear Laurenz, the point is that I think if we put all databases into
>> one
>> > database, then we have blocked our growth in the future.
>>
>> How?
>>
>> > A monolith database can be scaled only vertically. We have had huge
>> > headaches in the past with SQL Server on Windows and a single database.
>> > But when you divide bounded contexts into different databases, then you
>> > have the chance to deploy each database on a separate physical machine.
>> > That means a lot in terms of performance. Please correct me if I am
>> wrong.
>>
>> And you add the complexity of talking across machines, as well as
>> maintaining separate machines.
>>
>> >
>> > Let's put this physical restriction on ourselves that we have different
>> > databases. What options do we have? One option that comes to my mind,
>> is
>> > to store the ID of the categories in the Products table. This means
>> that
>> > I don't need FDW anymore. And databases can be on separate machines. I
>> > first query the categories database first, get the category IDs, and
>> > then add a where clause to limit the product search. That could be an
>> > option. Array data type in Postgres is something that I think other
>> > RDBMSs do not have. Will that work? And how about attributes? Because
>> > attributes are more than a single ID. I should store the attribute key,
>> > alongside its value. It's a key-value pair. What can I do for that?
>>
>> You seem to be going out of the way to make your life more complicated.
>>
>> The only way you are going to find an answer is set up test cases and
>> experiment. My bet is a single server with a single database and
>> multiple schemas is where you end up, after all that is where you are
>> starting from.
>>
>>
>> >
>> > Thank you for sharing your time. I really appreciate it.
>> > Saeed
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> >     On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
>> >      > That means a solid monolith database. We lose many goodies with
>> that.
>> >      > As a real-world example, right now we can import a single
>> database
>> >      > from the production to the development to test and troubleshoot
>> data.
>> >
>> >     Well, can't you import a single schema then?
>> >
>> >      > What if we host all databases on the same server and use FDW.
>> What
>> >      > happens in that case? Does it return 100 thousand records and
>> join
>> >      > in the memory?
>> >
>> >     It will do just the same thing.  The performance could be better
>> >     because of the reduced latency.
>> >
>> >      > Because in SQL Server, when you perform a cross-database query
>> >      > (not cross-server) the performance is extremely good, proving
>> that
>> >      > it does not return 100 thousand ItemId from
>> Taxonomy.ItemCategories
>> >      > to join with ProductId.
>> >      >
>> >      > Is that the same case with Postgres too, If databases are located
>> >      > on one server?
>> >
>> >     No, you cannot perform cross-database queries without a foreign
>> >     data wrapper.  I don't see a reason why the statement shouldn't
>> >     perform as well as in SQL Server if you use schemas instead of
>> >     databases.
>> >
>> >     Yours,
>> >     Laurenz Albe
>> >
>>
>> --
>> Adrian Klaver
>> [email protected]
>>
>>

view thread (19+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Quesion about querying distributed databases
  In-Reply-To: <CA+FnnTzjSDE9E=TF56F-EAp6u=oPH2vmGNrjN50H53dXrev1MA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox