Re: Quesion about querying distributed databases

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Quesion about querying distributed databases
19+ messages / 8 participants
[nested] [flat]

* Re: Quesion about querying distributed databases
@ 2025-03-05 09:55 Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-06 08:17 ` Re: Quesion about querying distributed databases Achilleas Mantzios - cloud <[email protected]>
  0 siblings, 2 replies; 19+ messages in thread

From: Laurenz Albe @ 2025-03-05 09:55 UTC (permalink / raw)
  To: me nefcanto <[email protected]>; +Cc: Adrian Klaver <[email protected]>; [email protected]

On Wed, 2025-03-05 at 12:57 +0330, me nefcanto wrote:
> Right now this data is in MariaDB, on separate databases (schema) but on one
> server. The solution in this situation is to have a cross-database query.
> (this is the status quo of our application)
> 
> Now our team has decided to migrate to Postgres. However, we realized that
> Postgres does not support cross-database queries. And if we want to do so,
> we should use FDW. So, we thought we might as well put databases on separate
> servers for scalability if we have to write more code. That's the reason
> behind this question.

In MySQL, the terms "database" and "schema" are used for the same thing.
Not so in PostgreSQL.  I think you should migrate the data into different
schemas in a single database, pretty much like you had it in MySQL.
Then you don't need a foreign data wrapper, and I bet the query can
perform as well as it did on MySQL.

Yours,
Laurenz Albe







^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
@ 2025-03-05 10:48 ` me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  1 sibling, 1 reply; 19+ messages in thread

From: me nefcanto @ 2025-03-05 10:48 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: Adrian Klaver <[email protected]>; [email protected]

Dear Laurenz,

That means a solid monolith database. We lose many goodies with that. As a
real-world example, right now we can import a single database from the
production to the development to test and troubleshoot data.

What if we host all databases on the same server and use FDW. What happens
in that case? Does it return 100 thousand records and join in the memory?

Because in SQL Server, when you perform a cross-database query (not
cross-server) the performance is extremely good, proving that it does not
return 100 thousand ItemId from Taxonomy.ItemCategories to join with
ProductId.

Is that the same case with Postgres too, If databases are located on one
server?

Regards
Saeed

^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
@ 2025-03-05 11:48   ` Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  0 siblings, 1 reply; 19+ messages in thread

From: Laurenz Albe @ 2025-03-05 11:48 UTC (permalink / raw)
  To: me nefcanto <[email protected]>; +Cc: Adrian Klaver <[email protected]>; [email protected]

On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
> That means a solid monolith database. We lose many goodies with that.
> As a real-world example, right now we can import a single database
> from the production to the development to test and troubleshoot data.

Well, can't you import a single schema then?

> What if we host all databases on the same server and use FDW. What
> happens in that case? Does it return 100 thousand records and join
> in the memory?

It will do just the same thing.  The performance could be better
because of the reduced latency.

> Because in SQL Server, when you perform a cross-database query
> (not cross-server) the performance is extremely good, proving that
> it does not return 100 thousand ItemId from Taxonomy.ItemCategories
> to join with ProductId.
> 
> Is that the same case with Postgres too, If databases are located
> on one server?

No, you cannot perform cross-database queries without a foreign
data wrapper.  I don't see a reason why the statement shouldn't
perform as well as in SQL Server if you use schemas instead of
databases.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
@ 2025-03-05 12:15     ` me nefcanto <[email protected]>
  2025-03-05 14:34       ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 15:21       ` Re: Quesion about querying distributed databases Greg Sabino Mullane <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  0 siblings, 3 replies; 19+ messages in thread

From: me nefcanto @ 2025-03-05 12:15 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: Adrian Klaver <[email protected]>; [email protected]

Dear Laurenz, the point is that I think if we put all databases into one
database, then we have blocked our growth in the future.
A monolith database can be scaled only vertically. We have had huge
headaches in the past with SQL Server on Windows and a single database.
But when you divide bounded contexts into different databases, then you
have the chance to deploy each database on a separate physical machine.
That means a lot in terms of performance. Please correct me if I am wrong.

Let's put this physical restriction on ourselves that we have different
databases. What options do we have? One option that comes to my mind, is to
store the ID of the categories in the Products table. This means that I
don't need FDW anymore. And databases can be on separate machines. I first
query the categories database first, get the category IDs, and then add a
where clause to limit the product search. That could be an option. Array
data type in Postgres is something that I think other RDBMSs do not have.
Will that work? And how about attributes? Because attributes are more than
a single ID. I should store the attribute key, alongside its value. It's a
key-value pair. What can I do for that?

Thank you for sharing your time. I really appreciate it.
Saeed

On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <[email protected]>
wrote:

> On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
> > That means a solid monolith database. We lose many goodies with that.
> > As a real-world example, right now we can import a single database
> > from the production to the development to test and troubleshoot data.
>
> Well, can't you import a single schema then?
>
> > What if we host all databases on the same server and use FDW. What
> > happens in that case? Does it return 100 thousand records and join
> > in the memory?
>
> It will do just the same thing.  The performance could be better
> because of the reduced latency.
>
> > Because in SQL Server, when you perform a cross-database query
> > (not cross-server) the performance is extremely good, proving that
> > it does not return 100 thousand ItemId from Taxonomy.ItemCategories
> > to join with ProductId.
> >
> > Is that the same case with Postgres too, If databases are located
> > on one server?
>
> No, you cannot perform cross-database queries without a foreign
> data wrapper.  I don't see a reason why the statement shouldn't
> perform as well as in SQL Server if you use schemas instead of
> databases.
>
> Yours,
> Laurenz Albe
>

^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
@ 2025-03-05 14:34       ` Laurenz Albe <[email protected]>
  2 siblings, 0 replies; 19+ messages in thread

From: Laurenz Albe @ 2025-03-05 14:34 UTC (permalink / raw)
  To: me nefcanto <[email protected]>; +Cc: Adrian Klaver <[email protected]>; [email protected]

On Wed, 2025-03-05 at 15:45 +0330, me nefcanto wrote:
> Dear Laurenz, the point is that I think if we put all databases into one database,
> then we have blocked our growth in the future.

Hard to say.
If you want to shard for horizontal scaling, that usually only works well
if there are few interconnections between the different shards.  If you end
up joining data from different shards, you usually lose.

> A monolith database can be scaled only vertically. We have had huge headaches
> in the past with SQL Server on Windows and a single database.

Without knowing what the headaches were, it is hard to answer something here.

> But when you divide bounded contexts into different databases, then you have
> the chance to deploy each database on a separate physical machine. That means
> a lot in terms of performance. Please correct me if I am wrong.

I don't know if you are wrong.  But it seems like it is not working well,
is it?  Perhaps you can explain how splitting up the data might result in
better performance.  Is that just a guess or do you have a reason to think so?

> Let's put this physical restriction on ourselves that we have different
> databases. What options do we have?

I don't know.  Based on what you showed: perhaps a complete re-design?

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
@ 2025-03-05 15:21       ` Greg Sabino Mullane <[email protected]>
  2 siblings, 0 replies; 19+ messages in thread

From: Greg Sabino Mullane @ 2025-03-05 15:21 UTC (permalink / raw)
  To: me nefcanto <[email protected]>; +Cc: Laurenz Albe <[email protected]>; Adrian Klaver <[email protected]>; [email protected]

On Wed, Mar 5, 2025 at 7:15 AM me nefcanto <[email protected]> wrote:

> I think if we put all databases into one database, then we have blocked
> our growth in the future.
>

I think this is premature optimization. Your products table has 100,000
rows. That's very tiny for the year 2025. Try putting everything on one box
with good indexes and you might be surprised at the performance.

> A monolith database can be scaled only vertically.
>

Postgres scales well vertically. Plus, you can have streaming replicas to
distribute the read queries (like the one given here) across many machines.

> We have had huge headaches in the past with SQL Server on Windows and a
> single database.
> But when you divide bounded contexts into different databases, then you
> have the chance to deploy each database on a separate physical machine.
> That means a lot in terms of performance.
>

I get your concern, but if the data is inter-related, it really is best to
have them on the same server (and same database, and same schema). Then
Postgres can devise a really efficient plan. You can also use Citus to
start sharding things across multiple physical servers if your database
gets very large.

Let's put this physical restriction on ourselves that we have different
> databases. What options do we have?
>

Your main option is FDW, which will never perform as well as a single
server. Plus, you have the additional headache of trying to coordinate data
updates atomically across different servers. The other option is to have
the application do the work, e.g. pull a list of things from one server,
use that to build a query against another one, etc. Definitely not ideal.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
@ 2025-03-05 16:08       ` Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2 siblings, 1 reply; 19+ messages in thread

From: Adrian Klaver @ 2025-03-05 16:08 UTC (permalink / raw)
  To: me nefcanto <[email protected]>; Laurenz Albe <[email protected]>; +Cc: [email protected]

On 3/5/25 04:15, me nefcanto wrote:
> Dear Laurenz, the point is that I think if we put all databases into one 
> database, then we have blocked our growth in the future.

How?

> A monolith database can be scaled only vertically. We have had huge 
> headaches in the past with SQL Server on Windows and a single database.
> But when you divide bounded contexts into different databases, then you 
> have the chance to deploy each database on a separate physical machine. 
> That means a lot in terms of performance. Please correct me if I am wrong.

And you add the complexity of talking across machines, as well as 
maintaining separate machines.

> 
> Let's put this physical restriction on ourselves that we have different 
> databases. What options do we have? One option that comes to my mind, is 
> to store the ID of the categories in the Products table. This means that 
> I don't need FDW anymore. And databases can be on separate machines. I 
> first query the categories database first, get the category IDs, and 
> then add a where clause to limit the product search. That could be an 
> option. Array data type in Postgres is something that I think other 
> RDBMSs do not have. Will that work? And how about attributes? Because 
> attributes are more than a single ID. I should store the attribute key, 
> alongside its value. It's a key-value pair. What can I do for that?

You seem to be going out of the way to make your life more complicated.

The only way you are going to find an answer is set up test cases and 
experiment. My bet is a single server with a single database and 
multiple schemas is where you end up, after all that is where you are 
starting from.


> 
> Thank you for sharing your time. I really appreciate it.
> Saeed
> 
> 
> 
> 
> 
> On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>     On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
>      > That means a solid monolith database. We lose many goodies with that.
>      > As a real-world example, right now we can import a single database
>      > from the production to the development to test and troubleshoot data.
> 
>     Well, can't you import a single schema then?
> 
>      > What if we host all databases on the same server and use FDW. What
>      > happens in that case? Does it return 100 thousand records and join
>      > in the memory?
> 
>     It will do just the same thing.  The performance could be better
>     because of the reduced latency.
> 
>      > Because in SQL Server, when you perform a cross-database query
>      > (not cross-server) the performance is extremely good, proving that
>      > it does not return 100 thousand ItemId from Taxonomy.ItemCategories
>      > to join with ProductId.
>      >
>      > Is that the same case with Postgres too, If databases are located
>      > on one server?
> 
>     No, you cannot perform cross-database queries without a foreign
>     data wrapper.  I don't see a reason why the statement shouldn't
>     perform as well as in SQL Server if you use schemas instead of
>     databases.
> 
>     Yours,
>     Laurenz Albe
> 

-- 
Adrian Klaver
[email protected]







^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
@ 2025-03-06 02:43         ` me nefcanto <[email protected]>
  2025-03-06 03:03           ` Re: Quesion about querying distributed databases Igor Korot <[email protected]>
  2025-03-06 06:22           ` Re: Quesion about querying distributed databases Ron Johnson <[email protected]>
  2025-03-06 08:04           ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-06 13:31           ` Re: Quesion about querying distributed databases Greg Sabino Mullane <[email protected]>
  0 siblings, 4 replies; 19+ messages in thread

From: me nefcanto @ 2025-03-06 02:43 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: Laurenz Albe <[email protected]>; [email protected]

I once worked with a monolithic SQL Server database with more than 10
billion records and about 8 Terabytes of data. A single backup took us more
than 21 days. It was a nightmare. Almost everybody knows that scaling up
has a ceiling, but scaling out has no boundaries.

Therefore I will never choose a monolithic database design unless it's a
small project. But my examples are just examples. We predict 100 million
records per year. So we have to design accordingly. And it's not just sales
records. Many applications have requirements that are cheap data but vast
in multitude. Consider a language-learning app that wants to store the
known words of any learner. 10 thousand learners each knowing 2 thousand
words means 20 million records. Convert that to 100 thousand learners each
knowing 7 thousand words and now you almost have a billion records. Cheap,
but necessary. Let's not dive into telemetry or time-series data.

We initially chose to break the database into smaller databases, because it
seemed natural for our modularized monolith architecture. And it worked
great for SQL Server. If you're small, we host them all on one server. If
you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database
scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to
monolith without giving solutions on how to scale, is not helping. To be
honest, I'm somehow disappointed by how the most advanced open source
database does not support cross-database querying just like how SQL Server
does. But if it doesn't, it doesn't. Our team should either drop it as a
choice or find a way (by asking the experts who built it or use it) how to
design based on its features. That's why I'm asking.

One thing that comes to my mind, is to use custom types. Instead of storing
data in ItemCategories and ItemAttributes, store them as arrays in the
relevant tables in the same database. But then it seems to me that in this
case, Mongo would become a better choice because I lose the relational
nature and normalization somehow. What drawbacks have you experienced in
that sense?

Regards
Saeed

On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver <[email protected]>
wrote:

> On 3/5/25 04:15, me nefcanto wrote:
> > Dear Laurenz, the point is that I think if we put all databases into one
> > database, then we have blocked our growth in the future.
>
> How?
>
> > A monolith database can be scaled only vertically. We have had huge
> > headaches in the past with SQL Server on Windows and a single database.
> > But when you divide bounded contexts into different databases, then you
> > have the chance to deploy each database on a separate physical machine.
> > That means a lot in terms of performance. Please correct me if I am
> wrong.
>
> And you add the complexity of talking across machines, as well as
> maintaining separate machines.
>
> >
> > Let's put this physical restriction on ourselves that we have different
> > databases. What options do we have? One option that comes to my mind, is
> > to store the ID of the categories in the Products table. This means that
> > I don't need FDW anymore. And databases can be on separate machines. I
> > first query the categories database first, get the category IDs, and
> > then add a where clause to limit the product search. That could be an
> > option. Array data type in Postgres is something that I think other
> > RDBMSs do not have. Will that work? And how about attributes? Because
> > attributes are more than a single ID. I should store the attribute key,
> > alongside its value. It's a key-value pair. What can I do for that?
>
> You seem to be going out of the way to make your life more complicated.
>
> The only way you are going to find an answer is set up test cases and
> experiment. My bet is a single server with a single database and
> multiple schemas is where you end up, after all that is where you are
> starting from.
>
>
> >
> > Thank you for sharing your time. I really appreciate it.
> > Saeed
> >
> >
> >
> >
> >
> > On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
> >      > That means a solid monolith database. We lose many goodies with
> that.
> >      > As a real-world example, right now we can import a single database
> >      > from the production to the development to test and troubleshoot
> data.
> >
> >     Well, can't you import a single schema then?
> >
> >      > What if we host all databases on the same server and use FDW. What
> >      > happens in that case? Does it return 100 thousand records and join
> >      > in the memory?
> >
> >     It will do just the same thing.  The performance could be better
> >     because of the reduced latency.
> >
> >      > Because in SQL Server, when you perform a cross-database query
> >      > (not cross-server) the performance is extremely good, proving that
> >      > it does not return 100 thousand ItemId from
> Taxonomy.ItemCategories
> >      > to join with ProductId.
> >      >
> >      > Is that the same case with Postgres too, If databases are located
> >      > on one server?
> >
> >     No, you cannot perform cross-database queries without a foreign
> >     data wrapper.  I don't see a reason why the statement shouldn't
> >     perform as well as in SQL Server if you use schemas instead of
> >     databases.
> >
> >     Yours,
> >     Laurenz Albe
> >
>
> --
> Adrian Klaver
> [email protected]
>
>

^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
@ 2025-03-06 03:03           ` Igor Korot <[email protected]>
  3 siblings, 0 replies; 19+ messages in thread

From: Igor Korot @ 2025-03-06 03:03 UTC (permalink / raw)
  To: me nefcanto <[email protected]>; +Cc: Adrian Klaver <[email protected]>; Laurenz Albe <[email protected]>; pgsql-generallists.postgresql.org <[email protected]>

Hi,

On Wed, Mar 5, 2025, 8:44 PM me nefcanto <[email protected]> wrote:

> I once worked with a monolithic SQL Server database with more than 10
> billion records and about 8 Terabytes of data. A single backup took us more
> than 21 days. It was a nightmare. Almost everybody knows that scaling up
> has a ceiling, but scaling out has no boundaries.
>

But then you did the backup incrementally correct?

That should not take the same amount of time...



> Therefore I will never choose a monolithic database design unless it's a
> small project. But my examples are just examples. We predict 100 million
> records per year. So we have to design accordingly. And it's not just sales
> records. Many applications have requirements that are cheap data but vast
> in multitude. Consider a language-learning app that wants to store the
> known words of any learner. 10 thousand learners each knowing 2 thousand
> words means 20 million records. Convert that to 100 thousand learners each
> knowing 7 thousand words and now you almost have a billion records. Cheap,
> but necessary. Let's not dive into telemetry or time-series data.
>

Can you try and see if 1 server with 3 different databases will do?

Having 1 table per database per server is too ugly.

Also please understand - every databae is different. And so it works and
operates differently. What work good in one may not work good in another...

Thank you.


> We initially chose to break the database into smaller databases, because
> it seemed natural for our modularized monolith architecture. And it worked
> great for SQL Server. If you're small, we host them all on one server. If
> you get bigger, we can put heavy databases on separate machines.
>
> However, I don't have experience working with other types of database
> scaling. I have used table partitioning, but I have never used sharding.
>
> Anyway, that's why I asked you guys. However, encouraging me to go back to
> monolith without giving solutions on how to scale, is not helping. To be
> honest, I'm somehow disappointed by how the most advanced open source
> database does not support cross-database querying just like how SQL Server
> does. But if it doesn't, it doesn't. Our team should either drop it as a
> choice or find a way (by asking the experts who built it or use it) how to
> design based on its features. That's why I'm asking.
>
> One thing that comes to my mind, is to use custom types. Instead of
> storing data in ItemCategories and ItemAttributes, store them as arrays in
> the relevant tables in the same database. But then it seems to me that in
> this case, Mongo would become a better choice because I lose the relational
> nature and normalization somehow. What drawbacks have you experienced in
> that sense?
>
> Regards
> Saeed
>
> On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver <[email protected]>
> wrote:
>
>> On 3/5/25 04:15, me nefcanto wrote:
>> > Dear Laurenz, the point is that I think if we put all databases into
>> one
>> > database, then we have blocked our growth in the future.
>>
>> How?
>>
>> > A monolith database can be scaled only vertically. We have had huge
>> > headaches in the past with SQL Server on Windows and a single database.
>> > But when you divide bounded contexts into different databases, then you
>> > have the chance to deploy each database on a separate physical machine.
>> > That means a lot in terms of performance. Please correct me if I am
>> wrong.
>>
>> And you add the complexity of talking across machines, as well as
>> maintaining separate machines.
>>
>> >
>> > Let's put this physical restriction on ourselves that we have different
>> > databases. What options do we have? One option that comes to my mind,
>> is
>> > to store the ID of the categories in the Products table. This means
>> that
>> > I don't need FDW anymore. And databases can be on separate machines. I
>> > first query the categories database first, get the category IDs, and
>> > then add a where clause to limit the product search. That could be an
>> > option. Array data type in Postgres is something that I think other
>> > RDBMSs do not have. Will that work? And how about attributes? Because
>> > attributes are more than a single ID. I should store the attribute key,
>> > alongside its value. It's a key-value pair. What can I do for that?
>>
>> You seem to be going out of the way to make your life more complicated.
>>
>> The only way you are going to find an answer is set up test cases and
>> experiment. My bet is a single server with a single database and
>> multiple schemas is where you end up, after all that is where you are
>> starting from.
>>
>>
>> >
>> > Thank you for sharing your time. I really appreciate it.
>> > Saeed
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> >     On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
>> >      > That means a solid monolith database. We lose many goodies with
>> that.
>> >      > As a real-world example, right now we can import a single
>> database
>> >      > from the production to the development to test and troubleshoot
>> data.
>> >
>> >     Well, can't you import a single schema then?
>> >
>> >      > What if we host all databases on the same server and use FDW.
>> What
>> >      > happens in that case? Does it return 100 thousand records and
>> join
>> >      > in the memory?
>> >
>> >     It will do just the same thing.  The performance could be better
>> >     because of the reduced latency.
>> >
>> >      > Because in SQL Server, when you perform a cross-database query
>> >      > (not cross-server) the performance is extremely good, proving
>> that
>> >      > it does not return 100 thousand ItemId from
>> Taxonomy.ItemCategories
>> >      > to join with ProductId.
>> >      >
>> >      > Is that the same case with Postgres too, If databases are located
>> >      > on one server?
>> >
>> >     No, you cannot perform cross-database queries without a foreign
>> >     data wrapper.  I don't see a reason why the statement shouldn't
>> >     perform as well as in SQL Server if you use schemas instead of
>> >     databases.
>> >
>> >     Yours,
>> >     Laurenz Albe
>> >
>>
>> --
>> Adrian Klaver
>> [email protected]
>>
>>


^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
@ 2025-03-06 06:22           ` Ron Johnson <[email protected]>
  3 siblings, 0 replies; 19+ messages in thread

From: Ron Johnson @ 2025-03-06 06:22 UTC (permalink / raw)
  To: pgsql-general

On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <[email protected]> wrote:

> I once worked with a monolithic SQL Server database with more than 10
> billion records and about 8 Terabytes of data. A single backup took us more
> than 21 days. It was a nightmare.
>

25 years ago (meaning *much* slower hardware), I managed a 1TB database.
Backups took about 4 hours.  Could have gotten it down to two hours if I'd
wanted to use more tape drives.

Right now, I manage a 5TB database.  Backups take 110 minutes, and that's
when using one channel for all IO, writing to not the fastest NAS, and
other 3+TB databases backing up to it at the same time.

> Almost everybody knows that scaling up has a ceiling
>

And that ceiling is much, much higher than you think it is.

> , but scaling out has no boundaries.
>

Except for complexity and fragility. I bet I could get good scaled up
performance out of the amount of hardware you're using to scale out.

-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
@ 2025-03-06 08:04           ` Laurenz Albe <[email protected]>
  2025-03-06 08:45             ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  3 siblings, 1 reply; 19+ messages in thread

From: Laurenz Albe @ 2025-03-06 08:04 UTC (permalink / raw)
  To: me nefcanto <[email protected]>; Adrian Klaver <[email protected]>; +Cc: [email protected]

On Thu, 2025-03-06 at 06:13 +0330, me nefcanto wrote:
> I once worked with a monolithic SQL Server database with more than 10 billion
> records and about 8 Terabytes of data. A single backup took us more than 21 days.
> It was a nightmare. Almost everybody knows that scaling up has a ceiling, but
> scaling out has no boundaries.

I hear you, and I agree with that.

> We initially chose to break the database into smaller databases, because it
> seemed natural for our modularized monolith architecture. And it worked great
> for SQL Server. If you're small, we host them all on one server. If you get
> bigger, we can put heavy databases on separate machines.

So you mean that you had those databases on different servers?
How would a cross-database query work in that case?  It must be something
akin to foreign data in PostgreSQL.

If that worked well, then it should also work well with PostgreSQL and
foreign data wrappers.  Look at the execution plan you got on SQL Server
and see where PostgreSQL chooses a different plan.  Then try to improve that.
We can try to help if we see actual plans.

> However, I don't have experience working with other types of database
> scaling. I have used table partitioning, but I have never used sharding.

Well, if you split the data into several databases, that *was* sharding.

> Anyway, that's why I asked you guys. However, encouraging me to go back to
> monolith without giving solutions on how to scale, is not helping. To be
> honest, I'm somehow disappointed by how the most advanced open source
> database does not support cross-database querying just like how SQL Server
> does. But if it doesn't, it doesn't. Our team should either drop it as a
> choice or find a way (by asking the experts who built it or use it) how
> to design based on its features. That's why I'm asking.

Excluding options from the start is limiting yourself.  Consider using
other, better databases than PostgreSQL (if you can find them).

It is difficult to come up with a concrete design based on the information
you provided.  Perhaps you should get a consultant; the mailing list does
not seem to be the right format for that request.

Typically, you split the data in a ways that they have few interconnections,
for example per customer, so that you don't regularly end up joining data
from different databases (shards).

> One thing that comes to my mind, is to use custom types. Instead of storing
> data in ItemCategories and ItemAttributes, store them as arrays in the
> relevant tables in the same database.

Don't ever store arrays in the database.  It will be a nightmare.
You seem to be drawn to questionable data design...

Yours,
Laurenz Albe

^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-06 08:04           ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
@ 2025-03-06 08:45             ` me nefcanto <[email protected]>
  2025-03-06 12:01               ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 19+ messages in thread

From: me nefcanto @ 2025-03-06 08:45 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: Adrian Klaver <[email protected]>; [email protected]

Dear Laurenz

> I hear you, and I agree with that.

Thank you. Such a relief.

> If that worked well, then it should also work well with PostgreSQL and
foreign data wrappers.

You're right. We had problems with cross-server queries on SQL Server and
MariaDB too. It seems that cross-server queries are not solved by any
engine. But we had no problem with cross-database queries. That's where it
worked well both on SQL Server and MariaDB. It seems that for
cross-database queries, Postgres returns the entire result set from the
other database to this database and then performs joins locally. It seems
that for Postgres it's not different if the foreign database is on the same
machine, or it's on another machine. I just say so by seeing the queries
and asking questions about them. I have not performed a test yet.

> Well, if you split the data into several databases, that *was* sharding.

The way I understood it, sharding is when you split the database by rows,
not by tables. Examples choose a column like Tenant or User or Date as the
base of sharding. Never have I seen an example that stores Orders on one
database and Customers on another database and call it sharding. I don't
know, but we might call it distributed databases.

 > Consider using other, better databases than PostgreSQL (if you can find
them).

That's the point here. If we can't design a good thing on Postgres, then we
stick back to MariaDB. That's why we're researching and testing. As I
mentioned above, Postgres is amazing at some points but lacks some simple
things that other engines expose out of the box.

> Perhaps you should get a consultant; the mailing list does not seem to be
the right format for that request.

We have done that over the last decade. For SQL Server and then for
MariaDB. We have come up with some very practical and useful designs.
Separating CLOBs from main tables, storing UUID only as the name of files
to match the cloud storage, storing date-times as UTC, using bigint
everywhere even for small tables for consistency, denormalizing enum
storage (storing text instead of numeric value) even in large tables, etc.
etc.

But to choose a technology, we do have enough literacy and experience. It's
just some simple questions and answers. If I know that FDW works
differently for same-server databases, then I know that we will migrate.

> Don't ever store arrays in the database.  It will be a nightmare.

This is a very interesting claim. May I ask you to share its problems and
your experience?

On Thu, Mar 6, 2025 at 11:34 AM Laurenz Albe <[email protected]>
wrote:

> On Thu, 2025-03-06 at 06:13 +0330, me nefcanto wrote:
> > I once worked with a monolithic SQL Server database with more than 10
> billion
> > records and about 8 Terabytes of data. A single backup took us more than
> 21 days.
> > It was a nightmare. Almost everybody knows that scaling up has a
> ceiling, but
> > scaling out has no boundaries.
>
> I hear you, and I agree with that.
>
>
> > We initially chose to break the database into smaller databases, because
> it
> > seemed natural for our modularized monolith architecture. And it worked
> great
> > for SQL Server. If you're small, we host them all on one server. If you
> get
> > bigger, we can put heavy databases on separate machines.
>
> So you mean that you had those databases on different servers?
> How would a cross-database query work in that case?  It must be something
> akin to foreign data in PostgreSQL.
>
> If that worked well, then it should also work well with PostgreSQL and
> foreign data wrappers.  Look at the execution plan you got on SQL Server
> and see where PostgreSQL chooses a different plan.  Then try to improve
> that.
> We can try to help if we see actual plans.
>
> > However, I don't have experience working with other types of database
> > scaling. I have used table partitioning, but I have never used sharding.
>
> Well, if you split the data into several databases, that *was* sharding.
>
> > Anyway, that's why I asked you guys. However, encouraging me to go back
> to
> > monolith without giving solutions on how to scale, is not helping. To be
> > honest, I'm somehow disappointed by how the most advanced open source
> > database does not support cross-database querying just like how SQL
> Server
> > does. But if it doesn't, it doesn't. Our team should either drop it as a
> > choice or find a way (by asking the experts who built it or use it) how
> > to design based on its features. That's why I'm asking.
>
> Excluding options from the start is limiting yourself.  Consider using
> other, better databases than PostgreSQL (if you can find them).
>
> It is difficult to come up with a concrete design based on the information
> you provided.  Perhaps you should get a consultant; the mailing list does
> not seem to be the right format for that request.
>
> Typically, you split the data in a ways that they have few
> interconnections,
> for example per customer, so that you don't regularly end up joining data
> from different databases (shards).
>
> > One thing that comes to my mind, is to use custom types. Instead of
> storing
> > data in ItemCategories and ItemAttributes, store them as arrays in the
> > relevant tables in the same database.
>
> Don't ever store arrays in the database.  It will be a nightmare.
> You seem to be drawn to questionable data design...
>
> Yours,
> Laurenz Albe
>

^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-06 08:04           ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-06 08:45             ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
@ 2025-03-06 12:01               ` Laurenz Albe <[email protected]>
  0 siblings, 0 replies; 19+ messages in thread

From: Laurenz Albe @ 2025-03-06 12:01 UTC (permalink / raw)
  To: me nefcanto <[email protected]>; +Cc: Adrian Klaver <[email protected]>; [email protected]

On Thu, 2025-03-06 at 12:15 +0330, me nefcanto wrote:
> We had problems with cross-server queries on SQL Server and MariaDB too.
> It seems that cross-server queries are not solved by any engine. But we
> had no problem with cross-database queries. That's where it worked well
> both on SQL Server and MariaDB.

But then you always worked with a monolithic system.  Splitting over several
"databases" (we call them schemas) does not allow you to scale horizontally.

> > Well, if you split the data into several databases, that *was* sharding.
> 
> The way I understood it, sharding is when you split the database by rows,
> not by tables. Examples choose a column like Tenant or User or Date as the
> base of sharding. Never have I seen an example that stores Orders on one
> database and Customers on another database and call it sharding.

Right.  And I don't think that your data model is good.  It won't scale
well, because you don't get more tables as you get more data.


> But to choose a technology, we do have enough literacy and experience.
> It's just some simple questions and answers. If I know that FDW works
> differently for same-server databases, then I know that we will migrate.

It doesn't work any differently.

> > Don't ever store arrays in the database.  It will be a nightmare.
> 
> This is a very interesting claim. May I ask you to share its problems
> and your experience?

If you store foreign keys in arrays, you have no referential integrity,
and any join over the array will be complicated and slow.

If you store many values in an array and you want to update one of these
values, you have to retrieve the whole array, construct a new array and
store that.  You cannot modify individual array elements, only the whole
thing.

Of course there are ways to use arrays that are not harmful.  It's just
that almost always when I see someone use an array as table column, it
is an abuse of technology.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
@ 2025-03-06 13:31           ` Greg Sabino Mullane <[email protected]>
  2025-03-06 15:47             ` Re: Quesion about querying distributed databases Igor Korot <[email protected]>
  3 siblings, 1 reply; 19+ messages in thread

From: Greg Sabino Mullane @ 2025-03-06 13:31 UTC (permalink / raw)
  To: me nefcanto <[email protected]>; +Cc: Adrian Klaver <[email protected]>; Laurenz Albe <[email protected]>; [email protected]

On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <[email protected]> wrote:

> Anyway, that's why I asked you guys. However, encouraging me to go back to
> monolith without giving solutions on how to scale, is not helping.
>

We did. In addition to the ongoing FDW discussion, I mentioned read-only
replicas and Citus. As far as *how* to scale vertically, we can offer
general advice (more hardware resources, ramdisks for temp stuff, OS-level
tuning, separate disk mounts). But a lot of it is tuning Postgres for your
specific situation and your specific bottlenecks. Which we are happy to
help with. Once we convince you to not throw the baby out with the
bathwater. :)

8 Terabytes of data. A single backup took us more than 21 days


Something was fundamentally wrong there.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-06 13:31           ` Re: Quesion about querying distributed databases Greg Sabino Mullane <[email protected]>
@ 2025-03-06 15:47             ` Igor Korot <[email protected]>
  2025-03-06 18:16               ` Re: Quesion about querying distributed databases Ron Johnson <[email protected]>
  0 siblings, 1 reply; 19+ messages in thread

From: Igor Korot @ 2025-03-06 15:47 UTC (permalink / raw)
  To: Greg Sabino Mullane <[email protected]>; +Cc: me nefcanto <[email protected]>; Adrian Klaver <[email protected]>; Laurenz Albe <[email protected]>; pgsql-generallists.postgresql.org <[email protected]>

Hi,

On Thu, Mar 6, 2025, 7:32 AM Greg Sabino Mullane <[email protected]> wrote:

> On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <[email protected]> wrote:
>
>> Anyway, that's why I asked you guys. However, encouraging me to go back
>> to monolith without giving solutions on how to scale, is not helping.
>>
>
> We did. In addition to the ongoing FDW discussion, I mentioned read-only
> replicas and Citus. As far as *how* to scale vertically, we can offer
> general advice (more hardware resources, ramdisks for temp stuff, OS-level
> tuning, separate disk mounts). But a lot of it is tuning Postgres for your
> specific situation and your specific bottlenecks. Which we are happy to
> help with. Once we convince you to not throw the baby out with the
> bathwater. :)
>
> 8 Terabytes of data. A single backup took us more than 21 days
>
>
> Something was fundamentally wrong there.
>

It could happen on an old and drained hardware... 😀

Thank you.


> Cheers,
> Greg
>
> --
> Crunchy Data - https://www.crunchydata.com
> Enterprise Postgres Software Products & Tech Support
>
>


^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-06 13:31           ` Re: Quesion about querying distributed databases Greg Sabino Mullane <[email protected]>
  2025-03-06 15:47             ` Re: Quesion about querying distributed databases Igor Korot <[email protected]>
@ 2025-03-06 18:16               ` Ron Johnson <[email protected]>
  2025-03-29 09:15                 ` Re: Quesion about querying distributed databases Kevin Stephenson <[email protected]>
  0 siblings, 1 reply; 19+ messages in thread

From: Ron Johnson @ 2025-03-06 18:16 UTC (permalink / raw)
  To: pgsql-generallists.postgresql.org <[email protected]>

On Thu, Mar 6, 2025 at 10:47 AM Igor Korot <[email protected]> wrote:

> Hi,
>
> On Thu, Mar 6, 2025, 7:32 AM Greg Sabino Mullane <[email protected]>
> wrote:
>
>> On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <[email protected]> wrote:
>>
>>> Anyway, that's why I asked you guys. However, encouraging me to go back
>>> to monolith without giving solutions on how to scale, is not helping.
>>>
>>
>> We did. In addition to the ongoing FDW discussion, I mentioned read-only
>> replicas and Citus. As far as *how* to scale vertically, we can offer
>> general advice (more hardware resources, ramdisks for temp stuff, OS-level
>> tuning, separate disk mounts). But a lot of it is tuning Postgres for your
>> specific situation and your specific bottlenecks. Which we are happy to
>> help with. Once we convince you to not throw the baby out with the
>> bathwater. :)
>>
>> 8 Terabytes of data. A single backup took us more than 21 days
>>
>>
>> Something was fundamentally wrong there.
>>
>
> It could happen on an old and drained hardware... 😀
>

8TB databases existed 20+ years ago.  Like always, the hardware must fit
the application.

21 days to backup a database absolutely means *many* things were improperly
sized and configured.

-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!


^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-06 13:31           ` Re: Quesion about querying distributed databases Greg Sabino Mullane <[email protected]>
  2025-03-06 15:47             ` Re: Quesion about querying distributed databases Igor Korot <[email protected]>
  2025-03-06 18:16               ` Re: Quesion about querying distributed databases Ron Johnson <[email protected]>
@ 2025-03-29 09:15                 ` Kevin Stephenson <[email protected]>
  2025-03-29 15:09                   ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  0 siblings, 1 reply; 19+ messages in thread

From: Kevin Stephenson @ 2025-03-29 09:15 UTC (permalink / raw)
  To: Ron Johnson <[email protected]>; +Cc: pgsql-generallists.postgresql.org <[email protected]>; [email protected] <[email protected]>

Bumping this old thread to clarify a few points.

As an initial note on terminology, a "server" can be called a server, an instance, a server instance, or in PostgreSQL's (PG hereafter) case, a "DB Cluster." They all are used interchangeably in the wild to mean the same thing, with perhaps "server" being the most ambiguous. And that thing is a running process that is listening to (typically) a single port and handing connections off to individual threads or processes (or some other construct). The comments below use the word "instance" for consistency and brevity.

MariaDB
As mentioned by others, MariaDB (and MySQL proper, both hereafter referred to as MySQL) instances only support a single database. The CREATE DATABASE statement is a de facto synonym for CREATE SCHEMA (it is disappointing that MySQL and MariaDB docs still do not mention this significant fact). To disabuse those who believe MySQL support multiple databases, query any of the INFORMATION_SCHEMA tables and for all those that have the *_CATALOG column (CATALOG is a synonym for DATABASE), notice every row in every table that shows a catalog column, they all say "def", as in default. Further, the identifier used in CREATE DATABASE will show up in INFORMATION_SCHEMA tables in *_SCHEMA columns.

MySQL supports one and two-part object naming for DML and it supports foreign data access using the Federated (deprecated in MariaDB) and FederatedX (MariaDB) storage engines. One-part names use the currently selected schema (via USE statement) along with <object_name>. Two-part names are <schema_name>.<object_name>. Foreign data (e.g. tables) are also accessed using two-part names and there are limitations on what queries can be used with them (I believe they can be organized in a way the one-part names would work as well). I'm unaware of how advanced the pushdown optimizations have become, but the concept of pushdown is critical to understand both for the MySQL storage engines as well as for the Federated* "foreign data wrappers" (FDW).

To summarize, all MySQL instances have a single database called "def" and there is no way to create another database on a single instance at this time (MySQL 9.2, MariaDB 11.8). To have more than one database currently requires running additional instances of MySQL on a different port. The idea that there are "high performance cross-database queries" in MySQL is simply incorrect at this time. There are high-performance cross-schema queries in MySQL, as there are in Microsoft SQL Server (MSSQL hereafter) and PG.

Note: MariaDB is planning on supporting real databases in 12.0 and they will be called CATALOGs, as some other RDMSs do. See:
https://mariadb.com/kb/en/create-catalog/

It's a big TBD on how well this will perform with InnoDB and other storage engines for cross-catalog queries (if that will even possible).

MSSQL (On-prem, VM, Azure SQL Managed Instance)
MSSQL supports one, two, three and four-part object naming.

One part names are the <object_name> and use the user's currently selected database (via USE <database_name> command or as set in the connection string), along with the user's default schema as a prefixes, and that schema prefix is typically dbo (similar to public in PG) to resolve object names (which behind the scenes are always three-part names: <database_name>.<schema_name>.<object_name>). The default schema can be changed, similar to the PG search path being changed..

Two-part names are <schema_name>.<object_name> and again use the user's currently selected database to get the fully qualified three-part name.

Where MSSQL differs from both MySQL and PG is in the support for three-part names. These are <database_name>.<schema_name>.<object_name>, and you can in fact run real "cross-database" MSSQL queries without FDW trickery. Historically, I believe there were some light limitations on their usage, such as in Foreign Keys and Triggers, but that may no longer be the case.

Linked Servers are the MSSQL near equivalent of FDW and are used in four-part names: <linked server name>.<database_name>.<schema_name>.<object_name>.

Two other MSSQL concepts are CONTAINED DATABASES and Azure SQL Database (with "Elastic Queries" as FDW, shockingly in "preview" for about a decade now). Those are beyond the scope of this email.

PG
On the PG side (which I know the least about), it does have real databases with schemas, and does support FDW. But since it uses a shared WAL (at least MSSQL separates logs per database, unclear on InnoDB, et. al.), and it uses a shared security system, I don't understand why it doesn't support cross-database queries. But that question would be best asked in a separate thread (any maybe it's already in the archives anyways). Putting all relevant objects in a single database with different schemas has already been mentioned as a workaround.

As far as some of the other back and forth on database (and system) architecture, that's really an enormous (and specialty) topic. Having previously worked on one of the largest MSSQL installations in the world, I have not seen anything in PG that would prevent a proper scaling strategy as long as "DB as API" was not being attempted, but rather microservices or some other pattern was used, and frequent massive JOINs cross-instance or cross-server (actual servers) weren't being attempted. All three platforms discussed here will have performance problems with that.

Perhaps the original poster would benefit from reviewing one of the gold standards on this topic, "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems"
https://www.amazon.com/dp/1449373321

Thanks,
Kevin

________________________________
From: Ron Johnson <[email protected]>
Sent: Thursday, March 6, 2025 10:16 AM
To: pgsql-generallists.postgresql.org <[email protected]>
Subject: Re: Quesion about querying distributed databases

On Thu, Mar 6, 2025 at 10:47 AM Igor Korot <[email protected]<mailto:[email protected]>> wrote:
Hi,

On Thu, Mar 6, 2025, 7:32 AM Greg Sabino Mullane <[email protected]<mailto:[email protected]>> wrote:
On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <[email protected]<mailto:[email protected]>> wrote:
Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping.

We did. In addition to the ongoing FDW discussion, I mentioned read-only replicas and Citus. As far as *how* to scale vertically, we can offer general advice (more hardware resources, ramdisks for temp stuff, OS-level tuning, separate disk mounts). But a lot of it is tuning Postgres for your specific situation and your specific bottlenecks. Which we are happy to help with. Once we convince you to not throw the baby out with the bathwater. :)

8 Terabytes of data. A single backup took us more than 21 days

Something was fundamentally wrong there.

It could happen on an old and drained hardware... 😀

8TB databases existed 20+ years ago.  Like always, the hardware must fit the application.

21 days to backup a database absolutely means many things were improperly sized and configured.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!


^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 10:48 ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 11:48   ` Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
  2025-03-05 12:15     ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-05 16:08       ` Re: Quesion about querying distributed databases Adrian Klaver <[email protected]>
  2025-03-06 02:43         ` Re: Quesion about querying distributed databases me nefcanto <[email protected]>
  2025-03-06 13:31           ` Re: Quesion about querying distributed databases Greg Sabino Mullane <[email protected]>
  2025-03-06 15:47             ` Re: Quesion about querying distributed databases Igor Korot <[email protected]>
  2025-03-06 18:16               ` Re: Quesion about querying distributed databases Ron Johnson <[email protected]>
  2025-03-29 09:15                 ` Re: Quesion about querying distributed databases Kevin Stephenson <[email protected]>
@ 2025-03-29 15:09                   ` Adrian Klaver <[email protected]>
  0 siblings, 0 replies; 19+ messages in thread

From: Adrian Klaver @ 2025-03-29 15:09 UTC (permalink / raw)
  To: Kevin Stephenson <[email protected]>; Ron Johnson <[email protected]>; +Cc: pgsql-generallists.postgresql.org <[email protected]>; [email protected] <[email protected]>

On 3/29/25 02:15, Kevin Stephenson wrote:
> Bumping this old thread to clarify a few points.
> 
> As an initial note on terminology, a "server" can be called a server, an 
> instance, a server instance, or in PostgreSQL's (PG hereafter) case, a 
> "DB Cluster." They all are used interchangeably in the wild to mean the 
> same thing, with perhaps "server" being the most ambiguous. And that 
> thing is a running process that is listening to (typically) a single 
> port and handing connections off to individual threads or processes (or 
> some other construct). The comments below use the word "instance" for 
> consistency and brevity.
> 
> MariaDB
> As mentioned by others, MariaDB (and MySQL proper, both hereafter 
> referred to as MySQL) instances only support a single database. The 
> CREATE DATABASE statement is a de facto synonym for CREATE SCHEMA (it is 
> disappointing that MySQL and MariaDB docs still do not mention this 
> significant fact). To disabuse those who believe MySQL support multiple 

Actually they do:

https://dev.mysql.com/doc/refman/9.2/en/create-database.html

"CREATE DATABASE creates a database with the given name. To use this 
statement, you need the CREATE privilege for the database. CREATE SCHEMA 
is a synonym for CREATE DATABASE."

https://mariadb.com/kb/en/create-database/

"CREATE DATABASE creates a database with the given name. To use this 
statement, you need the CREATE privilege for the database. CREATE SCHEMA 
is a synonym for CREATE DATABASE."

They don't expound on what that means in real terms.


> databases, query any of the INFORMATION_SCHEMA tables and for all those 
> that have the *_CATALOG column (CATALOG is a synonym for DATABASE), 
> notice every row in every table that shows a catalog column, they all 
> say "def", as in default. Further, the identifier used in CREATE 
> DATABASE will show up in INFORMATION_SCHEMA tables in *_SCHEMA columns.


-- 
Adrian Klaver
[email protected]







^ permalink  raw  reply  [nested|flat] 19+ messages in thread

* Re: Quesion about querying distributed databases
  2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
@ 2025-03-06 08:17 ` Achilleas Mantzios - cloud <[email protected]>
  1 sibling, 0 replies; 19+ messages in thread

From: Achilleas Mantzios - cloud @ 2025-03-06 08:17 UTC (permalink / raw)
  To: [email protected]


On 3/5/25 11:55, Laurenz Albe wrote:
> On Wed, 2025-03-05 at 12:57 +0330, me nefcanto wrote:
>> Right now this data is in MariaDB, on separate databases (schema) but on one
>> server. The solution in this situation is to have a cross-database query.
>> (this is the status quo of our application)
>>
>> Now our team has decided to migrate to Postgres. However, we realized that
>> Postgres does not support cross-database queries. And if we want to do so,
>> we should use FDW. So, we thought we might as well put databases on separate
>> servers for scalability if we have to write more code. That's the reason
>> behind this question.
> In MySQL, the terms "database" and "schema" are used for the same thing.
> Not so in PostgreSQL.  I think you should migrate the data into different
> schemas in a single database, pretty much like you had it in MySQL.
> Then you don't need a foreign data wrapper, and I bet the query can
> perform as well as it did on MySQL.
Same thing in MS SQL, as of version 2017 or so.
>
> Yours,
> Laurenz Albe
>
>
>






^ permalink  raw  reply  [nested|flat] 19+ messages in thread

end of thread, other threads:[~2025-03-29 15:09 UTC | newest]

Thread overview: 19+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-03-05 09:55 Re: Quesion about querying distributed databases Laurenz Albe <[email protected]>
2025-03-05 10:48 ` me nefcanto <[email protected]>
2025-03-05 11:48   ` Laurenz Albe <[email protected]>
2025-03-05 12:15     ` me nefcanto <[email protected]>
2025-03-05 14:34       ` Laurenz Albe <[email protected]>
2025-03-05 15:21       ` Greg Sabino Mullane <[email protected]>
2025-03-05 16:08       ` Adrian Klaver <[email protected]>
2025-03-06 02:43         ` me nefcanto <[email protected]>
2025-03-06 03:03           ` Igor Korot <[email protected]>
2025-03-06 06:22           ` Ron Johnson <[email protected]>
2025-03-06 08:04           ` Laurenz Albe <[email protected]>
2025-03-06 08:45             ` me nefcanto <[email protected]>
2025-03-06 12:01               ` Laurenz Albe <[email protected]>
2025-03-06 13:31           ` Greg Sabino Mullane <[email protected]>
2025-03-06 15:47             ` Igor Korot <[email protected]>
2025-03-06 18:16               ` Ron Johnson <[email protected]>
2025-03-29 09:15                 ` Kevin Stephenson <[email protected]>
2025-03-29 15:09                   ` Adrian Klaver <[email protected]>
2025-03-06 08:17 ` Achilleas Mantzios - cloud <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox