Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dT8K1-0000B4-63 for pgsql-performance@arkaria.postgresql.org; Thu, 06 Jul 2017 15:04:01 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1dT8K0-0003DW-PD for pgsql-performance@arkaria.postgresql.org; Thu, 06 Jul 2017 15:04:00 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1dT8K0-0003DN-D5 for pgsql-performance@postgresql.org; Thu, 06 Jul 2017 15:04:00 +0000 Received: from mail-qt0-x231.google.com ([2607:f8b0:400d:c0d::231]) by magus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1dT8Jw-0008ML-ON for pgsql-performance@postgresql.org; Thu, 06 Jul 2017 15:03:59 +0000 Received: by mail-qt0-x231.google.com with SMTP id i2so4525960qta.3 for ; Thu, 06 Jul 2017 08:03:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=MaJ/Q34CSUi6Yfl3En+XQebYzl11pvbJiHISkT92zyE=; b=ZSF+60lj6A+GWSXWFgOSpf/ybcPy5Gg0QSoM6tXl1w7XtLaYGIVMknM1002MmoZT25 +PlekwrBHKzUXV7mvsgtMI/NWpg7hl0k6aT0HHbtfYGZzJg7r16RAn+R9kKDYCYXnT7Y owwMCJwaQxDK8Q8zQ84vIILW0VxT1z8c7xd17JvKC6EDFP/h9gUPQDNUugXsFh4s+OxG f421xeYif54m04OnVM5I7FPKQKuPjFfq47XDm+x2HLog2XSxgIAt5H8YbPSbB/4NSe9+ 7RN722cq42UjXu4WRpMBNzzl1UTxBTw/tmlfd/a7p22QtIdsgxpPmv7uqwM7tVrvSs8z h59g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=MaJ/Q34CSUi6Yfl3En+XQebYzl11pvbJiHISkT92zyE=; b=XwF5IkHHjCpYNzqbEg4ofF5AIWsSIQa5PY3/BbQmbGn9of/r5RJvuSbdvl/zSMzBd4 FYq9DdhZK8IU0V75ConS2lIWuik6+oY4jlU2Xosnmht968L5OnjGrSfZpKLg71mi9j9o hJ85PrY1NQyt+eyRE7CEqHnBxauvz3WY4oOPlHKcX5FuHrBDGy7lswUfpa/25mWimKTe 0toz2fBynp8UQIZBDbazbl51q105cuQPyKoMFEbYU3EeOi7TA+YPEe5IykdxZdYKZd49 SvrtUrywX6ff7rReYJa5fS1BL257Zx6M7z5YU0gSBjuSiFRX5TPr8yVMf0PaF1dv+DlP eIWA== X-Gm-Message-State: AKS2vOzPlCd8I/cJKAdonTGdLvcOLH9fjMCeku1GP2ZfY4+ifm4mva6j bEScrcJ76QQFbiN5E760Dg/CWF0IdA== X-Received: by 10.200.0.81 with SMTP id i17mr62554450qtg.199.1499353434692; Thu, 06 Jul 2017 08:03:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.12.177.40 with HTTP; Thu, 6 Jul 2017 08:03:54 -0700 (PDT) From: Rick Otten Date: Thu, 6 Jul 2017 11:03:54 -0400 Message-ID: Subject: partitioning materialized views To: "pgsql-performa." Content-Type: multipart/alternative; boundary="f403045e736ade248d0553a76c02" List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org --f403045e736ade248d0553a76c02 Content-Type: text/plain; charset="UTF-8" I'm pondering approaches to partitioning large materialized views and was hoping for some feedback and thoughts on it from the [perform] minds. PostgreSQL 9.6.3 on Ubuntu 16.04 in the Google Cloud. I have a foreign table with 250M or so rows and 50 or so columns, with a UUID as the primary key. Queries to the foreign table have high latency. (From several minutes to more than an hour to run) If I create a materialized view of this FT, including indexes, it takes about 3-4 hours. If I refresh the materialized view concurrently, it takes 4-5 DAYS. When I run "refresh materialized view concurrently", it takes about an hour for it to download the 250M rows and load them onto the SSD tempspace. At that point we flatline a single core, and run I/O on the main tablespace up pretty high, and then stay that way until the refresh is complete. In order to speed up the concurrent refreshes, I have it broken into 4 materialized views, manually partitioned (by date) with a "union all view" in front of them. Refreshing the data which is changing regularly (new data, in one of the partitions) doesn't require refreshing the entire data set. This works fairly well, and I can refresh the most recent partition in 1 - 2 hours (daily). However, sometimes I have to reach back in time and refresh the deeper partitions. This is taking 3 or more days to complete, even with the data broken into 4 materialized views. This approache lets me refresh all of the partitions at the same time, which uses more cores at the same time (and more tempspace), [I'd like to use as much of my system resources as possible to get the refresh to finish faster.] Unfortunately I am finding I need to refresh the deeper data more and more often (at least once per week), and my table growth is going to jump from adding 3-5M rows per day to adding 10-20M rows per day over the next month or two. Waiting 3 or 4 days for the deeper data to be ready for consumption in PostgreSQL is no longer acceptable to the business. It doesn't look like partman supports partitioning materialized views. It also doesn't look like PG 10's new partitioning features will work with materialized views (although I haven't tried it yet). Citus DB also seems to be unable to help in this scenario. I could create new materialized views every time I need new data, and then swap out the view that is in front of them. There are other objects in the database which have dependencies on that view. In my experiments so far, "create and replace" seems to let me get away with this as long as the columns don't change. Alternatively, I could come up with a new partitioning scheme that lets me more selectively run "refresh concurrently", and run more of those at the same time. I was leaning towards this latter solution. Suppose I make a separate materialized view for each month of data. At the beginning of each month I would have to make a new materialized view, and then add it into the "union all view" on the fly. I would then need a "refresh all" script which refreshed as many of them concurrently as I am willing to dedicate cores to. And I need some handy ways to selectively refresh specific months when I know data for a particular month or set of months changed. So, I actually have 2 of these 250M row tables in the Foreign Database, that I want to do this with. And maybe more coming soon? I'm curious if I'm overlooking other possible architectures or tools that might make this simpler to manage. Similarly, could I construct the "union all view" in front of the partitions to be partition aware so that the query planner doesn't try to look in every one of the materialized views behind it to find the rows I want? If I go with the monthly partition, I'll start with about 36 materialized views behind the main view. --f403045e736ade248d0553a76c02 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I'm pondering approaches to partitioning large materia= lized views and was hoping for some feedback and thoughts on it from the [p= erform] minds.

PostgreSQL 9.6.3 on Ubuntu 16.04 in the G= oogle Cloud.

I have a foreign table with 250M or s= o rows and 50 or so columns, with a UUID as the primary key.=C2=A0 Queries = to the foreign table have high latency. =C2=A0(From several minutes to more= than an hour to run)

If I create a materialized v= iew of this FT, including indexes, it takes about 3-4 hours. =C2=A0

If I refresh the materialized view concurrently, it takes= 4-5 DAYS.

When I run "refresh materialized v= iew concurrently", it takes about an hour for it to download the 250M = rows and load them onto the SSD tempspace. =C2=A0 At that point we flatline= a single core, and run I/O on the main tablespace up pretty high, and then= stay that way until the refresh is complete.

In o= rder to speed up the concurrent refreshes, I have it broken into 4 material= ized views, manually partitioned (by date) with a "union all view"= ; in front of them.=C2=A0 Refreshing the data which is changing regularly (= new data, in one of the partitions) doesn't require refreshing the enti= re data set.=C2=A0 This works fairly well, and I can refresh the most recen= t partition in 1 - 2 hours (daily).

However, somet= imes I have to reach back in time and refresh the deeper partitions.=C2=A0 = This is taking 3 or more days to complete, even with the data broken into 4= materialized views.=C2=A0 This approache lets me refresh all of the partit= ions at the same time, which uses more cores at the same time (and more tem= pspace), =C2=A0 [I'd like to use as much of my system resources as poss= ible to get the refresh to finish faster.] =C2=A0 Unfortunately I am findin= g I need to refresh the deeper data more and more often (at least once per = week), and my table growth is going to jump from adding 3-5M rows per day t= o adding 10-20M rows per day over the next month or two.=C2=A0 Waiting 3 or= 4 days for the deeper data to be ready for consumption in PostgreSQL is no= longer acceptable to the business.

It doesn't= look like partman supports partitioning materialized views.=C2=A0 It also = doesn't look like PG 10's new partitioning features will work with = materialized views (although I haven't tried it yet). =C2=A0 Citus DB a= lso seems to be unable to help in this scenario.

I= could create new materialized views every time I need new data, and then s= wap out the view that is in front of them.=C2=A0 There are other objects in= the database which have dependencies on that view. In my experiments so fa= r, "create and replace" seems to let me get away with this as lon= g as the columns don't change.

Alternatively, = I could come up with a new partitioning scheme that lets me more selectivel= y run "refresh concurrently", and run more of those at the same t= ime.

I was leaning towards this latter solution.= =C2=A0
=C2=A0
Suppose I make a separate materialized vi= ew for each month of data.=C2=A0 At the beginning of each month I would hav= e to make a new materialized view, and then add it into the "union all= view" on the fly.

I would then need a "= refresh all" script which refreshed as many of them concurrently as I = am willing to dedicate cores to.=C2=A0 And I need some handy ways to select= ively refresh specific months when I know data for a particular month or se= t of months changed.

So, I actually have 2 of thes= e 250M row tables in the Foreign Database, that I want to do this with.=C2= =A0 And maybe more coming soon?=C2=A0

I'm curi= ous if I'm overlooking other possible architectures or tools that might= make this simpler to manage.


Simil= arly, could I construct the "union all view" in front of the part= itions to be partition aware so that the query planner doesn't try to l= ook in every one of the materialized views behind it to find the rows I wan= t? =C2=A0 If I go with the monthly partition, I'll start with about 36 = materialized views behind the main view.


<= /div>

--f403045e736ade248d0553a76c02--