Re: Proposal: Adding compression of temporary files

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Filip Janus <[email protected]>
To: Tomas Vondra <[email protected]>
Cc: [email protected]
Subject: Re: Proposal: Adding compression of temporary files
Date: Sun, 24 Nov 2024 15:53:38 +0100
Message-ID: <CAFjYY++22tmGDSac_YPA_NLJ4-cEJo7kS3SxXFHf7ZFTnwO-QA@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <CAFjYY+LAxqRpaJpukBx9KjeyhZvMu6O4xcCUU9gw1SFGsUTcDA@mail.gmail.com>
	<CAFjYY+JJ3x-QUBpSYr5eTdapERhS9Nw3SEAH+QnBB=kypoXUJw@mail.gmail.com>
	<[email protected]>

    -Filip-


st 20. 11. 2024 v 1:35 odesílatel Tomas Vondra <[email protected]> napsal:

> Hi,
>
> On 11/18/24 22:58, Filip Janus wrote:
> > ...
> >     Hi all,
> >     Postgresql supports data compression nowadays, but the compression of
> >     temporary files has not been implemented yet. The huge queries can
> >     produce a significant amount of temporary data that needs to
> >     be stored on disk
> >     and cause many expensive I/O operations.
> >     I am attaching a proposal of the patch to enable temporary files
> >     compression for
> >     hashjoins for now. Initially, I've chosen the LZ4 compression
> >     algorithm. It would
> >     probably make better sense to start with pglz, but I realized it
> late.
> >
>
> Thanks for the idea & patch. I agree this might be quite useful for
> workloads generating a lot of temporary files for stuff like sorts etc.
> I think it will be interesting to think about the trade offs, i.e. how
> to pick the compression level - at some point the compression ratio
> stops improving while paying more and more CPU time. Not sure what the
> right choice is, so using default seems fine.
>
> I agree it'd be better to start with pglz, and only then add lz4 etc.
> Firstly, pglz is simply the built-in compression, supported everywhere.
> And it's also simpler to implement, I think.
>
> >     # Future possible improvements
> >     Reducing the number of memory allocations within the dumping and
> >     loading of
> >     the buffer. I have two ideas for solving this problem. I would
> >     either add a buffer into
> >     struct BufFile or provide the buffer as an argument from the caller.
> >     For the sequential
> >     execution, I would prefer the second option.
> >
>
> Yes, this would be good. Doing a palloc+pfree for each compression is
> going to be expensive, especially because these buffers are going to be
> large - likely larger than 8kB. Which means it's not cached in the
> memory context, etc.
>
> Adding it to the BufFile is not going to fly, because that doubles the
> amount of memory per file. And we already have major issues with hash
> joins consuming massive amounts of memory. But at the same time the
> buffer is only needed during compression, and there's only one at a
> time. So I agree with passing a single buffer as an argument.
>
> >     # Future plan/open questions
> >     In the future, I would like to add support for pglz and zstd.
> >     Further, I plan to
> >     extend the support of the temporary file compression also for
> >     sorting, gist index creation, etc.
> >
> >     Experimenting with the stream mode of compression algorithms. The
> >     compression
> >     ratio of LZ4 in block mode seems to be satisfying, but the stream
> >     mode could
> >     produce a better ratio, but it would consume more memory due to the
> >     requirement to store
> >     context for LZ4 stream compression.
> >
>
> One thing I realized is that this only enables temp file compression for
> a single place - hash join spill files. AFAIK this is because compressed
> files don't support random access, and the other places might need that.
>
> Is that correct? The patch does not explain this anywhere. If that's
> correct, the patch probably should mention this in a comment for the
> 'compress' argument added to BufFileCreateTemp(), so that it's clear
> when it's legal to set compress=true.
>

I will add the description there.


> Which other places might compress temp files? Surely hash joins are not
> the only place that could benefit from this, right?
>

Yes, you are definitely right. I have chosen the hash joins as a POC
because
there are no seeks besides seeks at the beginning of the buffer.
I have focused on hashjoins, but there are definitely also other places
where
the compression could be used. I want to add support in other places
in the feature.


> Another thing is testing. If I run regression tests, it won't use
> compression at all, because the GUC has "none" by default, right? But we
> need some testing, so how would we do that? One option would be to add a
> regression test that explicitly sets the GUC and does a hash join, but
> that won't work with lz4 (because that may not be enabled).


Right, it's "none" by default. My opinion is that we would like to test
every supported compression method, so I will try to add environment
variable as
you recommended.


>
> Another option might be to add a PG_TEST_xxx environment variable that
> determines compression to use. Something like PG_TEST_USE_UNIX_SOCKETS.
> But perhaps there's a simpler way.
>
> >     # Benchmark
> >     I prepared three different databases to check expectations. Each
> >     dataset is described below. My testing demonstrates that my patch
> >     improves the execution time of huge hash joins.
> >     Also, my implementation should not
> >     negatively affect performance within smaller queries.
> >     The usage of memory needed for temporary files was reduced in every
> >      execution without a significant impact on execution time.
> >
> >     *## Dataset A:*
> >     Tables*
> >     *
> >     table_a(bigint id,text data_text,integer data_number) - 10000000 rows
> >     table_b(bigint id, integer ref_id, numeric data_value, bytea
> >     data_blob) - 10000000 rows
> >     Query:  SELECT *  FROM table_a a JOIN table_b b ON a.id <http://
> >     a.id> = b.id <http://b.id;;
> >
> >     The tables contain highly compressible data.
> >     The query demonstrated a reduction in the usage of the temporary
> >     files ~20GB -> 3GB, based on this reduction also caused the
> execution
> >     time of the query to be reduced by about ~10s.
> >
> >
> >     *## Dataset B:*
> >     Tables:*
> >     *
> >     table_a(integer id, text data_blob) - 1110000 rows
> >     table_b(integer id, text data_blob) - 10000000 rows
> >     Query:  SELECT *  FROM table_a a JOIN table_b b ON a.id <http://
> >     a.id> = b.id <http://b.id;;
> >
> >     The tables contain less compressible data. data_blob was generated
> >     by a pseudo-random generator.
> >     In this case, the data reduction was only ~50%. Also, the execution
> >     time was reduced
> >     only slightly with the enabled compression.
> >
> >     The second scenario demonstrates no overhead in the case of enabled
> >     compression and extended work_mem to avoid temp file usage.
> >
> >     *## Dataset C:*
> >     Tables
> >     customers (integer,text,text,text,text)
> >     order_items(integer,integer,integer,integer,numeric(10,2))
> >     orders(integer,integer,timestamp,numeric(10,2))
> >     products(integer,text,text,numeric(10,2),integer)
> >
> >     Query: SELECT p.product_id, p.name <http://p.name;, p.price,
> >     SUM(oi.quantity) AS total_quantity, AVG(oi.price) AS avg_item_price
> >     FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
> >     oi.product_id JOIN
> >     eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
> >     '2020-01-01' AND p.price > 50
> >     GROUP BY p.product_id, p.name <http://p.name;, p.price HAVING
> >     SUM(oi.quantity) > 1000
> >     ORDER BY total_quantity DESC LIMIT 100;
> >
> >     This scenario should demonstrate a more realistic usage of the
> database.
> >     Enabled compression slightly reduced the temporary memory usage, but
> >     the execution
> >     time wasn't affected by compression.
> >
> >
> >     +------------+-------------------------+-----------------------
> >     +------------------------------+
> >     |  Dataset   | Compression.       | temp_bytes         | Execution
> >     Time (ms)   |
> >     +------------+-------------------------+-----------------------
> >     +----------------------------- +
> >     | A             | Yes                        |  3.09 GiB
> >     | 22s586ms           | work_mem  = 4MB
> >     |                | No                         |  21.89 GiB
> >     | 35s                       | work_mem  = 4MB
> >     +------------+-------------------------+-----------------------
> >     +----------------------------------------
> >     | B             | Yes                        |  333 MB
> >     | 1815.545 ms       | work_mem = 4MB
> >     |                 | No                        |  146  MB
> >       | 1500.460 ms        | work_mem = 4MB
> >     |                 | Yes                       |  0 MB
> >         | 3262.305 ms        | work_mem = 80MB
> >     |                 | No                        |  0 MB
> >        | 3174.725 ms         | work_mem = 80MB
> >     +-------------+------------------------+------------------------
> >     +-------------------------------------
> >     | C             | Yes                       | 40 MB
> >     | 1011.020 ms        | work_mem = 1MB
> >     |                | No                        |  53
> >     MB                 |  1034.142 ms        | work_mem = 1MB
> >     +------------+------------------------+------------------------
> >     +--------------------------------------
> >
> >
>
> Thanks. I'll try to do some benchmarks on my own.
>
> Are these results fro ma single run, or an average of multiple runs?


It is average from multiple runs.

Do
> you maybe have a script to reproduce this, including the data generation?


I am attaching my SQL file for database preparation. I also did further
testing
with two other machines( see attachment huge_tables.rtf ).

>
> Also, can you share some information about the machine used for this? I
> expect the impact to strongly depends on memory pressure - if the temp
> file fits into page cache (and stays there), it may not benefit from the
> compression, right?
>

If it fits into the page cache due to compression, I would consider it as a
benefit from compression.
I performed further testing on machines with different memory sizes.
Both experiments showed that compression was beneficial for execution time.
The execution time reduction was more significant in the case of the
machine that had
less memory available.

Tests were performed on:
MacBook PRO M3 36GB - MacOs
Virtual machine ARM64 10GB/ 6CPU - Fedora 39


>
> regards
>
> --
> Tomas Vondra
>
>


Attachments:

  [text/rtf] huge_table.rtf (10.0K, 3-huge_table.rtf)
  download | inline:
{\rtf1\ansi\ansicpg1252\cocoartf2820
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fnil\fcharset0 Menlo-Regular;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;}
{\*\expandedcolortbl;;\csgray\c0;}
\paperw11900\paperh16840\margl1440\margr1440\vieww35800\viewh20300\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs22 \cf2 \CocoaLigature0 MacBook PRO M3 36GB \
\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\
\
huge_new=# set temp_file_compression = "no";\
SET\
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0
huge_new=# SELECT datname, temp_bytes                                                                                                                                                                                                                       FROM pg_stat_database;\
    datname    |  temp_bytes  \
---------------+--------------\
               |            0\
 postgres      |    154000000\
 fjanus        |            0\
 template1     |            0\
 template0     |            0\
 hashjoin_test |            0\
 huge_tables   | 157592582113\
 random        |    890168658\
 eshop         |     56335229\
 huge_new      |            0\
(10 rows)\
\
huge_new=#  EXPLAIN ANALYZE SELECT a.id, a.data, b.data FROM table_a a JOIN table_b b ON a.id = b.id;                                                                                                                                                                                                                         QUERY PLAN                                                                   \
-----------------------------------------------------------------------------------------------------------------------------------------------\
 Gather  (cost=1318666.18..4309880.44 rows=10000008 width=816) (actual time=113489.050..168871.466 rows=10000000 loops=1)\
   Workers Planned: 7\
   Workers Launched: 7\
   ->  Hash Join  (cost=1317666.18..3308879.64 rows=1428573 width=816) (actual time=113528.579..161470.256 rows=1250000 loops=8)\
         Hash Cond: (b.id = a.id)\
         ->  Parallel Seq Scan on table_b b  (cost=0.00..1139683.45 rows=2857145 width=412) (actual time=0.638..9547.653 rows=2500000 loops=8)\
         ->  Hash  (cost=655556.08..655556.08 rows=10000008 width=412) (actual time=113206.367..113206.367 rows=10000000 loops=8)\
               Buckets: 32768  Batches: 1024  Memory Usage: 4486kB\
               ->  Seq Scan on table_a a  (cost=0.00..655556.08 rows=10000008 width=412) (actual time=0.524..9326.244 rows=10000000 loops=8)\
 Planning Time: 1.720 ms\
 Execution Time: 169180.107 ms\
(11 rows)\
\
huge_new=# SELECT datname, temp_bytes / 1024 / 1024 AS temp_bytes_mb FROM pg_stat_database;\
    datname    | temp_bytes_mb \
---------------+---------------\
               |             0\
 postgres      |           146\
 fjanus        |             0\
 template1     |             0\
 template0     |             0\
 hashjoin_test |             0\
 huge_tables   |        150291\
 random        |           848\
 eshop         |            53\
 huge_new      |         41234\
(10 rows)\
\
huge_new=# set temp_file_compression = "lz4";\
SET\
huge_new=# SELECT pg_stat_reset();\
 pg_stat_reset \
---------------\
 \
(1 row)\
\
huge_new=#  EXPLAIN ANALYZE SELECT a.id, a.data, b.data FROM table_a a JOIN table_b b ON a.id = b.id;                                                                                                                                                                                                                         QUERY PLAN                                                                   \
-----------------------------------------------------------------------------------------------------------------------------------------------\
 Gather  (cost=1318666.18..4309880.44 rows=10000008 width=816) (actual time=93632.254..128635.086 rows=10000000 loops=1)\
   Workers Planned: 7\
   Workers Launched: 7\
   ->  Hash Join  (cost=1317666.18..3308879.64 rows=1428573 width=816) (actual time=93657.750..123759.947 rows=1250000 loops=8)\
         Hash Cond: (b.id = a.id)\
         ->  Parallel Seq Scan on table_b b  (cost=0.00..1139683.45 rows=2857145 width=412) (actual time=0.478..5278.765 rows=2500000 loops=8)\
         ->  Hash  (cost=655556.08..655556.08 rows=10000008 width=412) (actual time=93328.973..93328.973 rows=10000000 loops=8)\
               Buckets: 32768  Batches: 1024  Memory Usage: 4486kB\
               ->  Seq Scan on table_a a  (cost=0.00..655556.08 rows=10000008 width=412) (actual time=0.957..6605.046 rows=10000000 loops=8)\
 Planning Time: 0.807 ms\
 Execution Time: 128980.688 ms\
(11 rows)\
\
huge_new=# SELECT datname, temp_bytes / 1024 / 1024 AS temp_bytes_mb FROM pg_stat_database;\
    datname    | temp_bytes_mb \
---------------+---------------\
               |             0\
 postgres      |           146\
 fjanus        |             0\
 template1     |             0\
 template0     |             0\
 hashjoin_test |             0\
 huge_tables   |        150291\
 random        |           848\
 eshop         |            53\
 huge_new      |          1937\
(10 rows)\
\
huge_new=# SELECT pg_size_pretty(pg_database_size('huge_new'));\
 pg_size_pretty \
----------------\
 13 GB\
(1 row)\
\
\
\
\
Virtual machine ARM64 10GB/ 6CPU\
\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\'97\
\
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0
huge=# SELECT datname,\
       temp_bytes / 1024 / 1024 AS temp_bytes_mb\
FROM pg_stat_database;\
     datname      | temp_bytes_mb \
------------------+---------------\
                  |             0\
 postgres         |             0\
 template1        |             0\
 template0        |             0\
 huge             |             0\
(11 rows)\
\
huge=# set temp_file_compression = "lz4";\
SET\
huge=# EXPLAIN ANALYZE SELECT a.id, a.data, b.data FROM table_a a JOIN table_b b ON a.id = b.id;\
                                                                  QUERY PLAN                                                                   \
-----------------------------------------------------------------------------------------------------------------------------------------------\
 Gather  (cost=1318666.18..4967281.40 rows=10000008 width=816) (actual time=16515.365..35155.453 rows=10000000 loops=1)\
   Workers Planned: 2\
   Workers Launched: 2\
   ->  Hash Join  (cost=1317666.18..3966280.60 rows=4166670 width=816) (actual time=16497.442..31216.586 rows=3333333 loops=3)\
         Hash Cond: (b.id = a.id)\
         ->  Parallel Seq Scan on table_b b  (cost=0.00..1194445.40 rows=8333340 width=412) (actual time=0.873..3916.384 rows=6666667 loops=3)\
         ->  Hash  (cost=655556.08..655556.08 rows=10000008 width=412) (actual time=16479.002..16479.003 rows=10000000 loops=3)\
               Buckets: 32768  Batches: 1024  Memory Usage: 4486kB\
               ->  Seq Scan on table_a a  (cost=0.00..655556.08 rows=10000008 width=412) (actual time=0.356..9466.043 rows=10000000 loops=3)\
 Planning Time: 0.814 ms\
 Execution Time: 35431.791 ms\
(11 rows)\
\
huge=# SELECT datname,\
       temp_bytes / 1024 / 1024 AS temp_bytes_mb\
FROM pg_stat_database;\
     datname      | temp_bytes_mb \
------------------+---------------\
                  |             0\
 postgres         |             0\
 template1        |             0\
 template0        |             0\
 fjanus           |             0\
 test             |             0\
 d22              |             0\
 hashjoin_test    |            16\
 compression_test |           196\
 cmptest          |             1\
 huge             |           921\
(11 rows)\
\
huge=# set temp_file_compression = "no";\
SET\
huge=# SELECT pg_stat_reset();\
 pg_stat_reset \
---------------\
 \
(1 row)\
\
huge=# EXPLAIN ANALYZE SELECT a.id, a.data, b.data FROM table_a a JOIN table_b b ON a.id = b.id;\
                                                                   QUERY PLAN                                                                   \
------------------------------------------------------------------------------------------------------------------------------------------------\
 Gather  (cost=1318666.18..4967281.40 rows=10000008 width=816) (actual time=23555.046..69838.589 rows=10000000 loops=1)\
   Workers Planned: 2\
   Workers Launched: 2\
   ->  Hash Join  (cost=1317666.18..3966280.60 rows=4166670 width=816) (actual time=23542.253..59981.044 rows=3333333 loops=3)\
         Hash Cond: (b.id = a.id)\
         ->  Parallel Seq Scan on table_b b  (cost=0.00..1194445.40 rows=8333340 width=412) (actual time=0.397..13283.039 rows=6666667 loops=3)\
         ->  Hash  (cost=655556.08..655556.08 rows=10000008 width=412) (actual time=23520.499..23520.500 rows=10000000 loops=3)\
               Buckets: 32768  Batches: 1024  Memory Usage: 4486kB\
               ->  Seq Scan on table_a a  (cost=0.00..655556.08 rows=10000008 width=412) (actual time=0.378..15375.891 rows=10000000 loops=3)\
 Planning Time: 1.231 ms\
 Execution Time: 70140.003 ms\
(11 rows)\
\
huge=# SELECT datname,\
       temp_bytes / 1024 / 1024 AS temp_bytes_mb\
FROM pg_stat_database;\
     datname      | temp_bytes_mb \
------------------+---------------\
                  |             0\
 postgres         |             0\
 template1        |             0\
 template0        |             0\
 fjanus           |             0\
 test             |             0\
 d22              |             0\
 hashjoin_test    |            16\
 compression_test |           196\
 cmptest          |             1\
 huge             |         20655\
(11 rows)\
\
huge=# SELECT pg_size_pretty(pg_database_size('huge'));\
 pg_size_pretty \
----------------\
 13 GB\
(1 row)\
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0
\
\
\
\
}

  [application/octet-stream] lz4.sql (1.6K, 4-lz4.sql)
  download

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Proposal: Adding compression of temporary files
  In-Reply-To: <CAFjYY++22tmGDSac_YPA_NLJ4-cEJo7kS3SxXFHf7ZFTnwO-QA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox