Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sHYjk-0094BJ-4y for pgsql-general@arkaria.postgresql.org; Thu, 13 Jun 2024 00:54:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sHYjg-00AYDW-Nl for pgsql-general@arkaria.postgresql.org; Thu, 13 Jun 2024 00:54:41 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sHYjg-00AYDO-AY for pgsql-general@lists.postgresql.org; Thu, 13 Jun 2024 00:54:41 +0000 Received: from mail-wr1-x441.google.com ([2a00:1450:4864:20::441]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sHYjZ-001NqV-Cd for pgsql-general@lists.postgresql.org; Thu, 13 Jun 2024 00:54:40 +0000 Received: by mail-wr1-x441.google.com with SMTP id ffacd0b85a97d-35f0d49a9ebso45179f8f.1 for ; Wed, 12 Jun 2024 17:54:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pkm-inc.com; s=google; t=1718240071; x=1718844871; darn=lists.postgresql.org; h=content-transfer-encoding:subject:from:to:content-language :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=rVxSC+9swUe0ixX20kJRsN7h+F46jxj/4NS84Bwr7jQ=; b=E2uMU0imLky0CGYgs2DZ4KsPb7lEQTWcHiaFvl/tYtL4bwXPhfiDYxVOYkMuUomr1G PJtJMqHgkPFLFg96YBDNK0nn+19XC3iM2YXjEirhmTrwUmCqFV1lzG9ASH+GaEJhzRz5 jL2czOv5oLcrxWl7MVPxEq+y8qSVSLo+u3oRmkAmhvsPyZjAp4XfOeqsc7MZhb3GPZfP 6o5WSHYnMXxc3/Zu0r/UkDK7xsFyb7Dwew9HWQcoTlkFAcYEUDaLFUGWhi7lHlarmzbk zveWh4yNm67CT5UD8snxnOf0t3HqUZm+ssv9MBx8sjSbMo8tKO/eGS4BSXYQacZnU6FJ acKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718240071; x=1718844871; h=content-transfer-encoding:subject:from:to:content-language :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=rVxSC+9swUe0ixX20kJRsN7h+F46jxj/4NS84Bwr7jQ=; b=CfgYOZvFYqTnfWjivKnua/qjkjZebCH7kATZIGKUVBPMFTpbLQDVdjGuPvlcy5jUQF bFDnNATkFQdVdgCSKbxj8V+G1HxBEHoX/3T9OAYj731pXkBE26KuGje0VOECYl+10FB0 ZB2RFPFXyVKX9yke1e3dFSAUENjFR58ZdSe5p30mb836gpgyxvvSswOst/cu0VNWyQM7 QUhrlxJCSm/viETeCbOlE+VVzQHdSP/JfRsE8uVqTbX8PUEYvvIRpN76kKubW7pdwZnL 63UYBjoYqOU+w64c9qP3zjPIxgNZaafgwwHcaznMcWvVdyo3CoSomvx0vWAm2JWSr2Lg tnOA== X-Gm-Message-State: AOJu0YymSBGovrctDdP5MukT/OT+jTsAiIZRvV8MHFSvj8u5Lfx3OGru x3Ax3pSmbJBL9t3Rbrb1MAl2ytCKoDd0bLqHCOmjH7tvRom6/wlr3dWKiDV+b8DyuVD/TpeJmE1 aNUNOLedv X-Google-Smtp-Source: AGHT+IEFU+/QRoJMe2Bpe6he/V/P6SQf8SRzPjfgsCYrNfAQRuNfyxHo+uSTkU7fAlBIFSGtHIVtbA== X-Received: by 2002:a05:600c:1c14:b0:421:7dc3:9a1d with SMTP id 5b1f17b1804b1-422866c4ac7mr22539575e9.4.1718240070932; Wed, 12 Jun 2024 17:54:30 -0700 (PDT) Received: from [10.8.0.8] (93-86-103-73.dynamic.isp.telekom.rs. [93.86.103.73]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-422874e74c7sm42580535e9.47.2024.06.12.17.54.30 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 12 Jun 2024 17:54:30 -0700 (PDT) Message-ID: <514b55e0-342c-4ec9-b6b4-544af0415dd6@pkm-inc.com> Date: Thu, 13 Jun 2024 02:54:29 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: pgsql-general@lists.postgresql.org From: =?UTF-8?Q?Dragan_Milivojevi=C4=87?= Subject: Is NVMe RAID useless (performance-wise) with PostgreSQL? Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi all, While building a new PostgreSQL server, I realized that the performance with a single disk is the same or better than with a RAID0 4-disk array. All benchmarks were conducted using pgbench with a scaling factor of 2000. For a typical run with pgbench -j 4 -c 512 -P 60 -r -T 300 -b tpcb-like, these are the results: single disk run: latency average = 17.524 ms latency stddev = 6.904 ms tps = 28870 iostat: avg-cpu:  %user   %nice %system %iowait  %steal   %idle           26.10    0.00   21.02    2.95    0.00   49.93 | Device  |    r/s   |  rMB/s | rrqm/s | %rrqm | r_await | rareq-sz |    w/s   |  wMB/s | wrqm/s | %wrqm | w_await | wareq-sz |   f/s  | f_await | aqu-sz | %util | |---------|:--------:|:------:|:------:|:-----:|:-------:|:--------:|:--------:|:------:|:------:|:-----:|:-------:|:--------:|:------:|:-------:|:------:|:-----:| | nvme0n1 | 28641.27 | 255.00 |   0.00 |  0.00 |    0.16 | 9.12 | 27665.67 | 458.09 |   0.00 |  0.00 |    0.09 |    16.96 | 251.47 |    1.69 |   7.69 | 98.08 | RAID0 4 disk, 4K chunk: latency average = 22.269 ms latency stddev = 10.825 ms tps = 22742 avg-cpu:  %user   %nice %system %iowait  %steal   %idle           23.63    0.00   19.63    1.53    0.00   55.21 | Device  |    r/s   |  rMB/s |  rrqm/s | %rrqm | r_await | rareq-sz |    w/s    |  wMB/s | wrqm/s | %wrqm | w_await | wareq-sz |   f/s  | f_await | aqu-sz | %util | |---------|:--------:|:------:|:-------:|:-----:|:-------:|:--------:|:---------:|:------:|:------:|:-----:|:-------:|:--------:|:------:|:-------:|:------:|:-----:| | md127   | 55359.93 | 216.25 |    0.00 |  0.00 |    0.09 | 4.00 | 105629.07 | 412.61 |   0.00 |  0.00 |    0.04 |     4.00 |   0.00 |    0.00 |   9.02 | 93.76 | | nvme1n1 | 12763.33 |  54.03 | 1067.47 |  7.72 |    0.08 | 4.33 |  26572.07 | 103.31 |  37.33 |  0.14 |    0.05 |     3.98 | 162.53 |    1.74 |   2.67 | 99.18 | | nvme3n1 | 12753.07 |  53.97 | 1063.87 |  7.70 |    0.08 | 4.33 |  26560.47 | 103.26 |  37.40 |  0.14 |    0.05 |     3.98 | 162.47 |    1.73 |   2.58 | 99.15 | | nvme4n1 | 12787.27 |  54.10 | 1062.80 |  7.67 |    0.09 | 4.33 |  26492.73 | 102.99 |  35.67 |  0.13 |    0.05 |     3.98 | 162.53 |    1.69 |   2.67 | 99.07 | | nvme5n1 | 12796.53 |  54.15 | 1065.60 |  7.69 |    0.09 | 4.33 |  26505.67 | 103.04 |  35.73 |  0.13 |    0.05 |     3.98 | 162.53 |    1.66 |   2.56 | 98.95 | BTW, if these tables are mangled in transport or by email clients, I posted this email to https://pastebin.com/raw/ZmsH0T5M. A 4K chunk is obviously not optimal, but I should still be getting around a 2x uplift. In the past, when tuning PostgreSQL, I tweaked various RAID parameters like chunk size, stripe_cache_size, etc. but it never occurred to me to check the performance against a single drive. Hence this email. I'm not sure if this is expected or if there is something wrong with my setup. Full system details are at the end of the message. While exploring this, I went deep down the rabbit hole, running hundreds of tests and trying dozens of configurations. The best I achieved was 31K TPS with a plain RAID0 256KB chunk, nvme poll_queues=4 and io_pool=1. This resulted in a measly 2% improvement compared to a single disk. A sample of results: Single disk nvme poll_queues=0 pgbench -j4 -c X    tps    avg latency ms  latency stddev ms 1                   477         2.096             0.258 4                   1167        3.426             0.295 16                  4408        3.623             0.545 64                  12533       5.089             0.999 128                 21295       5.979             1.538 256                 28022       9.048             3.014 512                 28870      17.524             6.904 Single disk nvme poll_queues=4 pgbench -j4 -c X    tps    avg latency ms  latency stddev ms 128                 22284       5.711           1.448 256                 27390       9.240           2.848 512                 30596       16.452          6.090 1024                26352       38.481          19.513 4 disk RAID 0 4KB Chunk nvme poll_queues=0 pgbench -j4 -c X    tps     avg latency ms  latency stddev ms 128                 17614       7.231           2.43 256                 22347       11.37           4.922 512                 22742       22.269          10.825 1024                20896       48.57           26.975 4 disk LVM RAID 0 4KB Chunk nvme poll_queues=4 pgbench -j4 -c X    tps     avg latency ms  latency stddev ms 128                 17423       7.312           1.991 256                 22064      11.521           4.044 512                 24875      20.373           9.421 1024                21242      47.692           25.843 4 disk RAID 0 8KB Chunk nvme poll_queues=4 pgbench -j4 -c X    tps     avg latency ms  latency stddev ms 128                 18907       6.736           2.094 256                 24909       10.184          3.446 512                 24878       20.331          8.448 1024                20535       49.665          27.462 4 disk RAID 0 64KB Chunk nvme poll_queues=4 pgbench -j4 -c X    tps     avg latency ms  latency stddev ms 128                 21393       5.951           2.102 256                 27231       9.293           2.752 512                 30261      16.624           5.986 1024                25245      40.285          21.317 4 disk RAID 0 256KB Chunk nvme poll_queues=4 pgbench -j4 -c X    tps     avg latency ms  latency stddev ms 128                 21055       6.046           1.972 256                 27226       9.289           2.736 512                 31174      16.120           5.858 1024                27069      37.355           17.797 4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=0 stripe_cache_size=256 pgbench -j4 -c X    tps     avg latency ms  latency stddev ms 128                 15368       8.291           3.527 256                 19110       13.312          6.151 512                 19796       25.667          13.191 4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=4 stripe_cache_size=256 pgbench -j4 -c X    tps     avg latency ms  latency stddev ms 128                 17143       7.431           2.353 256                 21777      11.665           4.175 512                 22602      22.433           9.239 1024                20291      50.189          26.191 4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=8 stripe_cache_size=256 pgbench -j4 -c X    tps     avg latency ms  latency stddev ms 128                 16767       7.598           2.698 256                 21646      11.733           4.288 512                 22161      22.879           9.811 4 disk RAID 5 256KB Chunk nvme poll_queues=4 group_thread_cnt=8 stripe_cache_size=4096 pgbench -j4 -c X    tps     avg latency ms  latency stddev ms 128                 15997       7.966          2.754 256 512                 21745      23.293          9.805 The obvious conclusion is that there is something broken with Linux MD RAID, so I examined whether RAID0 scales with the simplest workloads, such as sequential reads. Unfortunately, I did not reach a definite conclusion. I posted my tests to the Linux RAID list but received no response (https://lore.kernel.org/linux-raid/1af8f1e0-4f41-4f25-bc34-f655a4c141b4@pkm-inc.com/T/#u). So my question is this: has anyone else tested whether MD RAID scales with PostgreSQL, and what results did you get? What is your typical setup when you have a bunch of NVMe drives in a shiny new server? System specs: Dell PowerEdge R7525, Dual AMD EPYC 7313, 32G DDR4 3200 Disks used for tests: "Samsung SSD 980 PRO with Heatsink 2TB" drives All drives under test are connected to the same processor. AlmaLinux release 9.4 (Seafoam Ocelot) Kernel 5.14.0-427.20.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC tuned-adm profile postgresql tweaks: poll_queues=4, io_poll = 1 version --------------------------------------------------------------------------------------------------------------  PostgreSQL 16.1 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3), 64-bit (1 row)              name             |  current_setting   | source ------------------------------+--------------------+--------------------  application_name             | psql               | client  checkpoint_completion_target | 0.9                | configuration file  client_encoding              | UTF8               | client  DateStyle                    | ISO, MDY           | configuration file  default_statistics_target    | 100                | configuration file  default_text_search_config   | pg_catalog.english | configuration file  dynamic_shared_memory_type   | posix              | configuration file  effective_cache_size         | 24GB               | configuration file  effective_io_concurrency     | 200                | configuration file  lc_messages                  | C.UTF-8            | configuration file  lc_monetary                  | C.UTF-8            | configuration file  lc_numeric                   | C.UTF-8            | configuration file  lc_time                      | C.UTF-8            | configuration file  log_filename                 | postgresql-%a.log  | configuration file  log_rotation_age             | 1d                 | configuration file  log_rotation_size            | 0                  | configuration file  log_timezone                 | Europe/Belgrade    | configuration file  log_truncate_on_rotation     | on                 | configuration file  logging_collector            | on                 | configuration file  maintenance_work_mem         | 2GB                | configuration file  max_connections              | 1024               | configuration file  max_wal_size                 | 8GB                | configuration file  min_wal_size                 | 2GB                | configuration file  random_page_cost             | 1.1                | configuration file  shared_buffers               | 8GB                | configuration file  TimeZone                     | Europe/Belgrade    | configuration file  wal_buffers                  | 16MB               | configuration file  work_mem                     | 4MB                | configuration file Thanks Dragan