Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.80) (envelope-from ) id 1aaZ32-0006BB-TS for pgsql-performance@arkaria.postgresql.org; Tue, 01 Mar 2016 01:24:25 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84) (envelope-from ) id 1aaZ32-0002VE-G5 for pgsql-performance@arkaria.postgresql.org; Tue, 01 Mar 2016 01:24:24 +0000 Received: from makus.postgresql.org ([2001:4800:1501:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84) (envelope-from ) id 1aaZ1N-0008ME-4t for pgsql-performance@postgresql.org; Tue, 01 Mar 2016 01:22:41 +0000 Received: from sss.pgh.pa.us ([66.207.139.130]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84) (envelope-from ) id 1aaZ1F-0004dL-UF for pgsql-performance@postgresql.org; Tue, 01 Mar 2016 01:22:39 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.14.4/8.14.4) with ESMTP id u211MSaF006639; Mon, 29 Feb 2016 20:22:28 -0500 From: Tom Lane To: David Rowley cc: James Parks , postgres performance list Subject: Re: Merge joins on index scans In-reply-to: References: Comments: In-reply-to David Rowley message dated "Sun, 28 Feb 2016 23:06:35 +1300" Date: Mon, 29 Feb 2016 20:22:28 -0500 Message-ID: <6638.1456795348@sss.pgh.pa.us> X-Pg-Spam-Score: -1.9 (-) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org David Rowley writes: > On 27 February 2016 at 11:07, James Parks wrote: >> If you force the query planner to use a merge join on the above query, it >> takes 10+ minutes to complete using the data as per below. If you force the >> query planner to use a hash join on the same data, it takes ~200 >> milliseconds. > I believe I know what is going on here, but can you please test; > SELECT b.* FROM b WHERE EXISTS (SELECT 1 FROM a ON b.a_id = a.id AND > a.nonce = ?) ORDER BY b.id ASC; > using the merge join plan. > If this performs much better then the problem is due to the merge join > mark/restore causing the join to have to transition through many > tuples which don't match the a.nonce = ? predicate. Clearly we are rescanning an awful lot of the "a" table: -> Index Scan using a_pkey on a (cost=0.00..26163.20 rows=843 width=8) (actual time=5.706..751385.306 rows=83658 loops=1) Filter: (nonce = 64) Rows Removed by Filter: 2201063696 Buffers: shared hit=2151024418 read=340 I/O Timings: read=1.015 The other explain shows a scan of "a" reading about 490k rows and returning 395 of them, so there's a factor of about 200 re-read here. I wonder if the planner should have inserted a materialize node to reduce that. However, I think the real problem is upstream of that: if that indexscan was estimated at 26163.20 units, how'd the mergejoin above it get costed at only 7850.13 units? The answer has to be that the planner thought the merge would stop before reading most of "a", as a result of limited range of b.a_id. It would be interesting to look into what the actual maximum b.a_id value is. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance