Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1usgyO-009Ozx-W3 for pgsql-hackers@arkaria.postgresql.org; Sun, 31 Aug 2025 12:15:55 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1usgyO-001PsD-Aa for pgsql-hackers@arkaria.postgresql.org; Sun, 31 Aug 2025 12:15:52 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1usgyN-001Ps5-GE for pgsql-hackers@lists.postgresql.org; Sun, 31 Aug 2025 12:15:52 +0000 Received: from forwardcorp1a.mail.yandex.net ([178.154.239.72]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1usgyJ-002fwo-2Y for pgsql-hackers@postgresql.org; Sun, 31 Aug 2025 12:15:51 +0000 Received: from mail-nwsmtp-smtp-corp-main-69.vla.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-69.vla.yp-c.yandex.net [IPv6:2a02:6b8:c1f:3a87:0:640:845c:0]) by forwardcorp1a.mail.yandex.net (Yandex) with ESMTPS id A6332C0165; Sun, 31 Aug 2025 15:15:44 +0300 (MSK) Received: from smtpclient.apple (unknown [2a02:6bf:8080:54d::1:38]) by mail-nwsmtp-smtp-corp-main-69.vla.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id hFcrI72Gw4Y0-TCMDON4c; Sun, 31 Aug 2025 15:15:44 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1756642544; bh=vGaCatZ0+SIzfCr7kcUD4ckBRQ2mjvIl54tQ41MxypM=; h=References:To:Cc:In-Reply-To:Date:From:Message-Id:Subject; b=Vclfssjg7mOdQo21gGo1x3N5X7bLYZcyK+rNYEaJmKYheSOkqBAy5e42d5fG47efL BrhNhh30N+o4K69f3GxDvmCQxxMgZyeBnCZzgs/0sjBhdH4eG86lM0/2KO0+Qy9JPB 3EAOVc4eCyOUn/mblibzsrB+JpYttuXETamuHbrk= Authentication-Results: mail-nwsmtp-smtp-corp-main-69.vla.yp-c.yandex.net; dkim=pass header.i=@yandex-team.ru From: Andrey Borodin Message-Id: Content-Type: multipart/mixed; boundary="Apple-Mail=_76A30C9C-86C3-47A6-8ADC-C164D48DCA06" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.600.51.1.1\)) Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat Date: Sun, 31 Aug 2025 17:15:32 +0500 In-Reply-To: <3F0FC1DC-3653-4F61-A46F-46FF81C14915@yandex-team.ru> Cc: pgsql-hackers , Kirk Wolak , Nikolay Samokhvalov To: Peter Geoghegan , boekewurm+postgres@gmail.com References: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru> <3F0FC1DC-3653-4F61-A46F-46FF81C14915@yandex-team.ru> X-Mailer: Apple Mail (2.3826.600.51.1.1) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_76A30C9C-86C3-47A6-8ADC-C164D48DCA06 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii > On 29 Aug 2025, at 13:39, Andrey Borodin wrote: >=20 > I think to establish baseline for locking correctness we are going to = start from writing index scan tests, that fail with proposed merge patch = and pass on current HEAD. I want to observe that forward scan is showing = duplicates and backward scan misses tuples. Well, that was unexpectedly easy. See patch 0001. It brings a test where = we create sparse tree, and injection point that will wait on a scan = stepping into some middle leaf page. Then the test invokes vacuum. There are ~35 leaf pages, most of them = will be merged into just a few pages. As expected, both scans produce incorrect results. t/008_btree_merge_scan_correctness.pl .. 1/?=20 # Failed test 'Forward scan returns correct count' # at t/008_btree_merge_scan_correctness.pl line 132. # got: '364' # expected: '250' # Failed test 'Backward scan returns correct count' # at t/008_btree_merge_scan_correctness.pl line 133. # got: '142' # expected: '250' # Looks like you failed 2 tests of 2. > =46rom that we will try to design locking that does not affect = performance significantly, but allows to merge pages. Perhaps, we can = design a way to switch new index scans to "safe mode" during index = vacuum and waiting for existing scans to complete. What if we just abort a scan, that stepped on the page where tuples were = moved out? I've prototype this approach, please see patch 0002. Maybe in future we = will improve locking protocol if we will observe high error rates. Unfortunately, this approach leads to default mergefactor 0 instead of = 5%. What do you think? Should we add this to CF or the idea is too wild for = a review? Best regards, Andrey Borodin. --Apple-Mail=_76A30C9C-86C3-47A6-8ADC-C164D48DCA06 Content-Disposition: attachment; filename=v2-0001-btree-Add-page-merge-during-vacuum-to-reduce-inde.patch Content-Type: application/octet-stream; x-unix-mode=0644; name="v2-0001-btree-Add-page-merge-during-vacuum-to-reduce-inde.patch" Content-Transfer-Encoding: quoted-printable =46rom=20b3317ab787b6674ac411f67c1bcb4a23fbcffd4f=20Mon=20Sep=2017=20= 00:00:00=202001=0AFrom:=20Andrey=20Borodin=20=0ADate:=20= Wed,=2020=20Aug=202025=2023:24:36=20+0500=0ASubject:=20[PATCH=20v2=20= 1/2]=20btree:=20Add=20page=20merge=20during=20vacuum=20to=20reduce=20= index=0A=20bloat=0A=0AImplement=20automatic=20merging=20of=20= nearly-empty=20B-tree=20leaf=20pages=20during=20vacuum.=0AWhen=20a=20= page=20exceeds=20the=20mergefactor=20threshold=20(default=205%=20=3D=20= 95%=20empty),=20move=0Aremaining=20tuples=20to=20the=20right=20sibling=20= and=20delete=20the=20source=20page,=20reducing=0Aindex=20bloat.=0A=0A= Changes:=0A-=20Add=20mergefactor=20reloption=20(0-50%,=20default=205%)=20= for=20configurable=20merge=20threshold=0A-=20Detect=20mergefactor-empty=20= pages=20in=20btvacuumpage()=20for=20merge=20consideration=0A-=20Add=20= merge=20logic=20in=20_bt_unlink_halfdead_page()=20with=20tuple=20= movement=0A-=20Perform=20feasibility=20checks=20for=20right=20sibling=20= space=20and=20level=0A-=20Handle=20WAL=20replay=20by=20reading=20tuples=20= from=20target=20page=20before=20reinit=0A-=20Skip=20merge=20when=20= vacuum=20deletions=20are=20pending=20to=20avoid=20WAL=20conflicts=0A-=20= Update=20assertions=20to=20allow=20non-empty=20pages=20for=20merge=20= scenarios=0A=0AThe=20implementation=20maintains=20crash=20safety=20= through=20existing=20critical=20sections=0Aand=20WAL=20logging,=20= preserves=20B-tree=20sort=20order,=20and=20coordinates=20with=20vacuum=0A= operations=20to=20prevent=20offset=20conflicts=20during=20standby=20= replay.=0A=0AUsage:=20CREATE=20INDEX=20ON=20table=20(col)=20WITH=20= (mergefactor=3D10);=0A---=0A=20src/backend/access/common/reloptions.c=20=20= =20=20=20=20=20=20|=20=2010=20+=0A=20src/backend/access/nbtree/nbtpage.c=20= =20=20=20=20=20=20=20=20=20=20|=20275=20+++++++++++++++++-=0A=20= src/backend/access/nbtree/nbtree.c=20=20=20=20=20=20=20=20=20=20=20=20|=20= =2028=20+-=0A=20src/backend/access/nbtree/nbtsearch.c=20=20=20=20=20=20=20= =20=20|=20=2019=20++=0A=20src/backend/access/nbtree/nbtutils.c=20=20=20=20= =20=20=20=20=20=20|=20=20=203=20+-=0A=20= src/backend/access/nbtree/nbtxlog.c=20=20=20=20=20=20=20=20=20=20=20|=20=20= 82=20+++++-=0A=20src/include/access/nbtree.h=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20|=20=20=208=20+=0A=20= src/include/access/nbtxlog.h=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20|=20=20=204=20+-=0A=20.../t/008_btree_merge_scan_correctness.pl=20= =20=20=20=20|=20137=20+++++++++=0A=209=20files=20changed,=20544=20= insertions(+),=2022=20deletions(-)=0A=20create=20mode=20100644=20= src/test/modules/test_misc/t/008_btree_merge_scan_correctness.pl=0A=0A= diff=20--git=20a/src/backend/access/common/reloptions.c=20= b/src/backend/access/common/reloptions.c=0Aindex=20= 0af3fea68fa..b16fd32dc9b=20100644=0A---=20= a/src/backend/access/common/reloptions.c=0A+++=20= b/src/backend/access/common/reloptions.c=0A@@=20-192,6=20+192,16=20@@=20= static=20relopt_int=20intRelOpts[]=20=3D=0A=20=09=09},=0A=20=09=09= BTREE_DEFAULT_FILLFACTOR,=20BTREE_MIN_FILLFACTOR,=20100=0A=20=09},=0A+=09= {=0A+=09=09{=0A+=09=09=09"mergefactor",=0A+=09=09=09"Minimum=20= percentage=20of=20free=20space=20to=20trigger=20btree=20page=20merge=20= during=20vacuum",=0A+=09=09=09RELOPT_KIND_BTREE,=0A+=09=09=09= ShareUpdateExclusiveLock=09/*=20since=20it=20applies=20only=20to=20= vacuum=0A+=09=09=09=09=09=09=09=09=09=09=20*=20operations=20*/=0A+=09=09= },=0A+=09=09BTREE_DEFAULT_MERGEFACTOR,=200,=2050=0A+=09},=0A=20=09{=0A=20= =09=09{=0A=20=09=09=09"fillfactor",=0Adiff=20--git=20= a/src/backend/access/nbtree/nbtpage.c=20= b/src/backend/access/nbtree/nbtpage.c=0Aindex=20c79dd38ee18..792bc52558b=20= 100644=0A---=20a/src/backend/access/nbtree/nbtpage.c=0A+++=20= b/src/backend/access/nbtree/nbtpage.c=0A@@=20-1805,6=20+1805,9=20@@=20= _bt_pagedel(Relation=20rel,=20Buffer=20leafbuf,=20BTVacState=20*vstate)=0A= =20=09bool=09=09rightsib_empty;=0A=20=09Page=09=09page;=0A=20=09= BTPageOpaque=20opaque;=0A+=09OffsetNumber=20maxoff;=0A+=09OffsetNumber=20= minoff;=0A+=09bool=09=09page_has_tuples;=0A=20=0A=20=09/*=0A=20=09=20*=20= Save=20original=20leafbuf=20block=20number=20from=20caller.=20=20Only=20= deleted=20blocks=0A@@=20-1876,8=20+1879,12=20@@=20_bt_pagedel(Relation=20= rel,=20Buffer=20leafbuf,=20BTVacState=20*vstate)=0A=20=0A=20=09=09/*=0A=20= =09=09=20*=20We=20can=20never=20delete=20rightmost=20pages=20nor=20root=20= pages.=20=20While=20at=20it,=0A-=09=09=20*=20check=20that=20page=20is=20= empty,=20since=20it's=20possible=20that=20the=20leafbuf=20page=0A-=09=09=20= *=20was=20empty=20a=20moment=20ago,=20but=20has=20since=20had=20some=20= inserts.=0A+=09=09=20*=20check=20that=20page=20is=20empty=20or=20nearly=20= empty,=20since=20it's=0A+=09=09=20*=20possible=20that=20the=20leafbuf=20= page=20was=20empty=20a=20moment=20ago,=20but=20has=20since=0A+=09=09=20*=20= had=20some=20inserts.=0A+=09=09=20*=0A+=09=09=20*=20For=20pages=20that=20= have=20tuples,=20attempt=20to=20move=20them=20to=20the=20right=20sibling=0A= +=09=09=20*=20if=20there's=20enough=20space.=20This=20enables=20merging=20= of=20nearly-empty=20pages.=0A=20=09=09=20*=0A=20=09=09=20*=20To=20keep=20= the=20algorithm=20simple,=20we=20also=20never=20delete=20an=20= incompletely=0A=20=09=09=20*=20split=20page=20(they=20should=20be=20rare=20= enough=20that=20this=20doesn't=20make=20any=0A@@=20-1893,9=20+1900,11=20= @@=20_bt_pagedel(Relation=20rel,=20Buffer=20leafbuf,=20BTVacState=20= *vstate)=0A=20=09=09=20*=20we=20know=20we=20stepped=20right=20from=20a=20= page=20that=20passed=20these=20tests,=20so=0A=20=09=09=20*=20it's=20OK.=0A= =20=09=09=20*/=0A-=09=09if=20(P_RIGHTMOST(opaque)=20||=20= P_ISROOT(opaque)=20||=0A-=09=09=09P_FIRSTDATAKEY(opaque)=20<=3D=20= PageGetMaxOffsetNumber(page)=20||=0A-=09=09=09= P_INCOMPLETE_SPLIT(opaque))=0A+=09=09maxoff=20=3D=20= PageGetMaxOffsetNumber(page);=0A+=09=09minoff=20=3D=20= P_FIRSTDATAKEY(opaque);=0A+=09=09page_has_tuples=20=3D=20(minoff=20<=3D=20= maxoff);=0A+=0A+=09=09if=20(P_RIGHTMOST(opaque)=20||=20P_ISROOT(opaque)=20= ||=20P_INCOMPLETE_SPLIT(opaque))=0A=20=09=09{=0A=20=09=09=09/*=20Should=20= never=20fail=20to=20delete=20a=20half-dead=20page=20*/=0A=20=09=09=09= Assert(!P_ISHALFDEAD(opaque));=0A@@=20-1904,6=20+1913,88=20@@=20= _bt_pagedel(Relation=20rel,=20Buffer=20leafbuf,=20BTVacState=20*vstate)=0A= =20=09=09=09return;=0A=20=09=09}=0A=20=0A+=09=09/*=0A+=09=09=20*=20If=20= page=20has=20tuples,=20check=20if=20they=20can=20be=20merged=20with=20= right=20sibling.=0A+=09=09=20*=20Actual=20merge=20will=20be=20done=20= later=20in=20_bt_unlink_halfdead_page()=20within=0A+=09=09=20*=20the=20= same=20WAL=20record=20as=20the=20page=20deletion=20for=20atomicity.=0A+=09= =09=20*/=0A+=09=09if=20(page_has_tuples)=0A+=09=09{=0A+=09=09=09= BlockNumber=20right_sibling=20=3D=20opaque->btpo_next;=0A+=09=09=09= Buffer=09=09rbuf;=0A+=09=09=09Page=09=09rpage;=0A+=09=09=09BTPageOpaque=20= ropaque;=0A+=09=09=09Size=09=09space_needed=20=3D=200;=0A+=09=09=09= OffsetNumber=20i;=0A+=09=09=09bool=09=09merge_feasible=20=3D=20false;=0A= +=0A+=09=09=09/*=20Calculate=20total=20space=20needed=20for=20all=20= tuples=20*/=0A+=09=09=09for=20(i=20=3D=20minoff;=20i=20<=3D=20maxoff;=20= i++)=0A+=09=09=09{=0A+=09=09=09=09ItemId=09=09itemid=20=3D=20= PageGetItemId(page,=20i);=0A+=09=09=09=09space_needed=20+=3D=20= ItemIdGetLength(itemid)=20+=20sizeof(ItemIdData);=0A+=09=09=09}=0A+=0A+=09= =09=09/*=20Check=20if=20merge=20with=20right=20sibling=20is=20feasible=20= */=0A+=09=09=09if=20(right_sibling=20!=3D=20P_NONE)=0A+=09=09=09{=0A+=09=09= =09=09rbuf=20=3D=20_bt_getbuf(rel,=20right_sibling,=20BT_READ);=0A+=09=09= =09=09rpage=20=3D=20BufferGetPage(rbuf);=0A+=09=09=09=09ropaque=20=3D=20= BTPageGetOpaque(rpage);=0A+=0A+=09=09=09=09/*=20Verify=20right=20sibling=20= is=20a=20leaf=20page=20at=20same=20level=20*/=0A+=09=09=09=09if=20= (P_ISLEAF(ropaque)=20&&=20ropaque->btpo_level=20=3D=3D=20= opaque->btpo_level=20&&=0A+=09=09=09=09=09!P_ISDELETED(ropaque)=20&&=20= !P_ISHALFDEAD(ropaque))=0A+=09=09=09=09{=0A+=09=09=09=09=09Size=20= freespace=20=3D=20PageGetFreeSpace(rpage);=0A+=0A+=09=09=09=09=09if=20= (freespace=20>=3D=20space_needed)=0A+=09=09=09=09=09{=0A+=09=09=09=09=09=09= merge_feasible=20=3D=20true;=0A+=09=09=09=09=09=09elog(DEBUG1,=20"Page=20= merge=20feasible=20for=20index=20\"%s\":=20can=20move=20%d=20tuples=20= from=20page=20%u=20to=20page=20%u=20(need=20%zu=20bytes,=20have=20%zu=20= bytes=20free)",=0A+=09=09=09=09=09=09=09=20RelationGetRelationName(rel),=20= (int)(maxoff=20-=20minoff=20+=201),=0A+=09=09=09=09=09=09=09=20= BufferGetBlockNumber(leafbuf),=20right_sibling,=20space_needed,=20= freespace);=0A+=09=09=09=09=09}=0A+=09=09=09=09=09else=0A+=09=09=09=09=09= {=0A+=09=09=09=09=09=09elog(DEBUG1,=20"Page=20merge=20not=20feasible=20= for=20index=20\"%s\":=20right=20sibling=20page=20%u=20has=20insufficient=20= space=20(need=20%zu=20bytes,=20have=20%zu=20bytes=20free)",=0A+=09=09=09=09= =09=09=09=20RelationGetRelationName(rel),=20right_sibling,=20= space_needed,=20freespace);=0A+=09=09=09=09=09}=0A+=09=09=09=09}=0A+=09=09= =09=09else=0A+=09=09=09=09{=0A+=09=09=09=09=09elog(DEBUG1,=20"Page=20= merge=20not=20feasible=20for=20index=20\"%s\":=20right=20sibling=20page=20= %u=20is=20not=20suitable=20(is_leaf=3D%d,=20deleted=3D%d,=20halfdead=3D%d,= =20level=3D%d=20vs=20%d)",=0A+=09=09=09=09=09=09=20= RelationGetRelationName(rel),=20right_sibling,=20P_ISLEAF(ropaque),=0A+=09= =09=09=09=09=09=20P_ISDELETED(ropaque),=20P_ISHALFDEAD(ropaque),=20= ropaque->btpo_level,=20opaque->btpo_level);=0A+=09=09=09=09}=0A+=0A+=09=09= =09=09_bt_relbuf(rel,=20rbuf);=0A+=09=09=09}=0A+=09=09=09else=0A+=09=09=09= {=0A+=09=09=09=09elog(DEBUG1,=20"Page=20merge=20not=20feasible=20for=20= index=20\"%s\":=20page=20%u=20has=20no=20right=20sibling",=0A+=09=09=09=09= =09=20RelationGetRelationName(rel),=20BufferGetBlockNumber(leafbuf));=0A= +=09=09=09}=0A+=0A+=09=09=09/*=20If=20merge=20is=20not=20feasible,=20= abort=20the=20page=20deletion=20*/=0A+=09=09=09if=20(!merge_feasible)=0A= +=09=09=09{=0A+=09=09=09=09_bt_relbuf(rel,=20leafbuf);=0A+=09=09=09=09= return;=0A+=09=09=09}=0A+=0A+=09=09=09/*=0A+=09=09=09=20*=20Merge=20is=20= feasible.=20Mark=20page=20as=20candidates=20for=20merge=20and=20proceed=0A= +=09=09=09=20*=20with=20deletion.=20The=20actual=20merge=20will=20happen=20= in=20_bt_unlink_halfdead_page().=0A+=09=09=09=20*/=0A+=09=09}=0A+=0A+=09=09= /*=0A+=09=09=20*=20Page=20might=20not=20be=20empty=20if=20we're=20doing=20= a=20merge,=20but=20we've=20validated=0A+=09=09=20*=20that=20merge=20is=20= feasible.=20Proceed=20with=20deletion=20-=20the=20actual=20merge=20will=0A= +=09=09=20*=20happen=20in=20_bt_unlink_halfdead_page().=0A+=09=09=20*/=0A= +=0A=20=09=09/*=0A=20=09=09=20*=20First,=20remove=20downlink=20pointing=20= to=20the=20page=20(or=20a=20parent=20of=20the=0A=20=09=09=20*=20page,=20= if=20we=20are=20going=20to=20delete=20a=20taller=20subtree),=20and=20= mark=20the=0A@@=20-2104,9=20+2195,21=20@@=20= _bt_mark_page_halfdead(Relation=20rel,=20Relation=20heaprel,=20Buffer=20= leafbuf,=0A=20=09page=20=3D=20BufferGetPage(leafbuf);=0A=20=09opaque=20=3D= =20BTPageGetOpaque(page);=0A=20=0A+=09/*=0A+=09=20*=20Assert=20page=20is=20= suitable=20for=20deletion.=20Page=20must=20be=20a=20leaf,=20not=20root,=0A= +=09=20*=20not=20rightmost,=20and=20not=20ignored.=20For=20traditional=20= deletion,=20page=20must=20be=0A+=09=20*=20empty.=20For=20merge=20= scenarios=20(nearly=20empty=20pages),=20page=20may=20have=20tuples.=0A+=09= =20*/=0A=20=09Assert(!P_RIGHTMOST(opaque)=20&&=20!P_ISROOT(opaque)=20&&=0A= -=09=09=20=20=20P_ISLEAF(opaque)=20&&=20!P_IGNORE(opaque)=20&&=0A-=09=09=20= =20=20P_FIRSTDATAKEY(opaque)=20>=20PageGetMaxOffsetNumber(page));=0A+=09=09= =20=20=20P_ISLEAF(opaque)=20&&=20!P_IGNORE(opaque));=0A+=0A+=09/*=0A+=09=20= *=20Page=20should=20be=20either=20empty=20(traditional=20deletion)=20or=20= nearly=20empty=0A+=09=20*=20with=20tuples=20that=20can=20be=20merged=20= (new=20merge=20feature).=0A+=09=20*/=0A+=09Assert(P_FIRSTDATAKEY(opaque)=20= >=20PageGetMaxOffsetNumber(page)=20||=0A+=09=09=20=20=20= (P_FIRSTDATAKEY(opaque)=20<=3D=20PageGetMaxOffsetNumber(page)=20&&=0A+=09= =09=09PageGetFreeSpace(page)=20>=3D=20((BLCKSZ=20-=20= SizeOfPageHeaderData)=20*=20BTGetMergeFactor(rel))=20/=20100));=0A=20=09= Assert(heaprel=20!=3D=20NULL);=0A=20=0A=20=09/*=0A@@=20-2231,7=20= +2334,13=20@@=20_bt_mark_page_halfdead(Relation=20rel,=20Relation=20= heaprel,=20Buffer=20leafbuf,=0A=20=09opaque=20=3D=20= BTPageGetOpaque(page);=0A=20=09opaque->btpo_flags=20|=3D=20= BTP_HALF_DEAD;=0A=20=0A-=09Assert(PageGetMaxOffsetNumber(page)=20=3D=3D=20= P_HIKEY);=0A+=09/*=0A+=09=20*=20Page=20should=20only=20have=20high=20key=20= in=20traditional=20deletion=20scenario.=0A+=09=20*=20In=20merge=20= scenarios,=20page=20may=20have=20data=20tuples=20that=20will=20be=20= moved=20later.=0A+=09=20*/=0A+=09Assert(PageGetMaxOffsetNumber(page)=20= =3D=3D=20P_HIKEY=20||=0A+=09=09=20=20=20(PageGetMaxOffsetNumber(page)=20= >=20P_HIKEY=20&&=0A+=09=09=09PageGetFreeSpace(page)=20>=3D=20((BLCKSZ=20= -=20SizeOfPageHeaderData)=20*=20BTGetMergeFactor(rel))=20/=20100));=0A=20= =09MemSet(&trunctuple,=200,=20sizeof(IndexTupleData));=0A=20=09= trunctuple.t_info=20=3D=20sizeof(IndexTupleData);=0A=20=09if=20= (topparent=20!=3D=20leafblkno)=0A@@=20-2333,9=20+2442,19=20@@=20= _bt_unlink_halfdead_page(Relation=20rel,=20Buffer=20leafbuf,=20= BlockNumber=20scanblkno,=0A=20=09FullTransactionId=20safexid;=0A=20=09= bool=09=09rightsib_is_rightmost;=0A=20=09uint32=09=09targetlevel;=0A+=09= IndexTuple=09tuple;=0A+=09Size=09=09tupsz;=0A+=09int=09=09=09idx;=0A=20=09= IndexTuple=09leafhikey;=0A=20=09BlockNumber=20leaftopparent;=0A=20=0A+=09= /*=20Variables=20for=20page=20merge=20functionality=20*/=0A+=09= IndexTuple=09*merge_tuples=20=3D=20NULL;=0A+=09Size=09=09*merge_tupsz=20= =3D=20NULL;=0A+=09OffsetNumber=20*merge_deletable=20=3D=20NULL;=0A+=09= int=09=09=09merge_ntuples=20=3D=200;=0A+=09bool=09=09do_merge=20=3D=20= false;=0A+=0A=20=09page=20=3D=20BufferGetPage(leafbuf);=0A=20=09opaque=20= =3D=20BTPageGetOpaque(page);=0A=20=0A@@=20-2480,11=20+2599,24=20@@=20= _bt_unlink_halfdead_page(Relation=20rel,=20Buffer=20leafbuf,=20= BlockNumber=20scanblkno,=0A=20=0A=20=09if=20(target=20=3D=3D=20= leafblkno)=0A=20=09{=0A-=09=09if=20(P_FIRSTDATAKEY(opaque)=20<=3D=20= PageGetMaxOffsetNumber(page)=20||=0A-=09=09=09!P_ISLEAF(opaque)=20||=20= !P_ISHALFDEAD(opaque))=0A+=09=09/*=0A+=09=09=20*=20Validate=20leaf=20= page=20status.=20Page=20must=20be=20leaf=20and=20half-dead.=0A+=09=09=20= *=20For=20traditional=20deletion,=20page=20must=20be=20empty.=20For=20= merge=20scenarios,=0A+=09=09=20*=20page=20may=20have=20data=20tuples=20= that=20will=20be=20moved=20to=20right=20sibling.=0A+=09=09=20*/=0A+=09=09= if=20(!P_ISLEAF(opaque)=20||=20!P_ISHALFDEAD(opaque))=0A=20=09=09=09= elog(ERROR,=20"target=20leaf=20page=20changed=20status=20unexpectedly=20= in=20block=20%u=20of=20index=20\"%s\"",=0A=20=09=09=09=09=20target,=20= RelationGetRelationName(rel));=0A=20=0A+=09=09/*=20Check=20if=20page=20= is=20empty=20(traditional=20deletion)=20or=20suitable=20for=20merge=20*/=0A= +=09=09if=20(P_FIRSTDATAKEY(opaque)=20<=3D=20= PageGetMaxOffsetNumber(page))=0A+=09=09{=0A+=09=09=09/*=20Page=20has=20= tuples=20-=20validate=20it's=20suitable=20for=20merge=20*/=0A+=09=09=09= if=20(PageGetFreeSpace(page)=20<=20((BLCKSZ=20-=20SizeOfPageHeaderData)=20= *=20BTGetMergeFactor(rel))=20/=20100)=0A+=09=09=09=09elog(ERROR,=20= "target=20leaf=20page=20changed=20status=20unexpectedly=20in=20block=20= %u=20of=20index=20\"%s\"",=0A+=09=09=09=09=09=20target,=20= RelationGetRelationName(rel));=0A+=09=09}=0A+=0A=20=09=09/*=20Leaf=20= page=20is=20also=20target=20page:=20don't=20set=20leaftopparent=20*/=0A=20= =09=09leaftopparent=20=3D=20InvalidBlockNumber;=0A=20=09}=0A@@=20-2591,6=20= +2723,71=20@@=20_bt_unlink_halfdead_page(Relation=20rel,=20Buffer=20= leafbuf,=20BlockNumber=20scanblkno,=0A=20=09=20*=20Here=20we=20begin=20= doing=20the=20deletion.=0A=20=09=20*/=0A=20=0A+=09/*=0A+=09=20*=20= Prepare=20merge=20data=20before=20critical=20section=20if=20needed.=0A+=09= =20*=20We'll=20use=20local=20variables=20and=20perform=20the=20merge=20= within=20the=20critical=20section.=0A+=09=20*/=0A+=0A+=09if=20(target=20= =3D=3D=20leafblkno)=0A+=09{=0A+=09=09Page=09=09leafpage=20=3D=20= BufferGetPage(leafbuf);=0A+=09=09BTPageOpaque=20leafopaque=20=3D=20= BTPageGetOpaque(leafpage);=0A+=09=09OffsetNumber=20minoff=20=3D=20= P_FIRSTDATAKEY(leafopaque);=0A+=09=09OffsetNumber=20maxoff=20=3D=20= PageGetMaxOffsetNumber(leafpage);=0A+=0A+=09=09/*=20Check=20if=20leaf=20= page=20has=20tuples=20that=20need=20to=20be=20merged=20*/=0A+=09=09if=20= (minoff=20<=3D=20maxoff=20&&=20P_ISLEAF(leafopaque)=20&&=20= P_ISHALFDEAD(leafopaque))=0A+=09=09{=0A+=09=09=09Page=09=09rightpage=20=3D= =20BufferGetPage(rbuf);=0A+=09=09=09BTPageOpaque=20rightopaque=20=3D=20= BTPageGetOpaque(rightpage);=0A+=09=09=09Size=09=09space_needed=20=3D=20= 0;=0A+=09=09=09OffsetNumber=20i;=0A+=0A+=09=09=09merge_ntuples=20=3D=20= maxoff=20-=20minoff=20+=201;=0A+=0A+=09=09=09/*=20Calculate=20total=20= space=20needed=20for=20all=20tuples=20*/=0A+=09=09=09for=20(i=20=3D=20= minoff;=20i=20<=3D=20maxoff;=20i++)=0A+=09=09=09{=0A+=09=09=09=09itemid=20= =3D=20PageGetItemId(leafpage,=20i);=0A+=09=09=09=09space_needed=20+=3D=20= ItemIdGetLength(itemid)=20+=20sizeof(ItemIdData);=0A+=09=09=09}=0A+=0A+=09= =09=09/*=20Verify=20that=20right=20sibling=20has=20enough=20space=20= (should=20have=20been=20checked=20earlier)=20*/=0A+=09=09=09if=20= (P_ISLEAF(rightopaque)=20&&=20rightopaque->btpo_level=20=3D=3D=20= leafopaque->btpo_level=20&&=0A+=09=09=09=09!P_ISDELETED(rightopaque)=20= &&=20!P_ISHALFDEAD(rightopaque)=20&&=0A+=09=09=09=09= PageGetFreeSpace(rightpage)=20>=3D=20space_needed)=0A+=09=09=09{=0A+=09=09= =09=09elog(DEBUG1,=20"Performing=20page=20merge=20in=20index=20\"%s\":=20= moving=20%d=20tuples=20from=20page=20%u=20to=20page=20%u",=0A+=09=09=09=09= =09=20RelationGetRelationName(rel),=20merge_ntuples,=20leafblkno,=20= rightsib);=0A+=0A+=09=09=09=09/*=20Pre-allocate=20and=20copy=20all=20= tuples=20outside=20critical=20section=20*/=0A+=09=09=09=09merge_tuples=20= =3D=20palloc(sizeof(IndexTuple)=20*=20merge_ntuples);=0A+=09=09=09=09= merge_tupsz=20=3D=20palloc(sizeof(Size)=20*=20merge_ntuples);=0A+=09=09=09= =09merge_deletable=20=3D=20palloc(sizeof(OffsetNumber)=20*=20= merge_ntuples);=0A+=0A+=09=09=09=09for=20(i=20=3D=20minoff;=20i=20<=3D=20= maxoff;=20i++)=0A+=09=09=09=09{=0A+=09=09=09=09=09itemid=20=3D=20= PageGetItemId(leafpage,=20i);=0A+=09=09=09=09=09tuple=20=3D=20= (IndexTuple)=20PageGetItem(leafpage,=20itemid);=0A+=09=09=09=09=09tupsz=20= =3D=20ItemIdGetLength(itemid);=0A+=09=09=09=09=09idx=20=3D=20i=20-=20= minoff;=0A+=0A+=09=09=09=09=09merge_tuples[idx]=20=3D=20palloc(tupsz);=0A= +=09=09=09=09=09memcpy(merge_tuples[idx],=20tuple,=20tupsz);=0A+=09=09=09= =09=09merge_tupsz[idx]=20=3D=20tupsz;=0A+=09=09=09=09=09= merge_deletable[idx]=20=3D=20i;=0A+=09=09=09=09}=0A+=0A+=09=09=09=09= do_merge=20=3D=20true;=0A+=09=09=09}=0A+=09=09=09else=0A+=09=09=09{=0A+=09= =09=09=09elog(DEBUG1,=20"Cannot=20perform=20page=20merge=20in=20index=20= \"%s\":=20right=20sibling=20conditions=20not=20met",=0A+=09=09=09=09=09=20= RelationGetRelationName(rel));=0A+=09=09=09}=0A+=09=09}=0A+=09}=0A+=0A=20= =09/*=20No=20ereport(ERROR)=20until=20changes=20are=20logged=20*/=0A=20=09= START_CRIT_SECTION();=0A=20=0A@@=20-2611,6=20+2808,38=20@@=20= _bt_unlink_halfdead_page(Relation=20rel,=20Buffer=20leafbuf,=20= BlockNumber=20scanblkno,=0A=20=09Assert(opaque->btpo_prev=20=3D=3D=20= target);=0A=20=09opaque->btpo_prev=20=3D=20leftsib;=0A=20=0A+=09/*=0A+=09= =20*=20Perform=20page=20merge=20if=20needed.=20Move=20tuples=20from=20= the=20leaf=20page=20to=20the=20right=0A+=09=20*=20sibling=20before=20= marking=20the=20leaf=20page=20as=20deleted.=20This=20must=20be=20done=20= within=0A+=09=20*=20the=20critical=20section=20to=20ensure=20atomicity=20= with=20the=20page=20deletion.=0A+=09=20*/=0A+=09if=20(do_merge)=0A+=09{=0A= +=09=09Page=09=09leafpage=20=3D=20BufferGetPage(leafbuf);=0A+=09=09Page=09= =09rightpage=20=3D=20BufferGetPage(rbuf);=0A+=09=09BTPageOpaque=20= rightopaque=20=3D=20BTPageGetOpaque(rightpage);=0A+=09=09OffsetNumber=20= insert_at=20=3D=20P_FIRSTDATAKEY(rightopaque);=0A+=09=09int=09=09=09i;=0A= +=0A+=09=09/*=20Move=20all=20saved=20tuples=20to=20right=20sibling=20*/=0A= +=09=09for=20(i=20=3D=200;=20i=20<=20merge_ntuples;=20i++)=0A+=09=09{=0A= +=09=09=09if=20(PageAddItem(rightpage,=20(Item)=20merge_tuples[i],=20= merge_tupsz[i],=0A+=09=09=09=09=09=09=09insert_at,=20false,=20false)=20= =3D=3D=20InvalidOffsetNumber)=0A+=09=09=09{=0A+=09=09=09=09elog(PANIC,=20= "failed=20to=20move=20tuple=20during=20page=20merge=20in=20index=20= \"%s\"=20-=20space=20should=20have=20been=20checked",=0A+=09=09=09=09=09=20= RelationGetRelationName(rel));=0A+=09=09=09}=0A+=09=09=09insert_at++;=0A= +=09=09}=0A+=0A+=09=09/*=20Clear=20all=20tuples=20from=20the=20leaf=20= page=20using=20pre-allocated=20array=20*/=0A+=09=09= PageIndexMultiDelete(leafpage,=20merge_deletable,=20merge_ntuples);=0A+=0A= +=09=09elog(DEBUG1,=20"Page=20merge=20completed=20in=20index=20\"%s\":=20= moved=20%d=20tuples=20from=20page=20%u=20to=20page=20%u",=0A+=09=09=09=20= RelationGetRelationName(rel),=20merge_ntuples,=20leafblkno,=20rightsib);=0A= +=09}=0A+=0A=20=09/*=0A=20=09=20*=20If=20we=20deleted=20a=20parent=20of=20= the=20targeted=20leaf=20page,=20instead=20of=20the=20leaf=0A=20=09=20*=20= itself,=20update=20the=20leaf=20to=20point=20to=20the=20next=20remaining=20= child=20in=20the=0A@@=20-2676,10=20+2905,15=20@@=20= _bt_unlink_halfdead_page(Relation=20rel,=20Buffer=20leafbuf,=20= BlockNumber=20scanblkno,=0A=20=0A=20=09=09XLogBeginInsert();=0A=20=0A-=09= =09XLogRegisterBuffer(0,=20buf,=20REGBUF_WILL_INIT);=0A+=09=09/*=20= Always=20use=20REGBUF_STANDARD=20for=20target=20page=20to=20allow=20= reading=20during=20replay=20*/=0A+=09=09XLogRegisterBuffer(0,=20buf,=20= REGBUF_STANDARD);=0A=20=09=09if=20(BufferIsValid(lbuf))=0A=20=09=09=09= XLogRegisterBuffer(1,=20lbuf,=20REGBUF_STANDARD);=0A-=09=09= XLogRegisterBuffer(2,=20rbuf,=20REGBUF_STANDARD);=0A+=09=09/*=20Always=20= use=20REGBUF_FORCE_IMAGE=20for=20right=20sibling=20if=20we're=20merging=20= tuples=20*/=0A+=09=09if=20(merge_ntuples=20>=200)=0A+=09=09=09= XLogRegisterBuffer(2,=20rbuf,=20REGBUF_FORCE_IMAGE);=0A+=09=09else=0A+=09= =09=09XLogRegisterBuffer(2,=20rbuf,=20REGBUF_STANDARD);=0A=20=09=09if=20= (target=20!=3D=20leafblkno)=0A=20=09=09=09XLogRegisterBuffer(3,=20= leafbuf,=20REGBUF_WILL_INIT);=0A=20=0A@@=20-2694,8=20+2928,12=20@@=20= _bt_unlink_halfdead_page(Relation=20rel,=20Buffer=20leafbuf,=20= BlockNumber=20scanblkno,=0A=20=09=09xlrec.leafrightsib=20=3D=20= leafrightsib;=0A=20=09=09xlrec.leaftopparent=20=3D=20leaftopparent;=0A=20= =0A+=09=09/*=20Note:=20merge=20information=20is=20inferred=20during=20= WAL=20replay=20by=20checking=20target=20page=20content=20*/=0A+=0A=20=09=09= XLogRegisterData(&xlrec,=20SizeOfBtreeUnlinkPage);=0A=20=0A+=09=09/*=20= Note:=20merged=20tuple=20data=20is=20read=20directly=20from=20target=20= page=20during=20WAL=20replay=20*/=0A+=0A=20=09=09if=20= (BufferIsValid(metabuf))=0A=20=09=09{=0A=20=09=09=09= XLogRegisterBuffer(4,=20metabuf,=20REGBUF_WILL_INIT=20|=20= REGBUF_STANDARD);=0A@@=20-2739,6=20+2977,19=20@@=20= _bt_unlink_halfdead_page(Relation=20rel,=20Buffer=20leafbuf,=20= BlockNumber=20scanblkno,=0A=20=0A=20=09END_CRIT_SECTION();=0A=20=0A+=09= /*=20Clean=20up=20merge-related=20memory=20allocations=20*/=0A+=09if=20= (do_merge=20&&=20merge_tuples=20!=3D=20NULL)=0A+=09{=0A+=09=09for=20(int=20= i=20=3D=200;=20i=20<=20merge_ntuples;=20i++)=0A+=09=09{=0A+=09=09=09if=20= (merge_tuples[i]=20!=3D=20NULL)=0A+=09=09=09=09pfree(merge_tuples[i]);=0A= +=09=09}=0A+=09=09pfree(merge_tuples);=0A+=09=09pfree(merge_tupsz);=0A+=09= =09pfree(merge_deletable);=0A+=09}=0A+=0A=20=09/*=20release=20metapage=20= */=0A=20=09if=20(BufferIsValid(metabuf))=0A=20=09=09_bt_relbuf(rel,=20= metabuf);=0Adiff=20--git=20a/src/backend/access/nbtree/nbtree.c=20= b/src/backend/access/nbtree/nbtree.c=0Aindex=20fdff960c130..8294336ba7a=20= 100644=0A---=20a/src/backend/access/nbtree/nbtree.c=0A+++=20= b/src/backend/access/nbtree/nbtree.c=0A@@=20-1620,6=20+1620,9=20@@=20= backtrack:=0A=20=09=09=20*=20since=20that=20would=20require=20teaching=20= _bt_pagedel()=20about=20backtracking=0A=20=09=09=20*=20(doesn't=20seem=20= worth=20adding=20more=20complexity=20to=20deal=20with=20that).=0A=20=09=09= =20*=0A+=09=09=20*=20We=20also=20attempt=20page=20deletion=20if=20the=20= page=20is=20mostly=20empty=20(by=20bytes).=0A+=09=09=20*=20This=20= enables=20merging=20of=20nearly-empty=20pages=20to=20reduce=20bloat.=0A+=09= =09=20*=0A=20=09=09=20*=20We=20don't=20count=20the=20number=20of=20live=20= TIDs=20during=20cleanup-only=20calls=20to=0A=20=09=09=20*=20btvacuumscan=20= (i.e.=20when=20callback=20is=20not=20set).=20=20We=20count=20the=20= number=0A=20=09=09=20*=20of=20index=20tuples=20directly=20instead.=20=20= This=20avoids=20the=20expense=20of=0A@@=20-1630,12=20+1633,33=20@@=20= backtrack:=0A=20=09=09=20*/=0A=20=09=09if=20(minoff=20>=20maxoff)=0A=20=09= =09=09attempt_pagedel=20=3D=20(blkno=20=3D=3D=20scanblkno);=0A-=09=09= else=20if=20(callback)=0A+=09=09else=20if=20(blkno=20=3D=3D=20scanblkno)=0A= +=09=09{=0A+=09=09=09/*=20Check=20if=20page=20meets=20merge=20threshold=20= by=20space=20usage=20*/=0A+=09=09=09Size=09=09freespace=20=3D=20= PageGetFreeSpace(page);=0A+=09=09=09Size=09=09pagesize=20=3D=20BLCKSZ=20= -=20SizeOfPageHeaderData;=0A+=09=09=09int=09=09=09mergefactor=20=3D=20= BTGetMergeFactor(rel);=0A+=0A+=09=09=09/*=0A+=09=09=09=20*=20Attempt=20= page=20merge=20if=20page=20meets=20merge=20threshold=20by=20space=20= usage.=0A+=09=09=09=20*/=0A+=09=09=09if=20(freespace=20>=3D=20(pagesize=20= *=20mergefactor)=20/=20100)=0A+=09=09=09=09attempt_pagedel=20=3D=20true;=0A= +=09=09}=0A+=0A+=0A+=09=09if=20(callback)=0A=20=09=09=09= stats->num_index_tuples=20+=3D=20nhtidslive;=0A=20=09=09else=0A=20=09=09=09= stats->num_index_tuples=20+=3D=20maxoff=20-=20minoff=20+=201;=0A=20=0A-=09= =09Assert(!attempt_pagedel=20||=20nhtidslive=20=3D=3D=200);=0A+=09=09/*=0A= +=09=09=20*=20Assert=20that=20either=20we're=20not=20attempting=20page=20= deletion,=20or=20if=20we=20are,=0A+=09=09=20*=20it's=20either=20because=20= the=20page=20is=20empty=20(nhtidslive=20=3D=3D=200)=20or=20because=0A+=09= =09=20*=20we're=20attempting=20a=20merge=20of=20a=20mostly=20empty=20= page=20with=20tuples.=0A+=09=09=20*/=0A+=09=09Assert(!attempt_pagedel=20= ||=20nhtidslive=20=3D=3D=200=20||=0A+=09=09=09=20=20=20(blkno=20=3D=3D=20= scanblkno=20&&=20PageGetFreeSpace(page)=20>=3D=20((BLCKSZ=20-=20= SizeOfPageHeaderData)=20*=20BTGetMergeFactor(rel))=20/=20100));=0A=20=09= }=0A=20=0A=20=09if=20(attempt_pagedel)=0Adiff=20--git=20= a/src/backend/access/nbtree/nbtsearch.c=20= b/src/backend/access/nbtree/nbtsearch.c=0Aindex=20= d69798795b4..6ff31f87033=20100644=0A---=20= a/src/backend/access/nbtree/nbtsearch.c=0A+++=20= b/src/backend/access/nbtree/nbtsearch.c=0A@@=20-21,6=20+21,7=20@@=0A=20= #include=20"miscadmin.h"=0A=20#include=20"pgstat.h"=0A=20#include=20= "storage/predicate.h"=0A+#include=20"utils/injection_point.h"=0A=20= #include=20"utils/lsyscache.h"=0A=20#include=20"utils/rel.h"=0A=20=0A@@=20= -2174,6=20+2175,8=20@@=20_bt_steppage(IndexScanDesc=20scan,=20= ScanDirection=20dir)=0A=20=09if=20(so->numKilled=20>=200)=0A=20=09=09= _bt_killitems(scan);=0A=20=0A+=0A+=0A=20=09/*=0A=20=09=20*=20Before=20we=20= modify=20currPos,=20make=20a=20copy=20of=20the=20page=20data=20if=20= there=20was=20a=0A=20=09=20*=20mark=20position=20that=20needs=20it.=0A@@=20= -2222,6=20+2225,22=20@@=20_bt_steppage(IndexScanDesc=20scan,=20= ScanDirection=20dir)=0A=20=0A=20=09BTScanPosUnpinIfPinned(so->currPos);=0A= =20=0A+=09/*=0A+=09=20*=20Injection=20point=20for=20testing=20scan=20= correctness=20during=20page=20merge.=0A+=09=20*=20This=20allows=20tests=20= to=20pause=20scans=20between=20pages=20to=20simulate=20concurrent=0A+=09=20= *=20page=20operations.=20Only=20pause=20scans=20on=20the=20specific=20= test=20index.=0A+=09=20*=20We=20pause=20here=20after=20unpinning=20the=20= current=20buffer=20to=20avoid=20blocking=20VACUUM.=0A+=09=20*/=0A+=09{=0A= +=09=09Relation=20rel=20=3D=20scan->indexRelation;=0A+=09=09BlockNumber=20= blkno=20=3D=20so->currPos.currPage;=0A+=09=09/*=20Only=20pause=20scans=20= on=20our=20test=20index=20(btree_test_idx=20has=20OID=20around=2016400+)=20= */=0A+=09=09if=20(rel=20&&=20RelationGetRelid(rel)=20>=2016384=20&&=20= blkno=20=3D=3D=2020)=0A+=09=09{=0A+=09=09=09= INJECTION_POINT("btree-scan-between-pages",=20&blkno);=0A+=09=09}=0A+=09= }=0A+=0A=20=09/*=20Walk=20to=20the=20next=20page=20with=20data=20*/=0A=20= =09if=20(ScanDirectionIsForward(dir))=0A=20=09=09blkno=20=3D=20= so->currPos.nextPage;=0Adiff=20--git=20= a/src/backend/access/nbtree/nbtutils.c=20= b/src/backend/access/nbtree/nbtutils.c=0Aindex=20= 9aed207995f..30b986f35be=20100644=0A---=20= a/src/backend/access/nbtree/nbtutils.c=0A+++=20= b/src/backend/access/nbtree/nbtutils.c=0A@@=20-3649,7=20+3649,8=20@@=20= btoptions(Datum=20reloptions,=20bool=20validate)=0A=20=09=09= {"vacuum_cleanup_index_scale_factor",=20RELOPT_TYPE_REAL,=0A=20=09=09= offsetof(BTOptions,=20vacuum_cleanup_index_scale_factor)},=0A=20=09=09= {"deduplicate_items",=20RELOPT_TYPE_BOOL,=0A-=09=09offsetof(BTOptions,=20= deduplicate_items)}=0A+=09=09offsetof(BTOptions,=20deduplicate_items)},=0A= +=09=09{"mergefactor",=20RELOPT_TYPE_INT,=20offsetof(BTOptions,=20= mergefactor)}=0A=20=09};=0A=20=0A=20=09return=20(bytea=20*)=20= build_reloptions(reloptions,=20validate,=0Adiff=20--git=20= a/src/backend/access/nbtree/nbtxlog.c=20= b/src/backend/access/nbtree/nbtxlog.c=0Aindex=20d31dd56732d..7ee05d122f8=20= 100644=0A---=20a/src/backend/access/nbtree/nbtxlog.c=0A+++=20= b/src/backend/access/nbtree/nbtxlog.c=0A@@=20-813,6=20+813,8=20@@=20= btree_xlog_unlink_page(uint8=20info,=20XLogReaderState=20*record)=0A=20=09= Buffer=09=09rightbuf;=0A=20=09Page=09=09page;=0A=20=09BTPageOpaque=20= pageop;=0A+=09IndexTuple=20*merge_tuples;=0A+=09uint16=09=09= merge_ntuples;=0A=20=0A=20=09leftsib=20=3D=20xlrec->leftsib;=0A=20=09= rightsib=20=3D=20xlrec->rightsib;=0A@@=20-823,6=20+825,10=20@@=20= btree_xlog_unlink_page(uint8=20info,=20XLogReaderState=20*record)=0A=20=09= /*=20No=20leaftopparent=20for=20level=200=20(leaf=20page)=20or=20level=20= 1=20target=20*/=0A=20=09Assert(!BlockNumberIsValid(xlrec->leaftopparent)=20= ||=20level=20>=201);=0A=20=0A+=09/*=20Initialize=20merge=20variables=20= */=0A+=09merge_tuples=20=3D=20NULL;=0A+=09merge_ntuples=20=3D=200;=0A+=0A= =20=09/*=0A=20=09=20*=20In=20normal=20operation,=20we=20would=20lock=20= all=20the=20pages=20this=20WAL=20record=0A=20=09=20*=20touches=20before=20= changing=20any=20of=20them.=20=20In=20WAL=20replay,=20we=20at=20least=20= lock=0A@@=20-847,12=20+853,57=20@@=20btree_xlog_unlink_page(uint8=20= info,=20XLogReaderState=20*record)=0A=20=09else=0A=20=09=09leftbuf=20=3D=20= InvalidBuffer;=0A=20=0A-=09/*=20Rewrite=20target=20page=20as=20empty=20= deleted=20page=20*/=0A-=09target=20=3D=20XLogInitBufferForRedo(record,=20= 0);=0A-=09page=20=3D=20(Page)=20BufferGetPage(target);=0A+=09/*=0A+=09=20= *=20Handle=20target=20page.=20For=20page=20merges,=20we=20first=20read=20= existing=20content=0A+=09=20*=20to=20extract=20tuples,=20then=20= reinitialize.=20For=20simple=20deletions,=20we=20can=0A+=09=20*=20= initialize=20directly.=0A+=09=20*/=0A+=09if=20= (XLogReadBufferForRedo(record,=200,=20&target)=20=3D=3D=20= BLK_NEEDS_REDO)=0A+=09{=0A+=09=09page=20=3D=20(Page)=20= BufferGetPage(target);=0A+=09=09pageop=20=3D=20BTPageGetOpaque(page);=0A=20= =0A-=09_bt_pageinit(page,=20BufferGetPageSize(target));=0A-=09pageop=20=3D= =20BTPageGetOpaque(page);=0A+=09=09/*=20Check=20if=20this=20is=20a=20= leaf=20page=20merge=20case=20*/=0A+=09=09if=20(isleaf=20&&=20rightsib=20= !=3D=20P_NONE=20&&=20!P_RIGHTMOST(pageop))=0A+=09=09{=0A+=09=09=09= OffsetNumber=20maxoff=20=3D=20PageGetMaxOffsetNumber(page);=0A+=09=09=09= OffsetNumber=20minoff=20=3D=20P_FIRSTDATAKEY(pageop);=0A+=0A+=09=09=09/*=20= If=20page=20has=20tuples,=20save=20them=20for=20merging=20*/=0A+=09=09=09= if=20(minoff=20<=3D=20maxoff)=0A+=09=09=09{=0A+=09=09=09=09OffsetNumber=20= offnum;=0A+=09=09=09=09uint16=09=09i=20=3D=200;=0A+=0A+=09=09=09=09= merge_ntuples=20=3D=20maxoff=20-=20minoff=20+=201;=0A+=09=09=09=09= merge_tuples=20=3D=20(IndexTuple=20*)=20palloc(merge_ntuples=20*=20= sizeof(IndexTuple));=0A+=0A+=09=09=09=09/*=20Copy=20all=20tuples=20from=20= the=20page=20*/=0A+=09=09=09=09for=20(offnum=20=3D=20minoff;=20offnum=20= <=3D=20maxoff;=20offnum++)=0A+=09=09=09=09{=0A+=09=09=09=09=09ItemId=09=09= itemid=20=3D=20PageGetItemId(page,=20offnum);=0A+=09=09=09=09=09= IndexTuple=09tuple=20=3D=20(IndexTuple)=20PageGetItem(page,=20itemid);=0A= +=09=09=09=09=09Size=09=09tupsz=20=3D=20IndexTupleSize(tuple);=0A+=0A+=09= =09=09=09=09merge_tuples[i]=20=3D=20(IndexTuple)=20palloc(tupsz);=0A+=09=09= =09=09=09memcpy(merge_tuples[i],=20tuple,=20tupsz);=0A+=09=09=09=09=09= i++;=0A+=09=09=09=09}=0A+=09=09=09}=0A+=09=09}=0A+=0A+=09=09/*=20Now=20= reinitialize=20target=20page=20as=20empty=20deleted=20page=20*/=0A+=09=09= _bt_pageinit(page,=20BufferGetPageSize(target));=0A+=09=09pageop=20=3D=20= BTPageGetOpaque(page);=0A+=09}=0A+=09else=0A+=09{=0A+=09=09/*=20Page=20= doesn't=20need=20redo,=20but=20we=20still=20need=20to=20get=20the=20= buffer=20*/=0A+=09=09target=20=3D=20XLogInitBufferForRedo(record,=200);=0A= +=09=09page=20=3D=20(Page)=20BufferGetPage(target);=0A+=09=09= _bt_pageinit(page,=20BufferGetPageSize(target));=0A+=09=09pageop=20=3D=20= BTPageGetOpaque(page);=0A+=09}=0A=20=0A=20=09pageop->btpo_prev=20=3D=20= leftsib;=0A=20=09pageop->btpo_next=20=3D=20rightsib;=0A@@=20-865,17=20= +916,36=20@@=20btree_xlog_unlink_page(uint8=20info,=20XLogReaderState=20= *record)=0A=20=09PageSetLSN(page,=20lsn);=0A=20=09= MarkBufferDirty(target);=0A=20=0A-=09/*=20Fix=20left-link=20of=20right=20= sibling=20*/=0A+=09/*=20Fix=20left-link=20of=20right=20sibling=20and=20= replay=20merged=20tuples=20if=20any=20*/=0A=20=09if=20= (XLogReadBufferForRedo(record,=202,=20&rightbuf)=20=3D=3D=20= BLK_NEEDS_REDO)=0A=20=09{=0A=20=09=09page=20=3D=20(Page)=20= BufferGetPage(rightbuf);=0A=20=09=09pageop=20=3D=20= BTPageGetOpaque(page);=0A=20=09=09pageop->btpo_prev=20=3D=20leftsib;=0A=20= =0A+=09=09/*=0A+=09=09=20*=20Note:=20If=20tuples=20were=20merged=20= during=20the=20original=20operation,=20the=20right=0A+=09=09=20*=20= sibling=20buffer=20was=20registered=20with=20REGBUF_FORCE_IMAGE,=20so=20= the=20page=0A+=09=09=20*=20is=20automatically=20restored=20to=20its=20= final=20post-merge=20state.=20No=20explicit=0A+=09=09=20*=20tuple=20= insertion=20is=20needed=20during=20replay.=0A+=09=09=20*=0A+=09=09=20*=20= For=20non-merge=20operations,=20we=20only=20need=20to=20update=20the=20= left-link.=0A+=09=09=20*/=0A+=0A=20=09=09PageSetLSN(page,=20lsn);=0A=20=09= =09MarkBufferDirty(rightbuf);=0A=20=09}=0A=20=0A+=09/*=20Clean=20up=20= merge=20tuples=20*/=0A+=09if=20(merge_tuples=20!=3D=20NULL)=0A+=09{=0A+=09= =09uint16=09=09i;=0A+=0A+=09=09for=20(i=20=3D=200;=20i=20<=20= merge_ntuples;=20i++)=0A+=09=09=09pfree(merge_tuples[i]);=0A+=09=09= pfree(merge_tuples);=0A+=09}=0A+=0A=20=09/*=20Release=20siblings=20*/=0A=20= =09if=20(BufferIsValid(leftbuf))=0A=20=09=09= UnlockReleaseBuffer(leftbuf);=0Adiff=20--git=20= a/src/include/access/nbtree.h=20b/src/include/access/nbtree.h=0Aindex=20= e709d2e0afe..6fd1bcd142d=20100644=0A---=20a/src/include/access/nbtree.h=0A= +++=20b/src/include/access/nbtree.h=0A@@=20-201,6=20+201,7=20@@=20= typedef=20struct=20BTMetaPageData=0A=20#define=20= BTREE_DEFAULT_FILLFACTOR=0990=0A=20#define=20BTREE_NONLEAF_FILLFACTOR=09= 70=0A=20#define=20BTREE_SINGLEVAL_FILLFACTOR=0996=0A+#define=20= BTREE_DEFAULT_MERGEFACTOR=095=0A=20=0A=20/*=0A=20=20*=09In=20general,=20= the=20btree=20code=20tries=20to=20localize=20its=20knowledge=20about=0A= @@=20-1153,6=20+1154,7=20@@=20typedef=20struct=20BTOptions=0A=20=09int=09= =09=09fillfactor;=09=09/*=20page=20fill=20factor=20in=20percent=20= (0..100)=20*/=0A=20=09float8=09=09vacuum_cleanup_index_scale_factor;=09= /*=20deprecated=20*/=0A=20=09bool=09=09deduplicate_items;=09/*=20Try=20= to=20deduplicate=20items?=20*/=0A+=09int=09=09=09mergefactor;=09/*=20= page=20merge=20factor=20in=20percent=20(0..100)=20*/=0A=20}=20BTOptions;=0A= =20=0A=20#define=20BTGetFillFactor(relation)=20\=0A@@=20-1168,6=20= +1170,12=20@@=20typedef=20struct=20BTOptions=0A=20=09=09=09=09=20= relation->rd_rel->relam=20=3D=3D=20BTREE_AM_OID),=20\=0A=20=09= ((relation)->rd_options=20?=20\=0A=20=09=20((BTOptions=20*)=20= (relation)->rd_options)->deduplicate_items=20:=20true))=0A+#define=20= BTGetMergeFactor(relation)=20\=0A+=09= (AssertMacro(relation->rd_rel->relkind=20=3D=3D=20RELKIND_INDEX=20&&=20\=0A= +=09=09=09=09=20relation->rd_rel->relam=20=3D=3D=20BTREE_AM_OID),=20\=0A= +=09=20(relation)->rd_options=20?=20\=0A+=09=20((BTOptions=20*)=20= (relation)->rd_options)->mergefactor=20:=20\=0A+=09=20= BTREE_DEFAULT_MERGEFACTOR)=0A=20=0A=20/*=0A=20=20*=20Constant=20= definition=20for=20progress=20reporting.=20=20Phase=20numbers=20must=20= match=0Adiff=20--git=20a/src/include/access/nbtxlog.h=20= b/src/include/access/nbtxlog.h=0Aindex=20fbbe115c771..5ba5eb69b1d=20= 100644=0A---=20a/src/include/access/nbtxlog.h=0A+++=20= b/src/include/access/nbtxlog.h=0A@@=20-325,7=20+325,9=20@@=20typedef=20= struct=20xl_btree_unlink_page=0A=20=09BlockNumber=20leafrightsib;=0A=20=09= BlockNumber=20leaftopparent;=09/*=20next=20child=20down=20in=20the=20= subtree=20*/=0A=20=0A-=09/*=20xl_btree_metadata=20FOLLOWS=20IF=20= XLOG_BTREE_UNLINK_PAGE_META=20*/=0A+=09/*=0A+=09=20*=20xl_btree_metadata=20= FOLLOWS=20IF=20XLOG_BTREE_UNLINK_PAGE_META=0A+=09=20*/=0A=20}=20= xl_btree_unlink_page;=0A=20=0A=20#define=20SizeOfBtreeUnlinkPage=09= (offsetof(xl_btree_unlink_page,=20leaftopparent)=20+=20= sizeof(BlockNumber))=0Adiff=20--git=20= a/src/test/modules/test_misc/t/008_btree_merge_scan_correctness.pl=20= b/src/test/modules/test_misc/t/008_btree_merge_scan_correctness.pl=0Anew=20= file=20mode=20100644=0Aindex=2000000000000..abb40f6a4ff=0A---=20= /dev/null=0A+++=20= b/src/test/modules/test_misc/t/008_btree_merge_scan_correctness.pl=0A@@=20= -0,0=20+1,137=20@@=0A+#=20Copyright=20(c)=202025,=20PostgreSQL=20Global=20= Development=20Group=0A+=0A+#=20This=20test=20verifies=20scan=20= correctness=20during=20B-tree=20page=20merge=20operations.=0A+#=20It=20= demonstrates=20the=20race=20condition=20where=20moving=20tuples=20= between=20pages=20during=0A+#=20merge=20can=20cause=20forward=20scans=20= to=20see=20duplicates=20and=20backward=20scans=20to=20miss=20tuples.=0A+=0A= +use=20strict;=0A+use=20warnings=20FATAL=20=3D>=20'all';=0A+=0A+use=20= PostgreSQL::Test::Cluster;=0A+use=20PostgreSQL::Test::Utils;=0A+use=20= Test::More;=0A+=0A+#=20Check=20if=20injection=20points=20are=20available=20= =0A+if=20(!defined($ENV{enable_injection_points})=20||=20= $ENV{enable_injection_points}=20ne=20'yes')=0A+{=0A+=09plan=20skip_all=20= =3D>=20'Injection=20points=20not=20supported=20by=20this=20build';=0A+}=0A= +=0A+my=20$node=20=3D=20= PostgreSQL::Test::Cluster->new('btree_merge_scan_test');=0A+$node->init;=0A= +$node->append_conf('postgresql.conf',=20=0A+=09= 'shared_preload_libraries=20=3D=20\'injection_points\'');=0A= +$node->append_conf('postgresql.conf',=20'autovacuum=20=3D=20off');=0A= +$node->start;=0A+=0A+$node->safe_psql('postgres',=20'CREATE=20EXTENSION=20= injection_points');=0A+=0A+#=20Create=20test=20table=20and=20index=20= with=20conditions=20that=20should=20trigger=20merge=0A= +$node->safe_psql('postgres',=20q{=0A+=09CREATE=20TABLE=20btree_test=20(=0A= +=09=09id=20INTEGER,=0A+=09=09data=20TEXT=0A+=09);=0A+=09=0A+=09--=20= Insert=20data=20to=20create=20multiple=20pages=20(more=20data=20for=20= multi-page=20index)=0A+=09INSERT=20INTO=20btree_test=20=0A+=09SELECT=20= i,=20'data_'=20||=20i=20=0A+=09FROM=20generate_series(1,=205000)=20i;=0A= +=09=0A+=09--=20Create=20index=20with=20low=20fillfactor=20and=20high=20= mergefactor=20to=20encourage=20merging=0A+=09CREATE=20INDEX=20= btree_test_idx=20ON=20btree_test=20(id)=20=0A+=09WITH=20(fillfactor=20=3D=20= 30,=20mergefactor=20=3D=2050);=0A+});=0A+=0A+#=20Check=20index=20OID=20= for=20debugging=20our=20injection=20point=0A+my=20$index_oid=20=3D=20= $node->safe_psql('postgres',=20q{=0A+=09SELECT=20oid=20FROM=20pg_class=20= WHERE=20relname=20=3D=20'btree_test_idx';=0A+});=0A+note("Index=20OID:=20= $index_oid=20(injection=20point=20triggers=20for=20OID=20>=2016384)");=0A= +=0A+#=20Make=20index=20sparse=20to=20create=20merge=20candidates=20=20=0A= +$node->safe_psql('postgres',=20q{=0A+=09--=20Delete=2095%=20of=20rows=20= to=20make=20pages=20very=20sparse=20but=20with=20enough=20remaining=20= data=0A+=09DELETE=20FROM=20btree_test=20WHERE=20id=20%=2020=20!=3D=200;=0A= +});=0A+=0A+#=20Force=20index=20usage=20by=20disabling=20seqscan=0A= +$node->safe_psql('postgres',=20q{=0A+=09SET=20enable_seqscan=20=3D=20= off;=0A+=09SET=20enable_bitmapscan=20=3D=20off;=0A+});=0A+=0A+#=20Create=20= result=20tables=0A+$node->safe_psql('postgres',=20q{=0A+=09CREATE=20= TABLE=20forward_results=20(id=20INTEGER);=0A+=09CREATE=20TABLE=20= backward_results=20(id=20INTEGER);=0A+});=0A+=0A+#=20Set=20up=20= injection=20point=20to=20pause=20scans=20between=20pages=0A= +$node->safe_psql('postgres',=20q{=0A+=09SELECT=20= injection_points_attach('btree-scan-between-pages',=20'wait');=0A+});=0A= +=0A+#=20Launch=20background=20scans=20that=20will=20hit=20injection=20= points=0A+my=20$forward_scan=20=3D=20$node->background_psql('postgres');=0A= +my=20$backward_scan=20=3D=20$node->background_psql('postgres');=0A+=0A= +#=20Start=20queries=20without=20waiting=20for=20completion=20(they'll=20= hang=20at=20injection=20point)=0A+$forward_scan->query_until(qr/starting=20= forward=20scan/,=20q{=0A+=09SET=20enable_seqscan=20=3D=20off;=0A+=09SET=20= enable_bitmapscan=20=3D=20off;=0A+=09\echo=20starting=20forward=20scan=0A= +=09INSERT=20INTO=20forward_results=20(id)=20=0A+=09SELECT=20id=20FROM=20= btree_test=20ORDER=20BY=20id;=0A+=09\echo=20forward=20scan=20completed=0A= +});=0A+=0A+$backward_scan->query_until(qr/starting=20backward=20scan/,=20= q{=0A+=09SET=20enable_seqscan=20=3D=20off;=0A+=09SET=20enable_bitmapscan=20= =3D=20off;=0A+=09\echo=20starting=20backward=20scan=0A+=09INSERT=20INTO=20= backward_results=20(id)=0A+=09SELECT=20id=20FROM=20btree_test=20ORDER=20= BY=20id=20DESC;=0A+=09\echo=20backward=20scan=20completed=0A+});=0A+=0A= +#=20Give=20scans=20time=20to=20start=20and=20pause=20at=20injection=20= point=0A+sleep(1);=0A+=0A+#=20Run=20VACUUM=20while=20scans=20are=20= paused=20-=20this=20may=20trigger=20page=20merge=0A= +$node->safe_psql('postgres',=20q{=0A+=09VACUUM=20btree_test;=0A+});=0A+=0A= +#=20Release=20waiting=20scans=0A+$node->safe_psql('postgres',=20q{=0A+=09= SELECT=20injection_points_detach('btree-scan-between-pages');=0A+=09= SELECT=20injection_points_wakeup('btree-scan-between-pages');=0A+=09= SELECT=20injection_points_wakeup('btree-scan-between-pages');=0A+});=0A+=0A= +$forward_scan->quit;=0A+$backward_scan->quit;=0A+=0A+#=20Analyze=20= results=20for=20scan=20correctness=20issues=0A+my=20$expected_count=20=3D=20= $node->safe_psql('postgres',=20=0A+=09'SELECT=20count(*)=20FROM=20= btree_test');=0A+=0A+my=20$forward_count=20=3D=20= $node->safe_psql('postgres',=0A+=09'SELECT=20count(*)=20FROM=20= forward_results');=0A+=0A+my=20$backward_count=20=3D=20= $node->safe_psql('postgres',=20=0A+=09'SELECT=20count(*)=20FROM=20= backward_results');=0A+=0A+#=20Report=20results=0A+note("Expected=20= rows:=20$expected_count");=0A+note("Forward=20scan=20rows:=20= $forward_count");=0A+note("Backward=20scan=20rows:=20$backward_count");=0A= +=0A+#=20Test=20assertions=20-=20these=20should=20fail=20with=20page=20= merge,=20showing=20the=20race=20condition=0A+is($forward_count,=20= $expected_count,=20'Forward=20scan=20returns=20correct=20count');=0A= +is($backward_count,=20$expected_count,=20'Backward=20scan=20returns=20= correct=20count');=0A+=0A+$node->stop('fast');=0A+=0A+done_testing();=0A= \=20No=20newline=20at=20end=20of=20file=0A--=20=0A2.39.5=20(Apple=20= Git-154)=0A=0A= --Apple-Mail=_76A30C9C-86C3-47A6-8ADC-C164D48DCA06 Content-Disposition: attachment; filename=v2-0002-btree-Add-scan-abort-mechanism-for-page-merge-wit.patch Content-Type: application/octet-stream; x-unix-mode=0644; name="v2-0002-btree-Add-scan-abort-mechanism-for-page-merge-wit.patch" Content-Transfer-Encoding: quoted-printable =46rom=201314d4be1d0d0b5ab472884a01804ae910e3c610=20Mon=20Sep=2017=20= 00:00:00=202001=0AFrom:=20Andrey=20Borodin=20=0ADate:=20= Sun,=2031=20Aug=202025=2010:00:24=20+0500=0ASubject:=20[PATCH=20v2=20= 2/2]=20btree:=20Add=20scan=20abort=20mechanism=20for=20page=20merge=20= with=0A=20tuple=20movement=0A=0AWhen=20B-tree=20page=20merge=20moves=20= tuples=20between=20pages,=20concurrent=20scans=20can=0Amiss=20tuples=20= or=20see=20duplicates.=20This=20adds=20a=20BTP_HAD_TUPLES_MOVED=20flag=20= to=0Amark=20pages=20that=20had=20tuples=20moved=20during=20merge,=20and=20= aborts=20scans=20that=0Aencounter=20such=20deleted=20pages=20with=20a=20= serialization=20failure=20error.=0A=0AThe=20default=20mergefactor=20is=20= set=20to=200=20to=20disable=20merging=20by=20default,=0Aensuring=20no=20= behavior=20change=20unless=20explicitly=20configured.=20Includes=0ATAP=20= test=20demonstrating=20the=20scan=20abort=20mechanism.=0A---=0A=20= src/backend/access/nbtree/nbtpage.c=20=20=20=20=20=20=20=20=20=20=20|=20=20= 4=20++=0A=20src/backend/access/nbtree/nbtsearch.c=20=20=20=20=20=20=20=20= =20|=2061=20++++++++++++++++++-=0A=20src/include/access/nbtree.h=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20|=20=204=20+-=0A=20= .../t/008_btree_merge_scan_correctness.pl=20=20=20=20=20|=2037=20= ++++++-----=0A=204=20files=20changed,=2083=20insertions(+),=2023=20= deletions(-)=0A=0Adiff=20--git=20a/src/backend/access/nbtree/nbtpage.c=20= b/src/backend/access/nbtree/nbtpage.c=0Aindex=20792bc52558b..664f5f62b6b=20= 100644=0A---=20a/src/backend/access/nbtree/nbtpage.c=0A+++=20= b/src/backend/access/nbtree/nbtpage.c=0A@@=20-2818,6=20+2818,7=20@@=20= _bt_unlink_halfdead_page(Relation=20rel,=20Buffer=20leafbuf,=20= BlockNumber=20scanblkno,=0A=20=09=09Page=09=09leafpage=20=3D=20= BufferGetPage(leafbuf);=0A=20=09=09Page=09=09rightpage=20=3D=20= BufferGetPage(rbuf);=0A=20=09=09BTPageOpaque=20rightopaque=20=3D=20= BTPageGetOpaque(rightpage);=0A+=09=09BTPageOpaque=20leafopaque=20=3D=20= BTPageGetOpaque(leafpage);=0A=20=09=09OffsetNumber=20insert_at=20=3D=20= P_FIRSTDATAKEY(rightopaque);=0A=20=09=09int=09=09=09i;=0A=20=0A@@=20= -2838,6=20+2839,9=20@@=20_bt_unlink_halfdead_page(Relation=20rel,=20= Buffer=20leafbuf,=20BlockNumber=20scanblkno,=0A=20=0A=20=09=09= elog(DEBUG1,=20"Page=20merge=20completed=20in=20index=20\"%s\":=20moved=20= %d=20tuples=20from=20page=20%u=20to=20page=20%u",=0A=20=09=09=09=20= RelationGetRelationName(rel),=20merge_ntuples,=20leafblkno,=20rightsib);=0A= +=0A+=09=09/*=20Mark=20that=20this=20page=20had=20tuples=20moved=20for=20= scan=20detection=20*/=0A+=09=09leafopaque->btpo_flags=20|=3D=20= BTP_HAD_TUPLES_MOVED;=0A=20=09}=0A=20=0A=20=09/*=0Adiff=20--git=20= a/src/backend/access/nbtree/nbtsearch.c=20= b/src/backend/access/nbtree/nbtsearch.c=0Aindex=20= 6ff31f87033..28ce65faa71=20100644=0A---=20= a/src/backend/access/nbtree/nbtsearch.c=0A+++=20= b/src/backend/access/nbtree/nbtsearch.c=0A@@=20-2223,6=20+2223,8=20@@=20= _bt_steppage(IndexScanDesc=20scan,=20ScanDirection=20dir)=0A=20=09=09= Assert(!scan->parallel_scan);=0A=20=09}=0A=20=0A+=0A+=0A=20=09= BTScanPosUnpinIfPinned(so->currPos);=0A=20=0A=20=09/*=0A@@=20-2233,11=20= +2235,11=20@@=20_bt_steppage(IndexScanDesc=20scan,=20ScanDirection=20= dir)=0A=20=09=20*/=0A=20=09{=0A=20=09=09Relation=20rel=20=3D=20= scan->indexRelation;=0A-=09=09BlockNumber=20blkno=20=3D=20= so->currPos.currPage;=0A+=09=09BlockNumber=20ip_blkno=20=3D=20= so->currPos.currPage;=0A=20=09=09/*=20Only=20pause=20scans=20on=20our=20= test=20index=20(btree_test_idx=20has=20OID=20around=2016400+)=20*/=0A-=09= =09if=20(rel=20&&=20RelationGetRelid(rel)=20>=2016384=20&&=20blkno=20=3D=3D= =2020)=0A+=09=09if=20(rel=20&&=20RelationGetRelid(rel)=20>=2016384=20&&=20= ip_blkno=20=3D=3D=2020)=0A=20=09=09{=0A-=09=09=09= INJECTION_POINT("btree-scan-between-pages",=20&blkno);=0A+=09=09=09= INJECTION_POINT("btree-scan-between-pages",=20&ip_blkno);=0A=20=09=09}=0A= =20=09}=0A=20=0A@@=20-2446,6=20+2448,19=20@@=20= _bt_readnextpage(IndexScanDesc=20scan,=20BlockNumber=20blkno,=0A=20=09=09= page=20=3D=20BufferGetPage(so->currPos.buf);=0A=20=09=09opaque=20=3D=20= BTPageGetOpaque(page);=0A=20=09=09lastcurrblkno=20=3D=20blkno;=0A+=0A+=09= =09/*=0A+=09=09=20*=20Check=20if=20this=20is=20a=20deleted=20page=20that=20= had=20tuples=20moved=20during=20merge.=0A+=09=09=20*=20If=20so,=20abort=20= the=20scan=20to=20prevent=20incorrect=20results.=0A+=09=09=20*/=0A+=09=09= if=20(P_ISDELETED(opaque)=20&&=20P_HAD_TUPLES_MOVED(opaque))=0A+=09=09{=0A= +=09=09=09ereport(ERROR,=0A+=09=09=09=09=09= (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),=0A+=09=09=09=09=09=20= errmsg("scan=20aborted=20due=20to=20concurrent=20page=20merge=20with=20= tuple=20movement"),=0A+=09=09=09=09=09=20errhint("Retry=20the=20= operation.")));=0A+=09=09}=0A+=0A=20=09=09if=20= (likely(!P_IGNORE(opaque)))=0A=20=09=09{=0A=20=09=09=09/*=20see=20if=20= there=20are=20any=20matches=20on=20this=20page=20*/=0A@@=20-2549,6=20= +2564,20=20@@=20_bt_lock_and_validate_left(Relation=20rel,=20BlockNumber=20= *blkno,=0A=20=09=09=09=09/*=20Found=20desired=20page,=20return=20it=20*/=0A= =20=09=09=09=09return=20buf;=0A=20=09=09=09}=0A+=0A+=09=09=09/*=0A+=09=09= =09=20*=20Check=20if=20this=20is=20a=20deleted=20page=20that=20had=20= tuples=20moved=20during=20merge.=0A+=09=09=09=20*=20If=20so,=20abort=20= the=20scan=20to=20prevent=20incorrect=20results.=0A+=09=09=09=20*/=0A+=09= =09=09if=20(P_ISDELETED(opaque)=20&&=20P_HAD_TUPLES_MOVED(opaque))=0A+=09= =09=09{=0A+=09=09=09=09_bt_relbuf(rel,=20buf);=0A+=09=09=09=09= ereport(ERROR,=0A+=09=09=09=09=09=09= (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),=0A+=09=09=09=09=09=09=20= errmsg("scan=20aborted=20due=20to=20concurrent=20page=20merge=20with=20= tuple=20movement"),=0A+=09=09=09=09=09=09=20errhint("Retry=20the=20= operation.")));=0A+=09=09=09}=0A+=0A=20=09=09=09if=20= (P_RIGHTMOST(opaque)=20||=20++tries=20>=204)=0A=20=09=09=09=09break;=0A=20= =09=09=09/*=20step=20right=20*/=0A@@=20-2568,6=20+2597,19=20@@=20= _bt_lock_and_validate_left(Relation=20rel,=20BlockNumber=20*blkno,=0A=20=09= =09opaque=20=3D=20BTPageGetOpaque(page);=0A=20=09=09if=20= (P_ISDELETED(opaque))=0A=20=09=09{=0A+=09=09=09/*=0A+=09=09=09=20*=20= Check=20if=20this=20is=20a=20deleted=20page=20that=20had=20tuples=20= moved=20during=20merge.=0A+=09=09=09=20*=20If=20so,=20abort=20the=20scan=20= to=20prevent=20incorrect=20results.=0A+=09=09=09=20*/=0A+=09=09=09if=20= (P_HAD_TUPLES_MOVED(opaque))=0A+=09=09=09{=0A+=09=09=09=09= _bt_relbuf(rel,=20buf);=0A+=09=09=09=09ereport(ERROR,=0A+=09=09=09=09=09=09= (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),=0A+=09=09=09=09=09=09=20= errmsg("scan=20aborted=20due=20to=20concurrent=20page=20merge=20with=20= tuple=20movement"),=0A+=09=09=09=09=09=09=20errhint("Retry=20the=20= operation.")));=0A+=09=09=09}=0A+=0A=20=09=09=09/*=0A=20=09=09=09=20*=20= It=20was=20deleted.=20=20Move=20right=20to=20first=20nondeleted=20page=20= (there=0A=20=09=09=09=20*=20must=20be=20one);=20that=20is=20the=20page=20= that=20has=20acquired=20the=20deleted=0A@@=20-2585,6=20+2627,19=20@@=20= _bt_lock_and_validate_left(Relation=20rel,=20BlockNumber=20*blkno,=0A=20=09= =09=09=09opaque=20=3D=20BTPageGetOpaque(page);=0A=20=09=09=09=09if=20= (!P_ISDELETED(opaque))=0A=20=09=09=09=09=09break;=0A+=0A+=09=09=09=09/*=0A= +=09=09=09=09=20*=20Check=20if=20this=20is=20a=20deleted=20page=20that=20= had=20tuples=20moved=20during=20merge.=0A+=09=09=09=09=20*=20If=20so,=20= abort=20the=20scan=20to=20prevent=20incorrect=20results.=0A+=09=09=09=09=20= */=0A+=09=09=09=09if=20(P_HAD_TUPLES_MOVED(opaque))=0A+=09=09=09=09{=0A+=09= =09=09=09=09_bt_relbuf(rel,=20buf);=0A+=09=09=09=09=09ereport(ERROR,=0A+=09= =09=09=09=09=09=09(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),=0A+=09=09=09= =09=09=09=09=20errmsg("scan=20aborted=20due=20to=20concurrent=20page=20= merge=20with=20tuple=20movement"),=0A+=09=09=09=09=09=09=09=20= errhint("Retry=20the=20operation.")));=0A+=09=09=09=09}=0A=20=09=09=09}=0A= =20=09=09}=0A=20=09=09else=0Adiff=20--git=20= a/src/include/access/nbtree.h=20b/src/include/access/nbtree.h=0Aindex=20= 6fd1bcd142d..041bf0596ae=20100644=0A---=20a/src/include/access/nbtree.h=0A= +++=20b/src/include/access/nbtree.h=0A@@=20-83,6=20+83,7=20@@=20typedef=20= BTPageOpaqueData=20*BTPageOpaque;=0A=20#define=20BTP_HAS_GARBAGE=20(1=20= <<=206)=09/*=20page=20has=20LP_DEAD=20tuples=20(deprecated)=20*/=0A=20= #define=20BTP_INCOMPLETE_SPLIT=20(1=20<<=207)=09/*=20right=20sibling's=20= downlink=20is=20missing=20*/=0A=20#define=20BTP_HAS_FULLXID=09(1=20<<=20= 8)=09/*=20contains=20BTDeletedPageData=20*/=0A+#define=20= BTP_HAD_TUPLES_MOVED=20(1=20<<=209)=09/*=20page=20was=20deleted=20after=20= moving=20tuples=20*/=0A=20=0A=20/*=0A=20=20*=20The=20max=20allowed=20= value=20of=20a=20cycle=20ID=20is=20a=20bit=20less=20than=2064K.=20=20= This=20is=0A@@=20-201,7=20+202,7=20@@=20typedef=20struct=20= BTMetaPageData=0A=20#define=20BTREE_DEFAULT_FILLFACTOR=0990=0A=20#define=20= BTREE_NONLEAF_FILLFACTOR=0970=0A=20#define=20BTREE_SINGLEVAL_FILLFACTOR=09= 96=0A-#define=20BTREE_DEFAULT_MERGEFACTOR=095=0A+#define=20= BTREE_DEFAULT_MERGEFACTOR=090=09/*=20Disabled=20by=20default=20for=20= safety=20*/=0A=20=0A=20/*=0A=20=20*=09In=20general,=20the=20btree=20code=20= tries=20to=20localize=20its=20knowledge=20about=0A@@=20-228,6=20+229,7=20= @@=20typedef=20struct=20BTMetaPageData=0A=20#define=20= P_HAS_GARBAGE(opaque)=09(((opaque)->btpo_flags=20&=20BTP_HAS_GARBAGE)=20= !=3D=200)=0A=20#define=20P_INCOMPLETE_SPLIT(opaque)=09= (((opaque)->btpo_flags=20&=20BTP_INCOMPLETE_SPLIT)=20!=3D=200)=0A=20= #define=20P_HAS_FULLXID(opaque)=09(((opaque)->btpo_flags=20&=20= BTP_HAS_FULLXID)=20!=3D=200)=0A+#define=20P_HAD_TUPLES_MOVED(opaque)=20= (((opaque)->btpo_flags=20&=20BTP_HAD_TUPLES_MOVED)=20!=3D=200)=0A=20=0A=20= /*=0A=20=20*=20BTDeletedPageData=20is=20the=20page=20contents=20of=20a=20= deleted=20page=0Adiff=20--git=20= a/src/test/modules/test_misc/t/008_btree_merge_scan_correctness.pl=20= b/src/test/modules/test_misc/t/008_btree_merge_scan_correctness.pl=0A= index=20abb40f6a4ff..ca06cd2cf28=20100644=0A---=20= a/src/test/modules/test_misc/t/008_btree_merge_scan_correctness.pl=0A+++=20= b/src/test/modules/test_misc/t/008_btree_merge_scan_correctness.pl=0A@@=20= -76,7=20+76,7=20@@=20$node->safe_psql('postgres',=20q{=0A=20my=20= $forward_scan=20=3D=20$node->background_psql('postgres');=0A=20my=20= $backward_scan=20=3D=20$node->background_psql('postgres');=0A=20=0A-#=20= Start=20queries=20without=20waiting=20for=20completion=20(they'll=20hang=20= at=20injection=20point)=0A+#=20Start=20queries=20without=20waiting=20for=20= completion=20(they=20should=20abort=20with=20serialization=20error)=0A=20= $forward_scan->query_until(qr/starting=20forward=20scan/,=20q{=0A=20=09= SET=20enable_seqscan=20=3D=20off;=0A=20=09SET=20enable_bitmapscan=20=3D=20= off;=0A@@=20-100,9=20+100,13=20@@=20sleep(1);=0A=20=0A=20#=20Run=20= VACUUM=20while=20scans=20are=20paused=20-=20this=20may=20trigger=20page=20= merge=0A=20$node->safe_psql('postgres',=20q{=0A+=09SET=20= client_min_messages=20TO=20DEBUG1;=0A=20=09VACUUM=20btree_test;=0A=20});=0A= =20=0A+#=20Get=20current=20log=20position=20to=20check=20for=20new=20= errors=0A+my=20$log_offset=20=3D=20-s=20$node->logfile;=0A+=0A=20#=20= Release=20waiting=20scans=0A=20$node->safe_psql('postgres',=20q{=0A=20=09= SELECT=20injection_points_detach('btree-scan-between-pages');=0A@@=20= -110,28=20+114,23=20@@=20$node->safe_psql('postgres',=20q{=0A=20=09= SELECT=20injection_points_wakeup('btree-scan-between-pages');=0A=20});=0A= =20=0A-$forward_scan->quit;=0A-$backward_scan->quit;=0A-=0A-#=20Analyze=20= results=20for=20scan=20correctness=20issues=0A-my=20$expected_count=20=3D=20= $node->safe_psql('postgres',=20=0A-=09'SELECT=20count(*)=20FROM=20= btree_test');=0A+#=20Wait=20for=20scans=20to=20abort=20with=20= serialization=20errors=0A+$node->wait_for_log('scan=20aborted=20due=20to=20= concurrent=20page=20merge=20with=20tuple=20movement',=0A+=09= $log_offset);=0A=20=0A-my=20$forward_count=20=3D=20= $node->safe_psql('postgres',=0A-=09'SELECT=20count(*)=20FROM=20= forward_results');=0A+#=20Clean=20up=20background=20processes=20-=20they=20= should=20have=20failed=0A+$forward_scan->{run}->finish;=0A= +$backward_scan->{run}->finish;=0A=20=0A-my=20$backward_count=20=3D=20= $node->safe_psql('postgres',=20=0A-=09'SELECT=20count(*)=20FROM=20= backward_results');=0A+$node->stop('fast');=0A=20=0A-#=20Report=20= results=0A-note("Expected=20rows:=20$expected_count");=0A-note("Forward=20= scan=20rows:=20$forward_count");=0A-note("Backward=20scan=20rows:=20= $backward_count");=0A+#=20Verify=20that=20scans=20were=20aborted=20by=20= checking=20the=20log=20file=0A+my=20$log_contents=20=3D=20= slurp_file($node->logfile);=0A+my=20$error_count=20=3D=20()=20=3D=20= $log_contents=20=3D~=20/scan=20aborted=20due=20to=20concurrent=20page=20= merge=20with=20tuple=20movement/g;=0A=20=0A-#=20Test=20assertions=20-=20= these=20should=20fail=20with=20page=20merge,=20showing=20the=20race=20= condition=0A-is($forward_count,=20$expected_count,=20'Forward=20scan=20= returns=20correct=20count');=0A-is($backward_count,=20$expected_count,=20= 'Backward=20scan=20returns=20correct=20count');=0A+note("Found=20= $error_count=20scan=20abort=20errors=20in=20log");=0A=20=0A= -$node->stop('fast');=0A+#=20We=20should=20see=20at=20least=20two=20scan=20= abort=20error=20(possibly=20two,=20one=20for=20each=20scan)=0A= +ok($error_count=20>=3D=202,=20'At=20least=20twp=20scan=20was=20aborted=20= due=20to=20tuple=20movement');=0A=20=0A=20done_testing();=0A\=20No=20= newline=20at=20end=20of=20file=0A--=20=0A2.39.5=20(Apple=20Git-154)=0A=0A= --Apple-Mail=_76A30C9C-86C3-47A6-8ADC-C164D48DCA06--