MIME-Version: 1.0
References: 
 <CAAAe_zCvHz=tHkggP37OoQH8R0ux4j_CJdBTiPUg-L=6cVMyWg@mail.gmail.com>
 <20260309.120802.1845076903520202301.ishii@postgresql.org>
 <CAAAe_zAn2nFgM_gfsEDYu+MXCArRFoP6s9bRz2bP4X5HNmnYww@mail.gmail.com>
 <20260309.142202.1739855502263731478.ishii@postgresql.org>
In-Reply-To: <20260309.142202.1739855502263731478.ishii@postgresql.org>
From: SungJun Jang <sjjang112233@gmail.com>
Date: Wed, 11 Mar 2026 13:37:59 +0900
Message-ID: 
 <CAE+cgNgVWChqF-f-s4zT18V+oK1y9UOvORq9JX7jFKj4=r_taw@mail.gmail.com>
Subject: Re: Row pattern recognition
To: Tatsuo Ishii <ishii@postgresql.org>, assam258@gmail.com
Cc: vik@postgresfriends.org, er@xs4all.nl, jacob.champion@enterprisedb.com,
	david.g.johnston@gmail.com, peter@eisentraut.org,
	pgsql-hackers@postgresql.org
Content-Type: multipart/alternative; boundary="00000000000016ac39064cb83402"
Archived-At: 
 <https://www.postgresql.org/message-id/CAE%2BcgNgVWChqF-f-s4zT18V%2BoK1y9UOvORq9JX7jFKj4%3Dr_taw%40mail.gmail.com>
Precedence: bulk

--00000000000016ac39064cb83402
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi hackers,

I ran a cross-validation test comparing PostgreSQL RPR (current patch) and
Trino MATCH_RECOGNIZE to verify both result correctness and performance
characteristics across different scales.

The dataset consists of two segments where rows are arranged sequentially
as:

A =E2=86=92 B =E2=86=92 C =E2=86=92 D

Each segment contains a fixed distribution of categories A/B/C/D. Category
E does not exist in the dataset and is used to test match failure behavior.

Test scales ranged from 20,000 to 100,000 rows.

All tests were executed using:

AFTER MATCH SKIP PAST LAST ROW

Test Environment

Item        | PostgreSQL              | Trino
------------+-------------------------+-----------------------
Version     | 19devel (RPR patch)    | 471
Runtime     | Local                  | Local Docker container
Platform    | Linux x86_64           | Linux x86_64
RPR syntax  | WINDOW clause          | MATCH_RECOGNIZE

Dataset Structure

Each segment contains sequential categories:

A : ~1/3 of rows
B : ~1/3 of rows
C : ~1/3 of rows
D : 1 row

Example at 1x scale:

Segment size : 10,000 rows
Total rows : 20,000

Test Cases

Test 1
Pattern:
PATTERN (A+ B+ C+ D)

Expected result:
1 match per segment (2 total)

Reason:
Data is arranged exactly as A =E2=86=92 B =E2=86=92 C =E2=86=92 D.

Test 2
Pattern:
PATTERN (A+ B+ C+ E)

Expected result:
0 rows (no match)

Reason:
Category E does not exist in the dataset.

Correctness Results

Across all scales both systems returned identical results.

Test 1 =E2=86=92 1 match per segment (2 total)
Test 2 =E2=86=92 0 rows

Performance Results

Scale | Total Rows | PG Test1 | PG Test2 | Trino Test1 | Trino Test2
------+------------+----------+----------+-------------+-------------
1x    |  20,000    | 19 ms    | 17 ms    | ~0.3 s      | ~358 s
2x    |  40,000    | 37 ms    | 37 ms    | ~0.8 s      | ~1,364 s
3x    |  60,000    | 57 ms    | 51 ms    | ~1.5 s      | ~4,424 s
4x    |  80,000    | 73 ms    | 68 ms    | ~2.3 s      | ~9,989 s
5x    | 100,000    | 99 ms    | 92 ms    | ~3.3 s      | ~20,014 s

Note:
Trino measurements are adjusted to remove JVM startup overhead (~2.4s
measured via SELECT 1).

Observations

Match success (Test 1)

Both systems scale approximately linearly because the entire segment is
consumed in a single pass after a successful match.

PostgreSQL shows lower absolute latency.

Match failure (Test 2)

PostgreSQL maintains near-linear scaling.

This appears to be due to its NFA implementation combined with Context
Absorption, which discards redundant matching contexts when they reach the
same NFA state but start later in the input.

Example:

Ctx1 start=3D0 length=3D2
Ctx2 start=3D1 length=3D1

Ctx1 subsumes Ctx2, so Ctx2 is discarded.

This keeps the number of active contexts small and prevents quadratic
growth.

Measured scaling:

17 ms =E2=86=92 92 ms (1x =E2=86=92 5x)

which is consistent with O(n).

Trino also uses an NFA approach but appears to lack Context Absorption or a
similar optimization.

As a result, it explores matching contexts from all possible start
positions, leading to rapidly increasing backtracking cost.

Measured scaling:

358 s =E2=86=92 20,014 s (1x =E2=86=92 5x)

This exceeds theoretical O(n=C2=B2) growth and appears closer to O(n=C2=B2)=
=E2=80=93O(n=C2=B3) in
practice, likely compounded by JVM memory and GC overhead.

Summary

Correctness

PostgreSQL RPR and Trino MATCH_RECOGNIZE produce identical matching results
across all tested scales.

Performance

Match success:
Both systems scale roughly linearly.

Match failure:
PostgreSQL maintains O(n) scaling while Trino shows O(n=C2=B2) or worse beh=
avior.

At the largest tested scale:

PostgreSQL : ~92 ms
Trino : ~20,014 s (=E2=89=88 5.6 hours)

PostgreSQL is therefore approximately 217,000=C3=97 faster in this scenario=
.

Conclusion

Both systems use NFA-based pattern matching.

The key difference appears to be Context Absorption in PostgreSQL, which
removes redundant matching contexts and guarantees linear scaling even when
patterns fail to match.

This optimization prevents the non-linear performance degradation typically
associated with row pattern recognition.

Best regards

SungJun

2026=EB=85=84 3=EC=9B=94 9=EC=9D=BC (=EC=9B=94) PM 2:22, Tatsuo Ishii <ishi=
i@postgresql.org>=EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84=B1:

> Hi Henson,
>
> >> Excellnt findings!  BTW, I realized that we cannot use $1 of function
> >> in PATTERN clause like: A{$1}.
> >>
> >> ERROR:  42601: syntax error at or near "$1"
> >> LINE 10:         PATTERN (A{$1})
> >>                             ^
> >> LOCATION:  scanner_yyerror, scan.l:1211
> >>
> >> Should we document somewhere?
> >>
> >
> > The PATTERN quantifier {n} only accepts Iconst (integer literal) in the
> > grammar.  When a host variable or function parameter is used (e.g.,
> > A{$1}), the user gets a generic syntax error.
>
> Ok.
>
> > Oracle accepts broader syntax and validates later, producing an error
> > at a later stage rather than a syntax error at parse time.
> >
> > PostgreSQL itself already has precedent for this pattern -- in fact,
> > within the same window clause, frame offset (ROWS/RANGE/GROUPS) accepts
> > a_expr in the grammar and then rejects variables in parse analysis via
> > transformFrameOffset() -> checkExprIsVarFree().
> >
> > I'd lean against documenting this.  The SQL standard already defines
> > the quantifier bound as <unsigned integer literal>, so there is nothing
> > beyond the standard to call out, and documenting what is *not* allowed
> > tends to raise questions that wouldn't otherwise occur to users.
> >
> > Rather, I think accepting a broader grammar and validating later would
> > be the more appropriate response, producing a descriptive error like:
> >
> >   "argument of bounded quantifier must be an integer literal"
> >
> > I can either include this in the current patch set or handle it as a
> > separate follow-up after the main series is committed.  What do you
> > think?
>
> I think handing it as a separate follow-up after the commit is enough
> unless other developers complain.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>

--00000000000016ac39064cb83402
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi hackers,<br><br>I ran a cross-validation test comparing=
 PostgreSQL RPR (current patch) and Trino MATCH_RECOGNIZE to verify both re=
sult correctness and performance characteristics across different scales.<b=
r><br>The dataset consists of two segments where rows are arranged sequenti=
ally as:<br><br>A =E2=86=92 B =E2=86=92 C =E2=86=92 D<br><br>Each segment c=
ontains a fixed distribution of categories A/B/C/D. Category E does not exi=
st in the dataset and is used to test match failure behavior.<br><br>Test s=
cales ranged from 20,000 to 100,000 rows.<br><br>All tests were executed us=
ing:<br><br>AFTER MATCH SKIP PAST LAST ROW<br><br>Test Environment<br><br>I=
tem =C2=A0 =C2=A0 =C2=A0 =C2=A0| PostgreSQL =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0| Trino<br>------------+-------------------------+--------=
---------------<br>Version =C2=A0 =C2=A0 | 19devel (RPR patch) =C2=A0 =C2=
=A0| 471<br>Runtime =C2=A0 =C2=A0 | Local =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| Local Docker container<br>Platform =C2=A0 =
=C2=A0| Linux x86_64 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | Linux x86_64<br>R=
PR syntax =C2=A0| WINDOW clause =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| MATCH_R=
ECOGNIZE<br><br>Dataset Structure<br><br>Each segment contains sequential c=
ategories:<br><br>A : ~1/3 of rows<br>B : ~1/3 of rows<br>C : ~1/3 of rows<=
br>D : 1 row<br><br>Example at 1x scale:<br><br>Segment size : 10,000 rows<=
br>Total rows : 20,000<br><br>Test Cases<br><br>Test 1<br>Pattern:<br>PATTE=
RN (A+ B+ C+ D)<br><br>Expected result:<br>1 match per segment (2 total)<br=
><br>Reason:<br>Data is arranged exactly as A =E2=86=92 B =E2=86=92 C =E2=
=86=92 D.<br><br>Test 2<br>Pattern:<br>PATTERN (A+ B+ C+ E)<br><br>Expected=
 result:<br>0 rows (no match)<br><br>Reason:<br>Category E does not exist i=
n the dataset.<br><br>Correctness Results<br><br>Across all scales both sys=
tems returned identical results.<br><br>Test 1 =E2=86=92 1 match per segmen=
t (2 total)<br>Test 2 =E2=86=92 0 rows<br><br>Performance Results<br><br>Sc=
ale | Total Rows | PG Test1 | PG Test2 | Trino Test1 | Trino Test2<br>-----=
-+------------+----------+----------+-------------+-------------<br>1x =C2=
=A0 =C2=A0| =C2=A020,000 =C2=A0 =C2=A0| 19 ms =C2=A0 =C2=A0| 17 ms =C2=A0 =
=C2=A0| ~0.3 s =C2=A0 =C2=A0 =C2=A0| ~358 s<br>2x =C2=A0 =C2=A0| =C2=A040,0=
00 =C2=A0 =C2=A0| 37 ms =C2=A0 =C2=A0| 37 ms =C2=A0 =C2=A0| ~0.8 s =C2=A0 =
=C2=A0 =C2=A0| ~1,364 s<br>3x =C2=A0 =C2=A0| =C2=A060,000 =C2=A0 =C2=A0| 57=
 ms =C2=A0 =C2=A0| 51 ms =C2=A0 =C2=A0| ~1.5 s =C2=A0 =C2=A0 =C2=A0| ~4,424=
 s<br>4x =C2=A0 =C2=A0| =C2=A080,000 =C2=A0 =C2=A0| 73 ms =C2=A0 =C2=A0| 68=
 ms =C2=A0 =C2=A0| ~2.3 s =C2=A0 =C2=A0 =C2=A0| ~9,989 s<br>5x =C2=A0 =C2=
=A0| 100,000 =C2=A0 =C2=A0| 99 ms =C2=A0 =C2=A0| 92 ms =C2=A0 =C2=A0| ~3.3 =
s =C2=A0 =C2=A0 =C2=A0| ~20,014 s<br><br>Note:<br>Trino measurements are ad=
justed to remove JVM startup overhead (~2.4s measured via SELECT 1).<br><br=
>Observations<br><br>Match success (Test 1)<br><br>Both systems scale appro=
ximately linearly because the entire segment is consumed in a single pass a=
fter a successful match.<br><br>PostgreSQL shows lower absolute latency. <b=
r><br>Match failure (Test 2)<br><br>PostgreSQL maintains near-linear scalin=
g.<br><br>This appears to be due to its NFA implementation combined with Co=
ntext Absorption, which discards redundant matching contexts when they reac=
h the same NFA state but start later in the input.<br><br>Example:<br><br>C=
tx1 start=3D0 length=3D2<br>Ctx2 start=3D1 length=3D1<br><br>Ctx1 subsumes =
Ctx2, so Ctx2 is discarded.<br><br>This keeps the number of active contexts=
 small and prevents quadratic growth.<br><br>Measured scaling:<br><br>17 ms=
 =E2=86=92 92 ms (1x =E2=86=92 5x)<br><br>which is consistent with O(n).<br=
><br>Trino also uses an NFA approach but appears to lack Context Absorption=
 or a similar optimization.<br><br>As a result, it explores matching contex=
ts from all possible start positions, leading to rapidly increasing backtra=
cking cost.<br><br>Measured scaling:<br><br>358 s =E2=86=92 20,014 s (1x =
=E2=86=92 5x)<br><br>This exceeds theoretical O(n=C2=B2) growth and appears=
 closer to O(n=C2=B2)=E2=80=93O(n=C2=B3) in practice, likely compounded by =
JVM memory and GC overhead.<br><br>Summary<br><br>Correctness<br><br>Postgr=
eSQL RPR and Trino MATCH_RECOGNIZE produce identical matching results acros=
s all tested scales.<br><br>Performance<br><br>Match success:<br>Both syste=
ms scale roughly linearly.<br><br>Match failure:<br>PostgreSQL maintains O(=
n) scaling while Trino shows O(n=C2=B2) or worse behavior.<br><br>At the la=
rgest tested scale:<br><br>PostgreSQL : ~92 ms<br>Trino : ~20,014 s (=E2=89=
=88 5.6 hours)<br><br>PostgreSQL is therefore approximately 217,000=C3=97 f=
aster in this scenario.<br><br>Conclusion<br><br>Both systems use NFA-based=
 pattern matching.<br><br>The key difference appears to be Context Absorpti=
on in PostgreSQL, which removes redundant matching contexts and guarantees =
linear scaling even when patterns fail to match.<br><br>This optimization p=
revents the non-linear performance degradation typically associated with ro=
w pattern recognition.<br><br>Best regards<br><br>SungJun</div><br><div cla=
ss=3D"gmail_quote gmail_quote_container"><div dir=3D"ltr" class=3D"gmail_at=
tr">2026=EB=85=84 3=EC=9B=94 9=EC=9D=BC (=EC=9B=94) PM 2:22, Tatsuo Ishii &=
lt;<a href=3D"mailto:ishii@postgresql.org">ishii@postgresql.org</a>&gt;=EB=
=8B=98=EC=9D=B4 =EC=9E=91=EC=84=B1:<br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex">Hi Henson,<br>
<br>
&gt;&gt; Excellnt findings!=C2=A0 BTW, I realized that we cannot use $1 of =
function<br>
&gt;&gt; in PATTERN clause like: A{$1}.<br>
&gt;&gt;<br>
&gt;&gt; ERROR:=C2=A0 42601: syntax error at or near &quot;$1&quot;<br>
&gt;&gt; LINE 10:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0PATTERN (A{$1})<br>
&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0^<br>
&gt;&gt; LOCATION:=C2=A0 scanner_yyerror, scan.l:1211<br>
&gt;&gt;<br>
&gt;&gt; Should we document somewhere?<br>
&gt;&gt;<br>
&gt; <br>
&gt; The PATTERN quantifier {n} only accepts Iconst (integer literal) in th=
e<br>
&gt; grammar.=C2=A0 When a host variable or function parameter is used (e.g=
.,<br>
&gt; A{$1}), the user gets a generic syntax error.<br>
<br>
Ok.<br>
<br>
&gt; Oracle accepts broader syntax and validates later, producing an error<=
br>
&gt; at a later stage rather than a syntax error at parse time.<br>
&gt; <br>
&gt; PostgreSQL itself already has precedent for this pattern -- in fact,<b=
r>
&gt; within the same window clause, frame offset (ROWS/RANGE/GROUPS) accept=
s<br>
&gt; a_expr in the grammar and then rejects variables in parse analysis via=
<br>
&gt; transformFrameOffset() -&gt; checkExprIsVarFree().<br>
&gt; <br>
&gt; I&#39;d lean against documenting this.=C2=A0 The SQL standard already =
defines<br>
&gt; the quantifier bound as &lt;unsigned integer literal&gt;, so there is =
nothing<br>
&gt; beyond the standard to call out, and documenting what is *not* allowed=
<br>
&gt; tends to raise questions that wouldn&#39;t otherwise occur to users.<b=
r>
&gt; <br>
&gt; Rather, I think accepting a broader grammar and validating later would=
<br>
&gt; be the more appropriate response, producing a descriptive error like:<=
br>
&gt; <br>
&gt;=C2=A0 =C2=A0&quot;argument of bounded quantifier must be an integer li=
teral&quot;<br>
&gt; <br>
&gt; I can either include this in the current patch set or handle it as a<b=
r>
&gt; separate follow-up after the main series is committed.=C2=A0 What do y=
ou<br>
&gt; think?<br>
<br>
I think handing it as a separate follow-up after the commit is enough<br>
unless other developers complain.<br>
<br>
Best regards,<br>
--<br>
Tatsuo Ishii<br>
SRA OSS K.K.<br>
English: <a href=3D"http://www.sraoss.co.jp/index_en/" rel=3D"noreferrer" t=
arget=3D"_blank">http://www.sraoss.co.jp/index_en/</a><br>
Japanese:<a href=3D"http://www.sraoss.co.jp" rel=3D"noreferrer" target=3D"_=
blank">http://www.sraoss.co.jp</a><br>
</blockquote></div>

--00000000000016ac39064cb83402--