Date: Mon, 09 Feb 2026 16:05:20 +0900 (JST)
Message-Id: <20260209.160520.873483599710391856.ishii@postgresql.org>
To: assam258@gmail.com
Cc: jacob.champion@enterprisedb.com, david.g.johnston@gmail.com,
 vik@postgresfriends.org, er@xs4all.nl, peter@eisentraut.org,
 pgsql-hackers@postgresql.org
Subject: Re: Row pattern recognition
From: Tatsuo Ishii <ishii@postgresql.org>
In-Reply-To: <CAAAe_zBpgH_06Utd9r1rVYDLaF-UuTtT1_W69bVeEP4MBzvDBQ@mail.gmail.com>
References: <20260207.133143.1951390390459539453.ishii@postgresql.org>
	<CAAAe_zDoO47FzUBHQU-0CgYD63boZNi-BJSG5Q87Jr5w5bGcdw@mail.gmail.com>
	<CAAAe_zBpgH_06Utd9r1rVYDLaF-UuTtT1_W69bVeEP4MBzvDBQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Archived-At: <https://www.postgresql.org/message-id/20260209.160520.873483599710391856.ishii%40postgresql.org>
Precedence: bulk

Hi Henson,

> Sorry, I forgot to attach the patch file in my previous email.

Thanks for the patch!  Patch cleanly applied and regression test
passed here.

>> On a side note, I feel that the automata characteristics of RPR --
>> state transitions, context management, absorbability, etc. -- tend to
>> require a much larger volume of test cases than typical RDBMS/SQL
>> features.  This seems unavoidable given the combinatorial nature of
>> NFA execution.

I agree. A complex feature requires complex tests. We cannot avoid it.

BTW, I have tested RPR with large partition/reduced frame (up to 100k
rows). Following query took over 3 minutes on my laptop.

EXPLAIN (ANALYZE)
SELECT aid, count(*) OVER w
FROM generate_series(1,100000) AS g(aid) WHERE aid <= 100000
WINDOW w AS (
ORDER BY aid
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP PAST LAST ROW
INITIAL
PATTERN (START UP+)
DEFINE START AS TRUE, UP AS aid > PREV(aid)
);
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=4337.40..4504.08 rows=33333 width=14) (actual time=224171.589..224320.055 rows=100000.00 loops=1)
   Window: w AS (ORDER BY aid ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
   Pattern: start up+
   Storage: Disk  Maximum Storage: 4096kB
   NFA States: 200000 peak, 5000050001 total, 0 merged
   NFA Contexts: 100001 peak, 100001 total, 1 pruned
   NFA: 1 matched (len 100000/100000/100000.0), 0 mismatched
   NFA: 0 absorbed, 99998 skipped (len 2/99999/50000.5)
   Buffers: shared hit=3, temp read=400968 written=391
   ->  Sort  (cost=3754.09..3837.42 rows=33333 width=4) (actual time=14.938..38.911 rows=100000.00 loops=1)
         Sort Key: aid
         Sort Method: quicksort  Memory: 3073kB
         Buffers: shared hit=3, temp read=171 written=171
         ->  Function Scan on generate_series g  (cost=0.00..1250.00 rows=33333 width=4) (actual time=5.799..11.672 rows=100000.00 loops=1)
               Filter: (aid <= 100000)
               Buffers: temp read=171 written=171
 Planning:
   Buffers: shared hit=6 read=7
 Planning Time: 0.185 ms
 Execution Time: 224324.532 ms
 
It ran without eating all memory on my box (according to top command,
it was around 200MB). Great! However it took nearly 4 minutes to
complete the query.  I think this is because there are 100k rows
partition/reduced frame, and RPR needs to keep 100k contexts plus 200k
states at the peak. I also ran "perf top" and took following profile.

  19.87%  postgres                               [.] nfa_match
  18.92%  postgres                               [.] nfa_advance_state
  13.60%  postgres                               [.] nfa_cleanup_dead_contexts
  11.25%  postgres                               [.] row_is_in_reduced_frame
  11.06%  postgres                               [.] nfa_state_clone
   5.52%  postgres                               [.] nfa_route_to_elem
   4.39%  postgres                               [.] nfa_add_state_unique.isra.0
   4.32%  postgres                               [.] nfa_state_alloc
   2.36%  libc.so.6                              [.] __memset_avx2_unaligned_erms
   0.63%  postgres                               [.] memset@plt

The reason for nfa_match is called so many times is, it's in a a loop
which iterate over all the contexts:

	for (ctx = winstate->nfaContext; ctx != NULL; ctx = ctx->next)

I don't try to say we should enhance nfa modules in the case when
there is a large reduced frame, because I don't think this is a
realistic use case and I doubt it's worth to fight against unrealistic
scenario.

Note that even if the partition size is large, if we change the PATTERN to:

PATTERN (START UP{,3})

which generates only 4 rows reduced frames, we get very good
performance:

test=# test=# test=# EXPLAIN (ANALYZE)
SELECT aid, count(*) OVER w
FROM generate_series(1,100000) AS g(aid) WHERE aid <= 100000
WINDOW w AS (
ORDER BY aid
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP PAST LAST ROW
INITIAL
PATTERN (START UP{,3})
DEFINE START AS TRUE, UP AS aid > PREV(aid)
);
---------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=4337.40..4504.08 rows=33333 width=14) (actual time=36.586..91.988 rows=100000.00 loops=1)
   Window: w AS (ORDER BY aid ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
   Pattern: start up{0,3}
   Storage: Memory  Maximum Storage: 17kB
   NFA States: 8 peak, 325001 total, 0 merged
   NFA Contexts: 5 peak, 100001 total, 0 pruned
   NFA: 25000 matched (len 4/4/4.0), 0 mismatched
   NFA: 0 absorbed, 75000 skipped (len 1/3/2.0)
   Buffers: temp read=171 written=171
   ->  Sort  (cost=3754.09..3837.42 rows=33333 width=4) (actual time=36.561..40.933 rows=100000.00 loops=1)
         Sort Key: aid
         Sort Method: quicksort  Memory: 3073kB
         Buffers: temp read=171 written=171
         ->  Function Scan on generate_series g  (cost=0.00..1250.00 rows=33333 width=4) (actual time=14.297..28.692 rows=100000.00 loops=1)
               Filter: (aid <= 100000)
               Buffers: temp read=171 written=171
 Planning Time: 0.087 ms
 Execution Time: 94.810 ms
(18 rows)

I think this is more realistic use case than former. If I were
correct, we don't need to work on current nfa code to squeeze better
performance for unrealistic 100k reduced frame case. I maybe wrong
and I would love to hear from others especially, Henson, Vik, Jacob.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp