Re: Row pattern recognition

public inbox for [email protected]  
help / color / mirror / Atom feed

From: jian he <[email protected]>
To: [email protected]
Cc: Tatsuo Ishii <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Subject: Re: Row pattern recognition
Date: Tue, 2 Jun 2026 13:46:22 +0800
Message-ID: <CACJufxHVADC8e77pnQxSZRk7SYHCZFk6ZCM2HfTsKyD_kUji0A@mail.gmail.com> (raw)
In-Reply-To: <CAAAe_zCZc8s4zWfmVVUOt0y_FU=v7YTcJJJ4UL2gBzJ2_KkUmQ@mail.gmail.com>
References: <CAAAe_zDDuJofafXyhggPPPLzUXeejH-19NfcLR7jNNQxZchtog@mail.gmail.com>
	<[email protected]>
	<CAAAe_zBhzHyazC_Ot+HXBe_nToKc7AHs4r-s0nNcaBPw0L17wA@mail.gmail.com>
	<[email protected]>
	<CAAAe_zCZc8s4zWfmVVUOt0y_FU=v7YTcJJJ4UL2gBzJ2_KkUmQ@mail.gmail.com>

On Mon, Jun 1, 2026 at 1:35 PM Henson Choi <[email protected]> wrote:
>
> Hi Tatsuo, Jian,
>
> While tidying RPR comments I found a small inconsistency in the varId bounds.
> The comment/README side I'm already fixing in the in-progress series; whether
> to also change the bounds is a separate follow-up.  As lead author that one is
> ultimately your call, Tatsuo, but I'd welcome Jian's and the list's input on
> it first.
>
> The current state, in src/include/optimizer/rpr.h:
>
>   #define RPR_VARID_MAX   251
>   #define RPR_VARID_BEGIN 252   /* control codes 252..255 */
>   ... END 253, ALT 254, FIN 255
>
>   RPRElemIsVar(e)  ==  ((e)->varId <= RPR_VARID_MAX)   /* 0..251 */
>
> and the limit enforced in parse_rpr.c:
>
>   if (list_length(*varNames) >= RPR_VARID_MAX)   /* reject the 252nd */
>       ereport(ERROR, "too many pattern variables", "Maximum is 251");
>
> So 251 variables are accepted as varId 0..250, leaving 251 a hole: never
> assigned, yet the macro still classifies it as a variable -- one wider than
> the comment's own "0 to RPR_VARID_MAX - 1".
>
> RPRVarId is a uint8, kept small on purpose: varId is the likely per-row
> match-history key, and since a match can run arbitrarily long the history
> grows with it -- so one byte per row, not two, is what keeps that footprint
> in check.
>
> The catch of staying in uint8: the four control codes already fill 252..255,
> so 251 is the only free slot for any future sentinel (anchor ^/$, exclusion
> {- -}) short of widening to uint16.  So the hole is really the last reserve.
>
> Three ways, by what the gap is spent on:
>
> (1) Leave it -- just the doc alignment already underway: 251 stays a documented
>     reserve, macro unchanged.  No follow-up commit.  The one free slot is then
>     on hand for a single future control code, should one ever be needed.
>
> (2) Fill it as a 252nd variable (0..251).  Compatible and doable anytime; a few
>     lines in parse_rpr.c / rpr.h plus the boundary test.  But it spends the
>     last free slot, so a future control code would then force either a
>     compatibility-breaking narrow of RPR_VARID_MAX or a widen to two bytes
>     (doubling history).  Maximal variables now, the control question deferred.
>
> (3) Reserve 16 control codes now (4 used + 12 spare) at the 0xF0 boundary:
>     vars 0..239, control 240..255, existing sentinels unmoved, macro becomes
>     (varId & 0xF0) != 0xF0.  Buys 12-code headroom inside the byte, so history
>     stays 1 byte and (2)'s fork never arises.  Same edit shape as (2); costs
>     only the nominal drop to 240 variables -- but it is a narrowing, so free
>     only pre-release.
>
> Which would you prefer?
>

3.

240 variables is enough, as each variable supports multiple complex AND/OR
conditions. Additionally, since PostgreSQL regular expressions use 14 special
characters, reserving the remaining ones in advance is a future-proof approach.

----------------------------------------------------
src/backend/executor/README.rpr

XII-4. Memory Pool Management
  Choice: Custom free list

  Rationale:
  - NFA states are created and destroyed in large numbers per row
  - Avoids palloc/pfree overhead
  - State size is variable (counts[] array), but within a single query
    maxDepth is fixed, so all states have the same size

It would be better simply to mention that:
RPRNFAState and RPRNFAContext are allocated in a partition-lifespan
memory context; they will be destroyed in release_partition.
--------------------------
in ExecRPRFreeContext:
{
    if (ctx->states != NULL)
        nfa_state_free_list(winstate, ctx->states);
    if (ctx->matchedState != NULL)
        nfa_state_free(winstate, ctx->matchedState);
}

If ctx->matchedState points to one of the states already in ctx->states, will
nfa_state_free() be called on the same RPRNFAState twice? Is this double-free
permitted, or do we have a mechanism in place to guard against it?
--------------------------
In ExecRPRProcessRow

            if (currentPos == ctxFrameEnd)
            {
                /* Frame boundary reached: force mismatch */
                nfa_match(winstate, ctx, NULL);
                continue;
            }

If I comment out the CONTINUE, the entire regression still succeeds.



--
jian
https://www.enterprisedb.com/

view thread (109+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Row pattern recognition
  In-Reply-To: <CACJufxHVADC8e77pnQxSZRk7SYHCZFk6ZCM2HfTsKyD_kUji0A@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox