pg_tre 1.1.1 released -- an approximate-REGEX index AM for PostgreSQL 18+

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Greg Burd via PostgreSQL Announce <[email protected]>
To: PostgreSQL Announce <[email protected]>
Subject: pg_tre 1.1.1 released -- an approximate-REGEX index AM for PostgreSQL 18+
Date: Fri, 22 May 2026 18:51:48 +0000
Message-ID: <[email protected]> (raw)

I am pleased to announce the first public release of [pg_tre] (https://codeberg.org/gregburd/pg_tre), a native PostgreSQL 18+ index access method for approximate-regex matching.

pg_tre indexes text columns through a three-tier filter funnel (BRIN-style range bloom -> sparsemap trigram postings -> per-tuple bloom) backed by Ville Laurikari's TRE library for the heap recheck. The result is genuine Levenshtein-distance regex matching ("find text within k edits of this pattern") driven through a real IndexAmRoutine, with WAL coverage, VACUUM awareness, and REINDEX CONCURRENTLY support.

Highlights
----------

  * Custom IndexAmRoutine registered as USING tre, with custom
    rmgr (id 140) and full crash-recovery / streaming-replication
    coverage validated by TAP tests.

  * Edit-distance regex with per-subexpression budgets, e.g.
    body %~~ tre_pattern('(error){~1}.*(42[0-9]){~0}', 1)
    runs as a single indexed bitmap heap scan.

  * UTF-8 codepoint trigrams: CJK, accented characters, and
    emoji are indexed correctly; ASCII pays zero overhead.

  * DoS protection via configurable caps on NFA states, compile
    time, and per-match runtime.

  * Three-tier funnel cuts heap I/O dramatically: queries that
    return a handful of rows out of millions typically finish in
    sub-millisecond time.

  * Backward-compatible tre_amatch* UDFs from 0.1.0 are
    preserved.

How pg_tre fits alongside what you already have
-----------------------------------------------

PostgreSQL already ships strong text-search primitives. pg_tre is meant to complement them, not replace them.

  * pg_trgm (GIN/GiST):  exact regex, LIKE, and trigram-set
    similarity. Battle-tested. pg_trgm % is Jaccard similarity
    over trigram sets, which is not the same as edit distance:
    two strings with overlapping trigrams can score "similar"
    even when their Levenshtein distance is huge. Use pg_trgm
    when you need exact-substring or LIKE acceleration.

  * tsvector / tsquery (built-in FTS):  word-level linguistic
    search with stemming, stopwords, ranking, and language
    configuration. Use FTS for natural-language prose. pg_tre
    is language-agnostic, which is a feature for identifiers,
    SKUs, error codes, and log lines, and a non-feature for
    "running" matching "run".

  * pgvector / pgvectorscale:  semantic similarity over float
    embeddings. Orthogonal to pg_tre -- meaning vs. lexical
    structure. The two compose naturally as a hybrid filter
    (lexical pre-filter with pg_tre, semantic rank with
    pgvector).

  * pg_tre:  approximate regex with explicit edit budgets and
    full regex semantics (character classes, alternation,
    anchors, {m,n} repetition) composable with the {~k} edit
    operator. No other in-tree or out-of-tree PostgreSQL
    extension answers "is this text within N edits of this
    regex?" through an index.

A few things only pg_tre does well
----------------------------------

  * Typo-tolerant log and trace search:
    body %~~ tre_pattern('(timeout){~1}.*(connection){~1}', 1)
    finds "timeoutt" and "conection" in the same query.

  * Catalog and SKU lookup with edit tolerance:
    sku %~~ tre_pattern('AB-9?[0-9]{4}', 1)
    catches dropped digits or transposed characters without
    expensive post-filtering.

  * Per-phrase edit budgets in a single index query:
    body %~~ tre_pattern('(postgres){~2}.*(system){~0}', 0)
    expresses "postgres-ish" with strict "system" -- one round
    trip, no application-level reranking.

  * Hybrid retrieval for agent / RAG pipelines: pg_tre is the
    fuzzy-lexical leg that pg_trgm, FTS, and pgvector cannot
    cover on their own. An LLM-generated identifier with a
    typo, an OCR error in a scanned document, a near-duplicate
    error code -- pg_tre catches them all without sacrificing
    regex expressivity.

Installation
------------

    -- Requires shared_preload_libraries = 'pg_tre'
    CREATE EXTENSION pg_tre;

    CREATE INDEX docs_body_tre ON docs USING tre (body);

    SELECT id FROM docs
     WHERE body %~~ tre_pattern('database', 1);

A note on stability and feedback
--------------------------------

pg_tre is new software. The on-disk format is stable from 1.0.0 forward (1.1.x is byte-compatible with 1.0.0), the SQL surface is fixed, the WAL records are versioned, and the test suite covers the storage, query, recovery, and replication paths -- but the project is young and almost certainly has bugs that the existing tests do not yet provoke. It is not a beta and not a research toy; it is real software released early, and it should be used with appropriate caution.

Bug reports, feature requests, and pull requests are very welcome. If you can attach a reproduction case (a minimal schema and a query that misbehaves), that is the most useful form, but anything is better than nothing.

Links
-----

* Repository:   [https://codeberg.org/gregburd/pg_tre](https://codeberg.org/gregburd/pg_tre) 
* Issues:       [https://codeberg.org/gregburd/pg_tre/issues](https://codeberg.org/gregburd/pg_tre/issues)
* TRE library:  [https://github.com/laurikari/tre](https://github.com/laurikari/tre)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: pg_tre 1.1.1 released -- an approximate-REGEX index AM for PostgreSQL 18+
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox