Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL

public inbox for [email protected]  
help / color / mirror / Atom feed

Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL
2+ messages / 1 participants
[nested] [flat]

* Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL
@ 2026-03-02 07:25  Blessy Thomas <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: Blessy Thomas @ 2026-03-02 07:25 UTC (permalink / raw)
  To: [email protected]

Hello PostgreSQL Community,

I would like to introduce a PostgreSQL extension called
multilingual_fuzzy_match. This extension enables multilingual name
normalization, transliteration, and fuzzy phonetic matching directly inside
PostgreSQL at query time.

1. What Problem It Solves:
In multilingual datasets (especially Indian language datasets), the same
name may appear in:
- Different scripts
- Different transliterations
- Slight spelling variations
- Multiple languages

For example:
राम ≈ Raam ≈ رَام ≈ ராம்
Traditional equality or LIKE queries fail in such cases. Even trigram
matching doesn’t fully address cross-script phonetic similarity.

2. What This Extension Does

- Detects the script of the input text
- Performs transliteration and normalization
- Generates a phonetic key
- Uses Levenshtein distance (via python-Levenshtein)
- Returns similarity-scored results
All of this happens inside PostgreSQL using PL/Python (plpython3u).

3. Key Features
- No schema changes required
- Query-level matching
- Supports 11 major Indian scripts:
Devanagari, Tamil, Telugu, Bengali, Urdu, Malayalam, Kannada, Odia,
Gujarati, Punjabi
- Works on existing tables

4. Requirements
- PostgreSQL 17 (compiled with Python support)
- Python 3.12+
- plpython3u
- Python packages:
   pip install indic-transliteration python-Levenshtein

5. Example Usage
-----------------------------------------------------------------------------------------------------------------------------
postgres=#
SELECT * FROM fuzzy_match('names_native_dist', 'name', 'Rahul')
WHERE distance <= 1;
 id | name  | translit | normalized | fuzzy | distance
----+-------+----------+------------+-------+----------
  1 | राहुल  | rAhula   | rahul      | rahul |        0
  2 | রাহুল  | rAhula   | rahul      | rahul |        0
  4 | ರಾಹುಲ್ | rAhul    | rahul      | rahul |        0
  5 | Rahul | Rahul    | rahul      | rahul |        0
(4 rows)
--------------------------------------------------------------------------------------------------------------------------------

6. Feedback Requested

I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a starting
point for discussion about multilingual data handling inside PostgreSQL.

Looking forward to your thoughts and critiques.
Thank you!

Regards
Blessy Thomas


Attachments:

  [image/png] Screenshot from 2026-03-02 12-29-45.png (73.7K, 3-Screenshot%20from%202026-03-02%2012-29-45.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Fwd: Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL
@ 2026-03-23 05:52  Blessy Thomas <[email protected]>
  parent: Blessy Thomas <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Blessy Thomas @ 2026-03-23 05:52 UTC (permalink / raw)
  To: pgsql-general

---------- Forwarded message ---------
From: Blessy Thomas <[email protected]>
Date: Mon, 2 Mar 2026 at 12:55
Subject: Extension - multilingual_fuzzy_match : Multilingual phonetic
matching extension for PostgreSQL


Hello PostgreSQL Community,

I would like to introduce a PostgreSQL extension called
multilingual_fuzzy_match. This extension enables multilingual name
normalization, transliteration, and fuzzy phonetic matching directly inside
PostgreSQL at query time.

1. What Problem It Solves:
In multilingual datasets (especially Indian language datasets), the same
name may appear in:
- Different scripts
- Different transliterations
- Slight spelling variations
- Multiple languages

For example:
राम ≈ Raam ≈ رَام ≈ ராம்
Traditional equality or LIKE queries fail in such cases. Even trigram
matching doesn’t fully address cross-script phonetic similarity.

2. What This Extension Does

- Detects the script of the input text
- Performs transliteration and normalization
- Generates a phonetic key
- Uses Levenshtein distance (via python-Levenshtein)
- Returns similarity-scored results
All of this happens inside PostgreSQL using PL/Python (plpython3u).

3. Key Features
- No schema changes required
- Query-level matching
- Supports 11 major Indian scripts:
Devanagari, Tamil, Telugu, Bengali, Urdu, Malayalam, Kannada, Odia,
Gujarati, Punjabi
- Works on existing tables

4. Requirements
- PostgreSQL 17 (compiled with Python support)
- Python 3.12+
- plpython3u
- Python packages:
   pip install indic-transliteration python-Levenshtein

5. Example Usage
-----------------------------------------------------------------------------------------------------------------------------
postgres=#
SELECT * FROM fuzzy_match('names_native_dist', 'name', 'Rahul')
WHERE distance <= 1;
 id | name  | translit | normalized | fuzzy | distance
----+-------+----------+------------+-------+----------
  1 | राहुल  | rAhula   | rahul      | rahul |        0
  2 | রাহুল  | rAhula   | rahul      | rahul |        0
  4 | ರಾಹುಲ್ | rAhul    | rahul      | rahul |        0
  5 | Rahul | Rahul    | rahul      | rahul |        0
(4 rows)
--------------------------------------------------------------------------------------------------------------------------------

6. Feedback Requested

I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a starting
point for discussion about multilingual data handling inside PostgreSQL.

Looking forward to your thoughts and critiques.
Thank you!

Regards
Blessy Thomas


Attachments:

  [image/png] Screenshot from 2026-03-02 12-29-45.png (73.7K, 3-Screenshot%20from%202026-03-02%2012-29-45.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 2+ messages in thread

end of thread, other threads:[~2026-03-23 05:52 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-02 07:25 Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL Blessy Thomas <[email protected]>
2026-03-23 05:52 ` Blessy Thomas <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox