Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL

public inbox for [email protected]  
help / color / mirror / Atom feed

Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL
3+ messages / 2 participants
[nested] [flat]

* Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL
@ 2026-03-02 07:25  Blessy Thomas <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Blessy Thomas @ 2026-03-02 07:25 UTC (permalink / raw)
  To: pgsql-hackers

Hello PostgreSQL Community,

I would like to introduce a PostgreSQL extension called
multilingual_fuzzy_match. This extension enables multilingual name
normalization, transliteration, and fuzzy phonetic matching directly inside
PostgreSQL at query time.

1. What Problem It Solves:
In multilingual datasets (especially Indian language datasets), the same
name may appear in:
- Different scripts
- Different transliterations
- Slight spelling variations
- Multiple languages

For example:
राम ≈ Raam ≈ رَام ≈ ராம்
Traditional equality or LIKE queries fail in such cases. Even trigram
matching doesn’t fully address cross-script phonetic similarity.

2. What This Extension Does

- Detects the script of the input text
- Performs transliteration and normalization
- Generates a phonetic key
- Uses Levenshtein distance (via python-Levenshtein)
- Returns similarity-scored results
All of this happens inside PostgreSQL using PL/Python (plpython3u).

3. Key Features
- No schema changes required
- Query-level matching
- Supports 11 major Indian scripts:
Devanagari, Tamil, Telugu, Bengali, Urdu, Malayalam, Kannada, Odia,
Gujarati, Punjabi
- Works on existing tables

4. Requirements
- PostgreSQL 17 (compiled with Python support)
- Python 3.12+
- plpython3u
- Python packages:
   pip install indic-transliteration python-Levenshtein

5. Example Usage
-----------------------------------------------------------------------------------------------------------------------------
postgres=#
SELECT * FROM fuzzy_match('names_native_dist', 'name', 'Rahul')
WHERE distance <= 1;
 id | name  | translit | normalized | fuzzy | distance
----+-------+----------+------------+-------+----------
  1 | राहुल  | rAhula   | rahul      | rahul |        0
  2 | রাহুল  | rAhula   | rahul      | rahul |        0
  4 | ರಾಹುಲ್ | rAhul    | rahul      | rahul |        0
  5 | Rahul | Rahul    | rahul      | rahul |        0
(4 rows)
--------------------------------------------------------------------------------------------------------------------------------

6. Feedback Requested

I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a starting
point for discussion about multilingual data handling inside PostgreSQL.

Looking forward to your thoughts and critiques.
Thank you!

Regards
Blessy Thomas


Attachments:

  [image/png] Screenshot from 2026-03-02 12-29-45.png (73.7K, 3-Screenshot%20from%202026-03-02%2012-29-45.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Fwd: Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL
@ 2026-03-23 05:52  Blessy Thomas <[email protected]>
  parent: Blessy Thomas <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Blessy Thomas @ 2026-03-23 05:52 UTC (permalink / raw)
  To: [email protected]

---------- Forwarded message ---------
From: Blessy Thomas <[email protected]>
Date: Mon, 2 Mar 2026 at 12:55
Subject: Extension - multilingual_fuzzy_match : Multilingual phonetic
matching extension for PostgreSQL


Hello PostgreSQL Community,

I would like to introduce a PostgreSQL extension called
multilingual_fuzzy_match. This extension enables multilingual name
normalization, transliteration, and fuzzy phonetic matching directly inside
PostgreSQL at query time.

1. What Problem It Solves:
In multilingual datasets (especially Indian language datasets), the same
name may appear in:
- Different scripts
- Different transliterations
- Slight spelling variations
- Multiple languages

For example:
राम ≈ Raam ≈ رَام ≈ ராம்
Traditional equality or LIKE queries fail in such cases. Even trigram
matching doesn’t fully address cross-script phonetic similarity.

2. What This Extension Does

- Detects the script of the input text
- Performs transliteration and normalization
- Generates a phonetic key
- Uses Levenshtein distance (via python-Levenshtein)
- Returns similarity-scored results
All of this happens inside PostgreSQL using PL/Python (plpython3u).

3. Key Features
- No schema changes required
- Query-level matching
- Supports 11 major Indian scripts:
Devanagari, Tamil, Telugu, Bengali, Urdu, Malayalam, Kannada, Odia,
Gujarati, Punjabi
- Works on existing tables

4. Requirements
- PostgreSQL 17 (compiled with Python support)
- Python 3.12+
- plpython3u
- Python packages:
   pip install indic-transliteration python-Levenshtein

5. Example Usage
-----------------------------------------------------------------------------------------------------------------------------
postgres=#
SELECT * FROM fuzzy_match('names_native_dist', 'name', 'Rahul')
WHERE distance <= 1;
 id | name  | translit | normalized | fuzzy | distance
----+-------+----------+------------+-------+----------
  1 | राहुल  | rAhula   | rahul      | rahul |        0
  2 | রাহুল  | rAhula   | rahul      | rahul |        0
  4 | ರಾಹುಲ್ | rAhul    | rahul      | rahul |        0
  5 | Rahul | Rahul    | rahul      | rahul |        0
(4 rows)
--------------------------------------------------------------------------------------------------------------------------------

6. Feedback Requested

I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a starting
point for discussion about multilingual data handling inside PostgreSQL.

Looking forward to your thoughts and critiques.
Thank you!

Regards
Blessy Thomas


Attachments:

  [image/png] Screenshot from 2026-03-02 12-29-45.png (73.7K, 3-Screenshot%20from%202026-03-02%2012-29-45.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL
@ 2026-04-13 06:57  lakshmi <[email protected]>
  parent: Blessy Thomas <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: lakshmi @ 2026-04-13 06:57 UTC (permalink / raw)
  To: Blessy Thomas <[email protected]>; +Cc: [email protected]

Hello all,

I hope this mail finds you well.

I would like to inform you that as my friend has moved forward with another
offer I will be taking over her work related to the
multilingual_fuzzy_match extension going forward. My name is Lakshmi, and I
will be handling this work from now on. Please feel free to reach out to me
for any queries, discussions or updates.

Looking forward to working with you all.

Thank you.

Regards,
Lakshmi

On Mon, Apr 13, 2026 at 11:22 AM Blessy Thomas <[email protected]>
wrote:

>
>
> ---------- Forwarded message ---------
> From: Blessy Thomas <[email protected]>
> Date: Mon, 2 Mar 2026 at 12:55
> Subject: Extension - multilingual_fuzzy_match : Multilingual phonetic
> matching extension for PostgreSQL
>
>
> Hello PostgreSQL Community,
>
> I would like to introduce a PostgreSQL extension called
> multilingual_fuzzy_match. This extension enables multilingual name
> normalization, transliteration, and fuzzy phonetic matching directly inside
> PostgreSQL at query time.
>
> 1. What Problem It Solves:
> In multilingual datasets (especially Indian language datasets), the same
> name may appear in:
> - Different scripts
> - Different transliterations
> - Slight spelling variations
> - Multiple languages
>
> For example:
> राम ≈ Raam ≈ رَام ≈ ராம்
> Traditional equality or LIKE queries fail in such cases. Even trigram
> matching doesn’t fully address cross-script phonetic similarity.
>
> 2. What This Extension Does
>
> - Detects the script of the input text
> - Performs transliteration and normalization
> - Generates a phonetic key
> - Uses Levenshtein distance (via python-Levenshtein)
> - Returns similarity-scored results
> All of this happens inside PostgreSQL using PL/Python (plpython3u).
>
> 3. Key Features
> - No schema changes required
> - Query-level matching
> - Supports 11 major Indian scripts:
> Devanagari, Tamil, Telugu, Bengali, Urdu, Malayalam, Kannada, Odia,
> Gujarati, Punjabi
> - Works on existing tables
>
> 4. Requirements
> - PostgreSQL 17 (compiled with Python support)
> - Python 3.12+
> - plpython3u
> - Python packages:
>    pip install indic-transliteration python-Levenshtein
>
> 5. Example Usage
>
> -----------------------------------------------------------------------------------------------------------------------------
> postgres=#
> SELECT * FROM fuzzy_match('names_native_dist', 'name', 'Rahul')
> WHERE distance <= 1;
>  id | name  | translit | normalized | fuzzy | distance
> ----+-------+----------+------------+-------+----------
>   1 | राहुल  | rAhula   | rahul      | rahul |        0
>   2 | রাহুল  | rAhula   | rahul      | rahul |        0
>   4 | ರಾಹುಲ್ | rAhul    | rahul      | rahul |        0
>   5 | Rahul | Rahul    | rahul      | rahul |        0
> (4 rows)
>
> --------------------------------------------------------------------------------------------------------------------------------
>
> 6. Feedback Requested
>
> I would really appreciate feedback from the community on:
> - Extension design approach
> - Performance considerations
> - Suitability for PGXN submission
> I would love suggestions, improvements, and any guidance on making this
> production-ready. I’m sharing this not just as a project, but as a starting
> point for discussion about multilingual data handling inside PostgreSQL.
>
> Looking forward to your thoughts and critiques.
> Thank you!
>
> Regards
> Blessy Thomas
>


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2026-04-13 06:57 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-02 07:25 Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL Blessy Thomas <[email protected]>
2026-03-23 05:52 ` Blessy Thomas <[email protected]>
2026-04-13 06:57   ` lakshmi <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox