public inbox for [email protected]
help / color / mirror / Atom feedRe: pg_kazsearch: Full-text search extension for Kazakh language
2+ messages / 2 participants
[nested] [flat]
* Re: pg_kazsearch: Full-text search extension for Kazakh language
@ 2026-04-08 14:55 Darkhan <[email protected]>
2026-04-10 15:24 ` Re: pg_kazsearch: Full-text search extension for Kazakh language Philip Johnston <[email protected]>
0 siblings, 1 reply; 2+ messages in thread
From: Darkhan @ 2026-04-08 14:55 UTC (permalink / raw)
To: Adrien Nayrat <[email protected]>; +Cc: [email protected]
Thanks for the suggestion!
I did look into Snowball early on. There is actually a Turkish stemmer in
Snowball already and Turkish is structurally very similar to Kazakh (both
agglutinative Turkic languages). But honestly the Turkish one is pretty
lobotomized, it only handles nominal suffixes and doesn’t account for verb
morphology at all. The author even mentions this in the comments. So it
kind of works for basic noun cases but falls apart on real text.
The reason I went with a standalone extension is that Kazakh has suffix
chains where vowel harmony interacts with each layer and you need
context-aware decisions, not just stripping patterns from the end of the
word. My stemmer uses a penalty-scored BFS over possible suffix
decompositions instead of the linear step-by-step stripping that Snowball
does. With 5-6 suffixes stacked on one word you really need to evaluate
multiple decomposition paths to find the best one.
That said contributing a simplified Kazakh stemmer to Snowball is something
I’d like to explore longer term. Even a basic version would be better than
nothing which is what exists today. Would need to figure out how much of
the BFS logic can fit into the Snowball language or if a simpler approach
gets close enough.
Appreciate the pointer!
Darkhan
On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <[email protected]>
wrote:
> On 4/5/26 3:32 PM, Darkhan wrote:
> > Hi all,
> >
> > I built pg_kazsearch, a PostgreSQL extension that adds full-text search
> > support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
> > stop word list available in PostgreSQL, so anyone searching Kazakh text
> is
> > stuck with trigram matching or application-level workarounds.
> >
> > Kazakh is agglutinative — a single word can carry 5-6 suffixes, which
> makes
> > standard search approaches miss most relevant results. pg_kazsearch
> > provides a custom Kazakh stemmer (core written in Rust), a stop word
> list,
> > and a text search dictionary that plugs into the standard PostgreSQL FTS
> > infrastructure — GIN indexes, ts_rank, phrase search all work out of the
> > box.
> >
> > I tested it on a dataset of 3,000 real Kazakh news articles. On the same
> > query, pg_kazsearch returns 61 relevant articles vs 1 with trigram
> search,
> > with a 23% improvement in recall overall.
> >
> > You can install it with a single command via deb package or Docker image,
> > no compilation needed.
> >
> > Repo: https://github.com/darkhanakh/pg-kazsearch
> >
> > I'd appreciate any feedback, especially from anyone working on text
> search
> > internals or with experience supporting non-Latin or agglutinative
> > languages in PostgreSQL.
> >
> > Thanks, Darkhan
> >
>
> Hello,
>
> Thanks for your work.
> I don't know anything about Kazakh.
>
> But have you try to add it to Snowball stemmer [1] ?
> As Postgres uses it, you have more chances to have Kazakh
> supported in future versions.
>
>
> 1: https://github.com/snowballstem/snowball
>
> --
> Adrien NAYRAT
> https://pro.anayrat.info
>
^ permalink raw reply [nested|flat] 2+ messages in thread
* Re: pg_kazsearch: Full-text search extension for Kazakh language
2026-04-08 14:55 Re: pg_kazsearch: Full-text search extension for Kazakh language Darkhan <[email protected]>
@ 2026-04-10 15:24 ` Philip Johnston <[email protected]>
0 siblings, 0 replies; 2+ messages in thread
From: Philip Johnston @ 2026-04-10 15:24 UTC (permalink / raw)
To: Darkhan <[email protected]>; +Cc: Adrien Nayrat <[email protected]>; [email protected]
Darkhan,
Great work! As a former archaeologist your comment about Kazakh being
agglutinative reminded me of ancient Sumerian which has a similar structure.
You might find some interest among philologists and ancient near eastern
historians for your work.
Philip
On Wed, Apr 8, 2026 at 9:56 AM Darkhan <[email protected]> wrote:
> Thanks for the suggestion!
>
> I did look into Snowball early on. There is actually a Turkish stemmer in
> Snowball already and Turkish is structurally very similar to Kazakh (both
> agglutinative Turkic languages). But honestly the Turkish one is pretty
> lobotomized, it only handles nominal suffixes and doesn’t account for verb
> morphology at all. The author even mentions this in the comments. So it
> kind of works for basic noun cases but falls apart on real text.
>
> The reason I went with a standalone extension is that Kazakh has suffix
> chains where vowel harmony interacts with each layer and you need
> context-aware decisions, not just stripping patterns from the end of the
> word. My stemmer uses a penalty-scored BFS over possible suffix
> decompositions instead of the linear step-by-step stripping that Snowball
> does. With 5-6 suffixes stacked on one word you really need to evaluate
> multiple decomposition paths to find the best one.
>
> That said contributing a simplified Kazakh stemmer to Snowball is
> something I’d like to explore longer term. Even a basic version would be
> better than nothing which is what exists today. Would need to figure out
> how much of the BFS logic can fit into the Snowball language or if a
> simpler approach gets close enough.
>
> Appreciate the pointer!
>
> Darkhan
>
> On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <[email protected]>
> wrote:
>
>> On 4/5/26 3:32 PM, Darkhan wrote:
>> > Hi all,
>> >
>> > I built pg_kazsearch, a PostgreSQL extension that adds full-text search
>> > support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
>> > stop word list available in PostgreSQL, so anyone searching Kazakh text
>> is
>> > stuck with trigram matching or application-level workarounds.
>> >
>> > Kazakh is agglutinative — a single word can carry 5-6 suffixes, which
>> makes
>> > standard search approaches miss most relevant results. pg_kazsearch
>> > provides a custom Kazakh stemmer (core written in Rust), a stop word
>> list,
>> > and a text search dictionary that plugs into the standard PostgreSQL FTS
>> > infrastructure — GIN indexes, ts_rank, phrase search all work out of the
>> > box.
>> >
>> > I tested it on a dataset of 3,000 real Kazakh news articles. On the same
>> > query, pg_kazsearch returns 61 relevant articles vs 1 with trigram
>> search,
>> > with a 23% improvement in recall overall.
>> >
>> > You can install it with a single command via deb package or Docker
>> image,
>> > no compilation needed.
>> >
>> > Repo: https://github.com/darkhanakh/pg-kazsearch
>> >
>> > I'd appreciate any feedback, especially from anyone working on text
>> search
>> > internals or with experience supporting non-Latin or agglutinative
>> > languages in PostgreSQL.
>> >
>> > Thanks, Darkhan
>> >
>>
>> Hello,
>>
>> Thanks for your work.
>> I don't know anything about Kazakh.
>>
>> But have you try to add it to Snowball stemmer [1] ?
>> As Postgres uses it, you have more chances to have Kazakh
>> supported in future versions.
>>
>>
>> 1: https://github.com/snowballstem/snowball
>>
>> --
>> Adrien NAYRAT
>> https://pro.anayrat.info
>>
>
^ permalink raw reply [nested|flat] 2+ messages in thread
end of thread, other threads:[~2026-04-10 15:24 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-04-08 14:55 Re: pg_kazsearch: Full-text search extension for Kazakh language Darkhan <[email protected]>
2026-04-10 15:24 ` Philip Johnston <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox