Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wAUJx-0001Nn-1R for pgsql-general@arkaria.postgresql.org; Wed, 08 Apr 2026 14:55:57 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wAUJv-000GpU-2G for pgsql-general@arkaria.postgresql.org; Wed, 08 Apr 2026 14:55:56 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wAUJv-000GpK-0t for pgsql-general@lists.postgresql.org; Wed, 08 Apr 2026 14:55:56 +0000 Received: from mail-lf1-x133.google.com ([2a00:1450:4864:20::133]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wAUJt-000000000aw-3r3f for pgsql-general@lists.postgresql.org; Wed, 08 Apr 2026 14:55:55 +0000 Received: by mail-lf1-x133.google.com with SMTP id 2adb3069b0e04-5a3d2824e4bso5612649e87.3 for ; Wed, 08 Apr 2026 07:55:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775660153; cv=none; d=google.com; s=arc-20240605; b=IoZFd0+TRpwtPqwNP/50PcKEzjm1huXZ3Vq7bsTQRKQ1OwERlLzIphTcR5y/wHwEFp vR14kreD2jychyEoqDx1fvAOcNHaEpN6EJ7XIDSZk8efKQ1tF3bUbBMbS79PQ00yrIPV aeSI7icxTdZ7w/RqQQlBKRacAy0W3Nblml/DB2jmsDTY9Ifn7EQXmr2c1S/LM2qAKphK 3tu3912Nnxe8aFuATqpIRGIMOCDNBBtRKYRL199lOosICL6TyYHuiWWhzdDRPFZCWnOn MbR+Q3XzzfzElQngpKrZ5u5ttGUrWZL4vsH29OZytdO1QqisxM4RGp69BScleFMV3jeT Rcpw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=4RyKahu/7+hTHpUb2B6WhpiBMHiNZXzX/fUQ4lOApNE=; fh=mQkoCXPpXriVLbgRv1o0TFf+W42LFwpIvGt2JKhNkFo=; b=fAf+MwcEvLrRDZBi2/CNVToy1bKCkDmredrInNzACCgXcM1C42GDg8aX3z1dhTUbs0 7itBBPZvUvH5Knh7ZkEl+SdjncwVheph9URW/69d/Y4wPNaGUKWmspF+HAfeKMeeeIFE GK0ixS1aqhOYqNPBUoY6ASFK8SX00HXYJHsCpduaxZwOyriLdrc/g5abdrooI24qSWt5 yglz25EG4os1BB0hg0CSMFl5SNn8FfC3HrFCLWTu8URe7n7d9+m9DPE2nviub5YyBGQB knhSaJdnTrfNeL0pGe1B3qgN4dKGOOMFZeKCWK54yl6qCL4C+0m4y/wVQTSSbGUCGCFv X/2g==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775660153; x=1776264953; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=4RyKahu/7+hTHpUb2B6WhpiBMHiNZXzX/fUQ4lOApNE=; b=UgEdBbE859pFcoOTtNqsfxykfEO10gzxxj16lAHAnLFRpPSXhn2Fgtfv0DDSq8/Ya8 DAh5w7in71qv9nIseAacfX6IIZeBCRb6qmUnilQMSMr+9KMmb7o57qfuti70IPxMuW3N Zckw4EoWq9LrrOkTyZzK2Xgmh4kpUv1Huvxsmzvvm8DvHjeDd/aH1+sLf6otiBHFIS0o 4KQMmr4/4zHk7abNM//gS0MGguvzIMxcO9LHSOoa4tz0wp/FcTMlPs7GYMmfUbAEgjB2 pZPExR/AE7DUSatG5X0i/oZti1eP6R5P6WDtsshKY0/sXmyEqgVcP6uwlrmrxprOKySG 4/Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775660153; x=1776264953; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=4RyKahu/7+hTHpUb2B6WhpiBMHiNZXzX/fUQ4lOApNE=; b=Me4GBSQIpOmn9bDS+g23ANSNGiVxoidkKQB4DcppvY/Ippm8VXMPy4u3kCBLrmdSBs 7tkZKrK5NDLEqFHWZjf6WCp0KKvUubtEWg23Gutv9XEUndP4EL21jDdGD5oqTlF2MO4q bZOxusNLwgDTY7Cg8iDXVqXI1rnLIvTKTHwp1Kj5oCC2w9iqLaFl5NLsivJvg6ELXKOe CgXPm3DsPtTF60ew5HKIWELMIMI9LMGSZocP+w4AUwXbcVwcIzoiQWd4BYayGfppqEBf PA89h129S5uKLZJLMIqFt23Crd9BwV3uPP62hXeJvWVMspcoQYLiN9r+hKtmP1rVZkfE x2dw== X-Gm-Message-State: AOJu0Yx5AsUdrrG0eOMasDMS9n9kzHvYPFnhV67H+krffLwSRUaGNq8t ICePN00+RP2QJerd/bdiF5FU3ERb6wA11z7g5kQyQDnbUQsDwTkqBPC3+vwLq1mit0J8ZeEBsYS EWaeD133Qe+brU+aiikeRjVkSpTBx0zKQig== X-Gm-Gg: AeBDieuzDHU+OfOnrJV9D/nQh4XgxaTObdtcv21Snx67YSgD87/m3A7pInL25d03nJF B4RxB6iRzfPYrNPXwfVVVtIvwMybp1F+g8vkUYn7OIpXAggMDyUyzGZ2VS97lqlqW8SnefozcLZ wElmDXsH4oLekXcROKJMfWJHuHNf35SSXX29JUxJ80+6uAOHNZc4hCk7tYiMa2w1LBMaaIYmhcH 9MwD3pn+PQjrJD0NPIu8DzPax6ya0nQ7j6AS9A7utGISi/CwxlGk5ai0FWOgSDWQNMy518l76iz XaTRz07mlyjW8rVQQ1I= X-Received: by 2002:a05:6512:3983:b0:5a2:c766:13a9 with SMTP id 2adb3069b0e04-5a33755d89bmr6223458e87.17.1775660152247; Wed, 08 Apr 2026 07:55:52 -0700 (PDT) MIME-Version: 1.0 References: <34cf74ff-5466-44e0-9a3f-e626708f893a@anayrat.info> In-Reply-To: <34cf74ff-5466-44e0-9a3f-e626708f893a@anayrat.info> From: Darkhan Date: Wed, 8 Apr 2026 19:55:26 +0500 X-Gm-Features: AQROBzCpFtaQ_vUeixGUsrTDoG3nIeKoMc1rLjqe6jZ8dtIf6YZMmeRe5yEV1Fs Message-ID: Subject: Re: pg_kazsearch: Full-text search extension for Kazakh language To: Adrien Nayrat Cc: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000009dda1e064ef41881" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000009dda1e064ef41881 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks for the suggestion! I did look into Snowball early on. There is actually a Turkish stemmer in Snowball already and Turkish is structurally very similar to Kazakh (both agglutinative Turkic languages). But honestly the Turkish one is pretty lobotomized, it only handles nominal suffixes and doesn=E2=80=99t account f= or verb morphology at all. The author even mentions this in the comments. So it kind of works for basic noun cases but falls apart on real text. The reason I went with a standalone extension is that Kazakh has suffix chains where vowel harmony interacts with each layer and you need context-aware decisions, not just stripping patterns from the end of the word. My stemmer uses a penalty-scored BFS over possible suffix decompositions instead of the linear step-by-step stripping that Snowball does. With 5-6 suffixes stacked on one word you really need to evaluate multiple decomposition paths to find the best one. That said contributing a simplified Kazakh stemmer to Snowball is something I=E2=80=99d like to explore longer term. Even a basic version would be bett= er than nothing which is what exists today. Would need to figure out how much of the BFS logic can fit into the Snowball language or if a simpler approach gets close enough. Appreciate the pointer! Darkhan On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat wrote: > On 4/5/26 3:32 PM, Darkhan wrote: > > Hi all, > > > > I built pg_kazsearch, a PostgreSQL extension that adds full-text search > > support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or > > stop word list available in PostgreSQL, so anyone searching Kazakh text > is > > stuck with trigram matching or application-level workarounds. > > > > Kazakh is agglutinative =E2=80=94 a single word can carry 5-6 suffixes,= which > makes > > standard search approaches miss most relevant results. pg_kazsearch > > provides a custom Kazakh stemmer (core written in Rust), a stop word > list, > > and a text search dictionary that plugs into the standard PostgreSQL FT= S > > infrastructure =E2=80=94 GIN indexes, ts_rank, phrase search all work o= ut of the > > box. > > > > I tested it on a dataset of 3,000 real Kazakh news articles. On the sam= e > > query, pg_kazsearch returns 61 relevant articles vs 1 with trigram > search, > > with a 23% improvement in recall overall. > > > > You can install it with a single command via deb package or Docker imag= e, > > no compilation needed. > > > > Repo: https://github.com/darkhanakh/pg-kazsearch > > > > I'd appreciate any feedback, especially from anyone working on text > search > > internals or with experience supporting non-Latin or agglutinative > > languages in PostgreSQL. > > > > Thanks, Darkhan > > > > Hello, > > Thanks for your work. > I don't know anything about Kazakh. > > But have you try to add it to Snowball stemmer [1] ? > As Postgres uses it, you have more chances to have Kazakh > supported in future versions. > > > 1: https://github.com/snowballstem/snowball > > -- > Adrien NAYRAT > https://pro.anayrat.info > --0000000000009dda1e064ef41881 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks for the suggestio= n!

I did look into Snowball early on. There is actually a Turkish s= temmer in Snowball already and Turkish is structurally very similar to Kaza= kh (both agglutinative Turkic languages). But honestly the Turkish one is p= retty lobotomized, it only handles nominal suffixes and doesn=E2=80=99t acc= ount for verb morphology at all. The author even mentions this in the comme= nts. So it kind of works for basic noun cases but falls apart on real text.=

The reason I went with a standalone extension is that Kazakh has s= uffix chains where vowel harmony interacts with each layer and you need con= text-aware decisions, not just stripping patterns from the end of the word.= My stemmer uses a penalty-scored BFS over possible suffix decompositions i= nstead of the linear step-by-step stripping that Snowball does. With 5-6 su= ffixes stacked on one word you really need to evaluate multiple decompositi= on paths to find the best one.

That said contributing a simplified = Kazakh stemmer to Snowball is something I=E2=80=99d like to explore longer = term. Even a basic version would be better than nothing which is what exist= s today. Would need to figure out how much of the BFS logic can fit into th= e Snowball language or if a simpler approach gets close enough.

Appreciate the pointer!

Darkhan

On= Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <adrien.nayrat@anayrat.info> wrote:
On 4/5/26 3:32 PM, Darkhan wrote:<= br> > Hi all,
>
> I built pg_kazsearch, a PostgreSQL extension that adds full-text searc= h
> support for Kazakh. Currently there's no Kazakh dictionary, stemme= r, or
> stop word list available in PostgreSQL, so anyone searching Kazakh tex= t is
> stuck with trigram matching or application-level workarounds.
>
> Kazakh is agglutinative =E2=80=94 a single word can carry 5-6 suffixes= , which makes
> standard search approaches miss most relevant results. pg_kazsearch > provides a custom Kazakh stemmer (core written in Rust), a stop word l= ist,
> and a text search dictionary that plugs into the standard PostgreSQL F= TS
> infrastructure =E2=80=94 GIN indexes, ts_rank, phrase search all work = out of the
> box.
>
> I tested it on a dataset of 3,000 real Kazakh news articles. On the sa= me
> query, pg_kazsearch returns 61 relevant articles vs 1 with trigram sea= rch,
> with a 23% improvement in recall overall.
>
> You can install it with a single command via deb package or Docker ima= ge,
> no compilation needed.
>
> Repo: https://github.com/darkhanakh/pg-kazsearch<= br> >
> I'd appreciate any feedback, especially from anyone working on tex= t search
> internals or with experience supporting non-Latin or agglutinative
> languages in PostgreSQL.
>
> Thanks, Darkhan
>

Hello,

Thanks for your work.
I don't know anything about Kazakh.

But have you try to add it to Snowball stemmer [1] ?
As Postgres uses it, you have more chances to have Kazakh
supported in future versions.


1: https://github.com/snowballstem/snowball

--
Adrien NAYRAT
h= ttps://pro.anayrat.info
--0000000000009dda1e064ef41881--