Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wBDj8-000r1g-2B for pgsql-general@arkaria.postgresql.org; Fri, 10 Apr 2026 15:24:59 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wBDj4-00Cnpz-2g for pgsql-general@arkaria.postgresql.org; Fri, 10 Apr 2026 15:24:55 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wBDj4-00Cnpq-1P for pgsql-general@lists.postgresql.org; Fri, 10 Apr 2026 15:24:55 +0000 Received: from mail-lf1-x12b.google.com ([2a00:1450:4864:20::12b]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wBDj2-00000000NJN-3WAq for pgsql-general@lists.postgresql.org; Fri, 10 Apr 2026 15:24:54 +0000 Received: by mail-lf1-x12b.google.com with SMTP id 2adb3069b0e04-5a0ff30b240so3037004e87.0 for ; Fri, 10 Apr 2026 08:24:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775834691; cv=none; d=google.com; s=arc-20240605; b=eI2eDspB1OfsTkU00zljWynqPqpxdbm/5tl73OLL15GGr2eBdvLW5AA4TphAFuomh8 BTFOQ7Ebvkwav/RywTFei96zJKE73b4osBX8a1Ok968Xa9ZwQAhe3bvv+sxBSLJqppya 5IwM5NWNxCWCNe3ygh5JXqj3tXnekQBkDImX3p9WG+ls5121wRdpklun+wk2WG8WlEJ9 frs8Ac9XnmFfBaEol48DLs5EMzBgJeF1h2Bxh2hZcaHqyoJTRXeVM0c6FPdliFAToH8G DwDD3loJTskPBtRa/5CNnXBwM3RsCOm+H1Y1TNhnj+68d5RcS+mttLIIovg8FeR5Eskx 3hrA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=I2jmsE9zUbCsFWEylskIqvcNqP28otTTA0L2ZI5RYC8=; fh=POTWvaFIfYs46HH7JVeYKa3DJjOLjYJFRgUbuAddMgs=; b=OhMadKsxGwliHrCDsDPuYVabsIz0/weSyTSv/8/ZRKxENtl+dsmLdUyxDtRd/e7/jW bhnMkyDelg6OE3kCgQXMNdyWcjLmOED0YDc9QdDbnvyeoC6vGiPkK95ko/Otj/ocKayz qVwO/qKCaUhMUedqNSudeFh7HgmZfezetmJfwtmR+rXK39N+uAQF+2jQq2CJJrepFhCS 5OYJIG/F6lfrEudqnvdheEkrP987whVSrF+D5sY393ujr8273ZuI4Uau3TrobbXkGvI7 lzs73XmiSTpepBr3sqboRDary4Y/L4aoOr1GresvYlJt+4E48hp1aR2X9rqNPxrqbmLU tT8Q==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pgcache.com; s=google; t=1775834691; x=1776439491; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=I2jmsE9zUbCsFWEylskIqvcNqP28otTTA0L2ZI5RYC8=; b=NgJnkig7Gf7rBDe36nmYxh5mPF4fM9owvBfqjT2kEctIk7dqmnhHwRYuxa63KqDIR5 8/LuqiWzIIbSnQf02ESWQWvu/89ezeyHnVugZpWyZhEYy/0iBuxZeESohV5eiAERdfS6 N/i0DMOV0FDxfP+2yeDnErSCgYsv1E73TMvzBImyWw7YZQeII1m7/cufKYF3GP1/oab5 qn2L+Ta4HqTlHesDRMky9KUvjieGPGImKyZ5UhG+UM/B8cib1qdaow6sYXMcuzPHyTHL fBy0erkvt/Tx7vfhOb/m17RlbXRPx6WFnneG2uEGLhT4H59amJdvSBI6g5YTGhNhj7q7 ArLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775834691; x=1776439491; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=I2jmsE9zUbCsFWEylskIqvcNqP28otTTA0L2ZI5RYC8=; b=ptlCPSRUzaLYWaov45iDHkBZAlt2/ToNRKuGpjLarzsJKfx6p5gJ+55Z9rfeUt4ov0 ejoW2ZndOMPJYnnbxEzaHkglwTLhGzI39j27yMrwdqvlyVVR4jz0XBn5HHu0wKJI2Q8a E5EpNDcZx6ph1vpYLkRkReCdlQTpNi8+68vch9uApr3JTItVpjaB5SLvr6U9J+uAT8uQ FLLA77+ac+x/CJVIUBIaPnx1TsSmTjSotgnlh/rg34YkXdVC+m2vaqKm0P5ics4JO1U5 Wh1tNzdnbZpamHK/bgm5qw4nqmpkO716cPRqkAc3oV69LK4mM1inZKdubJDGck58FE0E gsnw== X-Forwarded-Encrypted: i=1; AJvYcCUxq/wtvScdeo08/YfMOTvY5M/cZdjDPNKzE9wKWJY9MI9czG0/zsaPXRn64fe2he9a1VBAQom3BdfeGphV@lists.postgresql.org X-Gm-Message-State: AOJu0YyU+zoPq42yCc4gqDH5i5Svj+yQKbbjdR8l0nyc2WOmp0sJMVL1 FpfvnEpK9VTMBzyP2JDCR45f0AY78tSv7Ps0OXIRetQ6SxRH4OGwROE8YGU3GJJq7gfDbmQTGyK w2eWv7r0752Iea1Tf/Jjts9sKhz4TzyBoHXEOHPm80H2tOKrohGU7eQ== X-Gm-Gg: AeBDiesms1J6AFk667386NaN8c0pZiaB+fRpdWyh0sgC/d1TGOVm9kh/WcYEtdFQOOn Re6GB6UlwdiqzNMypOoqKdMein9Byeyga8YvX24lziSlYZvTqZhOd4wC7RhCOwxtDK/IOzQa8gt Lbg416b6geMvi43KsLfg8thHTAwiTnFzeEFcsCKxeh0L2xzSReZ4xXmxSHTQVlljd2K7Q6LTJL4 H4ymj7d+p4DNEn9mn5E3hn2WhZYZ9aZjE99XKFdYjFOqyUF/jS7HDZCQ/c3cY4Nx2yqe0U+yR4A 3Z4IEyQy06KojX8AXEkCCkPX X-Received: by 2002:a05:6512:b8b:b0:5a2:bd2a:5b44 with SMTP id 2adb3069b0e04-5a3ef8f8cd7mr1354419e87.10.1775834690523; Fri, 10 Apr 2026 08:24:50 -0700 (PDT) MIME-Version: 1.0 References: <34cf74ff-5466-44e0-9a3f-e626708f893a@anayrat.info> In-Reply-To: From: Philip Johnston Date: Fri, 10 Apr 2026 10:24:38 -0500 X-Gm-Features: AQROBzBJUjZZ1QvkTpn9ctZ31Mf4qqPLFlq2Jcm1FS1tOJHZrIWO8rt0LFqLZKY Message-ID: Subject: Re: pg_kazsearch: Full-text search extension for Kazakh language To: Darkhan Cc: Adrien Nayrat , pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000e8a367064f1cbbb9" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000e8a367064f1cbbb9 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Darkhan, Great work! As a former archaeologist your comment about Kazakh being agglutinative reminded me of ancient Sumerian which has a similar structure= . You might find some interest among philologists and ancient near eastern historians for your work. Philip On Wed, Apr 8, 2026 at 9:56=E2=80=AFAM Darkhan wrote: > Thanks for the suggestion! > > I did look into Snowball early on. There is actually a Turkish stemmer in > Snowball already and Turkish is structurally very similar to Kazakh (both > agglutinative Turkic languages). But honestly the Turkish one is pretty > lobotomized, it only handles nominal suffixes and doesn=E2=80=99t account= for verb > morphology at all. The author even mentions this in the comments. So it > kind of works for basic noun cases but falls apart on real text. > > The reason I went with a standalone extension is that Kazakh has suffix > chains where vowel harmony interacts with each layer and you need > context-aware decisions, not just stripping patterns from the end of the > word. My stemmer uses a penalty-scored BFS over possible suffix > decompositions instead of the linear step-by-step stripping that Snowball > does. With 5-6 suffixes stacked on one word you really need to evaluate > multiple decomposition paths to find the best one. > > That said contributing a simplified Kazakh stemmer to Snowball is > something I=E2=80=99d like to explore longer term. Even a basic version w= ould be > better than nothing which is what exists today. Would need to figure out > how much of the BFS logic can fit into the Snowball language or if a > simpler approach gets close enough. > > Appreciate the pointer! > > Darkhan > > On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat > wrote: > >> On 4/5/26 3:32 PM, Darkhan wrote: >> > Hi all, >> > >> > I built pg_kazsearch, a PostgreSQL extension that adds full-text searc= h >> > support for Kazakh. Currently there's no Kazakh dictionary, stemmer, o= r >> > stop word list available in PostgreSQL, so anyone searching Kazakh tex= t >> is >> > stuck with trigram matching or application-level workarounds. >> > >> > Kazakh is agglutinative =E2=80=94 a single word can carry 5-6 suffixes= , which >> makes >> > standard search approaches miss most relevant results. pg_kazsearch >> > provides a custom Kazakh stemmer (core written in Rust), a stop word >> list, >> > and a text search dictionary that plugs into the standard PostgreSQL F= TS >> > infrastructure =E2=80=94 GIN indexes, ts_rank, phrase search all work = out of the >> > box. >> > >> > I tested it on a dataset of 3,000 real Kazakh news articles. On the sa= me >> > query, pg_kazsearch returns 61 relevant articles vs 1 with trigram >> search, >> > with a 23% improvement in recall overall. >> > >> > You can install it with a single command via deb package or Docker >> image, >> > no compilation needed. >> > >> > Repo: https://github.com/darkhanakh/pg-kazsearch >> > >> > I'd appreciate any feedback, especially from anyone working on text >> search >> > internals or with experience supporting non-Latin or agglutinative >> > languages in PostgreSQL. >> > >> > Thanks, Darkhan >> > >> >> Hello, >> >> Thanks for your work. >> I don't know anything about Kazakh. >> >> But have you try to add it to Snowball stemmer [1] ? >> As Postgres uses it, you have more chances to have Kazakh >> supported in future versions. >> >> >> 1: https://github.com/snowballstem/snowball >> >> -- >> Adrien NAYRAT >> https://pro.anayrat.info >> > --000000000000e8a367064f1cbbb9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Darkhan,

Great work! As a former archae= ologist your comment about Kazakh being agglutinative reminded me of ancien= t Sumerian which has a similar structure.

You migh= t find some interest among philologists and ancient near eastern historians= for your work.

Philip

On Wed, Apr 8, 2026 at 9:56=E2=80=AFAM Darkhan <darkhanahmetov2005@gmail.com> wrote:
Thanks for the suggestion!

I did look= into Snowball early on. There is actually a Turkish stemmer in Snowball al= ready and Turkish is structurally very similar to Kazakh (both agglutinativ= e Turkic languages). But honestly the Turkish one is pretty lobotomized, it= only handles nominal suffixes and doesn=E2=80=99t account for verb morphol= ogy at all. The author even mentions this in the comments. So it kind of wo= rks for basic noun cases but falls apart on real text.

The reason I= went with a standalone extension is that Kazakh has suffix chains where vo= wel harmony interacts with each layer and you need context-aware decisions,= not just stripping patterns from the end of the word. My stemmer uses a pe= nalty-scored BFS over possible suffix decompositions instead of the linear = step-by-step stripping that Snowball does. With 5-6 suffixes stacked on one= word you really need to evaluate multiple decomposition paths to find the = best one.

That said contributing a simplified Kazakh stemmer to Sno= wball is something I=E2=80=99d like to explore longer term. Even a basic ve= rsion would be better than nothing which is what exists today. Would need t= o figure out how much of the BFS logic can fit into the Snowball language o= r if a simpler approach gets close enough.
<= br>
Appreciate the point= er!

Darkhan

On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <= ;adrien.nay= rat@anayrat.info> wrote:
On 4/5/26 3:32 PM, Darkhan wrote:
> Hi all,
>
> I built pg_kazsearch, a PostgreSQL extension that adds full-text searc= h
> support for Kazakh. Currently there's no Kazakh dictionary, stemme= r, or
> stop word list available in PostgreSQL, so anyone searching Kazakh tex= t is
> stuck with trigram matching or application-level workarounds.
>
> Kazakh is agglutinative =E2=80=94 a single word can carry 5-6 suffixes= , which makes
> standard search approaches miss most relevant results. pg_kazsearch > provides a custom Kazakh stemmer (core written in Rust), a stop word l= ist,
> and a text search dictionary that plugs into the standard PostgreSQL F= TS
> infrastructure =E2=80=94 GIN indexes, ts_rank, phrase search all work = out of the
> box.
>
> I tested it on a dataset of 3,000 real Kazakh news articles. On the sa= me
> query, pg_kazsearch returns 61 relevant articles vs 1 with trigram sea= rch,
> with a 23% improvement in recall overall.
>
> You can install it with a single command via deb package or Docker ima= ge,
> no compilation needed.
>
> Repo: https://github.com/darkhanakh/pg-kazsearch<= br> >
> I'd appreciate any feedback, especially from anyone working on tex= t search
> internals or with experience supporting non-Latin or agglutinative
> languages in PostgreSQL.
>
> Thanks, Darkhan
>

Hello,

Thanks for your work.
I don't know anything about Kazakh.

But have you try to add it to Snowball stemmer [1] ?
As Postgres uses it, you have more chances to have Kazakh
supported in future versions.


1: https://github.com/snowballstem/snowball

--
Adrien NAYRAT
h= ttps://pro.anayrat.info
--000000000000e8a367064f1cbbb9--