Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wAU81-0001CU-02 for pgsql-general@arkaria.postgresql.org; Wed, 08 Apr 2026 14:43:38 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wAU70-0007tU-0R for pgsql-general@arkaria.postgresql.org; Wed, 08 Apr 2026 14:42:35 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wAU6z-0007tL-2L for pgsql-general@lists.postgresql.org; Wed, 08 Apr 2026 14:42:34 +0000 Received: from smtp-bc0f.mail.infomaniak.ch ([2001:1600:7:10::bc0f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wAU6w-000000000T3-3x9W for pgsql-general@lists.postgresql.org; Wed, 08 Apr 2026 14:42:33 +0000 Received: from smtp-4-0001.mail.infomaniak.ch (smtp-4-0001.mail.infomaniak.ch [10.7.10.108]) by smtp-4-3000.mail.infomaniak.ch (Postfix) with ESMTPS id 4frQl16mPKz65p; Wed, 8 Apr 2026 16:42:21 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anayrat.info; s=20210210; t=1775659341; bh=nP8RzGRzB90PwUDFlsYmigZ54fl/ehbJ8l6n3/nuMJM=; h=Date:Subject:To:References:From:In-Reply-To:From; b=fFLlXPrkIbNTAbP2yG8WTPhpbGGCwTUXHR2XXhmeeUh/XXa52kPD3xt51/OsuqbXm /rlGmjoDo9tWqOcH3FrSOySszSQIct+7AyAhAOvoHQevwWsaZPZIAUjNodCvn3iIWB m6u1muPazRgyiVxmWOBtfAG1PtVoVdiiP/1fAMxo= Received: from unknown by smtp-4-0001.mail.infomaniak.ch (Postfix) with ESMTPA id 4frQl12TV1zls8; Wed, 8 Apr 2026 16:42:21 +0200 (CEST) Message-ID: <34cf74ff-5466-44e0-9a3f-e626708f893a@anayrat.info> Date: Wed, 8 Apr 2026 16:42:21 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: pg_kazsearch: Full-text search extension for Kazakh language To: Darkhan , pgsql-general@lists.postgresql.org References: Content-Language: en-US From: Adrien Nayrat In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Infomaniak-Routing: alpha List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 4/5/26 3:32 PM, Darkhan wrote: > Hi all, > > I built pg_kazsearch, a PostgreSQL extension that adds full-text search > support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or > stop word list available in PostgreSQL, so anyone searching Kazakh text is > stuck with trigram matching or application-level workarounds. > > Kazakh is agglutinative — a single word can carry 5-6 suffixes, which makes > standard search approaches miss most relevant results. pg_kazsearch > provides a custom Kazakh stemmer (core written in Rust), a stop word list, > and a text search dictionary that plugs into the standard PostgreSQL FTS > infrastructure — GIN indexes, ts_rank, phrase search all work out of the > box. > > I tested it on a dataset of 3,000 real Kazakh news articles. On the same > query, pg_kazsearch returns 61 relevant articles vs 1 with trigram search, > with a 23% improvement in recall overall. > > You can install it with a single command via deb package or Docker image, > no compilation needed. > > Repo: https://github.com/darkhanakh/pg-kazsearch > > I'd appreciate any feedback, especially from anyone working on text search > internals or with experience supporting non-Latin or agglutinative > languages in PostgreSQL. > > Thanks, Darkhan > Hello, Thanks for your work. I don't know anything about Kazakh. But have you try to add it to Snowball stemmer [1] ? As Postgres uses it, you have more chances to have Kazakh supported in future versions. 1: https://github.com/snowballstem/snowball -- Adrien NAYRAT https://pro.anayrat.info