Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nBcJE-0001lW-NJ for pgsql-www@arkaria.postgresql.org; Sun, 23 Jan 2022 12:49:29 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1nBcJC-0002fi-QF for pgsql-www@arkaria.postgresql.org; Sun, 23 Jan 2022 12:49:26 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nBcJC-0002fX-HD for pgsql-www@lists.postgresql.org; Sun, 23 Jan 2022 12:49:26 +0000 Received: from mail-yb1-xb2b.google.com ([2607:f8b0:4864:20::b2b]) by makus.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1nBcJ5-000591-Dt for pgsql-www@postgresql.org; Sun, 23 Jan 2022 12:49:25 +0000 Received: by mail-yb1-xb2b.google.com with SMTP id c10so42513154ybb.2 for ; Sun, 23 Jan 2022 04:49:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jp-hosting.net; s=google; h=mime-version:from:date:message-id:subject:to; bh=YvYOmZAU7I/9jCVOmzm+Zyx1egmXDhqrzY6RyoXRW9A=; b=aNPEiZE/MX7Imt1tpbNkDyWhnd3P9HidLt4vcdxGJcqhYa//Cill0n/jEGMh2u/v/x MXfiRZs3Zn1cAYm6qUlcMM5AHJLIoOPwcDMRgH3FhBpzKJgMsyu34HrlcnBchm3S+QrB 0JhiL/wL5A67ZLPFzuBa4XdHK+/1KDxAOJlXTtxRBeCS9SixLYbnJc88+83oA8hU/RUv 1OnfpGf1AMtcSMio/tMqCfeyBrHnbyCUkInP3w+r6+CkzpdJJobOfdIamrQm2mNDVBMa ZvegoXOlaZOWKwsOsOD7JE0+96bD/6TnFTUFVQVCEvY1a29nPvTQ7TyqDb5kghDv1zdY PTSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=YvYOmZAU7I/9jCVOmzm+Zyx1egmXDhqrzY6RyoXRW9A=; b=HBQlBohXiFUZxpupKvOMStP0GYyQlFzCOAlsPv9Xe733wchh+WCRt+sFAq94Y69Z4I LhHd8zbMBk2Lmh3rFc7xSFXXFzj/l/BOuMiX4WkLBj6i1ax9y4MHmzrWn4oijqE05WTc vw0ERjjkt4IKwnkEcttbJFolYrp6AQFdPvpNE8Wrp/CzfJfo6IUY1fKV2hwy8sA6kR/Z LjY04QDAARNyXGMrSEPlQOIQgH9nn1WqOtQ6y0ZR7v/BQ7K7PhfEdFWHyCHWvVTrf4+Z swuC9SRSBBmve+4fs6JQdkQ6C2qGa8UJhZU0C1yKz0Iuf8BLVXyycxdomIRBwSv8tcyO iNJw== X-Gm-Message-State: AOAM530nKNwNrL8QaAsCcO6Ve8yoXeG1hKKKMqI6q9eErF6mmnMwWfOi JUPLr06Sk8b911U+CO5UP2y4Ari0CdtSEFsvNJEsKPnGjJNnz/eE X-Google-Smtp-Source: ABdhPJxCrFimEkaPzXyFkcERPTO6ThqXWtaOlhjKMp9UbfJZBu55cmGYXlRaOTHBdyOXHZf0Mre9YBYqKS5eooHPNW8= X-Received: by 2002:a25:314:: with SMTP id 20mr15952380ybd.592.1642942158540; Sun, 23 Jan 2022 04:49:18 -0800 (PST) MIME-Version: 1.0 From: James Addison Date: Sun, 23 Jan 2022 12:49:07 +0000 Message-ID: Subject: Mailing list search engine: surprising missing results? To: pgsql-www@postgresql.org Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hello, I noticed that the mailing list search engine[1] seems to unexpectedly miss results for some queries. For example: A search for "boyer"[2] returns five results, including result snippets that contain the text "Boyer-More-Horspool" [sic] and "Boyer-Moore-Horspool". However, a more specific search for "boyer-moore"[3] does not return any results -- that seems surprising. Specializing the query further and searching for "boyer-moore-horspool"[4] *does* again return results -- two documents -- with the terms "boyer" and "horspool" highlighted. Although it's not a significant problem, I do have a theory that could explain the behaviour (offered in case it may save time on investigation): It seems possible that the term "more" -- and nearby misspellings, like "moore" -- may be filtered out as stopwords (meaning: they're not present in the search index), and that the search engine is configured to require a minimum percentage match rate for query terms. Under those conditions: searches for "boyer" would produce an 100% match rate, "boyer-moore" would produce 50% (since "moore" would not be found in the term index), and "boyer-moore-horspool" would match at 66-point-6-repeating percent. Given a required match rate of around two thirds, that could explain the behaviour (it might not be the true reason, but it seems like one possibility). Thanks, James [1] https://www.postgresql.org/search/ [2] https://www.postgresql.org/search/?m=1&q=boyer&l=1&d=365&s=r [3] https://www.postgresql.org/search/?m=1&q=boyer-moore&l=1&d=365&s=r [4] https://www.postgresql.org/search/?m=1&q=boyer-moore-horspool&l=1&d=365&s=r