Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dhLaJ-0004yC-PU for pgsql-performance@arkaria.postgresql.org; Mon, 14 Aug 2017 20:03:35 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1dhLaJ-0006eN-CT for pgsql-performance@arkaria.postgresql.org; Mon, 14 Aug 2017 20:03:35 +0000 Received: from makus.postgresql.org ([2001:4800:1501:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1dhLYX-000276-NN for pgsql-performance@postgresql.org; Mon, 14 Aug 2017 20:01:45 +0000 Received: from mail-qt0-x22d.google.com ([2607:f8b0:400d:c0d::22d]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1dhLYV-0004hq-5t for pgsql-performance@postgresql.org; Mon, 14 Aug 2017 20:01:44 +0000 Received: by mail-qt0-x22d.google.com with SMTP id p3so57691549qtg.2 for ; Mon, 14 Aug 2017 13:01:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=FAQxFmC75mW+c0/o7Fip8bbhnePwBBqRiKQnkgvpW/0=; b=LPZ+EHpIM4jzApiiHYQ0YRA3s5voMEhvoM1pS6IUXVrKdLQuZJtbnD3YKQVMZMMA0F N3lwKfIxutBVbbGwaTiI2tki7zwcyj2NIGKM/efO6lhGNabpJ/8scDReaHHNk68K1kOE jxWYht9Fi9CIfNXA8U2icd5r9xJ7iVyA1AstJP/8LCk29IErDU7qNuikXz0hlOd19S2X 1DiJAUNCW79dLpSkUNmCLggaZdXWZSb2sO9/L2CX2IRJabfBhFhNB19YjYdUIQK9LiZ+ WXiI2YoeaNM8iyCjeWPbY381gej5+4lJNJY7sy1G9k5Hj96Pqi5nhxmrgu6NAcb+wovI s3Cw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=FAQxFmC75mW+c0/o7Fip8bbhnePwBBqRiKQnkgvpW/0=; b=j2HURFJ4WMOiYG3Hh3kYKjjfYYp8tudCi3EuaP8MetWHaP6vyuoD1OS+8QHaNhe4Ev Zee+ngv7mdkwqp2zMJebRiOyXtB2ox9TF6eFOlveOZFovH70MlV2WVQwF7+LGPOQShWs hXX81mJWH3JZrBwzTtd5qftLwInt09XoqJfFnr87Lmx5iv5FRUcKB83DOX2TpsH/TX5h tOlPVuA+N1DBbs1kUrgYvG59IkGPoc+7PdiJMKgeZRzNnjZEYHFnC49CJSG9nWhSkYJS XqEgam4GIRNeaLeyg67uuf1zHsr6mdCCF71iKJnFq8ZzTu9ecH0UwAyR3r2rSCpgl41J x41g== X-Gm-Message-State: AHYfb5jzgxYY584QpomxipwkNco3hOfskIpO6pFel/7R9BAVoykma72F seZNH5Sbt0yaGo08E5hISHpk0VKtsw== X-Received: by 10.200.35.12 with SMTP id a12mr32725482qta.271.1502740902520; Mon, 14 Aug 2017 13:01:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.105.164 with HTTP; Mon, 14 Aug 2017 13:01:42 -0700 (PDT) In-Reply-To: References: From: Scott Marlowe Date: Mon, 14 Aug 2017 14:01:42 -0600 Message-ID: Subject: Re: Odd sudden performance degradation related to temp object churn To: Jeremy Finzel Cc: postgres performance list Content-Type: text/plain; charset="UTF-8" List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org On Mon, Aug 14, 2017 at 1:53 PM, Jeremy Finzel wrote: > This particular db is on 9.3.15. Recently we had a serious performance > degradation related to a batch job that creates 4-5 temp tables and 5 > indexes. It is a really badly written job but what really confuses us is > that this job has been running for years with no issue remotely approaching > this one. We are also using pgpool. > > The job would kick off with 20-30 of similar queries running at once. The > thing normally takes only 30ms or so to run - it only operates on 1 customer > at a time (yes, it's horribly written). All of a sudden the cluster started > thrashing and performance seriously degraded. We tried a number of things > with no success: > > Analyzed the whole database > Turned off full logging > Turned off synchronous commit > Vacuumed several of the catalog tables > Checked if we had an abnormal high amount of traffic this time - we didn't > No abnormal disk/network issues (we would have seen much larger issues if > that had been the case) > Tried turning down the number of app nodes running > > What ended up completely resolving the issue was converting the query to use > ctes instead of temp tables. That means we avoided the disk writing and the > catalog churn, and useless indexes. However, we are baffled as to why this > could make such a big difference when we had no issue like this before, and > we have seen no systematic performance degradation in our system. > > Any insights would be greatly appreciated, as we are concerned not knowing > the root cause. How are your disks setup? One big drive with everything on it? Separate disks for pg_xlog and pg's data dir and the OS logging? IO contention is one of the big killers of db performance. Logging likely isn't your problem, but yeah you don't need to log ERRYTHANG to see the problem either. Log long running queries temp usage, buffer usage, query plans on slow queries, stuff like that. You've likely hit a "tipping point" in terms of data size. Either it's cause the query planner to make a bad decision, or you're spilling to disk a lot more than you used to. Be sure to log temporary stuff with log_temp_files = 0 in your postgresql.conf and then look for temporary file in your logs. I bet you've started spilling into the same place as your temp tables are going, and by default that's your data directory. Adding another drive and moving pgsql's temp table space to it might help. Also increasing work_mem (but don't go crazy, it's per sort, so can multiply fast on a busy server) Also log your query plans or run explain / explain analyze on the slow queries to see what they're doing that's so expensive. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance