Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bBM8a-0001MB-Ka for pgsql-performance@arkaria.postgresql.org; Fri, 10 Jun 2016 13:06:12 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1bBM8Z-0000Ve-Pt for pgsql-performance@arkaria.postgresql.org; Fri, 10 Jun 2016 13:06:11 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1bBM6t-00071s-Hh for pgsql-performance@postgresql.org; Fri, 10 Jun 2016 13:04:27 +0000 Received: from mail-qk0-x232.google.com ([2607:f8b0:400d:c09::232]) by magus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1bBM6p-0005K5-1u for pgsql-performance@postgresql.org; Fri, 10 Jun 2016 13:04:26 +0000 Received: by mail-qk0-x232.google.com with SMTP id p10so5208916qke.3 for ; Fri, 10 Jun 2016 06:04:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to; bh=51H9LM/qxrW1YCaOC6zIuHo8bkUyb+5I0uURZu78F/8=; b=hk3kga6i4dxXtha8bYtbr8ZEtT+a+gGwcYlbSJQl57plUpU0Sl2yPNJOrQk9qUgAMh a4s2ad8r6uTVIXPhibSvt27KVB/zZ8ck0wDKAJ3FJoovdJEkcFxjlQ+qc58zB6MECkTJ r1l/kyg+tqg0L6nTT5hAA9fvyleokgZmWIrCRk/4EP70qSJbDCSKYQoScIfxjjYzSccg M9BelnBBjZIAffQx+UAXlMk11B8gy2uBGZczRbB+86Y+v+0BNhqTDlcJKW6Qf/GBYo4/ NBIPXIaR7kHR7nQKNvtcOjzHh4tUJRZOsZKe2agEZSIwTp92AiB9ddbjDcumFGJQt2by NP7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=51H9LM/qxrW1YCaOC6zIuHo8bkUyb+5I0uURZu78F/8=; b=N/TeQAQbuGLHKY/WuMAUaUYthLEnr5bSj4SMQr5YLRsJJICgWbfHI7eXdcmcP312PQ kDUpdWh8mdXD8VlfHZvOjVvoLxLHDYwbRAS74yfQrRpXF6cwBxDpNOEPlc8FfMj6ZXLd +VD8pjRFYl5Rw/BrM3SBIg7HO3HGpkjQN9rQdbESyBqnmHOO7JN6Lb0X0GdwzwNpxpO3 w7TJkNCDId/Yje+l4e+RTfYDDMqn+2tVW/8cpd9K+eyWfQEydf6k28qbgcIiVkIsilLj 70sOqcwaZd2rhc4ksat4bLy5h//vAHV3oU2ZVGGBg0bgW4goy4PxQt1YsT7770fsDbd6 8t7Q== X-Gm-Message-State: ALyK8tKXApc/ho5w7KQarnIT1TAHQk14VBDN9OQCC7+hHwhX8Do21h7ld9wxd1nD4eAVjJliYBsdbInVidAqwQ== X-Received: by 10.55.110.65 with SMTP id j62mr1204340qkc.112.1465563861261; Fri, 10 Jun 2016 06:04:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.85.40 with HTTP; Fri, 10 Jun 2016 06:04:01 -0700 (PDT) From: Rowan Seymour Date: Fri, 10 Jun 2016 15:04:01 +0200 Message-ID: Subject: Many-to-many performance problem To: pgsql-performance@postgresql.org Content-Type: multipart/alternative; boundary=94eb2c05940c58a4180534ec2dd0 X-Pg-Spam-Score: -2.7 (--) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org --94eb2c05940c58a4180534ec2dd0 Content-Type: text/plain; charset=UTF-8 In our Django app we have messages (currently about 7 million in table msgs_message) and labels (about 300), and a join table to associate messages with labels (about 500,000 in msgs_message_labels). Not sure you'll need them, but here are the relevant table schemas: CREATE TABLE msgs_message ( id INTEGER PRIMARY KEY NOT NULL, type VARCHAR NOT NULL, text TEXT NOT NULL, is_archived BOOLEAN NOT NULL, created_on TIMESTAMP WITH TIME ZONE NOT NULL, contact_id INTEGER NOT NULL, org_id INTEGER NOT NULL, case_id INTEGER, backend_id INTEGER NOT NULL, is_handled BOOLEAN NOT NULL, is_flagged BOOLEAN NOT NULL, is_active BOOLEAN NOT NULL, has_labels BOOLEAN NOT NULL, CONSTRAINT msgs_message_contact_id_5c8e3f216c115643_fk_contacts_contact_id FOREIGN KEY (contact_id) REFERENCES contacts_contact (id), CONSTRAINT msgs_message_org_id_81a0adfcc99151d_fk_orgs_org_id FOREIGN KEY (org_id) REFERENCES orgs_org (id), CONSTRAINT msgs_message_case_id_51998150f9629c_fk_cases_case_id FOREIGN KEY (case_id) REFERENCES cases_case (id) ); CREATE UNIQUE INDEX msgs_message_backend_id_key ON msgs_message (backend_id); CREATE INDEX msgs_message_6d82f13d ON msgs_message (contact_id); CREATE INDEX msgs_message_9cf869aa ON msgs_message (org_id); CREATE INDEX msgs_message_7f12ca67 ON msgs_message (case_id); CREATE TABLE msgs_message_labels ( id INTEGER PRIMARY KEY NOT NULL, message_id INTEGER NOT NULL, label_id INTEGER NOT NULL, CONSTRAINT msgs_message_lab_message_id_1dfa44628fe448dd_fk_msgs_message_id FOREIGN KEY (message_id) REFERENCES msgs_message (id), CONSTRAINT msgs_message_labels_label_id_77cbdebd8d255b7a_fk_msgs_label_id FOREIGN KEY (label_id) REFERENCES msgs_label (id) ); CREATE UNIQUE INDEX msgs_message_labels_message_id_label_id_key ON msgs_message_labels (message_id, label_id); CREATE INDEX msgs_message_labels_4ccaa172 ON msgs_message_labels (message_id); CREATE INDEX msgs_message_labels_abec2aca ON msgs_message_labels (label_id); Users can search for messages, and they are returned page by page in reverse chronological order. There are several partial multi-column indexes on the message table, but the one used for the example queries below is CREATE INDEX msgs_inbox ON msgs_message(org_id, created_on DESC) WHERE is_active = TRUE AND is_handled = TRUE AND is_archived = FALSE AND has_labels = TRUE; So a typical query for the latest page of messages looks like ( https://explain.depesz.com/s/G9ew): SELECT "msgs_message".* FROM "msgs_message" WHERE ("msgs_message"."org_id" = 7 AND "msgs_message"."is_active" = true AND "msgs_message"."is_handled" = true AND "msgs_message"."has_labels" = true AND "msgs_message"."is_archived" = false AND "msgs_message"."created_on" < '2016-06-10T07:11:06.381000 +00:00'::timestamptz ) ORDER BY "msgs_message"."created_on" DESC LIMIT 50 But users can also search for messages that have one or more labels, leading to queries that look like: SELECT DISTINCT "msgs_message".* FROM "msgs_message" INNER JOIN "msgs_message_labels" ON ( "msgs_message"."id" = "msgs_message_labels"."message_id" ) WHERE ("msgs_message"."org_id" = 7 AND "msgs_message"."is_active" = true AND "msgs_message"."is_handled" = true AND "msgs_message_labels"."label_id" IN (127, 128, 135, 136, 137, 138, 140, 141, 143, 144) AND "msgs_message"."has_labels" = true AND "msgs_message"."is_archived" = false AND "msgs_message"."created_on" < '2016-06-10T07:11:06.381000 +00:00'::timestamptz ) ORDER BY "msgs_message"."created_on" DESC LIMIT 50 Most of time, this query performs like https://explain.depesz.com/s/ksOC (~15ms). It's no longer using the using the msgs_inbox index, but it's plenty fast. However, sometimes it performs like https://explain.depesz.com/s/81c (67000ms) And if you run it again, it'll be fast again. Am I correct in interpreting that second explain as being slow because msgs_message_pkey isn't cached? It looks like it read from that index 3556 times, and each time took 18.559 (?) ms, and that adds up to 65,996ms. The database server says it has lots of free memory so is there something I should be doing to keep that index in memory? Generally speaking, is there a good strategy for optimising queries like these which involve two tables? - I tried moving the label references into an int array on msgs_message, and then using btree_gin to create a multi-column index involving the array column, but that doesn't appear to be very useful for these ordered queries because it's not an ordered index. - I tried adding created_on to msgs_message_labels table but I couldn't find a way of avoiding the in-memory sort. - Have thought about dynamically creating partial indexes for each label using an array column on msgs_message to hold label ids, and index condition like WHERE label_ids && ARRAY[123] but not sure what other problems I'll run into with hundreds of indexes on the same table. Server is an Amazon RDS instance with default settings and Postgres 9.3.10, with one other database in the instance. All advice very much appreciated, thanks -- *Rowan Seymour* | +260 964153686 --94eb2c05940c58a4180534ec2dd0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
In our Django app we have messages (currently about 7= million in table msgs_message) and labels (about 300), and a join table to= associate messages with labels (about 500,000 in msgs_message_labels). Not= sure you'll need them, but here are the relevant table schemas:
<= div>
CREATE TABLE msgs_message
(
=C2= =A0 =C2=A0 id INTEGER PRIMARY KEY NOT NULL,
=C2=A0 =C2=A0 type VA= RCHAR NOT NULL,
=C2=A0 =C2=A0 text TEXT NOT NULL,
=C2= =A0 =C2=A0 is_archived BOOLEAN NOT NULL,
=C2=A0 =C2=A0 created_on= TIMESTAMP WITH TIME ZONE NOT NULL,
=C2=A0 =C2=A0 contact_id INTE= GER NOT NULL,
=C2=A0 =C2=A0 org_id INTEGER NOT NULL,
= =C2=A0 =C2=A0 case_id INTEGER,
=C2=A0 =C2=A0 backend_id INTEGER N= OT NULL,
=C2=A0 =C2=A0 is_handled BOOLEAN NOT NULL,
=C2= =A0 =C2=A0 is_flagged BOOLEAN NOT NULL,
=C2=A0 =C2=A0 is_active B= OOLEAN NOT NULL,
=C2=A0 =C2=A0 has_labels BOOLEAN NOT NULL,
=
=C2=A0 =C2=A0 CONSTRAINT msgs_message_contact_id_5c8e3f216c115643_fk_c= ontacts_contact_id FOREIGN KEY (contact_id) REFERENCES contacts_contact (id= ),
=C2=A0 =C2=A0 CONSTRAINT msgs_message_org_id_81a0adfcc99151d_f= k_orgs_org_id FOREIGN KEY (org_id) REFERENCES orgs_org (id),
=C2= =A0 =C2=A0 CONSTRAINT msgs_message_case_id_51998150f9629c_fk_cases_case_id = FOREIGN KEY (case_id) REFERENCES cases_case (id)
);
CRE= ATE UNIQUE INDEX msgs_message_backend_id_key ON msgs_message (backend_id);<= /div>
CREATE INDEX msgs_message_6d82f13d ON msgs_message (contact_id);<= /div>
CREATE INDEX msgs_message_9cf869aa ON msgs_message (org_id);
CREATE INDEX msgs_message_7f12ca67 ON msgs_message (case_id);

CREATE TABLE msgs_message_labels
(
=C2=A0 =C2=A0 id INTEGER PRIMARY KEY NOT NULL,
=C2=A0 = =C2=A0 message_id INTEGER NOT NULL,
=C2=A0 =C2=A0 label_id INTEGE= R NOT NULL,
=C2=A0 =C2=A0 CONSTRAINT msgs_message_lab_message_id_= 1dfa44628fe448dd_fk_msgs_message_id FOREIGN KEY (message_id) REFERENCES msg= s_message (id),
=C2=A0 =C2=A0 CONSTRAINT msgs_message_labels_labe= l_id_77cbdebd8d255b7a_fk_msgs_label_id FOREIGN KEY (label_id) REFERENCES ms= gs_label (id)
);
CREATE UNIQUE INDEX msgs_message_label= s_message_id_label_id_key ON msgs_message_labels (message_id, label_id);
CREATE INDEX msgs_message_labels_4ccaa172 ON msgs_message_labels (m= essage_id);
CREATE INDEX msgs_message_labels_abec2aca ON msgs_mes= sage_labels (label_id);

Users can search for= messages, and they are returned page by page in reverse chronological orde= r. There are several partial multi-column indexes on the message table, but= the one used for the example queries below is

CREATE INDEX msgs_inbox ON msgs_message(org_id, created_on DESC)
WHERE is_active =3D TRUE AND is_handled =3D TRUE AND is_archived =3D FALS= E AND has_labels =3D TRUE;

So a typical quer= y for the latest page of messages looks like (https://explain.depesz.com/s/G9ew):<= /div>

SELECT "msgs_message".*=C2=A0
= FROM "msgs_message"=C2=A0
WHERE ("msgs_message&quo= t;."org_id" =3D 7=C2=A0
=C2=A0 =C2=A0 AND "msgs_me= ssage"."is_active" =3D true=C2=A0
=C2=A0 =C2=A0 AN= D "msgs_message"."is_handled" =3D true=C2=A0
= =C2=A0 =C2=A0 AND "msgs_message"."has_labels" =3D true= =C2=A0
=C2=A0 =C2=A0 AND "msgs_message"."is_archiv= ed" =3D false=C2=A0
=C2=A0 =C2=A0 AND "msgs_message&quo= t;."created_on" < '2016-06-10T07:11:06.381000+00:00'::time= stamptz
) ORDER BY "msgs_message"."created_on"= ; DESC LIMIT 50

But users can also search for mess= ages that have one or more labels, leading to queries that look like:
=

SELECT DISTINCT "msgs_message".*=C2=A0
FROM "msgs_message"=C2=A0
INNER JOIN "msg= s_message_labels" ON ( "msgs_message"."id" =3D &qu= ot;msgs_message_labels"."message_id" )=C2=A0
WHERE= ("msgs_message"."org_id" =3D 7=C2=A0
=C2=A0 = =C2=A0 AND "msgs_message"."is_active" =3D true=C2=A0
=C2=A0 =C2=A0 AND "msgs_message"."is_handled" = =3D true=C2=A0
=C2=A0 =C2=A0 AND "msgs_message_labels".= "label_id" IN (127, 128, 135, 136, 137, 138, 140, 141, 143, 144)= =C2=A0
=C2=A0 =C2=A0 AND "msgs_message"."has_label= s" =3D true=C2=A0
=C2=A0 =C2=A0 AND "msgs_message"= ."is_archived" =3D false=C2=A0
=C2=A0 =C2=A0 AND "= msgs_message"."created_on" < '2016-06-10T07:11:06.381000+= 00:00'::timestamptz
) ORDER BY "msgs_message"."= ;created_on" DESC LIMIT 50

Most of time, this query performs like https://explain.depesz.com/= s/ksOC (~15ms). It's no longer using the using the msgs_inbox index= , but it's plenty fast. However, sometimes it performs like=C2=A0https://explain.de= pesz.com/s/81c (67000ms)

And if you run it aga= in, it'll be fast again. Am I correct in interpreting that second expla= in as being slow because msgs_message_pkey isn't cached? It looks like = it read from that index 3556 times, and each time took=C2=A018.559 (?)=C2= =A0ms, and that adds up to=C2=A065,996ms. The database server says it has l= ots of free memory so is there something I should be doing to keep that ind= ex in memory?

Generally speaking, is there a good = strategy for optimising queries like these which involve two tables?
<= div>
  • I tried moving the label references into an int array on msgs_m= essage, and then using btree_gin to create a multi-column index involving t= he array column, but that doesn't appear to be very useful for these or= dered queries because it's not an ordered index.
  • I tried adding= created_on to=C2=A0msgs_message_labels table but I couldn't find a way= of avoiding the in-memory sort.
  • Have thought about dynamically cre= ating partial indexes for each label using an array column on=C2=A0msgs_mes= sage to hold label ids, and index condition like WHERE label_ids &&= ARRAY[123] but not sure what other problems I'll run into with hundred= s of indexes on the same table.
Server is an Amazon RDS insta= nce with default settings and Postgres 9.3.10, with one other database in t= he instance.

All advice very much appreciate= d, thanks

--
Rowan Sey= mour | +260 964153686
--94eb2c05940c58a4180534ec2dd0--