Fix performance issue on RQLExpressions using EXISTS()

This backout the changeset dfcc3f7 which introduced wrapping all
{E,R}RQLExpression where clause with EXISTS().

It appear to have very bad performance on PostgreSQL on queries already using
EXISTS(), in this case rql was generating a query with double EXISTS(), leading
to a very bad query plan:

=# create table t as select * from generate_series(1, 1000000) as id;
SELECT 1000000
=# create unique index on t(id);
CREATE INDEX

For the RQLExpression "EXISTS(X identity X)" the generated sql was: Any X WHERE EXISTS(EXISTS(X identity X)), X eid %(eid)s, which is equivalent to:

=# explain analyze select id from t where exists(select 1 where exists(select 1 from t as x where t.id = x.id) and t.id = 42);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..8469425.00 rows=500000 width=4) (actual time=0.277..1426.342 rows=1 loops=1)
Filter: (SubPlan 2)
Rows Removed by Filter: 999999
SubPlan 2
-> Result (cost=8.45..8.46 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1000000)
One-Time Filter: ($1 AND (t.id = 42))
InitPlan 1 (returns $1)
-> Index Only Scan using t_id_idx on t x (cost=0.42..8.44 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1000000)
Index Cond: (id = t.id)
Heap Fetches: 1000000
Planning Time: 0.190 ms
Execution Time: 1426.384 ms


The planner wasn't able to optimise this (bad written) query, it produce a full table read (Seq Scan).

With a single EXISTS, the query perform much better: Any X WHERE EXISTS(X identity X), X eid %(eid)s, which is equivalent to:

=# explain analyze select id from t where exists(select 1 from t as x where t.id = x.id) and t.id = 42;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Nested Loop Semi Join (cost=0.85..16.90 rows=1 width=4) (actual time=0.093..0.095 rows=1 loops=1)
-> Index Only Scan using t_id_idx on t (cost=0.42..8.44 rows=1 width=4) (actual time=0.081..0.082 rows=1 loops=1)
Index Cond: (id = 42)
Heap Fetches: 1
-> Index Only Scan using t_id_idx on t x (cost=0.42..8.44 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: (id = 42)
Heap Fetches: 1
Planning Time: 0.383 ms
Execution Time: 0.136 ms


An alternative patch would be to insert the EXISTS only if the original query
doesn't already contains a EXISTS. But I think it's better to drop the magic
here and let user control what they really want in their expressions.

I added a note to the 3.27 changelog for this.

Also add some tests to RQLExpressionTC with EXISTS().

authorPhilippe Pepiot <philippe.pepiot@logilab.fr>
changesetc71e5bbead3d
branchdefault
phasedraft
hiddenno
parent revision#a4d465a3e77d fix(ci): manually remove the .tox/doc directory
child revision#0d5b9482f40d Better peformance for EmailAddress permissions
files modified by this revision
cubicweb/schema.py
# HG changeset patch
# User Philippe Pepiot <philippe.pepiot@logilab.fr>
# Date 1611764179 -3600
# Wed Jan 27 17:16:19 2021 +0100
# Node ID c71e5bbead3d959b1cd08cfd2c8e440d448b15cc
# Parent a4d465a3e77d07cf6a79c121c10b2d6484cd7468
Fix performance issue on RQLExpressions using EXISTS()

This backout the changeset dfcc3f7 which introduced wrapping all
{E,R}RQLExpression where clause with EXISTS().

It appear to have very bad performance on PostgreSQL on queries already using
EXISTS(), in this case rql was generating a query with double EXISTS(), leading
to a very bad query plan:

=# create table t as select * from generate_series(1, 1000000) as id;
SELECT 1000000
=# create unique index on t(id);
CREATE INDEX

For the RQLExpression "EXISTS(X identity X)" the generated sql was: Any X WHERE EXISTS(EXISTS(X identity X)), X eid %(eid)s, which is equivalent to:

=# explain analyze select id from t where exists(select 1 where exists(select 1 from t as x where t.id = x.id) and t.id = 42);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..8469425.00 rows=500000 width=4) (actual time=0.277..1426.342 rows=1 loops=1)
Filter: (SubPlan 2)
Rows Removed by Filter: 999999
SubPlan 2
-> Result (cost=8.45..8.46 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1000000)
One-Time Filter: ($1 AND (t.id = 42))
InitPlan 1 (returns $1)
-> Index Only Scan using t_id_idx on t x (cost=0.42..8.44 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1000000)
Index Cond: (id = t.id)
Heap Fetches: 1000000
Planning Time: 0.190 ms
Execution Time: 1426.384 ms


The planner wasn't able to optimise this (bad written) query, it produce a full table read (Seq Scan).

With a single EXISTS, the query perform much better: Any X WHERE EXISTS(X identity X), X eid %(eid)s, which is equivalent to:

=# explain analyze select id from t where exists(select 1 from t as x where t.id = x.id) and t.id = 42;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Nested Loop Semi Join (cost=0.85..16.90 rows=1 width=4) (actual time=0.093..0.095 rows=1 loops=1)
-> Index Only Scan using t_id_idx on t (cost=0.42..8.44 rows=1 width=4) (actual time=0.081..0.082 rows=1 loops=1)
Index Cond: (id = 42)
Heap Fetches: 1
-> Index Only Scan using t_id_idx on t x (cost=0.42..8.44 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: (id = 42)
Heap Fetches: 1
Planning Time: 0.383 ms
Execution Time: 0.136 ms


An alternative patch would be to insert the EXISTS only if the original query
doesn't already contains a EXISTS. But I think it's better to drop the magic
here and let user control what they really want in their expressions.

I added a note to the 3.27 changelog for this.

Also add some tests to RQLExpressionTC with EXISTS().

diff --git a/cubicweb/schema.py b/cubicweb/schema.py
@@ -315,11 +315,10 @@
1                  keyarg = rqlst.selection[0].name.lower()
2              else:
3                  keyarg = None
4              rqlst.recover()
5              return rql, found, keyarg
6 -        rqlst.where = nodes.Exists(rqlst.where)
7          return rqlst.as_string(), None, None
8 
9      def _check(self, _cw, **kwargs):
10          """return True if the rql expression is matching the given relation
11          between fromeid and toeid