Opened 4 years ago

Closed 4 years ago

#815 closed defect (fixed)

RDR query does too much work

Reported by: bryan b. thompson Owned by: mikepersonick
Priority: blocker Milestone: Query
Component: Query Plan Generator Version: RDR
Keywords: RDR Cc: mrpersonick, thompsonbry

Description

The following query is doing a full scan on

select * {<<?s ?p ?o>> ?p1 ?o1 } LIMIT 10

If you look at the EXPLAIN of this query, you can see that it visited ALL statements on SPOPredicate[1](?src, ?p2, ?tgt, ?--anon-2 then did the (pipeline) join with SPOPredicate[4](?-sid-1, ?p, ?o, ?--anon-5).

The right way to run this query is to do a scan of the SID region of the SPO index and then unpack the <<?s ?p ?o>> from the subject position of the visited triples. There is no other join.

The RDR syntax <<x y z>> p o implies that <<x y z>> must exist (in a query) and asserts that statement (when parsing data). An interpretation that does not respect this might be the root of the problem.

Attachments (1)

RDR-query-explain.html (9.2 KB) - added by bryan b. thompson 4 years ago.
Explain of the query for this ticket.

Download all attachments as: .zip

Change History (10)

comment:1 Changed 4 years ago by bryan b. thompson

  • Status changed from new to assigned

Changed 4 years ago by bryan b. thompson

Explain of the query for this ticket.

comment:2 Changed 4 years ago by bryan b. thompson

The problem with this specific query is that we have 5M statements in the database. 1/2 of them are link attributes. The other half is the ground triple for the link. The query is being executed with what amounts to two fully unbound triple patterns, so it begins with a full scan of 2.5M statements before it finds the first statement that can join with the 2nd triple pattern.

There are two ways to fix this:

  1. The two statement patterns could be combined as a single statement pattern which uses a prefix constraint to only visit the SIDs portion of the SPO index. The Subject of the visited statements will be a SID. We can then decompose that SID, obtaining bindings for its embedded s, p, and o. (Actually, this is not correct since it fails to verify the existence of the ground triple and the RDR syntax requires that we do this.)
  1. The second statement pattern can have a key-range constraint that is lifted onto the access path and the optimizer can notice this key-range constraint. The constraint is that the S variable MUST be a SID, so it will have the appropriate prefix bytes. If the optimizer notices this constraint and the range counts are collected with the constraint applied, then the JOINs will be executed in the other order and the query will be fast (because each statement visited on the 1st join will have a SID for its Subject position and the nexted index join for the 2nd triple pattern will always succeed).

In fact, both of this really rely on recognizing the key-range constraint that can be lifted onto the access path. (1) just goes a step further and eliminates one join.

(1) is not valid. The RDR syntax for data asserts the existence of the ground triple. The RDR syntax for query requires the presence of the ground triple. Therefore we really do need to run both joins. Hence, I think that (2) is correct while (1) is not.

We have added some unit tests that help to analyze the problem, but the basic performance issue remains.

comment:3 Changed 4 years ago by bryan b. thompson

comment:4 Changed 4 years ago by bryanthompson

  • Owner changed from mrpersonick to mikepersonick
  • Version set to RDR

comment:5 Changed 4 years ago by bryanthompson

Mike and I have talked this through. It appears that we need to modify
the SPOPredicate.asBound() method to unify the SID variable (when present)
and the S,P,O and C (if in quads mode) positions on the SPO. Since this
unification can fail, we have decided to modify the contract for asBound()
to allow it to return null if the unification does not succeed. Mike is
going to handle this.

I am going modify the PipelineJoin? code to handle the case where
SPOPredicate.asBound() returns null. In this case, the incoming binding
set simply fails the join. The logic needs to be updated to let that
happen naturally.

comment:6 Changed 4 years ago by bryanthompson

IPredicate.asBound() - javadoc. now allows a null return.

BOpContext.bind() - the bind() version associated with the IElement based PipelineJoin? code path has been deprecated.

PipelineJoin?:

  • the handleJoin() method has been deprecated. It is associated with the IElement process rather than "solutions" based AP reads.
  • the code in the BindingSetConsumerTask? has been modified to accomodate the possible null return from SPOPreciate.asBound() when the SPOPredicate is associated with an RDR access path (the SID variable is defined). Incoming binding sets that cause SPOPredicate.asBound() to fail will cause the join to fail for that source solution. For these cases, the join for that source solution is simply not run (we do not create the associated AccessPathTest?).

Committed revision r7868.

comment:7 Changed 4 years ago by mikepersonick

Fixed SPOPredicate asBound to deal with the case where the sid var is bound in the incoming solution. It will do some consistency checking (filters out solutions where the sid bound to sidVar is either not a sid at all or has values that do not match the constants in the SPOPredicate). It will then bind the values from the incoming sid onto any variables defined by the SPOPredicate before doing the asBound. This will give us the point tests to make sure the statements are ground.

Also changed the StaticOptimizer? to pick a better ordering when two statement patterns have the same cardinality but one has a sid var and the other does not.

comment:8 Changed 4 years ago by bryanthompson

  • Resolution set to fixed
  • Status changed from assigned to closed

Fix verified in r7870.

Note: See TracTickets for help on using tickets.