The problem with this specific query is that we have 5M statements in the database. 1/2 of them are link attributes. The other half is the ground triple for the link. The query is being executed with what amounts to two fully unbound triple patterns, so it begins with a full scan of 2.5M statements before it finds the first statement that can join with the 2nd triple pattern.
There are two ways to fix this:
1. The two statement patterns could be combined as a single statement pattern which uses a prefix constraint to only visit the SIDs portion of the SPO index. The Subject of the visited statements will be a SID. We can then decompose that SID, obtaining bindings for its embedded s, p, and o. (Actually, this is not correct since it fails to verify the existence of the ground triple and the RDR syntax requires that we do this.)
2. The second statement pattern can have a key-range constraint that is lifted onto the access path and the optimizer can notice this key-range constraint. The constraint is that the S variable MUST be a SID, so it will have the appropriate prefix bytes. If the optimizer notices this constraint and the range counts are collected with the constraint applied, then the JOINs will be executed in the other order and the query will be fast (because each statement visited on the 1st join will have a SID for its Subject position and the nexted index join for the 2nd triple pattern will always succeed).
In fact, both of this really rely on recognizing the key-range constraint that can be lifted onto the access path. (1) just goes a step further and eliminates one join.
(1) is not valid. The RDR syntax for data asserts the existence of the ground triple. The RDR syntax for query requires the presence of the ground triple. Therefore we really do need to run both joins. Hence, I think that (2) is correct while (1) is not.
We have added some unit tests that help to analyze the problem, but the basic performance issue remains.