SPARQL Order Matters

From Blazegraph
Jump to: navigation, search

In SPARQL, Order Matters

In the Blazegraph 1.5.2 release, we’ve fixed some internal issues regarding the evaluation order within join groups, making Blazegraph’s join group evaluation strategy compliant with the official SPARQL W3C semantics. As a consequence, for some query patterns, the behavior of Blazegraph changed and if you’re upgrading your Blazegraph installation to 1.5.2 or higher it might make sense to review your queries for such patterns, in order to avoid regressions. This Wiki page explains why order in SPARQL matters and indicates changes in the evaluation semantics that you may need to consider when upgrading.

Introduction

One common mistake made when writing SPARQL queries is the implicit assumption that, within SPARQL join groups, the order in which components of the group are specified does not matter. The point here is that, by definition, SPARQL follows a sequential semantics when evaluating join groups and, although in many cases the order of evaluating patterns does not change the outcome of the query , there are some situations where the order is crucial to get the expected query result.

A Simple Example

To illustrate why order in SPARQL indeed matters, assume we want to find all person URIs and, if present, associated images. Instantiating this scenario, let us consider query

Q1a

 SELECT * WHERE {
   ?person rdf:type <http://example.com/Person>
   OPTIONAL { ?person <http://example.com/image> ?image }
 }

over a database containing the following triples:

 <http://example.com/Alice> rdf:type                    <http://example.com/Person> .
 <http://example.com/Alice> <http://example.com/image>  "Alice.jpg" .
 <http://example.com/Bob>   rdf:type                    <http://example.com/Person> .

The result of this query is quite obvious and in line with our expectations: We get one result row for Alice, including the existing image, plus one result row for Bob, without an image:

 ?person                    | ?image
 ---------------------------|------------
 <http://example.com/Alice> | "Alice.jpg"
 <http://example.com/Bob>   |

Now let’s see what happens when swapping the order of the triple pattern and the OPTIONAL block in the query:

Q1b

 SELECT * WHERE {
   OPTIONAL { ?person <http://example.com/image> ?image } .
   ?person rdf:type <http://example.com/Person>
 }

The result of Q1b may come at a surprise:

 ?person                    | ?image
 ---------------------------|------------
 <http://example.com/Alice> | "Alice.jpg"

Where’s Bob gone? As mentioned before, SPARQL evaluates the query sequentially. This is, in the first step it evaluates the OPTIONAL { ?person :image ?image } block. As a result of this first step, we obtain:

 ?person                    | ?image
 ---------------------------|------------
 <http://example.com/Alice> | "Alice.jpg"

No Bob in sight here. In a second step, this partial result joined with the (non-optional) triple pattern ?person rdf:type <http://example.com/Person>. Intuitively speaking, the join with the triple patterns now serves as an assertion, retaining those result rows for which the value of variable ?person can be shown to be of rdf:type <http://example.com/Person>. Obviously, this assertion holds for our (single) result, the URI of Alice, so the previous intermediate result is left unmodified.

So what we get in this case is the “set of all persons that have an image associated, including the respective images”. But is this informal description really capturing what Q2b does? Consider the evaluation of Q2b over a database where none of the persons has an image associated, say:

 <http://example.com/Alice> rdf:type <http://example.com/Person> .
 <http://example.com/Bob>   rdf:type <http://example.com/Person> .

Here’s the result:

 ?person                    | ?image
 ---------------------------|------------------------------------
 <http://example.com/Alice> | 
 <http://example.com/Bob>   | 

Surprised again? Well, here’s the explanation: SPARQL evaluation starts out over the so-called “empty mapping” (also called “universal mapping”): the OPTIONAL block is not able to retrieve any results, and given the semantics of OPTIONAL, this step simply retains the empty mapping. This empty mapping is then joined with the non-optional triple pattern, giving us the two bindings for  ?person as result, with the ?image being unbound in both.

Taking our observations together, the semantics of Q1b could be phrased as: “If there is at least one triple with predicate <http://example.com/image> in our database, return all persons that have an image and including the image, otherwise return all persons.”

Practical Implications

The example above illustrates an interesting edge case, which (given that the semantics sketched informally above is somewhat odd) does not frequently show up in practice though. For join groups composed out of (non-optional) statement patterns only, for instance, the order in which you denote the triple patterns does not matter at all. But for operators OPTIONAL and MINUS we may run into such ordering problems.

Interestingly, even for patterns involving OPTIONAL (and MINUS), in many cases there are different orders that lead to the same result (independently on the data in the underlying database). Let’s look at the following variant of our initial query Q1a which, in addition, extracts the person label (and assume we have labels in our sample database for the persons, as well):

Q2a

 SELECT * WHERE {
   ?person rdf:type <http://example.com/Person> .
   ?person rdfs:label ?label .
   OPTIONAL { ?person <http://example.com/image> ?image } .
 }

Following the same arguments as in our example before, it is not allowed to move the OPTIONAL to the beginning, but the following query

Q2b

 SELECT * WHERE {
   ?person rdf:type <http://example.com/Person> .
   OPTIONAL { ?person <http://example.com/image> ?image } .
   ?person rdfs:label ?label .
 }

can be shown to be semantically equivalent. Without going into details, the key difference here towards our example from the beginning is that variable ?person has definitely been bound (in both Q2a and Q2b) when evaluating the OPTIONAL; this rules out strange effects as discussed in the beginning .

If you’re confused now about where to place your OPTIONAL’s, here are some rules of thumb:

  1. Be aware that order matters
  2. First specify your non-optional parts of the query (unless there’s some good to not do so)
  3. Then specify your OPTIONAL clauses to include optional patterns of your graph


For short, unless you want to encode somewhat odd patterns in SPARQL: it usually makes sense to put OPTIONAL patterns in the end of your join groups. If you’re still unsure about some of your queries, just have a look at the semantics section in the official W3C specs.

Optimizations within Blazegraph 1.5.2

While the official SPARQL semantics uses a sequential approach, one key optimization approach of the Blazegraph query optimizer is to identify a semantics-preserving reordering of join groups in the query that minimizes the amount of work spent in evaluating the query. Thereby, amongst other heuristics (such as cardinality estimations for the individual triple patterns), optimizers typically try to evaluate non-optional patterns first (this holds particularly in the presence of more complex OPTIONAL expressions).

As a consequence, queries like those sketched above challenge the Blazegraph query optimizer: reordering must be done carefully and based on sound theoretical foundations, in order to retain the semantics of the original query within the rewriting process. In Blazegraph 1.5.2, we’ve reworked Blazegraph’s internal join group reordering strategy.

For instance, Blazegraph 1.5.2 is able to detect the equivalence transformations (as sketched in Q2a/Q2b) when evaluating queries and uses this knowledge to find more efficient query execution plans: if you run Q2b query above in Blazegraph and turn the EXPLAIN mode on, you’ll see that the OPTIONAL clause is moved to the end by the optimizer in the optimized abstract syntax tree.