Federated Query

From Blazegraph
Jump to: navigation, search

Bigdata supports the SPARQL 1.1 Federated Query Extension. However, the trick with federated query is managing the order in which local and remote joins are evaluated. This page provides some guidance on how to do that with bigdata. Also see the QueryHints page.

Background

If the service reference is a variable, then it MUST be bound before the SERVICE call can be evaluated. The query will throw an exception if that SERVICE URI is never bound.

The default behavior for a SERVICE is to wait until all source solutions have been fully buffered and then evaluate the SERVICE call once. This "atOnce" evaluation can be explicitly disabled using the atOnce query hint. When run using atOnce evaluation, the SERVICE all source solutions will be vectored into a single ServiceCallJoin and the ServiceCallJoin will be evaluated exactly once in the query (for a given SERVICE clause).

If the service reference is a variable, then there will still be one service end point invocation per distinct bound value for that variable (the solutions are re-grouped by the service end point and then vectored to each service end point). However, there should be only one remote service end point call per service end point within the query.

Example

Here is a sample federated query. It joins local data for ?s ?o1 ?o2 with REMOTE data matching the SERVICE clause.

PREFIX : <http://example.org/> 

SELECT ?s ?o1 ?o2 
{
  ?s ?p1 ?o1 .
  SERVICE <http://example.org/endpoint1> {
    ?s ?p2 ?o2
  }
} 

Evaluation Order

The default evaluation order for SERVICEs is as follows. You can customize the evaluation order using QueryHints, NamedSubquery, or the ServiceRegistry.

  1. Services in a NamedSubquery will run before services in the main WHERE clause.
  2. Services may be configured to always run as early as possible within the join group. This is done by setting the runFirst property for the IServiceOptions for the service URI. Note that this directive only works when the service URI is a constant in the query. If the service URI is a variable, then the query planner cannot resolve the IServiceOptions for the end point until after the join evaluation order has been locked in.
  3. If the serviceRef is a constant, then it runs before SERVICEs with a variable reference, but after other required joins.
  4. If the serviceRef is a variable, then the service will be evaluated after all other required joins.

Bigdata does not currently optimize the order of SERVICE evaluation within those groups.

Large Result Sets

If the remote end point produces a large result set, then that result set might have to be fully materialized within bigdata before your query can be completed. However, this is not always true. It becomes true if the query plan cannot be pipelined. Examples of queries that cannot be pipelined include:

  1. ORDER BY - all results must be materialized. ORDER BY is applied before a LIMIT, so a LIMIT will not help.
  2. Sub-Select, Optional, and Sub-Group - these are evaluated using a hash join. This forces all upstream results to be materialized before the hash join can run. So, if you have a LOT of results from that remote service, they will all be materialized into a hash join. This is a good time to enable the AnalyticQuery mode.
  3. SERVICE calls - if there are any down stream service calls, then they also use a hash join.

You can work around this issue in a number of ways:

  1. Use a LIMIT within the SERVICE's graph pattern
  2. Rewrite your query so it can be pipelined (no ORDER BY, no Sub-SELECT no sub-groups, no optional, and no additional SERVICE calls)
  3. Enabling the AnalyticQuery mode. This will not prevent the remote solutions from being materialized, but it will take the burden off of the Java heap.

Query Hints

See the QueryHints page for more information.

hint:runFirst

You can specify that a SERVICE should be the first thing that is evaluated within a join group using the runFirst query hint.

hint:runLast

You can specify that a SERVICE should be the last thing that is evaluated within a join group using the runLast query hint.

hint:optimizer "None"

If you disable the static query optimizer, then the joins will be run in the exact specified order.

hint:maxParallel

The #of concurrent service end point invocations for a query is governed by the maximum operator concurrency for the ServiceCallJoin operator. The default is FIVE (5), which means that requests for up to 5 distinct service end points will be processed in parallel for a single query. This can be overridden using the maxParallel query hint.

ServiceRegistry

The ServiceRegistry provides a place to register and configure service end points. You do not need to register a service URI before you can query it, but the default configuration might not work for all services. For example, the service might not support SPARQL 1.1 in which case we have to use a different technique to vector solutions to that service end point.

SPARQL 1.0 or SPARQL 1.1

All queries to remote service end points are vectored. This is done using different techniques for SPARQL 1.0 and SPARQL 1.1.

Default Behavior

You can specify the default assumption for all service end points using the ServiceRegistry:

final RemoteServiceOptions options = new RemoteServiceOptions();

options.setSparql11(...); // true or false

final RemoteServiceFactoryImpl remoteServiceFactory = new RemoteServiceFactory(options);

ServiceRegistry.getInstance().setDefaultServiceFactory(remoteServiceFactory));

Note: The default assumption is SPARQL 1.1. You can change this using the incantation above. Review the javadoc for other service configuration options.

Custom Services

Custom services are a great integration option. Using a custom service, you can extend the bigdata RDF database and add your own application or domain specific behaviors. You can observe updates as statements are added to or removed from the RDF database and use that information to maintain your own indices. Best of all, custom services are invoked using the SPARQL 1.1 SERVICE syntax so they are automatically accessible to your application at the SPARQL layer.

If your service is an RDF application, then it merely interprets the SERVICE group graph pattern. Remote SPARQL end points already do exactly this. Custom services integrated with bigdata can do the same thing, but can also observe database updates. Otherwise, you need to interpret the group graph pattern as "magic triples" providing service specific directives. See Full Text Search in Bigdata and FullTextSearch for how we do this for full text search.

Custom services implement the ServiceFactory or CustomServiceFactory interface. Custom services can be either bigdata "aware" (they will interchange bigdata IBindingSet objects) or openrdf "aware" (they will interchange openrdf BindingSet objects). Bigdata aware services are more efficient, but openrdf aware services are easier to write. Either way, solutions will be vectored into your custom service, and solutions produced by your custom service will be vectored back into bigdata.

Custom Service Examples

Bigdata's internal full text search component is implemented as a custom service. See the SearchServiceFactory. The bigdata search service is integrated pretty tightly into the code, but that is mainly for historical reasons (before SPARQL 1.1 there was no mechanism in the query language to integrate custom services). As another example, see the ExternalFullTextSearch service.

The opensahara project maintains a geospatial and full text search integration based on a custom service.

Monitoring Updates

Many times custom services need to be able to observe updates against the database. You can do this by writing an IChangeLog implementation. By registering as a CustomServiceFactory, your service will automatically be notified at each mutable connection start. This notice provides an opportunity to register an IChangeLog against the BigdataSailConnection.

Configuring Service URLs Whitelist

If you want to restrict the Federated Query service URIs that are allowed in SPARQL queries, you need to configure a service URLs whitelist. This can be done by adding the 'serviceWhitelist' parameter into the 'web.xml' file. For example, specifying the following parameter will allow using 'http://www.bigdata.com/rdf/search#search' and 'http://www.bigdata.com/rdf/geospatial#search' services only. You can also use the whitelist facility to restrict SPARQL Federated query to specific SPARQL end points.

 <context-param>
   <description>List of allowed services.</description>
   <param-name>serviceWhitelist</param-name>
   <param-value>http://www.bigdata.com/rdf/search#search, http://www.bigdata.com/rdf/geospatial#search</param-value>
  </context-param> 

An attempt of using a URL which is not whitelisted will cause an IllegalArgumentException error.

See com.bigdata.rdf.sparql.ast.service.ServiceRegistry (in the code) for a list of pre-defined services that are automatically registered.