Inference And Truth Maintenance

From Blazegraph
Revision as of 11:33, 21 July 2015 by Igor Kim (Talk | contribs) (updated documentation 1.5.2)

Jump to: navigation, search

Bigdata supports inference and incremental truth maintenance for rules that can be expressed as conjunctive query. Bigdata bundles support for RDFS+ (RDFS plus a little bit of OWL) as well as some simpler profiles. It can be extended to draw additional inference. Support for inference is primarily based on the eager materialization of additional triples in the database. This can occur either when the data is loaded or when the inference engine is explicitly invoked.

Less is more

Many inferences do not add any information to your application and an inference can absorb significant computational and disk resources.

For example, rdfs:domain and rdfs:range are basically useless inferences. They do not provide a constraint on the data (you need to at least be using OWL to have constraints, which are a type of negation - negation is not supported by RDFS and cannot be expressed in a simple conjunctive query). All they do is annotate the data with instance / class relationships. Those same relationships can nearly always be asserted directly by your application when posting data to the database. If you can do this in your application, then you can avoid the relatively expensive inference and truth maintenance for these rules.

(u rdf:type x) <= ( a rdfs:domain x), ( u a y ). // rdfs:domain rule (RDFS 2)

(v rdf:type x) <= (a rdfs:range x), (u a v). // rdfs:range rule (RDFS 3)

Another example of a useless inference is RDFS4. This rule states that everything is an rdfs:Resource. This inference is part of the model theory, but has no practical application. You will never write a query to find all rdfs:Resources so you do not need to compute this inference.

(?u rdf:type rdfs:Resource) <= (?u ?a ?x).

(?v rdf:type rdfs:Resource) <= (?u ?a ?v).

Modes of Inference

Bigdata supports two very different modes of inference. The first is database-at-once closure. The other is incremental truth maintenance.

Database at once closure

In this mode, the inferences are computed for all triples in the graph at once. This can be significantly more efficient if there are a large number of triples in the graph. However, because SPARQL UPDATE does not define a means to manage inference, a database at once closure cannot currently be controlled through the REST API. Instead you should use the DataLoader class or the BigdataSailConnection. Both of these allow you to explicitly manage the data when an inference is computed.

Incremental Truth Maintenance

If you enable inference for the graph, then incremental truth maintenance is performed every time the new data is added to, or removed from, the graph. When new triples are added, new inferences (also called entailments) may be produced and there may be new ways in which existing inferences could be proven. All of this is recorded in the graph. The inferences are written into the statement indices and marked as "inferences" (vs explicit statements or axioms). The proof chains are written into a justifications index. When triples are removed from the graph, the proof chains are used to decide whether any inferences that were supported by those triples can still be proven. If an inference can no longer be proven it will be retracted from the graph.

Incremental truth maintenance is fast for small updates. For large updates it can be very expensive. Therefore bulk load should ALWAYS use the database-at-once closure method.

Inference and Quads

Bigdata does not support inference in the quads mode out of the box. This is because there is no standard that specifies which named graphs are the sources for the told triples and which named graphs are the destinations for the inferred triples. There are actually a variety of interesting ways to set this up. Some examples and some possible ways to handle them are outlined below.

  1. Some named graphs that provide the ontology that is used by all other named graphs. Each named graph is then combined with the ontology graphs and the inferences are written back into the named graph.
  2. All named graphs used as the source for inference and the results are written into the default graph (this does not work for bigdata, since bigdata interprets the default graph as the RDF merge of the named graphs).
  3. Each named graph is completely independent. The inferences are computed solely based on the triples in the named graphs. For this case, you can use the Multi-Tenancy API and place each named graph into its own triple store within a single bigdata instance. If you need to query across those graphs, you can do this using SPARQL Basic Federated Query.

Scaling Inference with a Compute Cluster

Inference is a relatively heavy weight operation. However, it is possible to scale this operation using a compute cluster and durable work queues. This approach has been used by a number of customers that have to manage the data for a large number of tenants. This approach can also be combined very nicely with Map/Reduce processing for data ingest and with the use of an HA replication cluster to load balance, and scale the query workload.


Configuring Inference

(Big thanks to Antoni, on the developer's mailing list, for summarizing the following.)

Note: When configuring inference, you need to understand the performance tradeoffs, and be sure that you really need ALL the rules. Every new rule means slower database. The default settings are fast and scalable.

Bigdata supports RDFS+. This means that it is possible to configure bigdata to provide a RDF inference, RDFS inference, plus any other inferences that can be defined using the conjunctive query.

Inference Bigdata depends on exactly three things:

  1. The axioms - i.e. the set of triples that are in the graph in the beginning and cannot be removed. They also serve to define the inferences that will be drawn by the closure program.
  2. The closure program - i.e. the set of inference rules and how they are applied.
  3. The InferenceEngine, which is obtained from the AbstractTripleStore and contains additional configuration like forwardChainRdfTypeRdfsResource, etc. The InferenceEngine config is read by the FastClosure and FullClosure classes (see below).

When I want to be sure that the inferencing goes according to my needs: I need to understand the meaning of the various twelve configuration options and make sure they have the correct values, especially:

  1. and (both of which are declared in AbstractTripleStore.Options. The values are the names of the Axioms class and the BaseClosure class respectively.
  2. The properties defined in com.bigdata.rdf.rules.InferenceEngine and especially in InferenceEngine.Options

The default settings of the above options (OwlAxioms, FastClosure, default InferenceEngine.Options settings yield a rule set that is not formally defined anywhere. It's not full RDFS (e.g. rules RDFS4a and RDFS4b are disabled by default) nor OWL. It is meant to be a useful set of inference rules, not corresponding directly to a standard.

If you want to follow a written standard, and have all the RDFS entailment rules from, or OWL 2 RL/RDF from, then you need to take care of this yourself. There (at present) are no canned "standard" settings that I could enable with a flick of a switch. If I need any of that, I'll need to set configuration options mentioned above and maybe even write my your own Axioms and BaseClosure subclasses. The classes have to be wrapped in a jar, placed in WEB-INF/lib and shipped with my BigData distribution (or contributed back into the core project distribution).

The only complete and authoritative documentation of ALL available Bigdata configuration options is in the code. I need to search for all interfaces named "Options" and see the javadocs of the constants there. Each constant X is accompanied by a DEFAULT_X constant with the default value.

Quads No Inference


Triples Modes

All of the triples modes optionally support Reification Done Right (aka RDR aka RDF*/SPARQL*). This mode can be enabled as follows:

Triples No Inference


Triples + RDFS with Incremental Truth Maintenance


Triples + subset of OWL with Incremental Truth Maintenance


Writing your own inference rules

It is easy to write your own rules. The full source code for RDFS10 is: (?u,rdfs:subClassOf,?u) <= (?u,rdf:type,rdfs:Class).

public class RuleRdfs10 extends Rule {

    private static final long serialVersionUID = -2964784545354974663L;

    public RuleRdfs10(String relationName, Vocabulary vocab) {

        super(  "rdfs10",//
                new SPOPredicate(relationName,var("u"), vocab.getConstant(RDFS.SUBCLASSOF), var("u")),//
                new SPOPredicate[]{
                    new SPOPredicate(relationName,var("u"), vocab.getConstant(RDF.TYPE), vocab.getConstant(RDFS.CLASS))//
                null // constraints


Once you have written your own rule, you need to incorporate it into one of the "inference programs", typically the FullClosure program.

Fast Closure

The FastClosure program supports RDFS+ (plus a little bit of OWL). This is a good choice if you are using the built-in rules. If you declare your own rules, then you need to either modify this class (tricky) or use the FullClosure program instead.

Full Closure

The FullClosure program iteratively applies a set of rules until a fixed point. It is possible to extend this program with user defined inference rules and it will simply do the right thing. However, the FastClosure program defined above is more efficient and should be used unless you have defined your own rules.