Custom Function

From Blazegraph
Jump to: navigation, search

This page describes how to write custom SPARQL functions for bigdata.

Custom functions are written using internal APIs which are still evolving.

Background

Bigdata uses a vectored query engine. Chunks of solutions flow through the query plan operators, which we refer to as "bigdata operators" or "bops" for short. There is parallelism across queries, across operators within a query, and within an operator (multiple instances of the same operator can be evaluated in parallel). Operators broadly break down into those which operate on solutions (PipelineOp) and those which operate on value expressions (e.g. math expressions, filters, etc). The former are vectored, operate on chunks of solutions at a time, and have access to the indices. The latter are not currently vectored (we are looking into this), operate on a single solution at a time, and do not have access to the indices (index access is always vectored to reduce latency and is therefore performed by pipeline ops).

In this page, we will look at writing custom function bops.

IVs

The IV interface is bigdata's internal model for an RDF Value. IVs may be either "inline" (they can be directly converted into an RDF Value) or non-inline (you have to read on an index in order to materialize the corresponding RDF Value). Inline IVs are assigned by the ILexiconConfiguration. There are many types of inline IVs, including IVs to represent most of the xsd datatype values, IVs for pre-declared vocabulary, etc. Non-inline IVs are assigned by writing on the TERM2ID or BLOBS index, depending on the size of the RDF Value. The statement indices are modeled directly using IVs.

Bigdata vectors both IV resolution (Value => IV) and Value materialization (IV => Value). The query plan generator is responsible for inserting the appropriate pipelined operator steps. Therefore, you do not need to worry about this yourself.

Function bops

Function bops:

  1. Extend IVValueExpression.
  2. Operate on a single solution at a time.
  3. Do not have access to the indices.
  4. Evaluate their arguments (which are value expressions) recursively.
  5. Return IVs when they are evaluated.

Computing an IV

Some functions can generate an arbitrary RDF Value. That RDF Value must be turned into an IV. The IV is then returned from the function. From within a function bop, you can use the following method to generate IVs from RDF Values. It is possible, it will return an inline IV. Otherwise it will stamp a "mock" IV:

    /**
     * Return an {@link IV} for the {@link Value}.
     * 
     * @param value
     *            The {@link Value}.
     * @param bsetIsIgnored
     *            The bindings on the solution are ignored, but the reference is
     *            used to obtain the {@link ILexiconConfiguration}.
     *            
     * @return An {@link IV} for that {@link Value}.
     */
     protected IV asIV(final Value value, IBindingSet bsetIsIgnored);

Sometimes there are ways to do this faster or smarter, but this method is for general purposes.

If the custom function is numerical in nature, it may be possible to bypass the creation of a materialized Value and go straight to the more compact IV representation. XSDNumericIV handles byte, short, int, long, and double. XSDBooleanIV handles booleans and has built-ins for TRUE and FALSE. XSDIntegerIV handles the type xsd:integer using the BigInteger class. XSDDecimalIV handles the type xsd:decimal using the BigDecimal class.

Writing a Custom Function

Functions that compute RDF Values should extend AbstractLiteralBOp. See LcaseBOp for a documented example.

Functions which compute Boolean values should extend XSDBooleanIVValueExpression.

Handling arguments

Most function bop have arguments which are value expressions themselves. Evaluation of the value expressions arguments to a function is handled through recursion. Two kinds of dynamic errors can be identified when attempting to evaluate the function arguments: type errors (including when a variable is not bound) and "not materialized" errors. Neither exception should be caught unless the semantics of your function demands it. Type errors typically cause a solution to be dropped. Not materialized exceptions are handled by the query plan, as described below.

    /**
     * Get the function argument (a value expression) and evaluate it against
     * the source solution. The evaluation of value expressions is recursive.
     * 
     * @param i
     *            The index of the function argument ([0...n-1]).
     * @param bs
     *            The source solution.
     * 
     * @return The result of evaluating that argument of this function.
     * 
     * @throws IndexOutOfBoundsException
     *             if the index is not the index of an operator for this
     *             operator.
     * 
     * @throws SparqlTypeErrorException
     *             if the value expression at that index can not be evaluated.
     * 
     * @throws NotMaterializedException
     *             if evaluation encountered an {@link IV} whose {@link IVCache}
     *             was not set when the value expression required a materialized
     *             RDF {@link Value}.
     */
     protected IV getAndCheck(final int i, final IBindingSet bs);

RDF Value Materialization

Function bops differ in their ability to handle IVs versus materialized RDF Values. They indicate their capabilities using the INeedsMaterialization interface.

  1. NEVER - The function never requires materialized RDF Values. For example, isBNode() can be decoded by inspecting the IV.
  2. SOMETIMES - The function sometimes requires materialized RDF Values. For example, math operations can often be performed on the inline IV.
  3. ALWAYS - The function always requires materialized RDF Values.

Your function will almost certainly be NEVER or SOMETIMES. In the case of NEVER, there is no need to implement the INeedsMaterialization interface. To determine whether your function is NEVER or SOMETIMES, think about the methods that you will need on the IVs to evaluate your function. The IVs implement the Sesame interfaces for Values. They implement either by answering the methods directly (in the case of inline IVs that do not need materialization) or by delegating to materialized Values from an index. For example, an xsd:int can be represented compactly as an IV without creating an entry in the dictionary index, so xsd:int IVs can answer Sesame's Literal.getLabel() method directly. A string literal cannot be represented compactly as an inline IV in the statement indices, for those we use an IV that acts as a reference to a term in the dictionary index. These IVs cannot answer the getLabel() method without first materializing their Value from the dictionary index. If your function limits itself strictly to the IV API then you are probably a NEVER function. If your function uses the Sesame API then you are probably a SOMETIMES function. ALWAYS is a legacy mode that will eventually be pruned from the codebase.

Blazegraph evaluates SOMETIMES functions on a solution before materializing the RDF Value from the IV. If the evaluation succeeds, then the solution is routed around the materialization step in the data flow and all is good. If it fails, it will throw a NotMaterializedException. That exception is caught by the query plan, which then routes the solution through the RDF Value materialization step and then re-evaluates the solution against the value expression.

If you need the Sesame API methods, you can turn an IV into an RDF Value using the following methods. It will throw an exception if the IV needs materialization, is unbound, or is of the wrong type. Except for some highly unusual functions, those exceptions should NOT be caught within your function bop. They will be handled automatically by bigdata.

    /**
     * Return the {@link Value} for the {@link IV}.
     * 
     * @param iv
     *            The {@link IV}.
     * 
     * @return The {@link Value}.
     * 
     * @throws SparqlTypeErrorException
     *             if the argument is <code>null</code>.
     * @throws NotMaterializedException
     *             if the {@link IVCache} is not set and the {@link IV} can not
     *             be turned into a {@link Value} without an index read.
     */
    @SuppressWarnings("rawtypes")
    final static public Value asValue(final IV iv)

    /**
     * Return the {@link Literal} for the {@link IV}.
     * 
     * @param iv
     *            The {@link IV}.
     * 
     * @return The {@link Literal}.
     * 
     * @throws SparqlTypeErrorException
     *             if the argument is <code>null</code>.
     * @throws SparqlTypeErrorException
     *             if the argument does not represent a {@link Literal}.
     * @throws NotMaterializedException
     *             if the {@link IVCache} is not set and the {@link IV} can not
     *             be turned into a {@link Literal} without an index read.
     */
    protected Literal asLiteral(final IV iv)

Example Function: SecurityFilter

Here is a short example for a function that checks solutions against an internal security validator. This type of function is useful if you want to limit the visibility of results based on the current user's credentials. A filter function will always evaluate to TRUE (keep the solution) or FALSE (drop the solution). If you are writing a simple boolean filter, you can extends XSDBooleanIVValueExpression.

    public class SecurityFilter extends XSDBooleanIVValueExpression
            implements INeedsMaterialization
    {

        /**
         * Required deep copy constructor.
         * 
         * @param op
         */
        public SecurityFilter(final SecurityFilter op) {
            super(op);
        }

        /**
         * Required shallow copy constructor.
         * 
         * @param args
         *            The function arguments.
         * @param anns
         *            The function annotations.
         */
        public SecurityFilter(final BOp[] args, final Map<String, Object> anns) {
            super(args, anns);
        }

        /**
          * The function needs two pieces of information to operate - the document to check
          * and the user to check against.
          */
        public SecurityFilter(
                final IVariable<? extends IV> user,
                final IVariable<? extends IV> document,
                final GlobalAnnotations globals) {

            this(new BOp[] { user, document }, super.anns(globals));

        }

        @Override
        protected boolean accept(final IBindingSet bset) {

            // get the bound term for the ?user var
            final Value user = asValue(getAndCheckBound(0, bset));

            // get the bound term for the ?document var
            final Value document = asValue(getAndCheckBound(1, bset));

            return GlobalSecurityValidator.validate(user, document);

        }

        @Override
        public Requirement getRequirement() {
            
            return Requirement.SOMETIMES;
            
        }

    }

Registering a Custom Function

Once you have written your custom function, you need to register it before it will be available for SPARQL evaluation. This is done using the bigdata FunctionRegistry.

Implement a Function Factory

You will need to implement a Factory, which creates instances of your function from value expression arguments. There are alternative constructor patterns for custom functions depending on how many arguments they require (or can accept for functions with variable numbers of arguments).

final FunctionRegistry.Factory securityFactory = new FunctionRegistry.Factory() {

    @Override
    public IValueExpression<? extends IV> create(
            GlobalAnnotations globals,
            Map<String, Object> scalarValues,
            ValueExpressionNode... args) {
      
      // Validate your argument(s)
      FunctionRegistry.checkArgs(args, ValueExpressionNode.class, ValueExpressionNode.class);

      // Turn them into physical (executable) bops
      final IVariable<? extends IV> user = AST2BOpUtility.toVE(lex, args[0]);
      final IVariable<? extends IV> document = AST2BOpUtility.toVE(lex, args[1]);
      
      // Return your custom function.
      return new SecurityFilter(user, document, globals);

    }

};

Register your custom function in an Embedded Blazegraph Instance

If you are using Blazegraph jar as a dependency, you have to declare your custom function on each program run before it can be used in queries:

URI myFunctionURI = new URIImpl("http://www.example.com/validate");

FunctionRegistry.add(myFunctionURI, myFactory);

Register your custom function in the Blazegraph NanoSparqlServer

If you are using the NanoSparqlServer as a stand alone process it also should register custom function on each run and have custom function in its class path.. You must override the BigdataRDFContextListener contextInitialized method to add the function registry. Then the web.xml must be updated to use the overridden context listener in the web.xml listener.

MyBigdataRDFContextListener extends BigdataRDFContextListener {

      contextInitialized(final ServletContextEvent e) {

              super.contextInitialized(e);

             URI myFunctionURI = new URIImpl("http://www.example.com/validate");

             FunctionRegistry.add(myFunctionURI, myFactory);

      }

}

web.xml update

  <listener>
   <listener-class>MyBigdataRDFServletContextListener</listener-class>
  </listener>

It is possible to do this in the executable jar using the jetty overrideWeb descriptor.

java -cp /path/to/my/classes -server -Xmx4g -Djetty.overrideDescriptor=/path/to/my/updated/web.xml -jar bigdata-bundled.jar

Use it in SPARQL queries

You can use registered custom functions in SPARQL queries. Here is how you would retrieve the list of documents visible to the user "John".

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX ex: <http://www.example.com/>
SELECT ?doc
{
  ?doc rdf:type ex:Document .
  filter(ex:validate(?doc, ?user)) .
}
BINDINGS ?user {
  (ex:John)
}