RDF GAS API

From Blazegraph
Jump to: navigation, search

Status

This feature was first available in releases starting with 1.3.1.

Think like a vertex

The "think like a vertex" abstraction was popularized by the Google paper on the Pregel architecture (2010). This was followed closely by the original GraphLab (2010) and Signal/Collect (2010) publications. These systems all use a vertex-centric API that makes it relatively easy to write a wide variety of graph traversal, graph mining, and similar classes of algorithms.

Concepts

This section lays out the relationship between RDF Values (URIs, Literals, and blank nodes) and the concepts of vertices, vertex property sets, links, and link property sets.

Parameter Definition
Vertex A URI appearing in the Subject or Object position of an RDF Statement. If a vertex has out-edges, then it will appear in the Subject position of at least one Statement. If a vertex has in-edges, then it will appear in the Object position of at least one statement.
Link An RDF Statement connecting two vertices.
Vertex property set The set of RDF Statements associating a simple property value (an RDF Literal) with a vertex.
Link property set The set of RDF Statements associating a simple property value (an RDF Literal) with a link. Link attributes are a kind of statement about statements where the object of the meta-statement is an RDF Literal. Statements about Statements are modeled in RDF as reified statement models. There is a lot of bad press about link attributes in RDF, however bigdata uses Reification Done Right (RDR) to model, index, and query link attributes efficiently. It does NOT store 5 statements to model a link attribute, just one.
Frontier The set of vertices on which the graph algorithm operates in a given iteration. This is often used loosely to refer to the set of vertices that are scheduled for the next iteration, that is, that have been added to the frontier at t+1.
Iteration GAS graph algorithms are iterative. Each iteration executes the Gather, Apply, and Scatter phases of the GAS algorithm. Note that many GAS algorithms do not use each of these GAS phases. There is no overhead for GAS phases that are not executed.
Directed graph RDF statements are directed. For example :tom :loves :mary. GAS algorithms are written for directed graph semantics. If you need to execute the algorithm with undirected graph semantics, you simply specify traversalDirection:=Undirected when you run the algorithm. An algorithm that would have operated on only the in-edges or out-edges during its Gather or Scatter phase will now operate on all-edges. You can also specify traversalDirection:=Reverse to interpret the edges of the graph as directed, but use a reverse direction for the traversal.
Predecessor Some graph algorithms define the concept of a predecessor. The predecessor p of a vertex v is the vertex that was the used to schedule v into the frontier. For some algorithms, a vertex can only be scheduled once into the frontier (this is true of BFS). For other algorithms, it is possible to identify new paths that would cause the vertex v to be rescheduled into the frontier (this is true of SSSP and occurs when the new path has a shorter distance than the previously discovered path).

Gather Apply Scatter

The Gather Apply Scatter (GAS) model was developed in a series of papers on the GraphLab and GraphChi platforms (GraphLab (2010), Distributed GraphLab (2012), [https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf GraphChi (2012)). Part of the evolution of this API has been learning about the constraints that allow for scalable, multi-machine, and (with MapGraph), massively parallel graph algorithms.

Bigdata implements a version of the GAS abstraction that is designed to work efficiently with RDF, including efficient link attributes. This implementation is exposed to SPARQL end points as a SERVICE. See below for examples on how to use this GAS SERVICE.

The bigdata GAS implementation is efficient for the single machine and HA replication cluster deployment models. The bigdata scale-out architecture uses a 1D-partitioning scheme. Scale-out performance on graph traversal algorithms requires a 2D-partitioning scheme in order to minimize the communications volume (any other approach communicates with N*N nodes for each Gather and Scatter operation, a 2D layout communicates with only N nodes). We plan to support a 2D edge partitioning scheme for the bigdata scale-out architecture and integrate that 2D layout MapGraph.

2D-edge-partitioning1.png

Graph traversal algorithms are bandwidth limited. If you are running against disk, SSD is very important and bigdata can achieve close to 1 million traversed edges per second on a MacBook Air. This is substantially faster than any other graph database platform that we have tested. This is because the GAS abstraction runs on the database server, rather than doing round trips between the server and the client as required by APIs such as blueprints. The visited vertex set state is maintained in the Java managed object heap. This places a practical constraint on the size of the graphs that you can run. You can always give Java a lot of memory if you are trying operations on large graphs. However, algorithms like Page Rank ALWAYS begin by visiting all of the vertices in the graph multiple times. This sort of access pattern will both slam the disk, and drive your heap heavily. High performance on graph traversal is limited by bandwidth. The best performance is obtained on the hardware with the highest possible bandwidth - the GPU (MapGraph).

The GAS abstraction breaks down the algorithm into three core methods:

Gather

The Gather phase collects information from the 1-hop neighborhood of a vertex using a generalized binary operator to perform a parallel reduction over the 1-hop edges and vertices in that 1-hop neighborhood. Typical implementations of that binary operator include sum, min, and max. Union is sometimes used, but this can lead to scaling problems when the vertex neighborhood is large since the intermediate state is now O(nedges).

In the image below, the green vertices are the active vertices (the vertices in the current frontier). The little (i) icons represent messages as the information from the 1-hop neighborhood is aggregated in a parallel reduction over that neighborhood. The numbers inside of those green vertices represent the current state of that vertex (not the vertex identifier). After the Apply phase, the state of the active vertices may have changed. You can see this in the image for the Scatter phase below.

GAS graph Gather1.jpg

Apply

The Apply phase integrates the information from the Gather phase, updating the state of the vertex. In the picture, the green vertices are the active vertices. Their state may change as a result of the apply operation.

GAS graph Apply1.jpg

Scatter

The Scatter phase redistributes information to the 1-hop neighborhood of a vertex. Again the green vertices are the active vertices. You can see that their state has changed. This change occurred during the Apply phase (above). The little (i) icons show the movement of messages along the edges leading to the 1-hop neighbors of the active vertices.

In some versions of this abstraction, the scatter phase can also propagate updates to remote vertices. We call this a push-style scatter. The use of a push-style scatter can eliminate the gather phase for algorithms such as SSSP, doubling throughput by reducing the number of traversed edges by half.

GAS graph Scatter1.jpg

MapGraph

For outrageous speed, MapGraph provides the world's fastest implementation of the GAS abstraction, with over 3 billion Traversed Edges Per Second (TEPS) on a single GPU. The bottleneck for graph traversal is memory bandwidth and GPUs have 10 times more memory bandwidth than CPUs. However, writing good CUDA kernels is hard. This is where MapGraph comes in. The MapGraph library delivers an ultra high performance on parallel graph algorithms using the GAS API. All of the really difficult bits required to load balance the GPU are handled by the MapGraph library, freeing you to write your algorithms using a simple high-level abstraction.

Running GAS Programs

Bigdata exposes a SPARQL enabled com.bigdata.rdf.graph.impl.bd.GASService that makes it trivial to run GAS algorithms against RDF data.

Parameter Definition
gas:program This is the subject for the magic predicate assertions that are used to pass parameters into (and out of) the GASService.
gas:gasClass This magic predicate is used to specify the Java IGASProgram implementation to execute (required).
gas:in This magic predicate is used to specify the vertex (or vertices) in the initial frontier. This predicate may appear more than once. It is only required for algorithms that do not put all vertices into the initial frontier. For example, for BFS and SSSP you must explicitly specify which vertices are in the initial frontier. However, algorithms like CC and PR put all vertices into the initial frontier (this is done automatically - the algorithms declare that their initial frontier includes all vertices so you do not have to do anything).
gas:out, gas:out1, etc. These magic predicates are used to bind values discovered by the GAS algorithm onto the solution sets. The resulting solutions and variable bindings become the output of the GAS SERVICE call. gas:out is always the visited vertex identifier. The magic predicates gas:out1, gas:out2, etc. are used to bind other variables in an algorithm specific fashion. See the documentation for each algorithm to understand what variables it can bind.
gas:traversalDirection This magic predicate may be used to indicate whether the graph traversal will interpret edges as forward, reversed, or undirected. The legal values are Forward (the default), Reverse, and Undirected.
gas:nthreads This magic predicate may be used to specify how many threads will be used to execute the Gather, Apply, and Scatter phases in parallel (optional).
gas:maxIterations This magic predicate may be used to specify the maximum number of iterations before the GAS algorithm terminates (default is Integer.MAX_VALUE).
gas:maxVisited This magic predicate may be used to specify the maximum number of visited vertices before the GAS algorithm terminates (default is Integer.MAX_VALUE).
gas:linkType This magic predicate may be used to constrain the GAS algorithm to only visit edges that have the specified URI as their predicate.
gas:linkAttrType This magic predicate may be used to constrain the GAS algorithm to only visit link attributes that have the specified URI as their link attribute type.

GAS Examples

Some examples on how to use the GASService are given below.

BFS Histogram

The following query will show you a histogram giving the #of vertices at each traversal depth up to 4. The ?out variables are interpreted differently by different algorithms, but all of them report ?out as the vertices visited by the algorithm. For BFS, ?out1 is the traversal depth. This is documented on the different IGASProgram classes. See the GASService.Options interface for a complete list of the magic predicates that you can use with this service.

PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?depth (count(?out) as ?cnt) {
  SERVICE gas:service {
     gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" .
     gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier.
     gas:program gas:out ?out . # exactly once - will be bound to the visited vertices.
     gas:program gas:out1 ?depth . # exactly once - will be bound to the depth of the visited vertices.
     gas:program gas:maxIterations 4 . # optional limit on breadth first expansion.
     gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices.
  }
} 
group by ?depth
order by ?depth

BFS with Predecessor

This SERVICE call will extract the predecessor and join against the graph to identify the different ways in which the predecessor and the target vertex (out) are connected in the graph. Note that the predecessor for BFS is just the first vertex to discover a given target vertex during the breadth first expansion. See the query below for how to get out all edges for the visited vertices, not just those between a specific predecessor and target vertex.

PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?depth ?predecessor ?linkType ?out {
  SERVICE gas:service {
     gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" .
     gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier.
     gas:program gas:out ?out . # exactly once - will be bound to the visited vertices.
     gas:program gas:out1 ?depth . # exactly once - will be bound to the depth of the visited vertices.
     gas:program gas:out2 ?predecessor . # exactly once - will be bound to the predecessor.
     gas:program gas:maxIterations 4 . # optional limit on breadth first expansion.
     gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices.
  }
  ?predecessor ?linkType ?out . # figure out what link type(s) connect a vertex with a predecessor
} 
limit 100

BFS with target vertices

Sometimes you are only interested in specific target vertices. Algorithms that define the concept of a predecessor allow you to eliminate visited vertices that are not along the path to one or more target vertices of interest. This does not reduce work performed by the algorithm, but it can be used to reduce the data reported by the algorithm substantially, and that can reduce work for subsequent stages of processing.

PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?depth ?predecessor ?linkType ?out {
  SERVICE gas:service {
     gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" .
     gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier.
     gas:program gas:target <ip:/112.174.135.230> . # only retain vertices along paths to these target vertices.
     gas:program gas:target <ip:/112.174.164.226> . # only retain vertices along paths to these target vertices.
     gas:program gas:target <ip:/112.174.221.150> . # only retain vertices along paths to these target vertices.
     gas:program gas:out ?out . # exactly once - will be bound to the visited vertices.
     gas:program gas:out1 ?depth . # exactly once - will be bound to the depth of the visited vertices.
     gas:program gas:out2 ?predecessor . # exactly once - will be bound to the predecessor.
     gas:program gas:maxIterations 4 . # optional limit on breadth first expansion.
     gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices.
  }
  ?predecessor ?linkType ?out . # figure out what link type(s) connect a vertex with a predecessor
} 
order by desc(?depth)
limit 100

BFS with extracted subgraph

You often want to recover the subgraph containing the vertices visited by the GAS algorithm. Below are some examples of queries that will help you to do this. Note that you can often optimize such queries by only recovering the data that you actually need rather than joining in all vertex attributes, all link types, and all link attributes for all link types.

This query extracts all links and all vertex attributes for any vertex visited by the BFS traversal.

PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?depth ?out ?p ?o {
  SERVICE gas:service {
     gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" .
     gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier.
     gas:program gas:out ?out . # exactly once - will be bound to the visited vertices.
     gas:program gas:out1 ?depth . # exactly once - will be bound to the depth of the visited vertices.
     gas:program gas:out2 ?predecessor . # exactly once - will be bound to the predecessor.
     gas:program gas:maxIterations 4 . # optional limit on breadth first expansion.
     gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices.
  }
  ?out ?p ?o . # extract all links and vertex attributes for the visited vertices.
} 
limit 100

Note: This query does not extract the link attributes. To do that, you need to add an OPTIONAL join:

PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?depth ?out ?p ?o ?linkType ?linkValue {
  SERVICE gas:service {
     gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" .
     gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier.
     gas:program gas:out ?out . # exactly once - will be bound to the visited vertices.
     gas:program gas:out1 ?depth . # exactly once - will be bound to the depth of the visited vertices.
     gas:program gas:out2 ?predecessor . # exactly once - will be bound to the predecessor.
     gas:program gas:maxIterations 4 . # optional limit on breadth first expansion.
     gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices.
  }
  ?out ?p ?o . # extract all edges and vertex attributes for the visited vertices.
  OPTIONAL {<<?out ?p ?o>> ?linkType ?linkValue} # extract the link attributes as well.
} 
limit 100

UNION of two BFS traversals

The following query will execute two n-hop BFS traversals, each with a different starting point but having the same set of target vertices. The visited vertices for both traversals are extracted and reported along with their traversal depth. Since there are two traversals, it is possible that the same target vertex may be discovered by both traversals. When this occurs, there will be two solutions reported for that target vertex and each will be associated with the depth at which that target vertex was encountered during the corresponding BFS traversal.

PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?out ?depth {
  {SERVICE gas:service {
     gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" .
     gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier.
     gas:program gas:target <ip:/112.174.135.230> . # only retain vertices along paths to these target vertices.
     gas:program gas:target <ip:/112.174.164.226> .
     gas:program gas:target <ip:/211.229.203.138> .
     gas:program gas:out ?out . # exactly once - will be bound to the visited vertices.
     gas:program gas:out1 ?depth . # exactly once - will be bound to the depth of the visited vertices.
     gas:program gas:out2 ?predecessor . # exactly once - will be bound to the predecessor.
     gas:program gas:maxIterations 4 . # optional limit on breadth first expansion.
     gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices.
  }} UNION {SERVICE gas:service {
     gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" .
     gas:program gas:in <ip:/61.76.33.170> . # one or more times, specifies the initial frontier.
     gas:program gas:target <ip:/112.174.135.230> . # only retain vertices along paths to these target vertices.
     gas:program gas:target <ip:/112.174.164.226> .
     gas:program gas:target <ip:/112.174.221.150> .
     gas:program gas:out ?out . # exactly once - will be bound to the visited vertices.
     gas:program gas:out1 ?depth . # exactly once - will be bound to the depth of the visited vertices.
     gas:program gas:out2 ?predecessor . # exactly once - will be bound to the predecessor.
     gas:program gas:maxIterations 4 . # optional limit on breadth first expansion.
     gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices.
  }}
} 
order by ?out ?depth
limit 100

GAS Algorithm Library

The following GAS algorithms are bundled with bigdata. It is straightforward to implement new algorithms by following a simple pattern. You need create a concrete class that implements the com.bigdata.rdf.graph.IGASProgram interface and fill in the behaviors for the Gather, Apply, and Scatter phases. If the algorithm uses a Gather phase, you need to specify the binary operator used to combine the information over the 1-hop neighborhood during a gather. Each algorithm also needs to specify a factory for the vertex state associated with the visited vertices. If the algorithm needs to maintain state for the visited edges, then it must specify a factory for that edge state as well. Each algorithm also needs to implement an interface that allows its vertex state to become bound onto variables when GASService executes this.

See the implementations linked below for concrete examples.

BFS

com.bigdata.rdf.graph.analytics.BFS
Parameter Definition
out The vertex.
out1 The depth of that vertex in hops from the initial frontier.
out2 The predecessor, which is the first vertex to discover a given vertex during the iterative expansion of the frontier.

SSSP

com.bigdata.rdf.graph.analytics.SSSP

Single Source Shortest Path. This algorithm computes the shortest path to each vertex in the graph.

Parameter Definition
out The vertex.
out1 The minimum distance between the initial frontier and this vertex.

CC

com.bigdata.rdf.graph.analytics.CC

Connected Components - this algorithm identifies and labels the distinct connected components in the graph. A connected component is a subgraph whose vertices are connected, but is not connected with any vertex not in the subgraph.

Parameter Definition
out The vertex.
out1 The label associated with the connected component. The label is a vertex identifier and can be used to jump into the subgraph.

PageRank

com.bigdata.rdf.graph.analytics.PR

Page Rank computes the relative importance of each vertex in the graph based on the number of links between the vertices in the graph.

parameter definition
out The vertex.
out1 The computed page rank for the vertex.

PageRank Example

The SPARQL Query below will execute a PageRank on a general RDF knowledge base.

PREFIX gas: <http://www.bigdata.com/rdf/gas#>
SELECT ?node ?rank {
  SERVICE gas:service {
     gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.PR" .
     gas:program gas:out ?node . # exactly once - will be bound to the visited vertices.
     gas:program gas:out1 ?rank . # Computed PageRank value for the node
  }
  FILTER (?rank<100)
} ORDER BY DESC(?rank)