Difference between revisions of "RDF GAS API"
(updated documentation 1.5.2) 
Brad Bebee (Talk  contribs) (Updated for PageRank Example) 

Line 348:  Line 348:  
 The computed page rank for the vertex.   The computed page rank for the vertex.  
}  }  
+  
+  === PageRank Example ===  
+  
+  The SPARQL Query below will execute a PageRank on a general RDF knowledge base.  
+  
+  <pre>  
+  PREFIX gas: <http://www.bigdata.com/rdf/gas#>  
+  SELECT ?node ?rank {  
+  SERVICE gas:service {  
+  gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.PR" .  
+  gas:program gas:out ?node . # exactly once  will be bound to the visited vertices.  
+  gas:program gas:out1 ?rank . # Computed PageRank value for the node  
+  }  
+  FILTER (?rank<100)  
+  } ORDER BY DESC(?rank)  
+  </pre> 
Revision as of 14:28, 2 August 2015
Contents
Status
This feature was first available in releases starting with 1.3.1.
Think like a vertex
The "think like a vertex" abstraction was popularized by the Google paper on the Pregel architecture (2010). This was followed closely by the original GraphLab (2010) and Signal/Collect (2010) publications. These systems all use a vertexcentric API that makes it relatively easy to write a wide variety of graph traversal, graph mining, and similar classes of algorithms.
Concepts
This section lays out the relationship between RDF Values (URIs, Literals, and blank nodes) and the concepts of vertices, vertex property sets, links, and link property sets.
Parameter  Definition 

Vertex  A URI appearing in the Subject or Object position of an RDF Statement. If a vertex has outedges, then it will appear in the Subject position of at least one Statement. If a vertex has inedges, then it will appear in the Object position of at least one statement. 
Link  An RDF Statement connecting two vertices. 
Vertex property set  The set of RDF Statements associating a simple property value (an RDF Literal) with a vertex. 
Link property set  The set of RDF Statements associating a simple property value (an RDF Literal) with a link. Link attributes are a kind of statement about statements where the object of the metastatement is an RDF Literal. Statements about Statements are modeled in RDF as reified statement models. There is a lot of bad press about link attributes in RDF, however bigdata uses Reification Done Right (RDR) to model, index, and query link attributes efficiently. It does NOT store 5 statements to model a link attribute, just one. 
Frontier  The set of vertices on which the graph algorithm operates in a given iteration. This is often used loosely to refer to the set of vertices that are scheduled for the next iteration, that is, that have been added to the frontier at t+1. 
Iteration  GAS graph algorithms are iterative. Each iteration executes the Gather, Apply, and Scatter phases of the GAS algorithm. Note that many GAS algorithms do not use each of these GAS phases. There is no overhead for GAS phases that are not executed. 
Directed graph  RDF statements are directed. For example :tom :loves :mary . GAS algorithms are written for directed graph semantics. If you need to execute the algorithm with undirected graph semantics, you simply specify traversalDirection:=Undirected when you run the algorithm. An algorithm that would have operated on only the inedges or outedges during its Gather or Scatter phase will now operate on alledges. You can also specify traversalDirection:=Reverse to interpret the edges of the graph as directed, but use a reverse direction for the traversal.

Predecessor  Some graph algorithms define the concept of a predecessor. The predecessor p of a vertex v is the vertex that was the used to schedule v into the frontier. For some algorithms, a vertex can only be scheduled once into the frontier (this is true of BFS). For other algorithms, it is possible to identify new paths that would cause the vertex v to be rescheduled into the frontier (this is true of SSSP and occurs when the new path has a shorter distance than the previously discovered path). 
Gather Apply Scatter
The Gather Apply Scatter (GAS) model was developed in a series of papers on the GraphLab and GraphChi platforms (GraphLab (2010), Distributed GraphLab (2012), [https://www.usenix.org/system/files/conference/osdi12/osdi12final126.pdf GraphChi (2012)). Part of the evolution of this API has been learning about the constraints that allow for scalable, multimachine, and (with MapGraph), massively parallel graph algorithms.
Bigdata implements a version of the GAS abstraction that is designed to work efficiently with RDF, including efficient link attributes. This implementation is exposed to SPARQL end points as a SERVICE. See below for examples on how to use this GAS SERVICE.
The bigdata GAS implementation is efficient for the single machine and HA replication cluster deployment models. The bigdata scaleout architecture uses a 1Dpartitioning scheme. Scaleout performance on graph traversal algorithms requires a 2Dpartitioning scheme in order to minimize the communications volume (any other approach communicates with N*N nodes for each Gather and Scatter operation, a 2D layout communicates with only N nodes). We plan to support a 2D edge partitioning scheme for the bigdata scaleout architecture and integrate that 2D layout MapGraph.
Graph traversal algorithms are bandwidth limited. If you are running against disk, SSD is very important and bigdata can achieve close to 1 million traversed edges per second on a MacBook Air. This is substantially faster than any other graph database platform that we have tested. This is because the GAS abstraction runs on the database server, rather than doing round trips between the server and the client as required by APIs such as blueprints. The visited vertex set state is maintained in the Java managed object heap. This places a practical constraint on the size of the graphs that you can run. You can always give Java a lot of memory if you are trying operations on large graphs. However, algorithms like Page Rank ALWAYS begin by visiting all of the vertices in the graph multiple times. This sort of access pattern will both slam the disk, and drive your heap heavily. High performance on graph traversal is limited by bandwidth. The best performance is obtained on the hardware with the highest possible bandwidth  the GPU (MapGraph).
The GAS abstraction breaks down the algorithm into three core methods:
Gather
The Gather phase collects information from the 1hop neighborhood of a vertex using a generalized binary operator to perform a parallel reduction over the 1hop edges and vertices in that 1hop neighborhood. Typical implementations of that binary operator include sum, min, and max. Union is sometimes used, but this can lead to scaling problems when the vertex neighborhood is large since the intermediate state is now O(nedges).
In the image below, the red vertices are the active vertices (the vertices in the current frontier). The little (i) icons represent messages as the information from the 1hop neighborhood is aggregated in a parallel reduction over that neighborhood. The numbers inside of those red vertices represent the current state of that vertex (not the vertex identifier). After the Apply phase, the state of the active vertices may have changed. You can see this in the image for the Scatter phase below.
Apply
The Apply phase integrates the information from the Gather phase, updating the state of the vertex. In the picture, the red vertices are the active vertices. Their state may change as a result of the apply operation.
Scatter
The Scatter phase redistributes information to the 1hop neighborhood of a vertex. Again the red vertices are the active vertices. You can see that their state has changed. This change occurred during the Apply phase (above). The little (i) icons show the movement of messages along the edges leading to the 1hop neighbors of the active vertices.
In some versions of this abstraction, the scatter phase can also propagate updates to remote vertices. We call this a pushstyle scatter. The use of a pushstyle scatter can eliminate the gather phase for algorithms such as SSSP, doubling throughput by reducing the number of traversed edges by half.
MapGraph
For outrageous speed, MapGraph provides the world's fastest implementation of the GAS abstraction, with over 3 billion Traversed Edges Per Second (TEPS) on a single GPU. The bottleneck for graph traversal is memory bandwidth and GPUs have 10 times more memory bandwidth than CPUs. However, writing good CUDA kernels is hard. This is where MapGraph comes in. The MapGraph library delivers an ultra high performance on parallel graph algorithms using the GAS API. All of the really difficult bits required to load balance the GPU are handled by the MapGraph library, freeing you to write your algorithms using a simple highlevel abstraction.
Running GAS Programs
Bigdata exposes a SPARQL enabled com.bigdata.rdf.graph.impl.bd.GASService that makes it trivial to run GAS algorithms against RDF data.
Parameter  Definition 

gas:program  This is the subject for the magic predicate assertions that are used to pass parameters into (and out of) the GASService. 
gas:gasClass  This magic predicate is used to specify the Java IGASProgram implementation to execute (required). 
gas:in  This magic predicate is used to specify the vertex (or vertices) in the initial frontier. This predicate may appear more than once. It is only required for algorithms that do not put all vertices into the initial frontier. For example, for BFS and SSSP you must explicitly specify which vertices are in the initial frontier. However, algorithms like CC and PR put all vertices into the initial frontier (this is done automatically  the algorithms declare that their initial frontier includes all vertices so you do not have to do anything). 
gas:out, gas:out1, etc.  These magic predicates are used to bind values discovered by the GAS algorithm onto the solution sets. The resulting solutions and variable bindings become the output of the GAS SERVICE call. gas:out is always the visited vertex identifier. The magic predicates gas:out1, gas:out2, etc. are used to bind other variables in an algorithm specific fashion. See the documentation for each algorithm to understand what variables it can bind. 
gas:traversalDirection  This magic predicate may be used to indicate whether the graph traversal will interpret edges as forward, reversed, or undirected. The legal values are Forward (the default), Reverse, and Undirected. 
gas:nthreads  This magic predicate may be used to specify how many threads will be used to execute the Gather, Apply, and Scatter phases in parallel (optional). 
gas:maxIterations  This magic predicate may be used to specify the maximum number of iterations before the GAS algorithm terminates (default is Integer.MAX_VALUE). 
gas:maxVisited  This magic predicate may be used to specify the maximum number of visited vertices before the GAS algorithm terminates (default is Integer.MAX_VALUE). 
gas:linkType  This magic predicate may be used to constrain the GAS algorithm to only visit edges that have the specified URI as their predicate. 
gas:linkAttrType  This magic predicate may be used to constrain the GAS algorithm to only visit link attributes that have the specified URI as their link attribute type. 
GAS Examples
Some examples on how to use the GASService are given below.
BFS Histogram
The following query will show you a histogram giving the #of vertices at each traversal depth up to 4. The ?out
variables are interpreted differently by different algorithms, but all of them report ?out
as the vertices visited by the algorithm. For BFS, ?out1
is the traversal depth. This is documented on the different IGASProgram classes. See the GASService.Options interface for a complete list of the magic predicates that you can use with this service.
PREFIX gas: <http://www.bigdata.com/rdf/gas#> SELECT ?depth (count(?out) as ?cnt) { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" . gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier. gas:program gas:out ?out . # exactly once  will be bound to the visited vertices. gas:program gas:out1 ?depth . # exactly once  will be bound to the depth of the visited vertices. gas:program gas:maxIterations 4 . # optional limit on breadth first expansion. gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices. } } group by ?depth order by ?depth
BFS with Predecessor
This SERVICE call will extract the predecessor and join against the graph to identify the different ways in which the predecessor and the target vertex (out) are connected in the graph. Note that the predecessor for BFS is just the first vertex to discover a given target vertex during the breadth first expansion. See the query below for how to get out all edges for the visited vertices, not just those between a specific predecessor and target vertex.
PREFIX gas: <http://www.bigdata.com/rdf/gas#> SELECT ?depth ?predecessor ?linkType ?out { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" . gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier. gas:program gas:out ?out . # exactly once  will be bound to the visited vertices. gas:program gas:out1 ?depth . # exactly once  will be bound to the depth of the visited vertices. gas:program gas:out2 ?predecessor . # exactly once  will be bound to the predecessor. gas:program gas:maxIterations 4 . # optional limit on breadth first expansion. gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices. } ?predecessor ?linkType ?out . # figure out what link type(s) connect a vertex with a predecessor } limit 100
BFS with target vertices
Sometimes you are only interested in specific target vertices. Algorithms that define the concept of a predecessor allow you to eliminate visited vertices that are not along the path to one or more target vertices of interest. This does not reduce work performed by the algorithm, but it can be used to reduce the data reported by the algorithm substantially, and that can reduce work for subsequent stages of processing.
PREFIX gas: <http://www.bigdata.com/rdf/gas#> SELECT ?depth ?predecessor ?linkType ?out { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" . gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier. gas:program gas:target <ip:/112.174.135.230> . # only retain vertices along paths to these target vertices. gas:program gas:target <ip:/112.174.164.226> . # only retain vertices along paths to these target vertices. gas:program gas:target <ip:/112.174.221.150> . # only retain vertices along paths to these target vertices. gas:program gas:out ?out . # exactly once  will be bound to the visited vertices. gas:program gas:out1 ?depth . # exactly once  will be bound to the depth of the visited vertices. gas:program gas:out2 ?predecessor . # exactly once  will be bound to the predecessor. gas:program gas:maxIterations 4 . # optional limit on breadth first expansion. gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices. } ?predecessor ?linkType ?out . # figure out what link type(s) connect a vertex with a predecessor } order by desc(?depth) limit 100
BFS with extracted subgraph
You often want to recover the subgraph containing the vertices visited by the GAS algorithm. Below are some examples of queries that will help you to do this. Note that you can often optimize such queries by only recovering the data that you actually need rather than joining in all vertex attributes, all link types, and all link attributes for all link types.
This query extracts all links and all vertex attributes for any vertex visited by the BFS traversal.
PREFIX gas: <http://www.bigdata.com/rdf/gas#> SELECT ?depth ?out ?p ?o { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" . gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier. gas:program gas:out ?out . # exactly once  will be bound to the visited vertices. gas:program gas:out1 ?depth . # exactly once  will be bound to the depth of the visited vertices. gas:program gas:out2 ?predecessor . # exactly once  will be bound to the predecessor. gas:program gas:maxIterations 4 . # optional limit on breadth first expansion. gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices. } ?out ?p ?o . # extract all links and vertex attributes for the visited vertices. } limit 100
Note: This query does not extract the link attributes. To do that, you need to add an OPTIONAL join:
PREFIX gas: <http://www.bigdata.com/rdf/gas#> SELECT ?depth ?out ?p ?o ?linkType ?linkValue { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" . gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier. gas:program gas:out ?out . # exactly once  will be bound to the visited vertices. gas:program gas:out1 ?depth . # exactly once  will be bound to the depth of the visited vertices. gas:program gas:out2 ?predecessor . # exactly once  will be bound to the predecessor. gas:program gas:maxIterations 4 . # optional limit on breadth first expansion. gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices. } ?out ?p ?o . # extract all edges and vertex attributes for the visited vertices. OPTIONAL {<<?out ?p ?o>> ?linkType ?linkValue} # extract the link attributes as well. } limit 100
UNION of two BFS traversals
The following query will execute two nhop BFS traversals, each with a different starting point but having the same set of target vertices. The visited vertices for both traversals are extracted and reported along with their traversal depth. Since there are two traversals, it is possible that the same target vertex may be discovered by both traversals. When this occurs, there will be two solutions reported for that target vertex and each will be associated with the depth at which that target vertex was encountered during the corresponding BFS traversal.
PREFIX gas: <http://www.bigdata.com/rdf/gas#> SELECT ?out ?depth { {SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" . gas:program gas:in <ip:/112.174.24.90> . # one or more times, specifies the initial frontier. gas:program gas:target <ip:/112.174.135.230> . # only retain vertices along paths to these target vertices. gas:program gas:target <ip:/112.174.164.226> . gas:program gas:target <ip:/211.229.203.138> . gas:program gas:out ?out . # exactly once  will be bound to the visited vertices. gas:program gas:out1 ?depth . # exactly once  will be bound to the depth of the visited vertices. gas:program gas:out2 ?predecessor . # exactly once  will be bound to the predecessor. gas:program gas:maxIterations 4 . # optional limit on breadth first expansion. gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices. }} UNION {SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" . gas:program gas:in <ip:/61.76.33.170> . # one or more times, specifies the initial frontier. gas:program gas:target <ip:/112.174.135.230> . # only retain vertices along paths to these target vertices. gas:program gas:target <ip:/112.174.164.226> . gas:program gas:target <ip:/112.174.221.150> . gas:program gas:out ?out . # exactly once  will be bound to the visited vertices. gas:program gas:out1 ?depth . # exactly once  will be bound to the depth of the visited vertices. gas:program gas:out2 ?predecessor . # exactly once  will be bound to the predecessor. gas:program gas:maxIterations 4 . # optional limit on breadth first expansion. gas:program gas:maxVisited 2000 . # optional limit on the #of visited vertices. }} } order by ?out ?depth limit 100
GAS Algorithm Library
The following GAS algorithms are bundled with bigdata. It is straightforward to implement new algorithms by following a simple pattern. You need create a concrete class that implements the com.bigdata.rdf.graph.IGASProgram interface and fill in the behaviors for the Gather, Apply, and Scatter phases. If the algorithm uses a Gather phase, you need to specify the binary operator used to combine the information over the 1hop neighborhood during a gather. Each algorithm also needs to specify a factory for the vertex state associated with the visited vertices. If the algorithm needs to maintain state for the visited edges, then it must specify a factory for that edge state as well. Each algorithm also needs to implement an interface that allows its vertex state to become bound onto variables when GASService executes this.
See the implementations linked below for concrete examples.
BFS
com.bigdata.rdf.graph.analytics.BFS
Parameter  Definition 

out  The vertex. 
out1  The depth of that vertex in hops from the initial frontier. 
out2  The predecessor, which is the first vertex to discover a given vertex during the iterative expansion of the frontier. 
SSSP
com.bigdata.rdf.graph.analytics.SSSP
Single Source Shortest Path. This algorithm computes the shortest path to each vertex in the graph.
Parameter  Definition 

out  The vertex. 
out1  The minimum distance between the initial frontier and this vertex. 
CC
com.bigdata.rdf.graph.analytics.CC
Connected Components  this algorithm identifies and labels the distinct connected components in the graph. A connected component is a subgraph whose vertices are connected, but is not connected with any vertex not in the subgraph.
Parameter  Definition 

out  The vertex. 
out1  The label associated with the connected component. The label is a vertex identifier and can be used to jump into the subgraph. 
PageRank
com.bigdata.rdf.graph.analytics.PR
Page Rank computes the relative importance of each vertex in the graph based on the number of links between the vertices in the graph.
parameter  definition 

out  The vertex. 
out1  The computed page rank for the vertex. 
PageRank Example
The SPARQL Query below will execute a PageRank on a general RDF knowledge base.
PREFIX gas: <http://www.bigdata.com/rdf/gas#> SELECT ?node ?rank { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.PR" . gas:program gas:out ?node . # exactly once  will be bound to the visited vertices. gas:program gas:out1 ?rank . # Computed PageRank value for the node } FILTER (?rank<100) } ORDER BY DESC(?rank)