Nano Sparql Server

From Blazegraph
Revision as of 17:16, 27 May 2016 by Brad Bebee (Talk | contribs) (Servlet Container (Tomcat, Jetty, etc))

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

NanoSparqlServer provides a lightweight REST API for RDF. It is implemented using the Servlet API. You can run NanoSparqlServer from the command line and or embedded within your application using the bundled jetty dependencies. You can also deploy the REST API Servlets into a standard servlet engine.

Deploying NanoSparqlServer

It is not necessary to deploy the Sesame Web Atchive (WAR) to run NanoSparqlServer. NanoSparqlServer can be run from the command line (using Jetty), embedded (using Jetty), or deployed in a servlet container such as Tomcat. The easiest way to deploy it is in a servlet container.

Downloading the Executable Jar

Download the latest blazegraph.jar file and run it:

java -server -Xmx4g -jar blazegraph.jar

Alternatively you can build the blazegraph.jar file. Check out the code and use maven to generate the jar. See the Installation guide for details.
This generates target/blazegraph-X_Y_Z.jar:

cd blazegraph-jar
mvn package

Run target/blazegraph-X_Y_Z.jar:

java -server -Xmx4g -jar target/blazegraph-X_Y_Z.jar


Once it's started, the default is http://localhost:9999/bigdata/.
For example you start with blazegraph.jar:

java -server -Xmx4g -jar blazegraph.jar 

...
Welcome to the Blazegraph(tm) Database.

Go to http://localhost:9999/blazegraph/ to get started.

You can specify the properties file used with the -Dbigdata.propertyFile=<path>.

java -server -Xmx4g -Dbigdata.propertyFile=/etc/blazegraph/RWStore.properties -jar blazegraph.jar

Customizing the web.xml

You can override the default web.xml values in the executable jar using the jetty.overrideWebXml property. The file you specify should override the values that you'd like to replace. The web.xml values that default with the blazegraph.jar are in web.xml.

-Djetty.overrideWebXml=/path/to/override.xml

A full example is below.

java -server -Xmx4g -Djetty.overrideWebXml=/path/to/override.xml -Dbigdata.propertyFile=/etc/blazegraph/RWStore.properties -jar blazegraph.jar

Changing the default port

Blazegraph defaults to port 9999. This may be changed in the executable jar using the jetty.port property.

-Djetty.port=19999

A full example is below.

java -server -Xmx4g -Djetty.port=19999 -jar blazegraph.jar

Command line (using Jetty)

To run the server from the command line (using Jetty), you first need to know how your classpath should be set. The bundleJar target of the top-level build.xml file can be invoked to generate a bundle-<version>.jar file to simplify the classpath definition. Look in the bigdata-perf directories for examples of Ant scripts which do this.

Once you set your classpath you can run the NanoSparqlServer from the command line by executing the class com.bigdata.rdf.sail.webapp.NanoSparqlServer providing the connection port, the namespace and a property file:

java -cp ... -server com.bigdata.rdf.sail.webapp.NanoSparqlServer <port> <namespace> <propertiesFile>

The ... should be your classpath.

The port is just whatever http port you want to run on.

The namespace is the namespace of the triple or quads store instance within bigdata to which you want to connect. If no such namespace exists, a default kb instance is created.

The propertiesFile is where you configure bigdata. You can start with RWStore.properties and then edit it to match your requirements. There are a variety of example property files in samples for quads, triples, inference, provenance, and other interesting variations.

Embedded (using Jetty)

The following code example starts a server from code - see StandaloneNanoSparqlServer.java for a full example and the code we use for the executable jar.

            //Use this is you are embedding with the blazegraph.jar file to access the jetty.xml
            //in the jar classpath as a resource.
            String jettyXml = System.getProperty(SystemProperties.JETTY_XML, "jetty.xml");
            System.setProperty("jetty.home", jettyXml.getClass().getResource("/war").toExternalForm());
            
            server = NanoSparqlServer.newInstance(port, indexManager,
                    initParams);

            server.start();

            final int actualPort = server.getConnectors()[0]
                    .getLocalPort();

            String hostAddr = NicUtil.getIpAddress("default.nic",
                    "default", true/* loopbackOk */);

            if (hostAddr == null) {

                hostAddr = "localhost";

            }

            final String serviceURL = new URL("http", hostAddr, actualPort, ""/* file */)
                    .toExternalForm();
            
            System.out.println("serviceURL: " + serviceURL);

            // Block and wait. The NSS is running.
            server.join();

Servlet Container (Tomcat, Jetty, etc)

Download WAR

Download, install, and configure a servlet container. See the documentation for your server container as they are all different.

Download [the latest bigdata.war file]. Alternatively you can build the bigdata.war file:

ant clean bundleJar war

This generates ant-build/bigdata.war.

Drop the WAR into the webapps directory of your servlet container and unpack it.

Build Jetty deployer

Alternatively you can build a deployer for Jetty. This approach may be used for both High Available (HA) and non-HA deployments. It produces a directory structure that is suitable for installation as a service. The web.xml, jetty.xml, log4j.properties and related files are all located within the generated directory structure. See HAJournalServer for details on the structure and configuration of the generated distribution.

ant stage

Configuration

Note: It is strongly advised that you unpack the WAR before you start it and edit the RWStore.properties and/or the web.xml deployment descriptor. The web.xml file controls the location of the RWStore.properties file. The RWStore.properties file controls the behavior of the bigdata database instance, the location of the database instance on your disk, and the configuration for the default triple and/or quad store instance that will be created when the webapp starts for the first time. Take a moment to review and edit the web.xml and RWStore.properties before you go any further. See GettingStarted if you need help setting up the KB for triples versus quads, enable inference, etc.

Note: As of r6797 and releases after 1.2.2, you can specify the following property to override the location of the bigdata property file, where FILE is the fully qualified path of the bigdata property file (e.g., RWStore.properties):

-Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=FILE


You should specify JAVA_OPTS with at least the following properties. The guidelines for the maximum java heap size are no more than 1/2 of the available RAM. Heap sizes of 2G to 8G are recommended to avoid long GC pauses. Larger heaps are possible with the G1 collector (in Java 7).

export JAVA_OPTS="-server -Xmx2g"


You need to configure jetty maximum form size in a jetty-web.xml to support large POST requests (large queries or bulk loading):

 <Configure class="org.eclipse.jetty.webapp.WebAppContext">
...
<!-- Configure 10M POST size -->
 <Set name="maxFormContentSize">10000000</Set>
...
</Configure>

Adding Additional Namespace Declarations

Starting in Blazegraph 2.0.2, Blazegraph supports adding additional default namespace prefix declarations via a Java Property and configuration. This feature is implemented as an optional Java Property which specifies the path to a file containing a list of prefixes to be initialized by default.

-Dcom.bigdata.rdf.sail.sparql.PrefixDeclProcessor.additionalDeclsFile=/path/to/file

The format of the file is expected to be as below, which is prefix declarations on each line.

PREFIX wdref: <http://www.wikidata.org/reference/>
PREFIX wikibase: <http://wikiba.se/ontology#>

Adding a Jetty Startup Timeout (optional)

You can override the jetty startup timeout with the -Djetty.start.timeout= parameter where the value is the timeout in seconds.

-Djetty.start.timeout=60

Setting up SSL on Jetty (optional)

Generate keys and certificates:

$ keytool -keystore keystore -alias jetty -genkey -keyalg RSA

This command will generate private key and certificate and put it to key store, located in keystore file.

Configure SslContextFactory ( etc/jetty-ssl-context.xml ):

<New id="sslContextFactory" class="org.eclipse.jetty.util.ssl.SslContextFactory">
  <Set name="KeyStorePath"><Property name="jetty.home" default="." />/etc/keystore</Set>
  <Set name="KeyStorePassword">123456</Set>
  <Set name="KeyManagerPassword">123456</Set>
  <Set name="TrustStorePath"><Property name="jetty.home" default="." />/etc/keystore</Set>
  <Set name="TrustStorePassword">123456</Set>
</New>

KeyStorePath should point to keystore file created in previous step.

The TrustStorePath is used if validating client certificates and is typically set to the same keystore.

KeyStorePassword, KeyManagerPassword, TrustStorePassword are passwords specified on previous step.

Configure SSL connector and port ( etc/jetty-https.xml ):

<Call id="sslConnector" name="addConnector">
  <Arg>
    <New class="org.eclipse.jetty.server.ServerConnector">
      <Arg name="server"><Ref refid="Server" /></Arg>
        <Arg name="factories">
          <Array type="org.eclipse.jetty.server.ConnectionFactory">
            <Item>
              <New class="org.eclipse.jetty.server.SslConnectionFactory">
                <Arg name="next">http/1.1</Arg>
                <Arg name="sslContextFactory"><Ref refid="sslContextFactory"/></Arg>
              </New>
            </Item>
            <Item>
              <New class="org.eclipse.jetty.server.HttpConnectionFactory">
                <Arg name="config"><Ref refid="tlsHttpConfig"/></Arg>
              </New>
            </Item>
          </Array>
        </Arg>
        <Set name="host"><Property name="jetty.host" /></Set>
        <Set name="port"><Property name="jetty.ssl.port" default="8443" /></Set>
        <Set name="idleTimeout">30000</Set>
      </New>
  </Arg>
</Call>

For advanced SSL configuration see Jetty manual

Logging

A log4j.properties file is deployed to the WEB-INF/classes directory in the WAR. This will be located automatically during startup. Releases through 1.0.2 will log a warning indicating that the log4j configuration could not be located, but the log4j.properties file is still in effect.

By default, the log4j.properties file will log on the ConsoleAppender. You can edit the log4j.properties file to specify a different appender, e.g., a FileAppender and log file.

You can override the log4j.properties file with your own version by passing a Java property at the command line:

-Dlog4j.configuration=file:/opt/blazegraph/my-log4j.properties

Common Startup Problems

The default web.xml and RWStore.properties files use path names which are relative to the directory in which you start the servlet engine. To use the defaults for those files with tomcat you must start tomcat from the 'bin' directory. For example:

cd bin
./startup.sh

If you have any problems getting the bigdata WAR to start, please consult the servlet log files for detailed information which can help you to localize a configuration error. For Tomcat6 on Ubuntu 10.04 the servlet log is called /var/lib/tomcat6/logs/catalina.out . It may have another name or location in another environment. If you see a permissions error on attempting to open file rules.log then your servlet engine may have been started from the wrong directory.

If you cannot start Tomcat from the 'bin' directory as described above, then you can instead change bigdata file paths from relative to absolute:

  1. In webapps/bigdata/WEB-INF/RWStore.properties change to this line:
    com.bigdata.journal.AbstractJournal.file=bigdata.jnl
  2. In webapps/bigdata/WEB-INF/classes/log4j.properties change to these three lines:
    1. log4j.appender.ruleLog.File=rules.log
    2. log4j.appender.queryLog.File=queryLog.csv
    3. log4j.appender.queryRunStateLog.File=queryRunState.log
  3. In webapps/bigdata/WEB-INF/web.xml change to this line:
    <param-value>../bigdata/RWStore.properties</param-value>

Active URLs

When deployed normally, the following URLs should be active (make sure you use the correct port number for your servlet engine):

  1. http://localhost:8080/bigdata - help page / console.(This is also called the serviceURL.)
  2. http://localhost:8080/bigdata/sparql - REST API (This is also called the SparqlEndpoint and uses the default namespace.)
  3. http://localhost:8080/bigdata/status - Status page
  4. http://localhost:8080/bigdata/counters - Performance counters

For example, you can select everything in the database using (this will be an empty result set for a new quad store):

http://localhost:8080/bigdata/sparql?query=select * where { ?s ?p ?o } limit 1

This will be an empty result set for a new quad store.

URL encoded this would be:

http://localhost:8080/bigdata/sparql?query=select%20*%20where%20{%20?s%20?p%20?o%20}%20limit%201

web.xml

The following context-param entries are defined. Also see HAJournalServer and HALoadBalancer.

Name Default Definition Since
propertyFile WEB-INF/RWStore.properties The property file (for a standalone database instance) or the jini configuration file (for a federation). The file MUST end with either ".properties" or ".config". This path is relative to the directory from which you start the servlet container so you may have to edit it for your installation, e.g., by specifying an absolution path. Also, it is a good idea to review the RWStore.properties file and specify the location of the database file on which it will persist your data. Note: You MAY override this parameter using "-Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=FILE" when starting the servlet container.
namespace kb The default bigdata namespace of for the triple or quad store instance to be exposed.
create true When true, a new triple or quads store instance will be created if none is found at that namespace.
queryThreadPoolSize 16 The size of the thread pool used to service SPARQL queries -OR- ZERO (0) for an unbounded thread pool (which is not recommended).
readOnly false When true, the REST API will not permit mutation operations.
queryTimeout 0 When non-zero, this will timeout for queries (milliseconds).
warmupTimeout 0 When non-zero, this will timeout for the warm-up period (milliseconds). The warm-up period pulls in the non-leaf index pages and reduces the impact of sudden heavy query workloads on the disk and on GC. The end points are not available during the warm-up period. 1.5.2
warmupNamespaceList A list of the namespaces to be exercised during the warmup period (optional). When the list is empty, all namespaces will be warmed up. 1.5.2
warmupThreadPoolSize 20 The number of parallel threads to use for the warmup period. At most one thread will be used per index. 1.5.2

Read Only Configuration with the Jetty Override and Executable Jar

To enable readOnly mode with the executable jar, use the jetty.overrideWebXml to pass this context parameter to the server and override the default. This technique may be used for any of the values in NanoSparqlServer#web.xml.

Create a file called readonly.xml with the contents below.

<?xml version="1.0" encoding="UTF-8"?>
<web-app xmlns="http://java.sun.com/xml/ns/javaee"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_1.xsd"
      version="3.1">
  <context-param>
   <description>When true, the REST API will not permit mutation operations.</description>
   <param-name>readOnly</param-name>
   <param-value>true</param-value>
  </context-param>
</web-app>

Execute the command as below.

java -server -Xmx4g -Djetty.overrideWebXml=./readonly.xml -jar blazegraph.jar

Highly Available Replication Cluster (HA)

See HAJournalServer for information on deploying the HA Replication Cluster.

Scale-out (cluster / federation)

The NanoSparqlServer will automatically create a KB instance for a given namespace if none exists. However, the default KB configuration is not appropriate for a scale-out. In order to create a KB instance which is appropriate for scale-out you need to override the properties object which will be seen by the NanoSparqlServer (actually, by the BigdataRDFServletContext). You can do this by editing the "com.bigdata.service.jini.JiniClient" component block in the configuration file. The line that you want to change is:

old:
    // properties = new NV[] {};
new:
   properties =	lubm.properties;

This will direct the NanoSparqlServer to use the configuration for the KB instance described as the "lubm" component in the file, which gives a KB configuration which is appropriate for the LUBM benchmark. You can then modify the "lubm" component to reflect your use case, e.g., triples versus quads, etc.

To setup for quads, change the following lines in the "lubm" configuration block:


old: 
    static private namespace = "U"+univNum+"";
new:
    static private namespace = "PUT-YOUR_NAMESPACE_HERE"; // Note: This MUST be the same value you will specify to the NanoSparqlServer.

old:
	//new NV(BigdataSail.Options.AXIOMS_CLASS, "com.bigdata.rdf.axioms.RdfsAxioms"),
new:
         new NV(BigdataSail.Options.AXIOMS_CLASS,"com.bigdata.rdf.axioms.NoAxioms"),

new:
	new NV(BigdataSail.Options.QUADS_MODE,"true"),

old:
        new NV(BigdataSail.Options.FORWARD_CHAIN_OWL_INVERSE_OF, "true"),
        new NV(BigdataSail.Options.FORWARD_CHAIN_OWL_TRANSITIVE_PROPERTY, "true"),
new:
//        new NV(BigdataSail.Options.FORWARD_CHAIN_OWL_INVERSE_OF, "true"),
//        new NV(BigdataSail.Options.FORWARD_CHAIN_OWL_TRANSITIVE_PROPERTY, "true"),

Note that you have to specify the namespace both in the configuration file and on the command line and to the NanoSparqlServer since the configuration file is parameterized to override various indices based on the namespace.

Start the NanoSparqlServer using nanoSparqlServer.sh. You need to specify the port and the default KB namespace on the command line:

nanoSparqlServer.sh port namespace

The NanoSparqlServer will echo the serviceURL to the console. The actual URL depends on your installation, however it will be similar to this:

serviceURL: http://192.168.1.10:8090/bigdata

The "serviceURL" is actually the URI of the NanoSparqlServer web application. You can interact directly with the web application. If you want to use the SPARQL end point, you need to append "/sparql" to that URL. For example:

serviceURL: http://192.168.1.10:8090/bigdata/sparql

Read Lock

By default, the nanoSparqlServer.sh script will assert a read lock for the lastCommitTime on the federation. This removes the need to obtain a transaction per query on a cluster which reduces the coordination overhead of reads. This approach is also consistent with using concurrent parallel data load via the scale-out data loader combined with read-behind snapshot isolation on the last globally consistent commit point.

See the nanoSparqlServer.sh script and NanoSparqlServer for more information (look at the javadoc for main()).


Issues:

  1. log4j configuration complaints.
  2. reload of the webapp causes complaints.
  3. refer people to JVM settings for decent performance.