Getting Started

From Blazegraph
Jump to: navigation, search

This page will help you get started with bigdata and is focused on its use as an embedded RDF/graph database.

Support

Blazegraph has several different deployment modes (embedded, standalone, highly available replication cluster, scale-out), different operating modes (triples, provenance, and quads), and 100s of configuration options. This makes bigdata extremely flexible, yet somewhat bewildering. A Help Forum is available for open source support, configuration options are extensively documented in the javadoc, and there are whitepapers available at http://www.blazegraph.com/blog that provide extensive background on the bigdata architecture, internals, and deployment models.

We also offer developer support, deployment support and commercial license subscriptions.

Where do I get the code?

Java Requirement

Java 7 is required to run Blazegraph.

Download

- You can download the WAR, Executable Jar.

There is also a Debian Deployer with a setup guide.

There is also a RPM Deployer with a setup guide.

There is also a Tarball Deployer with a setup guide.

Maven Central

Starting with the 2.0.0 release, Blazegraph is available via Maven Central.

 <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-core</artifactId>
        <version>2.0.0</version>
    </dependency>
    <!-- Use if Tinkerpop 2.5 support is needed ; See also Tinkerpop3 below. -->
    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-blueprints</artifactId>
        <version>2.0.0</version>
    </dependency>

If you'd just link the Blazegraph Database dependencies without any of the external libraries, use the bigdata-runtime artifact.

    <dependency>
        <groupId>com.blazegraph</groupId>
        <artifactId>bigdata-runtime</artifactId>
        <version>2.0.0</version>
    </dependency>

GIT

Starting with 2.0.0-RC1, Blazegraph is available on github at Blazegraph database. Tagged releases with artifacts are here.

git init (If needed to initialize a new repo in the working directory)
git clone -b BLAZEGRAPH_RELEASE_2_0_0 --single-branch https://github.com/blazegraph/database.git BLAZEGRAPH_RELEASE_2_0_0

cd BLAZEGRAPH_RELEASE_2_0_0  # For those who can't wait to get started
./scripts/mavenInstall.sh
./scripts/startBlazegraph.sh


Pre 2.0.0

Pre-2.0.0 releases are hosted on Sourceforge. You can checkout bigdata from GIT. Older branches and tagged releases have names like BIGDATA_RELEASE_1_5_1.

Cloning the latest branch:


      git init (If needed to initialize a new repo in the working directory)

      git clone -b BLAZEGRAPH_RELEASE_X_Y_Z --single-branch git://git.code.sf.net/p/bigdata/git BLAZEGRAPH_RELEASE_X_Y_Z

Tagged releases are available in GIT with tags in the form BLAZEGRAPH_RELEASE_X_Y_X.

    http://sourceforge.net/p/bigdata/git/ci/master/tree/

Eclipse

For embedded development, we highly recommend checking out bigdata from GIT into Eclipse as its own project. You will need to follow the steps at Developing with Eclipse

Other Environments

If you checkout the source from GIT, then use ./scripts/mavenInstall.sh build the code and collect the dependencies in a single location.

Where do I start?

Bigdata supports several different deployment models (embedded, standalone, replicated, and scale-out). We generally recommend that applications decouple themselves at the REST layer (SPARQL, SPARQL UPDATE, and the REST API). Decoupled applications can scale more gracefully and there is an easy path from the single machine deployment model to the highly available replication cluster deployment model. However, there are applications where an embedded RDF/graph database makes more sense.

Non-Embedded Deployment Models

  • See NanoSparqlServer for easy steps to deploy a bigdata SPARQL end point + REST API either using an embedded jetty server (same JVM as your application), Executable Jar file, or as a WAR (in a servlet container such as tomcat).
  • See HAJournalServer for deploying a highly available replication cluster (SPARQL end point + REST API).
  • See Using_Bigdata_with_the_OpenRDF_Sesame_HTTP_Server for the significantly more complicated procedure required to deploy inside of the Sesame WAR. Note: We do NOT recommend this approach. The Sesame Server does not use the non-blocking query mode of bigdata. This can significantly limit the query throughput. The NanoSparqlServer and HAJournalServer both deliver non-blocking query.
  • See CommonProblems page for a FAQ on common problems and how to fix them.

Embedded RDF/Graph Database

We have implemented the Sesame API over bigdata. Sesame is an open source framework for storage, inferencing and querying of RDF data, much like Jena. The best place to start would be to head to http://www.openrdf.org openrdf.org, download Sesame, read their User Guide (specifically Chapter 8 - “The Repository API”), and maybe try writing some code using their pre-packaged memory or disk based triple stores. If you have a handle on this you are 90% of the way to being able to use bigdata as an embedded RDF store.

The rest of this page is focused on how to use bigdata as an embedded RDF/graph database.

Ok, I understand how to use Sesame. What now?

If you understand Sesame, then you are no doubt familiar with the concept of a SAIL (Storage and Inference Layer). Well, we have implemented a SAIL over bigdata. So all you have to do is take the code you’ve written for the Sesame API and instantiate a different SAIL class, specifically:

com.bigdata.rdf.sail.BigdataSail

You can get this Sesame implementation by either downloading the source tree from #GIT, or just download the binary and/or source release from the Blazegraph sourceforge download page.

So how do I put the database in triple store versus quad store mode?

We’ve created some configuration files that represent various common “modes” with which you might want to run bigdata:

  • Full Feature Mode. This turns on all of bigdata’s goodies - statement identifiers, free-text index, incremental inference and truth maintenance. This is how you would use bigdata in a system that requires statement-level provenance, free-text search, and incremental load and retraction.
  • RDF-Only Mode. This turns off all inference and truth maintenance, for when you just need to store triples.
  • Fast Load Mode. This is how we run bigdata when we are evaluating load and query performance, for example with the LUBM harness. This turns off some features that are unnecessary for this type of evaluation (statement identifiers and the free text index), which increases throughput. This mode still does inference, but it is database-at-once instead of incremental. It also turns off the recording of justification chains, meaning it is an extremely inefficient mode if you need to retract statements (all inferences would have to be wiped and re-computed). This is a highly specialized mode for highly specialized problem sets.

You can find these and other modes in the form of properties files in the bigdata source tree, in the “bigdata-sails” module, at:

bigdata-sails/src/samples/com/bigdata/samples [1]

Or let us help you devise the mode that is right for your particular problem. We offer development support, production support, and custom services around the platform.

[1] https://github.com/blazegraph/database/tree/master/bigdata-sails/src/samples/com/bigdata/samples

We've set up three modes for bigdata that configure the store properly for triples, triples with provenance, and quads. Look for the TRIPLES_MODE, TRIPLES_MODE_WITH_PROVENANCE, and QUADS_MODE on AbstractTripleStore.Options and BigdataSail.Options.

Currently bigdata does not support inference or provenance for quads, so those features are automatically turned off in QUADS_MODE.

Can I use inference with quads?

Bigdata does not support quads mode inference out of the box. The basic issue for inference with quads is that there is no standard concerning which named graphs data should be combined (data and ontologies) when performing quads mode inference and where to write the new entailments (inferences).

People often ask about "quads plus inference." Our questions is always, "what are you trying to accomplish?" Sometimes people use quads to support provenance - bigdata has a dedicated mode for this. Sometimes people use quads to have multiple graphs in the same database, but you have an an effectively unlimited number of distinct triple or quads stores in each bigdata instance.

Here are some possible approaches to problems that either appear to require quads plus inference or that actually do require quads plus inference:

  • Use property paths for runtime inference. You can cover an interesting subset of inference through property path expansions. If you combine property paths with inference, then you can explicitly tradeoff eager materialization against runtime evaluation.
  • Use triples mode with inference, but store multiple triple store instances in the same journal. We have customers with 15,000 triple stores in a single journal. This option works well if you are using quads mode to circumvent a limit in the number of triples mode instances you can use with some platforms. You can also query across those triple store instances using SPARQL federated query.
  • If you are using quads mode to track provenance, then the SIDS mode allows you to track statement level provenance without the overhead of quads mode indices.
  • You can use an external process to explicitly manage the inference process by combining various named graphs within a journal or temporary store and applying the delta for the update. The output delta can be recovered using a change log listener and then conveyed as a simple update to the appropriate named graphs and target journals. This pattern is used by several large customers to manage updates, sometimes as part of a map/reduce job which collects, transforms, and organizes the update process. This can also be used to decouple the inference workload from the query workload, making it possible to scale both processes independently for high data volume and data rate systems. (You can scale the query workload linearly using the highly available replication cluster.)

Ok, I’ve picked the Blazegraph configuration setting I want to work with and I want to use Sesame. Help me write some code.

It’s easy. For the most part it’s the same as any Sesame 2 repository. This code is taken from [1]

// use one of our pre-configured option-sets or "modes"
Properties properties =
    sampleCode.loadProperties("fullfeature.properties");

// create a backing file for the database
File journal = File.createTempFile("bigdata", ".jnl");
properties.setProperty(
    BigdataSail.Options.FILE,
    journal.getAbsolutePath()
    );

// instantiate a sail and a Sesame repository
BigdataSail sail = new BigdataSail(properties);
Repository repo = new BigdataSailRepository(sail);
repo.initialize();

We now have a Sesame repository that is ready to use. Anytime we want to “do” anything (load data, query, delete, etc), we need to obtain a connection to the repository. This is how I usually use the Sesame API:

RepositoryConnection cxn = repo.getConnection();
cxn.setAutoCommit(false);
try {

    ... // do something interesting

    cxn.commit();
} catch (Exception ex) {
    cxn.rollback();
    throw ex;
} finally {
    // close the repository connection
    cxn.close();
}

Make sure to always use autoCommit=false! Otherwise the SAIL automatically does a commit after every single operation! This causes severe performance degradation and also causes the bigdata journal to grow very large.

Inside that “do something interesting” section you might want to add a statement:

Resource s = new URIImpl("http://www.bigdata.com/rdf#Mike");
URI p = new URIImpl("http://www.bigdata.com/rdf#loves");
Value o = new URIImpl("http://www.bigdata.com/rdf#RDF");
Statement stmt = new StatementImpl(s, p, o);
cxn.add(stmt);

Or maybe you’d like to load an entire RDF document:

String baseURL = ... // the base URL for the document
InputStream is = ... // input stream to the document
Reader reader = new InputStreamReader(new BufferedInputStream(is));
cxn.add(reader, baseURL, RDFFormat.RDFXML);

Once you have data loaded you might want to read some data from your database. Note that by casting the statement to a “BigdataStatement”, you can get at additional information like the statement type (Explicit, Axiom, or Inferred):

URI uri = ... // a Resource that you’d like to know more about
RepositoryResult<Statement> stmts =
    cxn.getStatements(uri, null, null, true /* includeInferred */);
while (stmts.hasNext()) {
    Statement stmt = stmts.next();
    Resource s = stmt.getSubject();
    URI p = stmt.getPredicate();
    Value o = stmt.getObject();
    // do something with the statement

    // cast to BigdataStatement to get at additional information
    BigdataStatement bdStmt = (BigdataStatement) stmt;
    if (bdStmt.isExplicit()) {
        // do one thing
    } else if (bdStmt.isInferred()) {
        // do another thing
    } else { // bdStmt.isAxiom()
        // do something else
    }
}

Of course one of the most interesting things you can do is run high-level queries against the database. Sesame 2 repositories support the open-standard query language SPARQL[1] and a native Sesame query language SERQL[2]. Formulating high-level queries is outside the scope of this document, but assuming you have formulated your query you can execute it as follows:

final QueryLanguage ql = ... // the query language
final String query = ... // a “select” query
TupleQuery tupleQuery = cxn.prepareTupleQuery(ql, query);
tupleQuery.setIncludeInferred(true /* includeInferred */);
TupleQueryResult result = tupleQuery.evaluate();
// do something with the results

Personally I find “construct” queries to be more useful, they allow you to grab a real subgraph from your database:

// silly construct queries, can't guarantee distinct results
final Set<Statement> results = new LinkedHashSet<Statement>();
final GraphQuery graphQuery = cxn.prepareGraphQuery(ql, query);
graphQuery.setIncludeInferred(true /* includeInferred */);
graphQuery.evaluate(new StatementCollector(results));
// do something with the results
for (Statement stmt : results) {
    ...
}

While we’re at it, using the bigdata free text index is as simple as writing a high-level query. Bigdata uses a magic predicate to indicate that the free-text index should be used to find bindings for a particular variable in a high-level query. The free-text index is a Lucene style indexing that will match whole words or prefixes.

RepositoryConnection cxn = repo.getConnection();
cxn.setAutoCommit(false);
try {
    cxn.add(new URIImpl("http://www.bigdata.com/A"), RDFS.LABEL,
            new LiteralImpl("Yellow Rose"));
    cxn.add(new URIImpl("http://www.bigdata.com/B"), RDFS.LABEL,
            new LiteralImpl("Red Rose"));
    cxn.add(new URIImpl("http://www.bigdata.com/C"), RDFS.LABEL,
            new LiteralImpl("Old Yellow House"));
    cxn.add(new URIImpl("http://www.bigdata.com/D"), RDFS.LABEL,
            new LiteralImpl("Loud Yell"));
    cxn.commit();
} catch (Exception ex) {
    cxn.rollback();
    throw ex;
} finally {
    // close the repository connection
    cxn.close();
}

String query = "select ?x where { ?x <"+BNS.SEARCH+"> \"Yell\" . }";
executeSelectQuery(repo, query, QueryLanguage.SPARQL);
// will match A, C, and D

You can find all of this code and more in the source tree at bigdata-sails/src/samples/com/bigdata/samples.[3]

[1] http://www.w3.org/TR/rdf-sparql-query/
[2] http://www.openrdf.org/doc/sesame/users/ch06.html
[3] https://github.com/blazegraph/database/blob/master/bigdata-sails/src/samples/com/bigdata/samples/

You claim that you've "solved" the provenance problem for RDF with statement identifiers. Can you show me how that works?

Please see the Reification Done Right page (A guide to using efficient statements about statements in bigdata. Aka RDF* and SPARQL*).

Note: The older RDF/XML interchange for the Statement identifiers mode is no longer available.

Running Bigdata

Make sure you are running with the -server JVM option and provide at several GB of RAM for the embedded database (e.g., -Xmx4G). You should see extremely good load and query performance. If you are not, please contact us and let us help you get the most out of our product. Also see QueryOptimization, IOOptimization, and PerformanceOptimization.

Bundling Bigdata

Maven

You can use maven - see MavenRepository for the POM.

Scala

SBT is a popular and very powerful build tool. In order to add BigData to sbt projects in your build definition you should add:

1) bigdata dependency

libraryDependencies ++= Seq(
    "com.bigdata" % "bigdata" % bigDataVersion 
)

2) Aeveral Maven repositories to resolvers:

  resolvers += "nxparser-repo" at "http://nxparser.googlecode.com/svn/repository/",

  resolvers += "Bigdata releases" at "http://systap.com/maven/releases/",

  resolvers += "Sonatype OSS Releases" at "https://oss.sonatype.org/content/repositories/releases",

  resolvers += "apache-repo-releases" at "http://repository.apache.org/content/repositories/releases/"

Bigdata Modules and Dependencies

There are several project modules at this time. Each module bundles all necessary dependencies in its lib subdirectory.

  • bigdata (indices, journals, services, etc)
  • bigdata-rdf (the RDFS++ database)
  • bigdata-sails (the Sesame integration for the RDFS++ database)
  • bigdata-jini (jini integration providing for distributed services - this is NOT required for embedded or standalone deployments)

The following dependencies are required only for the scale-out architecture:

  • jini
  • zookeeper

ICU is required only if you want to take advantage of compressed Unicode sort keys. This is a great feature if you are using Unicode and you care about this sort of thing and is available for both scale-up and scale-out deployments. ICU will be used by default if the ICU dependenies are on the classpath. See the com.bigdata.btree.keys package for further notes on ICU and Unicode options. For the brave, ICU also has an optional JNI library.

Removing jini and zookeeper can save you 10M. Removing ICU can save you 30M.

The fastutils dependency is quite large, but it is automatically pruned in our WAR release to just those classes that bigdata actually uses.