Getting Started

From Blazegraph
Revision as of 22:43, 31 July 2009 by Thompsonbry (Talk | contribs) (Imported from wikispaces)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Where do I start?

I can answer this question best with another question - do you know how to use the Sesame 2 API? We have implemented the Sesame 2 API over bigdata. Sesame is an open source framework for storage, inferencing and querying of RDF data, much like Jena. The best place to start would be to head to openrdf.org[1], download Sesame 2.2, read their most excellent User Guide[2] (specifically Chapter 8 - “The Repository API”), and maybe try writing some code using their pre-packaged memory or disk based triple stores. If you have a handle on this you are 90% of the way to being able to use the bigdata RDF store.

[1] http://www.openrdf.org
[2] http://www.openrdf.org/doc/sesame2/users/


Where do I get the code?

This command will checkout the bigdata trunk, which is what you want.

svn co https://bigdata.svn.sourceforge.net/svnroot/bigdata/bigdata-trunk bigdata

Ok, I understand how to use Sesame 2. What now?

If you understand Sesame 2 then you are no doubt familiar with the concept of a SAIL (Storage and Inference Layer). Well, we have implemented a SAIL over bigdata. So all you have to do is take the code you’ve written for the Sesame 2 API and instantiate a different SAIL class, specifically:

com.bigdata.rdf.sail.BigdataSail

You can get this Sesame 2 implementation by either downloading the source tree from SVN (see above), or just download the binary and/or source release from Sourceforge[1].

I would highly recommend checking out the bigdata trunk from SVN directly into Eclipse as its own project, because you will get a .classpath and .project that will automatically build everything for you.

There are several project modules at this time: bigdata (indices, journals, services, etc), bigdata-jini (jini integration providing for distributed services), bigdata-rdf (the RDFS++ database), and bigdata-sails (the Sesame 2.0 integration for the RDFS++ database). Each module bundles all necessary dependencies in its lib subdirectory.

If you are concerned about the size of the distribution, note the following dependencies are required only for the scale-out architecture:

- jini
- zookeeper

If you are doing a scale-up installation, then you do not need any of the jars in the bigdata-jini/lib directory.

In addition, ICU is required only if you want to take advantage of compressed Unicode sort keys. This is a great feature if you are using Unicode and you care about this sort of thing and is available for both scale-up and scale-out deployments. ICU will be used by default if the ICU dependenies are on the classpath. See the com.bigdata.btree.keys package for further notes on ICU and Unicode options. For the brave, ICU also has an optional JNI library.

Removing jini and zookeeper can save you 10M. Removing ICU can save you 30M.

The fastutils dependency is also quite large. We plan to prune it subsequent releases to only the class files bigdata actually needs.

[1] http://sourceforge.net/project/showfiles.php?group_id=191861


Is it really that easy?

No of course not, life is never that easy. Bigdata currently has 70 configurable options, which makes it extremely flexible, yet somewhat bewildering. (This is why we encourage you to keep us in the loop as you evaluate bigdata, so that we can make sure you’re getting the most out of the database. Or better yet, buy a support contract.) Luckily, we’ve created some configuration files that represent various common “modes” with which you might want to run bigdata:

- Full Feature Mode. This turns on all of bigdata’s goodies - statement identifiers, free-text index, incremental inference and truth maintenance. This is how you would use bigdata in a system that requires statement-level provenance, free-text search, and incremental load and retraction.
- RDF-Only Mode. This turns off all inference and truth maintenance, for when you just need to store triples.
- Fast Load Mode. This is how we run bigdata when we are evaluating load and query performance, for example with the LUBM harness. This turns off some features that are unnecessary for this type of evaluation (statement identifiers and the free text index), which increases throughput. This mode still does inference, but it is database-at-once instead of incremental. It also turns off the recording of justification chains, meaning it is an extremely inefficient mode if you need to retract statements (all inferences would have to be wiped and re-computed). This is a highly specialized mode for highly specialized problem sets.

You can find these and other modes in the form of properties files in the bigdata source tree, in the “bigdata-sails” module, at:

bigdata-sails/src/samples/com/bigdata/samples[1]

Or let us help you devise the mode that is right for your particular problem. Of course we will always answer questions, but also please consider buying a support contract!

[1] http://bigdata.svn.sourceforge.net/viewvc/bigdata/trunk/bigdata-sails/src/samples/com/bigdata/samples/


Ok, I’ve picked the bigdata configuration setting I want to work with. Help me write some code.

It’s easy. For the most part it’s the same as any Sesame 2 repository. This code is taken from bigdata-sails/src/samples/com/bigdata/samples/SampleCode.java

// use one of our pre-configured option-sets or "modes"
Properties properties =
    sampleCode.loadProperties("fullfeature.properties");

// create a backing file for the database
File journal = File.createTempFile("bigdata", ".jnl");
properties.setProperty(
    BigdataSail.Options.FILE,
    journal.getAbsolutePath()
    );

// instantiate a sail and a Sesame repository
BigdataSail sail = new BigdataSail(properties);
Repository repo = new BigdataSailRepository(sail);
repo.initialize();

We now have a Sesame repository that is ready to use. Anytime we want to “do” anything (load data, query, delete, etc), we need to obtain a connection to the repository. This is how I usually use the Sesame API:

RepositoryConnection cxn = repo.getConnection();
cxn.setAutoCommit(false);
try {

    ... // do something interesting

    cxn.commit();
} catch (Exception ex) {
    cxn.rollback();
    throw ex;
} finally {
    // close the repository connection
    cxn.close();
}

Make sure to always use autoCommit=false! Otherwise the SAIL automatically does a commit after every single operation! This causes severe performance degradation and also causes the bigdata journal to grow very large.

Inside that “do something interesting” section you might want to add a statement:

Resource s = new URIImpl("http://www.bigdata.com/rdf#Mike");
URI p = new URIImpl("http://www.bigdata.com/rdf#loves");
Value o = new URIImpl("http://www.bigdata.com/rdf#RDF");
Statement stmt = new StatementImpl(s, p, o);
cxn.add(stmt);

Or maybe you’d like to load an entire RDF document:

String baseURL = ... // the base URL for the document
InputStream is = ... // input stream to the document
Reader reader = new InputStreamReader(new BufferedInputStream(is));
cxn.add(reader, baseURL, RDFFormat.RDFXML);

Once you have data loaded you might want to read some data from your database. Note that by casting the statement to a “BigdataStatement”, you can get at additional information like the statement type (Explicit, Axiom, or Inferred):

URI uri = ... // a Resource that you’d like to know more about
RepositoryResult<Statement> stmts =
    cxn.getStatements(uri, null, null, true /* includeInferred */);
while (stmts.hasNext()) {
    Statement stmt = stmts.next();
    Resource s = stmt.getSubject();
    URI p = stmt.getPredicate();
    Value o = stmt.getObject();
    // do something with the statement

    // cast to BigdataStatement to get at additional information
    BigdataStatement bdStmt = (BigdataStatement) stmt;
    if (bdStmt.isExplicit()) {
        // do one thing
    } else if (bdStmt.isInferred()) {
        // do another thing
    } else { // bdStmt.isAxiom()
        // do something else
    }
}

Of course one of the most interesting things you can do is run high-level queries against the database. Sesame 2 repositories support the open-standard query language SPARQL[1] and a native Sesame query language SERQL[2]. Formulating high-level queries is outside the scope of this document, but assuming you have formulated your query you can execute it as follows:

final QueryLanguage ql = ... // the query language
final String query = ... // a “select” query
TupleQuery tupleQuery = cxn.prepareTupleQuery(ql, query);
tupleQuery.setIncludeInferred(true /* includeInferred */);
TupleQueryResult result = tupleQuery.evaluate();
// do something with the results

Personally I find “construct” queries to be more useful, they allow you to grab a real subgraph from your database:

// silly construct queries, can't guarantee distinct results
final Set<Statement> results = new LinkedHashSet<Statement>();
final GraphQuery graphQuery = cxn.prepareGraphQuery(ql, query);
graphQuery.setIncludeInferred(true /* includeInferred */);
graphQuery.evaluate(new StatementCollector(results));
// do something with the results
for (Statement stmt : results) {
    ...
}

While we’re at it, using the bigdata free text index is as simple as writing a high-level query. Bigdata uses a magic predicate to indicate that the free-text index should be used to find bindings for a particular variable in a high-level query. The free-text index is a Lucene style indexing that will match whole words or prefixes.

RepositoryConnection cxn = repo.getConnection();
cxn.setAutoCommit(false);
try {
    cxn.add(new URIImpl("http://www.bigdata.com/A"), RDFS.LABEL,
            new LiteralImpl("Yellow Rose"));
    cxn.add(new URIImpl("http://www.bigdata.com/B"), RDFS.LABEL,
            new LiteralImpl("Red Rose"));
    cxn.add(new URIImpl("http://www.bigdata.com/C"), RDFS.LABEL,
            new LiteralImpl("Old Yellow House"));
    cxn.add(new URIImpl("http://www.bigdata.com/D"), RDFS.LABEL,
            new LiteralImpl("Loud Yell"));
    cxn.commit();
} catch (Exception ex) {
    cxn.rollback();
    throw ex;
} finally {
    // close the repository connection
    cxn.close();
}

String query = "select ?x where { ?x <"+BNS.SEARCH+"> \"Yell\" . }";
executeSelectQuery(repo, query, QueryLanguage.SPARQL);
// will match A, C, and D

You can find all of this code and more in the source tree at bigdata-sails/src/samples/com/bigdata/samples.[3]

[1] http://www.w3.org/TR/rdf-sparql-query/
[2] http://www.openrdf.org/doc/sesame/users/ch06.html
[3] http://bigdata.svn.sourceforge.net/viewvc/bigdata/trunk/bigdata-sails/src/samples/com/bigdata/samples/


You claim that you've "solved" the provenance problem for RDF with statement identifiers. Can you show me how that works?

Sure. The concept here is that RDF is very bad for making statements about statements. Well at least it used to be. With the introduction of the concept of named graphs, we can now exploit the context position in a clever way to allow statements about statements without cumbersome reification. All that was required was a custom extension to RDF/XML to model quads. This is best illustrated through an example. Let's start with some RDF/XML:

<rdf:Description rdf:about="#Mike" >
    <rdfs:label bigdata:sid="_S1">Mike</rdfs:label>
    <bigdata:loves bigdata:sid="_S2" rdf:resource="#RDF" />
</rdf:Description>

<rdf:Description rdf:nodeID="_S1" >
    <bigdata:source>www.systap.com</bigdata:source>
</rdf:Description>

<rdf:Description rdf:nodeID="_S2" >
    <bigdata:source>www.systap.com</bigdata:source>
</rdf:Description>

You can see that we first assert two statements, assigning each a "sid" or statement identifier in the form of a bnode. Then we can use that bnode ID to make statements about the statements. In this case, we simply assert the source. We could assert all sorts of other things as well, including access control information, author, date, etc. Bigdata then maps these bnode IDs into internal statement identifiers. Each explicit statement in the database gets a unique statement identifier. You can then write a SPARQL query using the named graph feature to get at this information. So if I wanted to write a query to get at all the provenance information for the statement { Mike, loves, RDF }, it would look as follows:

String NS = "http://www.bigdata.com/rdf#";
String MIKE = NS + "Mike";
String LOVES = NS + "loves";
String RDF = NS + "RDF";
String query =
    "construct { ?sid ?p ?o } " +
    "where { " +
    "  ?sid ?p ?o ." +
    "  graph ?sid { <"+MIKE+"> <"+LOVES+"> <"+RDF+"> } " +
    "}";
executeConstructQuery(repo, query, QueryLanguage.SPARQL);

This example is codified with the rest of the sample code in bigdata-sails/src/samples/com/bigdata/samples[1].

[1] http://bigdata.svn.sourceforge.net/viewvc/bigdata/trunk/bigdata-sails/src/samples/com/bigdata/samples/


Anything else I need to know?

Make sure you are running with the -server JVM option and if possible, expand the heap size using the -Xmx option as well. You should see extremely good load and query performance. If you are not, please contact us and let us help you get the most out of our product.


Problem: How do I build / install bigdata on a single machine?

1. bigdata is an eclipse project, so you can just check it out from under eclipse and it should build automatically.

2. There is an ant build script (build.xml). Use "ant" to generate the jar.

3. Use "ant bundleJar" to generate the jar and bundle all of the dependencies into the build/lib directory. You can then copy those jars to where ever you need them.


Problem: How do I install bigdata on a cluster?

"ant install" is the cluster install.

There are notes in build.properties and in build.xml for the "install" target on how to setup a cluster install. There are examples of bigdata configuration files for a 3-node cluster and for a 15-node cluster in src/resources/config. The cluster install is currently Linux specific, but we would be happy to help with ports to other platforms. It installs a bash script which should be run from a cron job. There is also a depency on sysstat, http://pagesperso-orange.fr/sebastien.godard/, for collecting performance counters from the O/S.

Please see the ClusterGuide for more detail.

We recommend that you ask for help before attempting your first cluster install.


Problem: What are all these pieces?

Solution: There are several layers to the bigdata architecture. At its basic layer, you can create and managed named indices on a com.bigdata.journal.Journal. The journal is a fast append only persistence store suitable for purely local application. Scale-out applications are written to the com.bigdata.service.IBigdataClient and com.bigdata.service.IBigdataFederaion APIs. There are several implementation of the IBigdataFederation interface:

com.bigdata.journal.Journal: This is not a federation at all. However, the Journal may be used for a fast local persistence store with named indices.

com.bigdata.service.LocalDataServiceFederation: Provides a lightweight federation instance backed by a single com.bigdata.service.DataService. The DataService provides the building block for the scale-out architecture and handles concurrency control write access to indices hosted by the DataService. This federation class does NOT support key-range partitioned indices, but it is plug and play compatible with the federations that do which makes this a good place to develop your applications

com.bigdata.service.EmbeddedDataServiceFederation: Provides an embedded (in-process) federation instance supporting key-range partitioned indices. This is mainly used for testing those aspects of bigdata or of specific applications which are sensitive to key-range partitioning of indices and to overflow events. An overflow event occurs the live journal absorbing writes for a DataService reaches its target maximum extent. Synchronous overflow processing is very fast. It creates a new "live" journal and defines new views of the indices found on the old live journal on the new journal. A background process then provides asynchronous compacting merges and related operations for the index views and also make decisions concerning whether to split, join or move index partitions.

com.bigdata.service.jini.JiniFederation: This is the scale-out architecture deployed using jini. Services may be started on machines throughout a cluster. The services use jini to register themselves and to discover other services.


Problem: You see a javac internal error from the ant build script.

Solution: Use jdk1.6.0_07 or better. Several problems with javac parsing were apparently resolved in 1.6.0_07 that were present in the 1.5 releases of the jdk.


Problem: You see an ArrayIndexOutOfBoundsException from the KeyDecoder in a SparseRowStore write.

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
        at com.bigdata.rdf.sail.BigdataSail.setUp(BigdataSail.java:405)
        at com.bigdata.rdf.sail.BigdataSail.<init>(BigdataSail.java:430)
        at Test.main(Test.java:73)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at com.bigdata.rdf.sail.BigdataSail.setUp(BigdataSail.java:398)
        ... 2 moreCaused by: java.lang.ArrayIndexOutOfBoundsException: -1
        at com.bigdata.sparse.KeyDecoder.<init>(KeyDecoder.java:261)
        at com.bigdata.sparse.AbstractAtomicRowReadOrWrite.atomicRead(AbstractAtomicRowReadOrWrite.java:234)
        at com.bigdata.sparse.AbstractAtomicRowReadOrWrite.atomicRead(AbstractAtomicRowReadOrWrite.java:152)
        at com.bigdata.sparse.AtomicRowWriteRead.apply(AtomicRowWriteRead.java:167)
        at com.bigdata.sparse.AtomicRowWriteRead.apply(AtomicRowWriteRead.java:25)

Solution: You don't have the ICU libraries on the classpath. The ICU libraries are in the bigdata/lib/icu folder. ICU provides fast correct unicode support for C and Java. The JDK's Unicode support is based on ICU, but does not support compressed Unicode sort keys. Hence we recommend the ICU package instead. If you DO NOT want Unicode sort keys (that is, if all String data in your index keys is ASCII) then you can use the com.bigdata.btree.keys.KeyBuilder.Options.COLLATOR option to disable Unicode support. This can also be done on a per-index basis when the index is provisioned.


Problem: You are seeing a LOT of log statements.

Solution: Configure log4j correctly!!! Bigdata uses log4j and conditional logging throughout. Some parts of bigdata (especially the B+Trees) produce an absolutely ENOURMOUS amount of logging data unless you have configured logging correctly. Also, logging is an ENOURMOUS performance burden as the StringBuilder operations required to (a) generate the log messages; and (b) generate and parse stack traces in order to give you nice metadata in your log (eg., classname and line number at which the log message was issued) drive the heap an a frantic pace.
By default, log4j will log at a DEBUG level. This is NOT acceptable. You MUST configure log4j to log at no more than ERROR or possibly WARN for bigdata.
In general, you configure log4j using a command line option such as:

-Dlog4j.configuration=file:src/resources/logging/log4j.properties

Notice that the log4j configuration file is specified as a URL, not a file name!

Note: This issue is resolved in CVS. The log level for com.bigdata will be defaulted to WARN if no default has been specified. This should prevent most surprises.


Problem: You see a stack trace out of com.bigdata.btree.Node.dump() or Leaf.dump()

  ERROR child[0] does not have parent reference.
Exception in thread "main" java.lang.RuntimeException: While loading: /tmp/data/test.owl
        at com.bigdata.rdf.store.DataLoader.loadFiles(DataLoader.java:800)
        at com.bigdata.rdf.store.DataLoader.loadData2(DataLoader.java:706)
        at com.bigdata.rdf.store.DataLoader.loadData(DataLoader.java:552)
        at com.bigdata.rdf.store.DataLoader.loadData(DataLoader.java:513)
        at TestDataLoader.main(TestDataLoader.java:26)
Caused by: java.lang.NullPointerException
        at com.bigdata.btree.Node.dump(Node.java:2545)

Solution: Configure log4j correctly!!!! (see above).

The dump() method is only invoked when the log4j level is at DEBUG. BTree code has some assertions that it makes within dump that are not valid during some kinds of mutation (rotation of a key, split or join of a node or leaf). We've tracked down most of these cases and just commented out the dump() invocations, but there are clearly some left over. This does not indicate a problem with the BTRee -- you just need to configure log4j correctly!

Note: This issue is resolved in CVS. The log level for com.bigdata will be defaulted to WARN if no default has been specified. This should prevent most surprises.


Problem: I am using the Journal and the file size grows very quickly.


Turn off auto-commit in the SAIL.

The journal is an append only data structure. A write on the B+Tree never overwrites the old data. Instead it writes a new revision of the node or leaf. If you are doing a commit for each statement loaded then that is the worst possible case. The scale-out architecture uses the Journal as a write buffer. Each time the journal on a data service fills up, a new journal is created and the buffered writes from the old journal are migrated onto read-optimized B+Tree files known as index segments (.seg files). If you would like to see a persistence store that can reclaim old space, please log a feature request.


Problem: How do I create a scale-out RDF database?

The BigdataSail is just a wrapper over a com.bigdata.rdf.AbstractTripleStore. It provides some constructors which make it easy on you and create the backing persistence store and the AbstractTripleStore. If you are trying to create a scale-out RDF database, then you need to work with the constructor that accepts the AbstractTripleStore object. The http://www.bigdata.com/bigdata/docs/api/com/bigdata/rdf/load/RDFDataLoadMaster.html has some code which creates an RDF database instance automatically if one does not already exist based on the description of an RDF database instance in the bigdata configuration file. You are basically running a distributed data load job. This is a good time to ask for help.


Problem: How do I use bigdata with Hadoop?

Bigdata uses zookeeper, but does not have any other integration points with hadoop at this time. This is something that we are interesting in doing. It should be possible to deploy bigdata over HDFS using FUSE, but we have not tried it.

One of the most common things that people what to do is pre-process a (huge) amount of data using map/reduce and then bulk load that data into a scale-out RDF database instance where they can they using high-level query (SPARQL). The easiest way to do this is to have bigdata clients running on the same hosts as your reduce operations, which are presumably aggregating RDF/XML, N3, etc. for bulk load into bigdata. You can use the file system loader to bulk load files out of a named directory where they are being written by a map/reduce job. Files will be deleted once they are restart safe on the bigdata RDF database instance. If you are trying to do this, let us know and we can work with you to get things setup.


Problem: You have errors when compiling with Sesame 2.2.4.

Exception in thread "main" java.lang.NoSuchMethodError:
org.openrdf.sail.helpers.SailBase.getConnection()Lorg/openrdf/sail/NotifyingSailConnection;
    at com.bigdata.rdf.sail.BigdataSail.getConnection(BigdataSail.java:794)

Solution: We will fix this, but 2.2 is supported for now and is bundled in bigdata-rdf/lib.

The problem is an API compatibility change. We'll get a fix out shortly.


Problem: Blank nodes appear in the subject position when loading RDF/XML.

Solution: Make sure that the bigdata JARs appear before the Sesame JARs in your CLASSPATH.

The problem arises from an extension to the RDF/XML handling to support statement identifiers. We override some of the Sesame classes for RDF/XML handling to support that extension. We will introduce a MIME type and file extension so that improper handling of RDF/XML will not occur for standard RDF/XML when the JARs are out of order. In the meantime, you can fix this issue by ordering the bigdata JARs before the Sesame JARs.